-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
open_local can only be used on a filesystem which has attribute local_file=True #81
Comments
Hi @martindurant and @scottyhq. Yes, the issue seems to be related to #76 but as far as I see, that is not because of the The idea of separating opendap and netcdf which was part of the discussion of #76 was based on the assumption that remote files which are actually netcdf files should be cached locally while remote resources which are actually opendap resources should not be cached locally (in fact, those resources are not a single file, so downloading and caching a single URL will never work for opendap resources). Complicating the situation is the fact that URLs to plain netCDF-files are not distingushable from URLs to opendap resources and the standard netCDF library is also able to open both cases. That is why it is useful to distinguish those URLs on driver level. I think the situation for the *It would be possible to create an opendap server which serves TIF files as an opendap resource (which is not the case here). But that situation would be handled properly in the current version by opening it with the |
You are right about your understanding of what open_local does, but it only works for a URL which specifies the caching mechanism to be used (typically by including "simplecache::" prefixed to the URL). Is it the case, then, that the driver can read directly from a python file-like object? read_local ought to be only used for the case that this doesn't work and you can only operate on file names. |
Thanks for the input @d70-t and @martindurant ! Just to clarify we are opening a TIF file here ultimately via I'm learning the plumbing as I go along here, but I'm actually a bit confused as to why fsspec is involved in this particular chain, because this works just fine without fsspec: import xarray as xr
da = xr.open_rasterio('https://landsat-pds.s3.amazonaws.com/c1/L8/152/038/LC08_L1TP_152038_20200611_20200611_01_RT/LC08_L1TP_152038_20200611_20200611_01_RT_B1.TIF') I noticed the full URLs didn't appear in the traceback, but they are public. Perhaps some flag or change needs to be added We definitely do not want to download the full TIF locally and then operate on it. |
I think the answer is sometimes... For rasterio, I don't know. For netcdf it would force But apart from that, if |
Yeah, so there are three paths, and I don't really know which driver supports which:
These could be options, but it seems like intake-xarray should know which ones are possible in each case and have sensible defaults. |
I can't claim to understand all the consequences, but this seems like a safe bet for intake-xarray/intake_xarray/raster.py Line 77 in a654542
to be simply |
Yes... except that that would break the case that someone does want caching. How would you know the difference? You could see whether fsspec would resolve the URL into a local-allowed instance or perhaps more simply, special case http(s) for passthrough and still call fsspec otherwise. (self.urlpath will never be a file-like, since you wouldn't be able to put that in a catalog, so we still have ambiguity between creating a file-like, or downloading and passing a local path) |
right, i'm also confused as to what happens when multiple libraries in the dependency chain do their own caching (fsspec and rasterio/GDAL under the hood). This open PR on xarray is also relevant pydata/xarray#4140, maybe @jhamman has some thoughts on the best way to proceed? |
@martindurant if I understood you correctly, caching as it is currently implemented only works if the urlpath is prefixed by a cache-specification (e.g. "simplecache::"). Doesn't this qualify as a way of distinguishing if someone wants the thing to be cached or not? - and could this be used to decide if On the other hand, is it really useful to do whole-file-caching? I could imagine many cases where the user only wants to access parts of the original dataset. In these cases, caching can introduce a lot more traffic on the network than actually needed. E.g. the netCDF4-library has a mechanism of generating HTTP range requests and I could imagine that rasterio has something similar as well. I see that fsspec file-like objects are supposed to do something similar, but maybe the backend libraries can do that in more clever ways and sometimes it might be difficult to convince these backend libraries to use file-like objects. Probably out of scope of this issue, but is there a good way of supporting python file-like objects in cases where the file is actually opened by an underlying c-library? |
I'll answer this part quickly:
No. C libraries could choose to accept python file-like objects only by linking against the python runtime (I think h5py can do this), so it's far from typical. You could use FUSE on posix, but that's a whole other bag of worms. |
Yes; except that other implementations could well appear. The class attribute (fsspec/filesystem_spec#438 for another thought on caching, perhaps not too useful in this discussion) |
Hmm, I think I am slowly getting closer to how Would it be an option to ask fsspec if it can provide a local name and just pass on the URL to the backend driver if it can't? In this case, caching maybe would be enabled if it doesn't hurt and in other cases, the catalog author could opt-in by prefixing simplecache? One such case would be that the server is serving very small files such that subsetting them isn't useful in almost every case. |
This does miss the file-object case, though, which might be important for reading only part of a remote file, from a resource that the loader doesn't know how to ask for ranges on (i.e., anything other than HTTP). To tell whether you can expect to get a local file:
but this does instantiate the filesystem instance. (perhaps |
Is it a bad thing to instantiate the filessytem instance? Maybe the procedure should then be (for the
Currently, I think the answer to 1 is no for the current implementations of the In future, 1 may be the toughest one to decide though. It is probable that the backend can do more clever requests but is not able to do caching, which then might be less clever. Probably 3 and 4 could be migrated into one as the backend will fail anyways if it is not able to handle the input. |
It may cause network requests, to establish credentials for instance
would not instantiate, only import - so that would be OK (but import would fail, if there are any URLs that a loader can handle which fsspec cannot, although there may be none of those). |
@martindurant If I understand correctly the change you referenced is a partial solution and we still need some changes either in intake-xarray or at the catalog level in Is the following reasonable for raster.py? It would add the requirement if isinstance(self.urlpath, list):
self._can_be_local = fsspec.utils.can_be_local(self.urlpath[0])
else:
self._can_be_local = fsspec.utils.can_be_local(self.urlpath)
if self._can_be_local:
files = fsspec.open_local(self.urlpath, **self.storage_options)
else:
files = self.urlpath If so I can follow up with a PR, perhaps adding a test like: url = 'https://landsat-pds.s3.amazonaws.com/c1/L8/152/038/LC08_L1TP_152038_20200611_20200611_01_RT/LC08_L1TP_152038_20200611_20200611_01_RT_B1.TIF'
source = intake.open_rasterio(url, chunks={})
da = source.to_dask()
assert isinstance(da, xrray.core.dataarray.DataArray) |
Yes, but only if the case where we pass a file-like object is not useful. Note that it would be much better to have a test that didn't need the network, which would mean setting up a server process in the CI. |
but yes, let's set that PR up. The truth will be in the test cases! For a test server, you could use https://github.com/intake/filesystem_spec/blob/master/fsspec/implementations/tests/test_http.py#L72 , or its fixture (a few lines below), which you can just import if fsspec is a dep anyway. |
I think it is a good idea to add the same change also to |
Yes, please. We should have tests for all the cases... So we can use the local server for HTTP, and perhaps the memory: filesystem for file-like. Than we can find out for ourselves what the loaders can handle. |
referencing a failing test in intake-stac with intake and intake-xarray dependencies from master intake/intake-stac#61
Note that test passes with
intake=0.6.0
andintake-xarray=0.3.1
Comment from @martindurant :
Note sure where to best remedy this...
The text was updated successfully, but these errors were encountered: