-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to load data plugins #318
Conversation
linked to #232 but doesn't really close it |
Co-authored-by: Alejandro © <[email protected]>
Co-authored-by: Alejandro © <[email protected]>
I think the main issue with this is that Datasource duplicates some functionality we're currently depending on Intake for, which might make things a bit more complicated for users. Someone browsing through a few scivision data sources now has to understand both Intake catalogs and Datasources, and that either might get returned, and how to handle each case. The documentation also ends up needing to explain both ways. Someone releasing a new data source has to make a non-obvious choice. I think this is basically your point. At the same time, a way to use stac seems genuinely useful, so think we should find a way to merge and think about the design we want, even if that means living with some duplication for a while. If we can't get Intake to work the way we want, it might be cleaner just to adopt an interface like yours for everything. I think the main advantage of keeping Intake as the way Scivision handles data sources is mostly that there is a lot we can build on already, and scivision might need to reinvent a lot without it. I wonder if it's possible to depend on Intake/fsspec for basic functionality but not use Intake catalogs? There is likely to be a way to interoperate with Intake if we want that. One question about the stac functionality though, is: Could this have been a new Intake driver in fact? Are there any particular difficulties with that, anything that is awkward or non-obvious? Answers could make good arguments for doing something different to using Intake (at all, or at least as currently done) so would be useful to consider this. Comparing this approach with Intake: stack.ipynb from this PR has
compared to Intake:
The main difference is that 'load_dataset' can install a python package to obtain the data, while Intake may need a particular driver to be installed. Sometimes the built-in drivers haven't been enough - empiarreader and vne are examples. For both of these, the intake drivers can't be installed automatically, and there isn't a nice way of handling the failure if they aren't (although Intake with Conda can install dependencies I think https://intake.readthedocs.io/en/latest/glossary.html#term-Driver). |
FWIW, there's nothing stopping an Intake driver returning data of any type (although certain types seem to be preferred by the Intake project) In general I think a flexible return type is only an advantage for contributors and catalog curators - it can be a bit of a nightmare for consumers of the data, or to use with any code intended to be reusable across datasets! Solving this is probably a bit out of scope for scivision though. |
This sounds sensible. Is the docstring of the relevant get_images method enough? |
One thing I could look into would be to convert the plugin to an intake driver (using https://github.com/alan-turing-institute/intake-alphabetsoup as inspiration), which might work better than the existing I think currently scivision doesn't have a common output format anyway, it just happens that the examples we have are all |
Sounds good - could be a good opportunity to decide if intake is still the best thing to use, especially if this turns out to be difficult. |
was going to try and merge into a dev branch instead of main but not sure how |
Would one of the following work?
|
On second thought I think I will merge this - opened new issue #345 |
Changes on this scivision branch enable us to add "data plugins" which is the name I'm giving for Python packages that are set up specifically as middleware code between scivision and a data resource that can be loaded via an existing Python API.
I have set up an example data plugin, which is a python package that loads Sentinel-2 Cloud-Optimized GeoTIFF images via the
odc-stac
package: https://github.com/alan-turing-institute/scivision_sentinel2_stacTo test
Have a play with the
scivision_sentinel2_stac
data plugin:pip install -e .
to get the changesI'm thinking I should document the API for this plugin in the plugin repo (not in scivision), i.e. what the arguments for the
get_images
func are and how to use via scivisionload_dataset
in the repo plugin itself e.g. in the README and/or via the example notebook. Perhaps the convention for data plugins should be to have their own documentation as a pre-requisite for inclusion in scivision. Otherwise it's not clear how a user would know what the arguments are when they load the data plugin (in the notebook where I dodata.load_data
).Reviewers
One thing that's not super nice about this that
load_dataset
can now return different object types depending on the input. The advantage of this is that we can be flexible with the returned data type of any data plugins, which will be useful as different datasources will inevitably come in different formats, and we can still use a single function. What do you think about this?TODO
scivision_sentinel2_stac
as a datasource in the scivision catalog