Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to load data plugins #318

Merged
merged 29 commits into from
Oct 12, 2022
Merged

Add ability to load data plugins #318

merged 29 commits into from
Oct 12, 2022

Conversation

edwardchalstrey1
Copy link
Contributor

@edwardchalstrey1 edwardchalstrey1 commented Sep 15, 2022

Changes on this scivision branch enable us to add "data plugins" which is the name I'm giving for Python packages that are set up specifically as middleware code between scivision and a data resource that can be loaded via an existing Python API.

I have set up an example data plugin, which is a python package that loads Sentinel-2 Cloud-Optimized GeoTIFF images via the odc-stac package: https://github.com/alan-turing-institute/scivision_sentinel2_stac

To test

Have a play with the scivision_sentinel2_stac data plugin:

  1. check out this branch in the scivision repo and run pip install -e . to get the changes
  2. look at the README on https://github.com/alan-turing-institute/scivision_sentinel2_stac and follow the instructions to open the notebook
  3. You can see the args that are available here - Let me know your thoughts and you could either comment here or open some issues in that repo.

I'm thinking I should document the API for this plugin in the plugin repo (not in scivision), i.e. what the arguments for the get_images func are and how to use via scivision load_dataset in the repo plugin itself e.g. in the README and/or via the example notebook. Perhaps the convention for data plugins should be to have their own documentation as a pre-requisite for inclusion in scivision. Otherwise it's not clear how a user would know what the arguments are when they load the data plugin (in the notebook where I do data.load_data).

Reviewers

One thing that's not super nice about this that load_dataset can now return different object types depending on the input. The advantage of this is that we can be flexible with the returned data type of any data plugins, which will be useful as different datasources will inevitably come in different formats, and we can still use a single function. What do you think about this?

TODO

  • Add scivision_sentinel2_stac as a datasource in the scivision catalog
    • make sure catalog entry is accurate
  • Add test that shows data can be loaded via data plugin
  • Document the API for this plugin in the plugin repo (not in scivision)
  • Update docs to explain how to add and use a data plugin
    • Add a section about how the data plugin config differs here
    • Update the API page

@edwardchalstrey1
Copy link
Contributor Author

linked to #232 but doesn't really close it

@acocac acocac added the enhancement New feature or request label Sep 23, 2022
scivision/io/reader.py Outdated Show resolved Hide resolved
scivision/io/reader.py Outdated Show resolved Hide resolved
@edwardchalstrey1 edwardchalstrey1 self-assigned this Sep 26, 2022
edwardchalstrey1 and others added 2 commits September 26, 2022 09:38
Co-authored-by: Alejandro © <[email protected]>
Co-authored-by: Alejandro © <[email protected]>
@ots22
Copy link
Member

ots22 commented Oct 6, 2022

load_dataset can now return different object types depending on the input

I think the main issue with this is that Datasource duplicates some functionality we're currently depending on Intake for, which might make things a bit more complicated for users. Someone browsing through a few scivision data sources now has to understand both Intake catalogs and Datasources, and that either might get returned, and how to handle each case. The documentation also ends up needing to explain both ways. Someone releasing a new data source has to make a non-obvious choice. I think this is basically your point.

At the same time, a way to use stac seems genuinely useful, so think we should find a way to merge and think about the design we want, even if that means living with some duplication for a while. If we can't get Intake to work the way we want, it might be cleaner just to adopt an interface like yours for everything.

I think the main advantage of keeping Intake as the way Scivision handles data sources is mostly that there is a lot we can build on already, and scivision might need to reinvent a lot without it. I wonder if it's possible to depend on Intake/fsspec for basic functionality but not use Intake catalogs? There is likely to be a way to interoperate with Intake if we want that.

One question about the stac functionality though, is: Could this have been a new Intake driver in fact? Are there any particular difficulties with that, anything that is awkward or non-obvious? Answers could make good arguments for doing something different to using Intake (at all, or at least as currently done) so would be useful to consider this.

Comparing this approach with Intake: stack.ipynb from this PR has

data = load_dataset('https://github.com/alan-turing-institute/scivision_sentinel2_stac')

data.load_data(resolution=20, bands=("red", "green", "blue")) # an xarray.Dataset

compared to Intake:

cat = intake.open_catalog("path/to/data.yml")

cat.datasource_name(resolution=20, bands=("red", "green", "blue")).read() # an xarray.Dataset

## The first method call make another catalog, with the extra options, the second chained call returns the array
## Need to know or figure out the correct 'datasource_name'

The main difference is that 'load_dataset' can install a python package to obtain the data, while Intake may need a particular driver to be installed. Sometimes the built-in drivers haven't been enough - empiarreader and vne are examples. For both of these, the intake drivers can't be installed automatically, and there isn't a nice way of handling the failure if they aren't (although Intake with Conda can install dependencies I think https://intake.readthedocs.io/en/latest/glossary.html#term-Driver).

@ots22
Copy link
Member

ots22 commented Oct 6, 2022

The advantage of this is that we can be flexible with the returned data type of any data plugins, which will be useful as different datasources will inevitably come in different formats

FWIW, there's nothing stopping an Intake driver returning data of any type (although certain types seem to be preferred by the Intake project)

In general I think a flexible return type is only an advantage for contributors and catalog curators - it can be a bit of a nightmare for consumers of the data, or to use with any code intended to be reusable across datasets! Solving this is probably a bit out of scope for scivision though.

@ots22
Copy link
Member

ots22 commented Oct 6, 2022

I'm thinking I should document the API for this plugin in the plugin repo (not in scivision), i.e. what the arguments for the get_images func are and how to use via scivision load_dataset in the repo plugin itself e.g. in the README and/or via the example notebook. Perhaps the convention for data plugins should be to have their own documentation as a pre-requisite for inclusion in scivision.

This sounds sensible. Is the docstring of the relevant get_images method enough?

@edwardchalstrey1
Copy link
Contributor Author

The advantage of this is that we can be flexible with the returned data type of any data plugins, which will be useful as different datasources will inevitably come in different formats

FWIW, there's nothing stopping an Intake driver returning data of any type (although certain types seem to be preferred by the Intake project)

In general I think a flexible return type is only an advantage for contributors and catalog curators - it can be a bit of a nightmare for consumers of the data, or to use with any code intended to be reusable across datasets! Solving this is probably a bit out of scope for scivision though.

One thing I could look into would be to convert the plugin to an intake driver (using https://github.com/alan-turing-institute/intake-alphabetsoup as inspiration), which might work better than the existing intake-stac which didn't do everything we wanted. That would at least mean we have a common config format, if not a common output format.

I think currently scivision doesn't have a common output format anyway, it just happens that the examples we have are all xarray.Dataarray

@ots22
Copy link
Member

ots22 commented Oct 6, 2022

One thing I could look into would be to convert the plugin to an intake driver (using https://github.com/alan-turing-institute/intake-alphabetsoup as inspiration)

Sounds good - could be a good opportunity to decide if intake is still the best thing to use, especially if this turns out to be difficult.

@edwardchalstrey1 edwardchalstrey1 marked this pull request as draft October 6, 2022 15:42
@edwardchalstrey1
Copy link
Contributor Author

was going to try and merge into a dev branch instead of main but not sure how

@ots22
Copy link
Member

ots22 commented Oct 6, 2022

Would one of the following work?

  • just merge into main
  • make an ad-hoc integration branch if there are few things you want to combine
  • keep this PR open, and merge main into it occasionally if not ready

@edwardchalstrey1 edwardchalstrey1 marked this pull request as ready for review October 12, 2022 08:39
@edwardchalstrey1
Copy link
Contributor Author

Would one of the following work?

  • just merge into main
  • make an ad-hoc integration branch if there are few things you want to combine
  • keep this PR open, and merge main into it occasionally if not ready

On second thought I think I will merge this - opened new issue #345

@edwardchalstrey1 edwardchalstrey1 merged commit 1d99794 into main Oct 12, 2022
@edwardchalstrey1 edwardchalstrey1 deleted the stac-load branch October 12, 2022 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants