Add ability to load data plugins #318

edwardchalstrey1 · 2022-09-15T14:11:57Z

Changes on this scivision branch enable us to add "data plugins" which is the name I'm giving for Python packages that are set up specifically as middleware code between scivision and a data resource that can be loaded via an existing Python API.

I have set up an example data plugin, which is a python package that loads Sentinel-2 Cloud-Optimized GeoTIFF images via the odc-stac package: https://github.com/alan-turing-institute/scivision_sentinel2_stac

To test

Have a play with the scivision_sentinel2_stac data plugin:

check out this branch in the scivision repo and run pip install -e . to get the changes
look at the README on https://github.com/alan-turing-institute/scivision_sentinel2_stac and follow the instructions to open the notebook
You can see the args that are available here - Let me know your thoughts and you could either comment here or open some issues in that repo.

I'm thinking I should document the API for this plugin in the plugin repo (not in scivision), i.e. what the arguments for the get_images func are and how to use via scivision load_dataset in the repo plugin itself e.g. in the README and/or via the example notebook. Perhaps the convention for data plugins should be to have their own documentation as a pre-requisite for inclusion in scivision. Otherwise it's not clear how a user would know what the arguments are when they load the data plugin (in the notebook where I do data.load_data).

Reviewers

One thing that's not super nice about this that load_dataset can now return different object types depending on the input. The advantage of this is that we can be flexible with the returned data type of any data plugins, which will be useful as different datasources will inevitably come in different formats, and we can still use a single function. What do you think about this?

TODO

Add scivision_sentinel2_stac as a datasource in the scivision catalog
- make sure catalog entry is accurate
Add test that shows data can be loaded via data plugin
Document the API for this plugin in the plugin repo (not in scivision)
Update docs to explain how to add and use a data plugin
- Add a section about how the data plugin config differs here
- Update the API page

edwardchalstrey1 · 2022-09-20T15:50:26Z

linked to #232 but doesn't really close it

scivision/io/reader.py

Co-authored-by: Alejandro © <[email protected]>

ots22 · 2022-10-06T13:49:28Z

load_dataset can now return different object types depending on the input

I think the main issue with this is that Datasource duplicates some functionality we're currently depending on Intake for, which might make things a bit more complicated for users. Someone browsing through a few scivision data sources now has to understand both Intake catalogs and Datasources, and that either might get returned, and how to handle each case. The documentation also ends up needing to explain both ways. Someone releasing a new data source has to make a non-obvious choice. I think this is basically your point.

At the same time, a way to use stac seems genuinely useful, so think we should find a way to merge and think about the design we want, even if that means living with some duplication for a while. If we can't get Intake to work the way we want, it might be cleaner just to adopt an interface like yours for everything.

I think the main advantage of keeping Intake as the way Scivision handles data sources is mostly that there is a lot we can build on already, and scivision might need to reinvent a lot without it. I wonder if it's possible to depend on Intake/fsspec for basic functionality but not use Intake catalogs? There is likely to be a way to interoperate with Intake if we want that.

One question about the stac functionality though, is: Could this have been a new Intake driver in fact? Are there any particular difficulties with that, anything that is awkward or non-obvious? Answers could make good arguments for doing something different to using Intake (at all, or at least as currently done) so would be useful to consider this.

Comparing this approach with Intake: stack.ipynb from this PR has

data = load_dataset('https://github.com/alan-turing-institute/scivision_sentinel2_stac')

data.load_data(resolution=20, bands=("red", "green", "blue")) # an xarray.Dataset

compared to Intake:

cat = intake.open_catalog("path/to/data.yml")

cat.datasource_name(resolution=20, bands=("red", "green", "blue")).read() # an xarray.Dataset

## The first method call make another catalog, with the extra options, the second chained call returns the array
## Need to know or figure out the correct 'datasource_name'

The main difference is that 'load_dataset' can install a python package to obtain the data, while Intake may need a particular driver to be installed. Sometimes the built-in drivers haven't been enough - empiarreader and vne are examples. For both of these, the intake drivers can't be installed automatically, and there isn't a nice way of handling the failure if they aren't (although Intake with Conda can install dependencies I think https://intake.readthedocs.io/en/latest/glossary.html#term-Driver).

ots22 · 2022-10-06T13:54:15Z

The advantage of this is that we can be flexible with the returned data type of any data plugins, which will be useful as different datasources will inevitably come in different formats

FWIW, there's nothing stopping an Intake driver returning data of any type (although certain types seem to be preferred by the Intake project)

In general I think a flexible return type is only an advantage for contributors and catalog curators - it can be a bit of a nightmare for consumers of the data, or to use with any code intended to be reusable across datasets! Solving this is probably a bit out of scope for scivision though.

ots22 · 2022-10-06T14:01:18Z

I'm thinking I should document the API for this plugin in the plugin repo (not in scivision), i.e. what the arguments for the get_images func are and how to use via scivision load_dataset in the repo plugin itself e.g. in the README and/or via the example notebook. Perhaps the convention for data plugins should be to have their own documentation as a pre-requisite for inclusion in scivision.

This sounds sensible. Is the docstring of the relevant get_images method enough?

edwardchalstrey1 · 2022-10-06T15:15:51Z

The advantage of this is that we can be flexible with the returned data type of any data plugins, which will be useful as different datasources will inevitably come in different formats

FWIW, there's nothing stopping an Intake driver returning data of any type (although certain types seem to be preferred by the Intake project)

In general I think a flexible return type is only an advantage for contributors and catalog curators - it can be a bit of a nightmare for consumers of the data, or to use with any code intended to be reusable across datasets! Solving this is probably a bit out of scope for scivision though.

One thing I could look into would be to convert the plugin to an intake driver (using https://github.com/alan-turing-institute/intake-alphabetsoup as inspiration), which might work better than the existing intake-stac which didn't do everything we wanted. That would at least mean we have a common config format, if not a common output format.

I think currently scivision doesn't have a common output format anyway, it just happens that the examples we have are all xarray.Dataarray

ots22 · 2022-10-06T15:30:50Z

One thing I could look into would be to convert the plugin to an intake driver (using https://github.com/alan-turing-institute/intake-alphabetsoup as inspiration)

Sounds good - could be a good opportunity to decide if intake is still the best thing to use, especially if this turns out to be difficult.

edwardchalstrey1 · 2022-10-06T15:42:39Z

was going to try and merge into a dev branch instead of main but not sure how

ots22 · 2022-10-06T15:53:40Z

Would one of the following work?

just merge into main
make an ad-hoc integration branch if there are few things you want to combine
keep this PR open, and merge main into it occasionally if not ready

edwardchalstrey1 · 2022-10-12T08:40:02Z

Would one of the following work?

just merge into main

make an ad-hoc integration branch if there are few things you want to combine

keep this PR open, and merge main into it occasionally if not ready

On second thought I think I will merge this - opened new issue #345

edwardchalstrey1 added 18 commits August 12, 2022 16:10

prototype load_stac_dataset func

82e76b4

make load_stac_dataset callable

96bf14b

import xarray for annotation

093c344

fix args

dc279b6

first attempt at datasource class

8b42de0

have a function that can import the data module

be8ca15

call the predict fn here

f0517ec

make the return object of load_stac_dataset an xarray.Dataset

80121d2

use the name of the ['prediction_fn']['call'] from config

a693b54

change expected config attributes

c658c9d

load_stac_dataset returns Datasource object

e056d43

use plumbing for data plugins

af983aa

don't expect a kwarg from config e.g. image

845e596

DataPlumber works

fc6964f

remove redundant code and comments

6b6961d

rename func to be not stac specific

7e4c8b4

make load_data_from_plugin a sub function of load_dataset

6b4ee5a

tidy docstring

fbea284

edwardchalstrey1 requested review from ots22 and acocac September 15, 2022 14:11

edwardchalstrey1 added 2 commits September 15, 2022 15:21

Merge branch 'main' into stac-load

de11632

flake8

4c17076

edwardchalstrey1 mentioned this pull request Sep 15, 2022

[ENH] add stac reader using the intake-stac driver #278

Closed

1 task

acocac added the enhancement New feature or request label Sep 23, 2022

acocac reviewed Sep 23, 2022

View reviewed changes

scivision/io/reader.py Outdated Show resolved Hide resolved

acocac reviewed Sep 23, 2022

View reviewed changes

scivision/io/reader.py Outdated Show resolved Hide resolved

edwardchalstrey1 self-assigned this Sep 26, 2022

edwardchalstrey1 and others added 2 commits September 26, 2022 09:38

edwardchalstrey1 added 5 commits September 26, 2022 09:39

Merge branch 'main' into stac-load

fb9222a

add data source to catalog

443177c

rename datasource

7469ef5

add test for stac plugin

f44994c

Merge branch 'main' into stac-load

edcc375

Merge branch 'main' into stac-load

c795fb9

edwardchalstrey1 marked this pull request as draft October 6, 2022 15:42

edwardchalstrey1 mentioned this pull request Oct 12, 2022

[ENH] Data plugins continued #345

Closed

edwardchalstrey1 marked this pull request as ready for review October 12, 2022 08:39

Merge branch 'main' into stac-load

aa2ac68

edwardchalstrey1 merged commit 1d99794 into main Oct 12, 2022

edwardchalstrey1 deleted the stac-load branch October 12, 2022 08:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to load data plugins #318

Add ability to load data plugins #318

edwardchalstrey1 commented Sep 15, 2022 •

edited

Loading

edwardchalstrey1 commented Sep 20, 2022

ots22 commented Oct 6, 2022 •

edited

Loading

ots22 commented Oct 6, 2022 •

edited

Loading

ots22 commented Oct 6, 2022

edwardchalstrey1 commented Oct 6, 2022

ots22 commented Oct 6, 2022

edwardchalstrey1 commented Oct 6, 2022

ots22 commented Oct 6, 2022

edwardchalstrey1 commented Oct 12, 2022

Add ability to load data plugins #318

Add ability to load data plugins #318

Conversation

edwardchalstrey1 commented Sep 15, 2022 • edited Loading

To test

Reviewers

TODO

edwardchalstrey1 commented Sep 20, 2022

ots22 commented Oct 6, 2022 • edited Loading

ots22 commented Oct 6, 2022 • edited Loading

ots22 commented Oct 6, 2022

edwardchalstrey1 commented Oct 6, 2022

ots22 commented Oct 6, 2022

edwardchalstrey1 commented Oct 6, 2022

ots22 commented Oct 6, 2022

edwardchalstrey1 commented Oct 12, 2022

edwardchalstrey1 commented Sep 15, 2022 •

edited

Loading

ots22 commented Oct 6, 2022 •

edited

Loading

ots22 commented Oct 6, 2022 •

edited

Loading