Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should open_dataset recommend using open_datatree for HDF5 files? #9891

Open
kafitzgerald opened this issue Dec 14, 2024 · 3 comments
Open
Labels
API design topic-DataTree Related to the implementation of a DataTree class topic-documentation

Comments

@kafitzgerald
Copy link
Contributor

What is your issue?

For context, I'm at the Pangeo hack day following AGU w/ the DataTree group and in getting started noticed that open_dataset is a bit quiet about not fully reading in the file metadata for HDF5 files.

open_datatree now does this nicely or you can add in a groups keyword, but it could be nice to push users in that direction and let them know the groups aren't being read by default.

Not sure on implementation and/or if this is necessarily desirable in all cases, but just a thought from the perspective of someone new to DataTree.

reproducible example:

import geocat.datafiles as gdf
import xarray as xr

dt = xr.open_datatree(gdf.get('hdf_files/3B-MO.MS.MRG.3IMERG.20140701-S000000-E235959.07.V03D.HDF5'))
dt

ds = xr.open_dataset(gdf.get('hdf_files/3B-MO.MS.MRG.3IMERG.20140701-S000000-E235959.07.V03D.HDF5'))
ds

open_dataset only reads in the attribute info and doesn't let me know that there is more there.

open_datatree successfully reads the groups, data variables, etc. as expected.

@kafitzgerald kafitzgerald added the needs triage Issue that has not been reviewed by xarray team member label Dec 14, 2024
@kafitzgerald
Copy link
Contributor Author

So interestingly kerchunk fails and gives you a ValueError when opening this file without a group specified. I'm not sure if this makes sense or how easy it would be, but something like this might be helpful.

../lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py:55, in extract_group(vds_refs, group)
     53 else:
     54     if group is None:
---> 55         raise ValueError(
     56             f"Multiple HDF Groups found. Must specify group= keyword to select one of {hdf_groups}"
     57         )
     58     else:
     59         # Ensure supplied group kwarg is consistent with kerchunk keys
     60         if not group.endswith("/"):

ValueError: Multiple HDF Groups found. Must specify group= keyword to select one of ['', 'Grid/']

@TomNicholas TomNicholas added API design topic-documentation topic-DataTree Related to the implementation of a DataTree class and removed needs triage Issue that has not been reviewed by xarray team member labels Dec 15, 2024
@TomNicholas
Copy link
Member

open_datatree now does this nicely or you can add in a groups keyword argument, but it could be nice to push users in that direction and let them know the groups aren't being read by default.

Alternatively the new open_groups function also shows all the groups, but is less restrictive than open_datatree...

So interestingly kerchunk fails and gives you a ValueError when opening this file without a group specified. I'm not sure if this makes sense or how easy it would be, but something like this might be helpful.

If we raised an error in open_dataset when there is more than one group, we would immediately break a very very large amount of existing user code. If we raised a warning we would cause warnings for a lot of users, which could only be turned off by passing open_dataset(group=''), which is quite annoying.

I think whilst we might have done that if we were making xarray from scratch today, we don't really want to do that now. Instead we should just treat this as a docs issue, and add notes suggesting that if you don't know if your data has groups you should probably use open_datatree/open_groups.

@kafitzgerald
Copy link
Contributor Author

Thanks! That makes a lot of sense.

I can try to take a look at the docs for this and submit a PR sometime in the next week or so, but if someone else is excited about it don't wait on me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API design topic-DataTree Related to the implementation of a DataTree class topic-documentation
Projects
None yet
Development

No branches or pull requests

2 participants