You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For context, I'm at the Pangeo hack day following AGU w/ the DataTree group and in getting started noticed that open_dataset is a bit quiet about not fully reading in the file metadata for HDF5 files.
open_datatree now does this nicely or you can add in a groups keyword, but it could be nice to push users in that direction and let them know the groups aren't being read by default.
Not sure on implementation and/or if this is necessarily desirable in all cases, but just a thought from the perspective of someone new to DataTree.
reproducible example:
import geocat.datafiles as gdf
import xarray as xr
dt = xr.open_datatree(gdf.get('hdf_files/3B-MO.MS.MRG.3IMERG.20140701-S000000-E235959.07.V03D.HDF5'))
dt
ds = xr.open_dataset(gdf.get('hdf_files/3B-MO.MS.MRG.3IMERG.20140701-S000000-E235959.07.V03D.HDF5'))
ds
open_dataset only reads in the attribute info and doesn't let me know that there is more there.
open_datatree successfully reads the groups, data variables, etc. as expected.
The text was updated successfully, but these errors were encountered:
So interestingly kerchunk fails and gives you a ValueError when opening this file without a group specified. I'm not sure if this makes sense or how easy it would be, but something like this might be helpful.
../lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py:55, in extract_group(vds_refs, group)
53 else:
54 if group is None:
---> 55 raise ValueError(
56 f"Multiple HDF Groups found. Must specify group= keyword to select one of {hdf_groups}"
57 )
58 else:
59 # Ensure supplied group kwarg is consistent with kerchunk keys
60 if not group.endswith("/"):
ValueError: Multiple HDF Groups found. Must specify group= keyword to select one of ['', 'Grid/']
open_datatree now does this nicely or you can add in a groups keyword argument, but it could be nice to push users in that direction and let them know the groups aren't being read by default.
Alternatively the new open_groups function also shows all the groups, but is less restrictive than open_datatree...
So interestingly kerchunk fails and gives you a ValueError when opening this file without a group specified. I'm not sure if this makes sense or how easy it would be, but something like this might be helpful.
If we raised an error in open_dataset when there is more than one group, we would immediately break a very very large amount of existing user code. If we raised a warning we would cause warnings for a lot of users, which could only be turned off by passing open_dataset(group=''), which is quite annoying.
I think whilst we might have done that if we were making xarray from scratch today, we don't really want to do that now. Instead we should just treat this as a docs issue, and add notes suggesting that if you don't know if your data has groups you should probably use open_datatree/open_groups.
I can try to take a look at the docs for this and submit a PR sometime in the next week or so, but if someone else is excited about it don't wait on me.
What is your issue?
For context, I'm at the Pangeo hack day following AGU w/ the DataTree group and in getting started noticed that
open_dataset
is a bit quiet about not fully reading in the file metadata for HDF5 files.open_datatree
now does this nicely or you can add in a groups keyword, but it could be nice to push users in that direction and let them know the groups aren't being read by default.Not sure on implementation and/or if this is necessarily desirable in all cases, but just a thought from the perspective of someone new to DataTree.
reproducible example:
open_dataset
only reads in the attribute info and doesn't let me know that there is more there.open_datatree
successfully reads the groups, data variables, etc. as expected.The text was updated successfully, but these errors were encountered: