Load collection-level assets into xarray#90
Conversation
|
Thanks @TomAugspurger !
Seems reasonable to me!
Makes sense. Maybe work here could resolve #59
Not really, but definitely have a look at #75 and in particular proposed changes for potentially deviating from the intake-xarray dependency to have drivers defined with intake-stac (https://github.com/intake/intake-stac/pull/75/files#diff-b45fa0c9c70f45ce9661f18946a5a2aed632ac4c1d3b1c09333291f77bbdfda6). For the specific case of Also, just want to note this PR addresses #59 |
|
I pushed an update so this is a bit simpler. I see that My main question now is around what to call this method. It's really doing two things:
I've called this |
|
Having some second thoughts about the API design around selecting an asset, and I wonder if anyone else has thoughts. We can't use Then the question is: do we have a separate method to get an asset, followed by a my_asset = my_collection.get_asset(asset_key) # type: StacAsset
ds = my_asset.to_dask() # type: xarray.DataArray, dask.dataframe.DataFrame, etc.or do we put the ds = my_asset.to_dask(asset_key) # type: xarray.DataArray, dask.dataframe.DataFrame, etc.I suppose that the first option, a |
Thinking about this more, I think at least one more STAC extension is appropriate to capture this information. These would be an extension of of the STAC collection and Item I want to capture everything necessary to go from STAC Asset to xarray Dataset within the STAC catalog itself. Essentially, asset = stac_catalog.assets[key]
store = fsspec.get_mapper(asset.href, **storage_options)
ds = xr.open_zarr(store, **xarray_open_kwargs)So there are two pieces of information to capture:
We could have two new extensions: "zarr-abfs": {
"href": "abfs://daymet-zarr/daily/hi.zarr",
"type": "application/vnd+zarr",
"title": "Daily Hawaii Daymet Azure Blob File System Zarr root",
"description": "Azure Blob File System of the daily Hawaii Daymet Zarr Group on Azure Blob Storage for use with adlfs.",
"roles": [
"data",
"zarr",
"abfs"
],
"xarray:storage_options": {
"account_name": "daymeteuwest"
},
"xarray:open_kwargs": {
"consolidated": true
}
}, |
|
See https://github.com/tomAugspurger/xarray-assets for a proposal. I don't really know how valuable that is, but I think it's worth exploring a bit. If that extension is present, then I think intake-stac could use it like https://github.com/tomAugspurger/xarray-assets#python-example to safely go from a STAC Asset -> xarray.Dataset without any arguments from the user. |
| if isinstance(result, DataSource): | ||
| kwargs = result._captured_init_kwargs | ||
| kwargs = {**kwargs, **dict(storage_options=storage_options), **open_kwargs} | ||
| result = result(*result._captured_init_args, **kwargs) |
There was a problem hiding this comment.
@martindurant currently, StacItem.__getitem__ will return a (subclass of) DataSource. Does this seem like the right way to control the parameters passed to that DataSource? If so, are _captured_init_args and captured_init_kwargs considered "public"?
There was a problem hiding this comment.
This looks essentially the same as DataSourceBase.configure_new (aliased with get for compatibility, and __call__), but yes, seems fine to me.
are _captured_init_args and _captured_init_kwargs considered "public"
They were means for internal storage and to be able to recreate things after serialisation, possibly to YAML. They are more "automatic" than "private", I think.
Does this seem like the right way
Unless configure_new already does the right thing.
I do wonder what result can be if not a DataSource.
There was a problem hiding this comment.
Unless configure_new already does the right thing.
Gotcha. I think configure_new doesn't quite work, since we want to merge these keywords with the "existing" ones that are in ._captured_init_args (we had a test relying on that anyway).
I don't see an easy way for configure_new to add a keyword to control whether or not to merge the new kwargs, since it's passing all the keywords through, there's the potential for a conflict.
I do wonder what result can be if not a DataSource.
In this case, perhaps a StacAsset, but I might be misunderstanding intake-stac's design.
There was a problem hiding this comment.
Just noting for posterity, intake-xarray's datasources define a .kwargs and .storage_options properties. We can't use those because they apparently aren't implemented by RasterIOSource.
There was a problem hiding this comment.
unfortunately i don't really follow this... i've always been a little confused about what should be handled by intake-xarray or whether intake-stac should just be stand-alone and define all the datasources under this repo. I sort of started down that road in https://github.com/intake/intake-stac/pull/75/files#diff-b45fa0c9c70f45ce9661f18946a5a2aed632ac4c1d3b1c09333291f77bbdfda6 but abandoned it...
|
The latest commit implements the API described in #90 (comment). So now users call In [2]: import intake
In [3]: collection = intake.open_stac_collection("https://planetarycomputer.microsoft.com/api/stac/v1/collections/daymet-annual-hi")
In [4]: source = collection.get_asset("zarr-https")
In [6]: source.kwargs
Out[6]: {'consolidated': True}
In [7]: source.to_dask()
Out[7]:
<xarray.Dataset>
Dimensions: (nv: 2, time: 41, x: 284, y: 584)
Coordinates:
lat (y, x) float32 dask.array<chunksize=(584, 284), meta=np.ndarray>
lon (y, x) float32 dask.array<chunksize=(584, 284), meta=np.ndarray>
* time (time) datetime64[ns] 1980-07-01T12:00:00 ... 20...
* x (x) float32 -5.802e+06 -5.801e+06 ... -5.519e+06
* y (y) float32 -3.9e+04 -4e+04 ... -6.21e+05 -6.22e+05
Dimensions without coordinates: nv
Data variables:
lambert_conformal_conic int16 ...
prcp (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
swe (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
time_bnds (time, nv) datetime64[ns] dask.array<chunksize=(1, 2), meta=np.ndarray>
tmax (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
tmin (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
vp (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
Attributes:
Conventions: CF-1.6
Version_data: Daymet Data Version 4.0
Version_software: Daymet Software Version 4.0
citation: Please see http://daymet.ornl.gov/ for current Daymet ...
references: Please see http://daymet.ornl.gov/ for current informa...
source: Daymet Software Version 4.0
start_year: 1980 |
|
Looks like the narrative docs are a bit out of date, but f1dc6ff added a small section on xarray-assets to the docs. @kthyng did you already have STAC items / collections I could test this against? Or were you waiting for intake-stac to be updated before generating those? @scottyhq do you have a chance to take a look at this? |
|
@TomAugspurger You mean a catalog already set up to use |
That should just require adding the extension's URL to the Catalog / Item's
If you're generating STAC metadata for Zarr datasets, https://github.com/TomAugspurger/xstac might be helpful, or you can generate it "by hand". |
|
|
||
| def __getitem__(self, key): | ||
| result = super().__getitem__(key) | ||
| # TODO: handle non-string assets? |
There was a problem hiding this comment.
i haven't come across this in the wild. are they always strings? here for example I see asset["0"] https://cmr.earthdata.nasa.gov/stac/NSIDC_ECS/collections/NSIDC-0723.v4/items
There was a problem hiding this comment.
Apparently it's possible to look up multiple items by passing a tuple to __getitem__. https://github.com/intake/intake/blob/d9faa048bbc09d74bb6972f672155e3814c3ca62/intake/catalog/base.py#L403
I haven't used it either.
scottyhq
left a comment
There was a problem hiding this comment.
Thanks for pushing this forward @TomAugspurger! I think this will be a great addition, left some comments for some minor suggested changes, then we should merge it!
|
@TomAugspurger Thanks for the help, that was really clear. I am meeting an issue I think due to using For code here:
|
I'm currently working with netcdf files and couldn't tell if I should be using |
|
Thanks @scottyhq, updated to address your comments. @kthyng I'll take a closer look later, but I think you can update properties['xarray_kwargs'] = {'drop_variables': 'obs'}
item.add_asset(
key=item.id,
asset=pystac.Asset(**asset)
)Hopefully that does the trick. I haven't tried xstac on a NetCDF file yet. I'll give that a shot tonight or tomorrow and add will add it as an example! |
@TomAugspurger thanks for the suggestion but unfortunately that hasn't worked for me. Specifically, it has to go into Here's what I mean at the point it gets to the intake GUI. The but I think they need to be available under "args" (second image) to be used in |


This is a prototype for loading collection-level assets from a STAC collection. If you want a full example, install the
mainbranch of pystac:It's not quite ready, but I have a few points of discussion:
zarr-https) in the example above. The STAC spec doesn't give those any meaning really, but they're used in other places (e.g.stackstac.stack(..., assets=[])so I think we're OK.xarray.open_datasetif it had that media type. Right now we're only supporting Zarr.storage_optionslikeconsolidated=True, rather than the user. Does intake-stac do anything like that today?Closes #59
Closes #70