-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add STAC <> Zarr report #139
base: staging
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wonderful, thank you @jsignell!
This is a tremendous resource as-is! I think it could be taken to the next level, separately from this PR, by having a small demonstration of catalogs that employ each of these approaches perhaps with a STAC browser on top. Do you have any plans to build of resource of that type? I'd be glad to help.
|
||
### Straight to xarray | ||
|
||
Currently this is the only supported option. You construct the lazily-loaded data cube in xarray and filter once you are in there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not really the only supported option, just the most common/likely useful, right? As in, one could also take the href from the STAC collection and open it with Zarr directly, non-Python Zarr implementations, or GDAL (depending on the Zarr format)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh sure sure I think I meant you cannot filter in STAC but of course you can do a million other things.
|
||
Currently this is the only supported option. You construct the lazily-loaded data cube in xarray and filter once you are in there. | ||
|
||
To do this you can use the `zarr` backend directly or you can use [the `stac` backend](https://github.com/stac-utils/xpystac) to streamline even more - this is mostly useful if the STAC collection uses the xarray extension. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To do this you can use the `zarr` backend directly or you can use [the `stac` backend](https://github.com/stac-utils/xpystac) to streamline even more - this is mostly useful if the STAC collection uses the xarray extension. | |
To do this you can use the `zarr` backend directly or you can use [the `stac` backend](https://github.com/stac-utils/xpystac) to streamline even more. The `stac` backend is mostly useful if the STAC collection uses the xarray extension. |
|
||
To do this you can use the `zarr` backend directly or you can use [the `stac` backend](https://github.com/stac-utils/xpystac) to streamline even more - this is mostly useful if the STAC collection uses the xarray extension. | ||
|
||
This is likely to be very fast if there is a consolidated metadata file OR the data is in Zarr-3 and the metadata fetch is highly parallelized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary for this PR, but eventually it would be nice to add a glossary with explanations for "consolidated metadata", etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this needs a read through for linking. I was thinking of this blog post https://earthmover.io/blog/xarray-open-zarr-improvements
|
||
Both of those access patterns should be supported by tooling, but depending on how the catalog is set up some patterns may be simpler and faster than others (at this point in time). | ||
|
||
Here is what exists so far: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is what exists so far: | |
Here is what exists so far in terms of typically organization of Zarr stores in STAC and how to use them: |
Doesn't need to be this format, but I think it'd be helpful to explain in advance how you're organizing the sections and sub-sections.
- Store the result of a data-cube constructed by concatenating Zarr stores: | ||
- as a new Zarr store - this option can include filtering and subsetting | ||
- as a virtual reference file (icechunk or kerchunk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mdsumner have you tried out concatenating Zarr stores via VRT/GTI? I'm struggling to keep up with your work, but thought you may have shared this as an option as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right! I had seen https://www.hypertidy.org/posts/2025-03-12-r-py-multidim/r-py-multidim but I need to revisit and link out.
**Pros** | ||
- There is no metadata duplication in STAC, so the STAC side is easy to maintain. | ||
- Simple access interface for Python users - no client-side concatenation. | ||
- Entire data-cube can be lazily constructed with one GET to the reference file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Entire data-cube can be lazily constructed with one GET to the reference file. | |
- Entire data-cube can be lazily constructed with one GET to the reference file or store. |
I'm just being nit-picky about the Icechunk virtual approach producing stores rather than files so that users don't expect that they could simply inspect/modify the file as one can with Kerchunk references.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no this is super helpful. I am stuck in kerchunkland in my mind
|
||
## Comparison Table | ||
|
||
The main thing to keep in mind is that Zarr is a file format and STAC is not. Sometimes we conflate "STAC" with "COGs stored in a STAC catalog", but STAC can be used to catalog anything as long as it has spatial temporal dimensions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main thing to keep in mind is that Zarr is a file format and STAC is not. Sometimes we conflate "STAC" with "COGs stored in a STAC catalog", but STAC can be used to catalog anything as long as it has spatial temporal dimensions. | |
The main thing to keep in mind is that Zarr is a data format and STAC is not. Sometimes we conflate "STAC" with "COGs stored in a STAC catalog", but STAC can be used to catalog anything as long as it has spatial temporal dimensions. |
I wonder if data format would be less confusing due to Zarr stores being split up over multiple files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes good call.
| good at search and discovery | good at filtering within a dataset | | ||
| searching returns items | filtering returns chunks | | ||
| stores metadata separately from data | stores metadata alongside data (except when virtualized) | | ||
| can point to anything | has opinions about how the data are laid out on disk | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| can point to anything | has opinions about how the data are laid out on disk | | |
| can point to anything with spatio-temporal dimensions | oriented around the storage of large N-dimensional typed arrays | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hesitate to include that Zarr has opinions about how the data are laid out on disk because technically you can store Zarr compliant data while redirecting the storage paths (e.g., Icechunk, VirtualiZarr)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think this is already the first row.
| can point to anything | has opinions about how the data are laid out on disk | |
| supports arbitrary metadata for catalogs, collections, items, assets | supports arbitrary metadata for groups, arrays | | ||
| good at search and discovery | good at filtering within a dataset | | ||
| searching returns items | filtering returns chunks | | ||
| stores metadata separately from data | stores metadata alongside data (except when virtualized) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| stores metadata separately from data | stores metadata alongside data (except when virtualized) | | |
| storage of STAC metadata is completely decoupled from storage of data | storage of metadata is coupled to data (i.e., in the same directory, except when virtualized) | |
This report will discuss the partially overlapping goals of STAC and Zarr and offer suggestions for how to use them together. Answering questions like: | ||
- What do each of these specifications excel at? | ||
- How can they be used together to get the maximum benefit out of both? | ||
- Where do virtualized datasets (kerchunk, VirtualiZarr) fit in? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Where do virtualized datasets (kerchunk, VirtualiZarr) fit in? | |
- Where do virtualized datasets (kerchunk references and Icechunk virtual stores produced by VirtualiZarr) fit in? |
Actually, I'm now wondering if even adding some drawing to these pages would help clarify the concepts as much or more than a full demonstration. |
I also wonder if the virtual zarr section should include some guidance on how to organize cataloging of the raw data with cataloging of the virtual Zarr references/stores? |
Yes! I think adding drawings could be really helpful. I was struggling with how to structure the sections, but once we settle on that drawings would really be useful. I think it would also help to just point to examples that implement the different setups. |
Yeeeaaahhh I wasn't at sure how much to talk about virtual zarr... but I can imagine a scenario where you catalog the data in normal stac objects but then also include a top-level reference file/store so that you have best of both worlds in terms of data access, but then you have 2 ways of accessing data and 2 places where you are abstracting metadata so I wasn't sure if that is a good idea. |
Maybe a good balance would be just to open an issue for a TODO on additional guidance for cataloguing both raw data and virtual zarrs and mention that issue in this new section, so readers are aware it's not fully comprehensive. |
First draft of #134 borrowing heavily from conversation on that issue.