Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add STAC <> Zarr report #139

Open
wants to merge 1 commit into
base: staging
Choose a base branch
from

Conversation

jsignell
Copy link
Collaborator

First draft of #134 borrowing heavily from conversation on that issue.

Copy link
Collaborator

@maxrjones maxrjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wonderful, thank you @jsignell!

This is a tremendous resource as-is! I think it could be taken to the next level, separately from this PR, by having a small demonstration of catalogs that employ each of these approaches perhaps with a STAC browser on top. Do you have any plans to build of resource of that type? I'd be glad to help.


### Straight to xarray

Currently this is the only supported option. You construct the lazily-loaded data cube in xarray and filter once you are in there.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really the only supported option, just the most common/likely useful, right? As in, one could also take the href from the STAC collection and open it with Zarr directly, non-Python Zarr implementations, or GDAL (depending on the Zarr format)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh sure sure I think I meant you cannot filter in STAC but of course you can do a million other things.


Currently this is the only supported option. You construct the lazily-loaded data cube in xarray and filter once you are in there.

To do this you can use the `zarr` backend directly or you can use [the `stac` backend](https://github.com/stac-utils/xpystac) to streamline even more - this is mostly useful if the STAC collection uses the xarray extension.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To do this you can use the `zarr` backend directly or you can use [the `stac` backend](https://github.com/stac-utils/xpystac) to streamline even more - this is mostly useful if the STAC collection uses the xarray extension.
To do this you can use the `zarr` backend directly or you can use [the `stac` backend](https://github.com/stac-utils/xpystac) to streamline even more. The `stac` backend is mostly useful if the STAC collection uses the xarray extension.


To do this you can use the `zarr` backend directly or you can use [the `stac` backend](https://github.com/stac-utils/xpystac) to streamline even more - this is mostly useful if the STAC collection uses the xarray extension.

This is likely to be very fast if there is a consolidated metadata file OR the data is in Zarr-3 and the metadata fetch is highly parallelized.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary for this PR, but eventually it would be nice to add a glossary with explanations for "consolidated metadata", etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this needs a read through for linking. I was thinking of this blog post https://earthmover.io/blog/xarray-open-zarr-improvements


Both of those access patterns should be supported by tooling, but depending on how the catalog is set up some patterns may be simpler and faster than others (at this point in time).

Here is what exists so far:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Here is what exists so far:
Here is what exists so far in terms of typically organization of Zarr stores in STAC and how to use them:

Doesn't need to be this format, but I think it'd be helpful to explain in advance how you're organizing the sections and sub-sections.

Comment on lines +66 to +68
- Store the result of a data-cube constructed by concatenating Zarr stores:
- as a new Zarr store - this option can include filtering and subsetting
- as a virtual reference file (icechunk or kerchunk)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdsumner have you tried out concatenating Zarr stores via VRT/GTI? I'm struggling to keep up with your work, but thought you may have shared this as an option as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right! I had seen https://www.hypertidy.org/posts/2025-03-12-r-py-multidim/r-py-multidim but I need to revisit and link out.

**Pros**
- There is no metadata duplication in STAC, so the STAC side is easy to maintain.
- Simple access interface for Python users - no client-side concatenation.
- Entire data-cube can be lazily constructed with one GET to the reference file.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Entire data-cube can be lazily constructed with one GET to the reference file.
- Entire data-cube can be lazily constructed with one GET to the reference file or store.

I'm just being nit-picky about the Icechunk virtual approach producing stores rather than files so that users don't expect that they could simply inspect/modify the file as one can with Kerchunk references.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no this is super helpful. I am stuck in kerchunkland in my mind


## Comparison Table

The main thing to keep in mind is that Zarr is a file format and STAC is not. Sometimes we conflate "STAC" with "COGs stored in a STAC catalog", but STAC can be used to catalog anything as long as it has spatial temporal dimensions.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The main thing to keep in mind is that Zarr is a file format and STAC is not. Sometimes we conflate "STAC" with "COGs stored in a STAC catalog", but STAC can be used to catalog anything as long as it has spatial temporal dimensions.
The main thing to keep in mind is that Zarr is a data format and STAC is not. Sometimes we conflate "STAC" with "COGs stored in a STAC catalog", but STAC can be used to catalog anything as long as it has spatial temporal dimensions.

I wonder if data format would be less confusing due to Zarr stores being split up over multiple files?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes good call.

| good at search and discovery | good at filtering within a dataset |
| searching returns items | filtering returns chunks |
| stores metadata separately from data | stores metadata alongside data (except when virtualized) |
| can point to anything | has opinions about how the data are laid out on disk |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| can point to anything | has opinions about how the data are laid out on disk |
| can point to anything with spatio-temporal dimensions | oriented around the storage of large N-dimensional typed arrays |

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hesitate to include that Zarr has opinions about how the data are laid out on disk because technically you can store Zarr compliant data while redirecting the storage paths (e.g., Icechunk, VirtualiZarr)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think this is already the first row.

Suggested change
| can point to anything | has opinions about how the data are laid out on disk |

| supports arbitrary metadata for catalogs, collections, items, assets | supports arbitrary metadata for groups, arrays |
| good at search and discovery | good at filtering within a dataset |
| searching returns items | filtering returns chunks |
| stores metadata separately from data | stores metadata alongside data (except when virtualized) |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| stores metadata separately from data | stores metadata alongside data (except when virtualized) |
| storage of STAC metadata is completely decoupled from storage of data | storage of metadata is coupled to data (i.e., in the same directory, except when virtualized) |

This report will discuss the partially overlapping goals of STAC and Zarr and offer suggestions for how to use them together. Answering questions like:
- What do each of these specifications excel at?
- How can they be used together to get the maximum benefit out of both?
- Where do virtualized datasets (kerchunk, VirtualiZarr) fit in?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Where do virtualized datasets (kerchunk, VirtualiZarr) fit in?
- Where do virtualized datasets (kerchunk references and Icechunk virtual stores produced by VirtualiZarr) fit in?

@maxrjones
Copy link
Collaborator

This is a tremendous resource as-is! I think it could be taken to the next level, separately from this PR, by having a small demonstration of catalogs that employ each of these approaches perhaps with a STAC browser on top. Do you have any plans to build of resource of that type? I'd be glad to help.

Actually, I'm now wondering if even adding some drawing to these pages would help clarify the concepts as much or more than a full demonstration.

@maxrjones
Copy link
Collaborator

I also wonder if the virtual zarr section should include some guidance on how to organize cataloging of the raw data with cataloging of the virtual Zarr references/stores?

@jsignell
Copy link
Collaborator Author

This is a tremendous resource as-is! I think it could be taken to the next level, separately from this PR, by having a small demonstration of catalogs that employ each of these approaches perhaps with a STAC browser on top. Do you have any plans to build of resource of that type? I'd be glad to help.

Actually, I'm now wondering if even adding some drawing to these pages would help clarify the concepts as much or more than a full demonstration.

Yes! I think adding drawings could be really helpful. I was struggling with how to structure the sections, but once we settle on that drawings would really be useful. I think it would also help to just point to examples that implement the different setups.

@jsignell
Copy link
Collaborator Author

I also wonder if the virtual zarr section should include some guidance on how to organize cataloging of the raw data with cataloging of the virtual Zarr references/stores?

Yeeeaaahhh I wasn't at sure how much to talk about virtual zarr... but I can imagine a scenario where you catalog the data in normal stac objects but then also include a top-level reference file/store so that you have best of both worlds in terms of data access, but then you have 2 ways of accessing data and 2 places where you are abstracting metadata so I wasn't sure if that is a good idea.

@maxrjones
Copy link
Collaborator

I also wonder if the virtual zarr section should include some guidance on how to organize cataloging of the raw data with cataloging of the virtual Zarr references/stores?

Yeeeaaahhh I wasn't at sure how much to talk about virtual zarr... but I can imagine a scenario where you catalog the data in normal stac objects but then also include a top-level reference file/store so that you have best of both worlds in terms of data access, but then you have 2 ways of accessing data and 2 places where you are abstracting metadata so I wasn't sure if that is a good idea.

Maybe a good balance would be just to open an issue for a TODO on additional guidance for cataloguing both raw data and virtual zarrs and mention that issue in this new section, so readers are aware it's not fully comprehensive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants