New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add STAC <> Zarr report #139

Open

jsignell wants to merge 1 commit into cloudnativegeo:staging from jsignell:stac-zarr

Collaborator

jsignell commented Mar 26, 2025

First draft of #134 borrowing heavily from conversation on that issue.


          Add STAC <> Zarr report

0ee4a0b

maxrjones approved these changes

View reviewed changes

Collaborator

maxrjones left a comment

This is wonderful, thank you @jsignell!

This is a tremendous resource as-is! I think it could be taken to the next level, separately from this PR, by having a small demonstration of catalogs that employ each of these approaches perhaps with a STAC browser on top. Do you have any plans to build of resource of that type? I'd be glad to help.

cookbooks/stac-zarr-report/data-consumers/index.qmd


		### Straight to xarray

		Currently this is the only supported option. You construct the lazily-loaded data cube in xarray and filter once you are in there.

Collaborator

maxrjones Mar 27, 2025

It's not really the only supported option, just the most common/likely useful, right? As in, one could also take the href from the STAC collection and open it with Zarr directly, non-Python Zarr implementations, or GDAL (depending on the Zarr format)?

Collaborator Author

jsignell Mar 27, 2025

oh sure sure I think I meant you cannot filter in STAC but of course you can do a million other things.

cookbooks/stac-zarr-report/data-consumers/index.qmd


		Currently this is the only supported option. You construct the lazily-loaded data cube in xarray and filter once you are in there.

		To do this you can use the `zarr` backend directly or you can use [the `stac` backend](https://github.com/stac-utils/xpystac) to streamline even more - this is mostly useful if the STAC collection uses the xarray extension.

Collaborator

maxrjones Mar 27, 2025

Suggested change

      
            To do this you can use the `zarr` backend directly or you can use [the `stac` backend](https://github.com/stac-utils/xpystac) to streamline even more - this is mostly useful if the STAC collection uses the xarray extension.
          
            To do this you can use the `zarr` backend directly or you can use [the `stac` backend](https://github.com/stac-utils/xpystac) to streamline even more. The `stac` backend is mostly useful if the STAC collection uses the xarray extension.

cookbooks/stac-zarr-report/data-consumers/index.qmd


		To do this you can use the `zarr` backend directly or you can use [the `stac` backend](https://github.com/stac-utils/xpystac) to streamline even more - this is mostly useful if the STAC collection uses the xarray extension.

		This is likely to be very fast if there is a consolidated metadata file OR the data is in Zarr-3 and the metadata fetch is highly parallelized.

Collaborator

maxrjones Mar 27, 2025

Not necessary for this PR, but eventually it would be nice to add a glossary with explanations for "consolidated metadata", etc.

Collaborator Author

jsignell Mar 27, 2025

yes this needs a read through for linking. I was thinking of this blog post https://earthmover.io/blog/xarray-open-zarr-improvements

cookbooks/stac-zarr-report/data-consumers/index.qmd


		Both of those access patterns should be supported by tooling, but depending on how the catalog is set up some patterns may be simpler and faster than others (at this point in time).

		Here is what exists so far:

Collaborator

maxrjones Mar 27, 2025

Suggested change

      
            Here is what exists so far:
          
            Here is what exists so far in terms of typically organization of Zarr stores in STAC and how to use them:

Doesn't need to be this format, but I think it'd be helpful to explain in advance how you're organizing the sections and sub-sections.

cookbooks/stac-zarr-report/data-consumers/index.qmd

Comment on lines +66 to +68

+                - Store the result of a data-cube constructed by concatenating Zarr stores:
+                  - as a new Zarr store - this option can include filtering and subsetting
+                  - as a virtual reference file (icechunk or kerchunk)

Collaborator

maxrjones Mar 27, 2025

@mdsumner have you tried out concatenating Zarr stores via VRT/GTI? I'm struggling to keep up with your work, but thought you may have shared this as an option as well.

Collaborator Author

jsignell Mar 27, 2025

Right! I had seen https://www.hypertidy.org/posts/2025-03-12-r-py-multidim/r-py-multidim but I need to revisit and link out.

cookbooks/stac-zarr-report/data-producers/index.qmd

+              **Pros**
+              - There is no metadata duplication in STAC, so the STAC side is easy to maintain.
+              - Simple access interface for Python users - no client-side concatenation.
+              - Entire data-cube can be lazily constructed with one GET to the reference file.

Collaborator

maxrjones Mar 27, 2025

Suggested change

      
            - Entire data-cube can be lazily constructed with one GET to the reference file.
          
            - Entire data-cube can be lazily constructed with one GET to the reference file or store.

I'm just being nit-picky about the Icechunk virtual approach producing stores rather than files so that users don't expect that they could simply inspect/modify the file as one can with Kerchunk references.

Collaborator Author

jsignell Mar 27, 2025

no this is super helpful. I am stuck in kerchunkland in my mind

cookbooks/stac-zarr-report/index.qmd


		## Comparison Table

		The main thing to keep in mind is that Zarr is a file format and STAC is not. Sometimes we conflate "STAC" with "COGs stored in a STAC catalog", but STAC can be used to catalog anything as long as it has spatial temporal dimensions.

Collaborator

maxrjones Mar 27, 2025

Suggested change

      
            The main thing to keep in mind is that Zarr is a file format and STAC is not. Sometimes we conflate "STAC" with "COGs stored in a STAC catalog", but STAC can be used to catalog anything as long as it has spatial temporal dimensions.
          
            The main thing to keep in mind is that Zarr is a data format and STAC is not. Sometimes we conflate "STAC" with "COGs stored in a STAC catalog", but STAC can be used to catalog anything as long as it has spatial temporal dimensions.

I wonder if data format would be less confusing due to Zarr stores being split up over multiple files?

Collaborator Author

jsignell Mar 27, 2025

yes good call.

cookbooks/stac-zarr-report/index.qmd

+              | good at search and discovery | good at filtering within a dataset |
+              | searching returns items | filtering returns chunks |
+              | stores metadata separately from data | stores metadata alongside data (except when virtualized) |
+              | can point to anything | has opinions about how the data are laid out on disk |

Collaborator

maxrjones Mar 27, 2025

Suggested change

      
            | can point to anything | has opinions about how the data are laid out on disk | 
          
            | can point to anything with spatio-temporal dimensions | oriented around the storage of large N-dimensional typed arrays |

Collaborator

maxrjones Mar 27, 2025

I hesitate to include that Zarr has opinions about how the data are laid out on disk because technically you can store Zarr compliant data while redirecting the storage paths (e.g., Icechunk, VirtualiZarr)

Collaborator Author

jsignell Mar 27, 2025

Actually I think this is already the first row.

Suggested change

      
            | can point to anything | has opinions about how the data are laid out on disk |

cookbooks/stac-zarr-report/index.qmd

+              | supports arbitrary metadata for catalogs, collections, items, assets | supports arbitrary metadata for groups, arrays |
+              | good at search and discovery | good at filtering within a dataset |
+              | searching returns items | filtering returns chunks |
+              | stores metadata separately from data | stores metadata alongside data (except when virtualized) |

Collaborator

maxrjones Mar 27, 2025

Suggested change

      
            | stores metadata separately from data | stores metadata alongside data (except when virtualized) |
          
            | storage of STAC metadata is completely decoupled from storage of data | storage of metadata is coupled to data (i.e., in the same directory, except when virtualized) |

cookbooks/stac-zarr-report/index.qmd

+              This report will discuss the partially overlapping goals of STAC and Zarr and offer suggestions for how to use them together. Answering questions like:
+               - What do each of these specifications excel at?
+               - How can they be used together to get the maximum benefit out of both?
+               - Where do virtualized datasets (kerchunk, VirtualiZarr) fit in?

Collaborator

maxrjones Mar 27, 2025

Suggested change

      
             - Where do virtualized datasets (kerchunk, VirtualiZarr) fit in?
          
             - Where do virtualized datasets (kerchunk references and Icechunk virtual stores produced by VirtualiZarr) fit in?

Collaborator

maxrjones commented Mar 27, 2025

This is a tremendous resource as-is! I think it could be taken to the next level, separately from this PR, by having a small demonstration of catalogs that employ each of these approaches perhaps with a STAC browser on top. Do you have any plans to build of resource of that type? I'd be glad to help.

Actually, I'm now wondering if even adding some drawing to these pages would help clarify the concepts as much or more than a full demonstration.

Collaborator

maxrjones commented Mar 27, 2025

I also wonder if the virtual zarr section should include some guidance on how to organize cataloging of the raw data with cataloging of the virtual Zarr references/stores?

Collaborator Author

jsignell commented Mar 27, 2025

This is a tremendous resource as-is! I think it could be taken to the next level, separately from this PR, by having a small demonstration of catalogs that employ each of these approaches perhaps with a STAC browser on top. Do you have any plans to build of resource of that type? I'd be glad to help.

Actually, I'm now wondering if even adding some drawing to these pages would help clarify the concepts as much or more than a full demonstration.

Yes! I think adding drawings could be really helpful. I was struggling with how to structure the sections, but once we settle on that drawings would really be useful. I think it would also help to just point to examples that implement the different setups.

Collaborator Author

jsignell commented Mar 27, 2025

I also wonder if the virtual zarr section should include some guidance on how to organize cataloging of the raw data with cataloging of the virtual Zarr references/stores?

Yeeeaaahhh I wasn't at sure how much to talk about virtual zarr... but I can imagine a scenario where you catalog the data in normal stac objects but then also include a top-level reference file/store so that you have best of both worlds in terms of data access, but then you have 2 ways of accessing data and 2 places where you are abstracting metadata so I wasn't sure if that is a good idea.

Collaborator

maxrjones commented Mar 27, 2025

I also wonder if the virtual zarr section should include some guidance on how to organize cataloging of the raw data with cataloging of the virtual Zarr references/stores?

Yeeeaaahhh I wasn't at sure how much to talk about virtual zarr... but I can imagine a scenario where you catalog the data in normal stac objects but then also include a top-level reference file/store so that you have best of both worlds in terms of data access, but then you have 2 ways of accessing data and 2 places where you are abstracting metadata so I wasn't sure if that is a good idea.

Maybe a good balance would be just to open an issue for a TODO on additional guidance for cataloguing both raw data and virtual zarrs and mention that issue in this new section, so readers are aware it's not fully comprehensive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet