Skip to content

Add refresh_attributes() and implement cache_attrs for Group and Array #3215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

bojidar-bg
Copy link

Should resolve #3178, if one passes cache_attrs=False when creating the various groups.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.rst
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Jul 8, 2025
Copy link

codecov bot commented Jul 8, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.75%. Comparing base (378d5af) to head (075a330).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3215      +/-   ##
==========================================
+ Coverage   94.73%   94.75%   +0.01%     
==========================================
  Files          78       78              
  Lines        8646     8670      +24     
==========================================
+ Hits         8191     8215      +24     
  Misses        455      455              
Files with missing lines Coverage Δ
src/zarr/api/synchronous.py 94.66% <ø> (ø)
src/zarr/core/array.py 98.48% <100.00%> (+0.02%) ⬆️
src/zarr/core/group.py 94.94% <100.00%> (+0.06%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment on lines +878 to +881
cache_attrs : bool, optional
If True (default), user attributes will be cached for attribute read
operations. If False, user attributes are reloaded from the store prior
to all attribute read operations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this docstring says true is the default, but the parameter itself has a default of None

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other places defining a cache_attrs argument did the same, so I decided to follow suit. (:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets break with the past and ensure that the docstring matches the annotation here

@d-v-b
Copy link
Contributor

d-v-b commented Jul 8, 2025

anyone have opinions on whether cache_attrs should have a default of True or False?

@TomAugspurger
Copy link
Contributor

Can someone write up a detailed proposal of the model here? The interaction between in-memory metadata objects and their serialized counterparts is a big topic.

If we're formally committing to something then we should do so carefully, and document it extensively.

@bojidar-bg
Copy link
Author

From my perspective of "just solving the issue", the crux of this PR is the refresh_attributes() function, and not the cache_attrs argument—I implemented cache_attrs only because it looked convenient to knock off another v3 missing feature. (:

As for a model... that would be a good idea, I agree.

If I understand the current main branch code correctly (from a cursory reading, mind it), the current model for caching data is that the store may cache data, but Array/Group never cache data themselves, only metadata. Metadata is cached when opening an array/group, and kept cached until you discard the respective object.
As such, array data and group items are never cached; meanwhile, group attributes and consolidated metadata, as well as array attributes, shapes, codecs, encoders, chunks sizes and such are always cached.
(Critically, since stores can never notify us of remote changes, writing to the same chunk/shard of data from multiple open instances of the same Array, or writing to the metadata of multiple open instances of the same Array or Group results in latest-write-wins behavior; which, with multiple peers resizing the same array, can result in various kinds of inconsistent data.)

This PR adds the possibility of manually (or automatically) reloading just the attributes of an array or group. Ideally, there would be support for reloading the whole metadata in case there might have been changes made by others—but for that, we would need to either make arrays/groups unfrozen, or guide users to re-open the whole array/group (and replace all references to the old version) in case there have been changes made by another process.

As for a better model:

  • Automatically watching for metadata changes seem infeasible. Perhaps support for it can be added in the store, but apart from the local and memory store, I don't think that can work for s3- and http-based stores.
  • Leaving things as-is and relying on users to manually re-open modified groups and arrays sounds like a footgun waiting to happen, especially with longer-running process which might keep a reference to an array around.
  • Adding a refresh_metadata method, with some ways to control metadata caching, seem plausible, but would require the AsyncArray() and AsyncGroup() classes to be unfrozen (so that they can replace their metadata on-the-fly).
  • Merging this PR as-is with its refresh_attributes method is probably also a footgun waiting to happen; as now not only would users have a way to slowly drift out of sync with their arrays' newest shapes, but they would also have a function to sync everything but the array's shape 😅

Thoughts?

@TomAugspurger
Copy link
Contributor

Very briefly, my understand of the current model is that there isn't any metadata caching at the Python Array / Group object level (Store instances are free to do what they want, and I think fsspec-based ones might cache things, but let's focus on the zarr-python objects)

Under that model, you can rely on your Group / Array metadata reflecting the state of the Store at the time that you created the instance. Methods that update the store in some way (creating, deleting, or updating) immediately synchronize with the Store. That's relatively simple to explain and understand.

IIUC, this attribute (or more generally, metadata) caching / refreshing is equivalent to users creating a new Group / Array instance. Should we steer users towards that? I hesitate to overcomplicate our API.

@bojidar-bg
Copy link
Author

bojidar-bg commented Jul 14, 2025

"Group / Array metadata reflecting the state of the Store at the time that you created the instance" is metadata being stored for the lifetime of the Group/Array instance, with no way to invalidate that stored metadata except by throwing away the whole Group / Array instance and recreating it anew. As "cached" data is data stored outside the main source of truth, especially for optimization purposes, I would claim that this metadata stored for the lifetime of the Group/Array instance is a caching mechanism at the Group / Array level.

Now, if Group / Array instances are short-lived, say, you create a group and immediately use it, there shouldn't be any harm coming from such caching. However, if users are likely to reuse and share Group / Array instances between multiple user-defined objects, then such caching can be rather unintuitive. As exemplified by the original issue (#3178), such behavior is confusing if you somehow end up with multiple long-lived instances of the same group. And, what's worse, currently to there's no indication that "creating a new Group / Array instance" is ever going to be necessary when you are just starting out, which can then tempt users to reuse the same Array instance across all their classes, making future recreation of that Array instance somewhat challenging.

--

Since Array already has e.g. an append method which modifies the Array object's metadata in-place, I think it's safe to say that the API is already incompatible with Array being an immutable object that stores metadata. Likewise with Group.attributes allowing direct modification of metadata.
As such, I feel like there are three generally-consistent positions one can take:

  • If metadata is part of the Array / Group, and both don't behave as short-lived immutable frozen pure-data objects, it makes sense to completely unfreeze them, and add methods with which to update the cached metadata.
  • If metadata is not part of the Array / Group, and both don't store anything past their Store and path, it makes sense to remove the metadata property from the Array/Group and push caching of metadata entirely to the store. (*Or: to an extra MetadataCache singleton of some sort, that stores and maintains a mutable mapping from Store/path to metadata)
  • If metadata is part of the Array / Group, yet both do behave as short-lived pure-data objects, it makes sense to adapt the API (likely turning it into a sort of fluent API) as well as the documentation and examples to always use method chaining instead of variables, and to treat Array / Group as short-lived utility objects which just expose a set of methods—with red capital letters around the risks of holding long-lived references to Array / Group, and maybe even warnings when the library can detect a long-lived reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs release notes Automatically applied to PRs which haven't added release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Group attributes not shared across different group instances with identical store_path, path and name
3 participants