-
-
Notifications
You must be signed in to change notification settings - Fork 346
Preventing Deadlocks When Reading Metadata Concurrently via asyncio.gather
#3207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Good detective work here! I think the ideal solution would keep store implementation details confined to the store classes themselves. So instead of the solution here, what if we override the get_many method on the fsspec store to include the logic you have added here, and then use that method instead of multiple gets |
Very good point! Perhaps something along these lines? class StorePath:
# ...
async def _is_concurrency_save(self):
fs = getattr(self.store, "fs", None)
return getattr(fs, "asynchronous", True)
async def get_many(
self,
*suffixes : str,
prototype: BufferPrototype | None = None,
byte_range: ByteRequest | None = None,
):
tasks = [
(self / suffix).get(prototype=prototype, byte_range=byte_range) for suffix in suffixes
]
if await self._is_concurrency_save():
return await gather(*tasks)
else:
results = []
for task in tasks:
result = await task
results.append(result)
return results
class FsspecStore:
# ...
async def _get_many(
self, requests: Iterable[tuple[str, BufferPrototype, ByteRequest | None]]
) -> AsyncGenerator[tuple[str, Buffer | None], None]:
if getattr(self.fs, "asynchronous", True):
async for result in super()._get_many(requests=requests):
yield result
else:
for key, prototype, byte_range in requests:
value = await self.get(key, prototype, byte_range)
yield (key, value)
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #3207 +/- ##
==========================================
+ Coverage 94.62% 94.64% +0.01%
==========================================
Files 78 78
Lines 8696 8718 +22
==========================================
+ Hits 8229 8251 +22
Misses 467 467
🚀 New features to boost your workflow:
|
src/zarr/storage/_common.py
Outdated
async def _is_concurrency_save(self): | ||
fs = getattr(self.store, "fs", None) | ||
return getattr(fs, "asynchronous", True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be a method specific to the fsspec store. the fsspec store should call it inside _get_many
in order to chose which implementation to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
src/zarr/core/group.py
Outdated
(_join_paths([path, ZARRAY_JSON]), default_buffer_prototype(), None), | ||
(_join_paths([path, ZGROUP_JSON]), default_buffer_prototype(), None), | ||
(_join_paths([path, ZATTRS_JSON]), default_buffer_prototype(), None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets bind _join_paths([path, X])
to a variable so that we don't call the _join_paths
function so many times. for example:
zarray_path = _join_paths([path, ZARRAY_JSON])
...
7f72217
to
37d112e
Compare
- This pull request resolves the issue of deadlocks and indefinite hangs when opening Zarr v3 arrays on synchronous fsspec filesystems, by implementing a fallback to sequential reads for non-concurrency-safe filesystems, ensuring robust metadata retrieval without sacrificing performance for safe filesystems. Furthermore `Store.get_many` was modified to retrieve objects concurrently from storage. The previous implementation was sequential, awaiting each `self.get(*req)` before proceeding, contrary to the docstring. - Introduced `Store.get_many_ordered` and `StorePath.get_many_ordered` to retrieve multiple metadata files in a single call, optimizing the retrieval process and reducing overhead. `StorePath.get_many_ordered` is used in `get_array_metadata`. `Store._get_many_ordered` is used in `_read_metadata_v2`. - Modified `FsspecStore._get_many` and `FsspecStore._get_many_ordered` to conditionally use `asyncio.gather` based on the concurrency safety of the underlying file system, enhancing compatibility with synchronous file systems by avoiding deadlocks when accessing metadata concurrently. Adding tests `LockableFileSystem` to test with async/sync behavior.
Thanks to the feedback I received and some additional testing, I have substantially improved my initial solution. I look forward to hearing your thoughts on it! :) PS: I'm also not sure why two of the tests failed because they don't seem to relate to any of the changes I've made. |
async def _get_many_ordered( | ||
self, requests: Iterable[tuple[str, BufferPrototype, ByteRequest | None]] | ||
) -> tuple[Buffer | None, ...]: | ||
""" | ||
Retrieve a collection of objects from storage in the order they were requested. | ||
""" | ||
tasks = [self.get(*req) for req in requests] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I see the use for this method. If store users want fetches to happen in a specific order, then users can call get
in a loop. If users only want the results of their fetches to be ordered, they can re-order the results after receiving them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I introduced the method _get_many_ordered
because it is the only used case in the current implementation. Not using it leads to repetitive boilerplate code because we would always find ourselves awaiting all fetches and then sorting the results afterwards. Using _get_many_ordered
encapsulates this logic, resulting in code that is less verbose, less repetitive, and easier-to-read.
Example
Compare for example
zarray_bytes, zgroup_bytes, zattrs_bytes = await store._get_many_ordered(
[
(_join_paths([path, ZARRAY_JSON]), default_buffer_prototype(), None),
(_join_paths([path, ZGROUP_JSON]), default_buffer_prototype(), None),
(_join_paths([path, ZATTRS_JSON]), default_buffer_prototype(), None),
]
)
with
ordered_keys = [
(_join_paths([path, ZARRAY_JSON]), default_buffer_prototype(), None),
(_join_paths([path, ZGROUP_JSON]), default_buffer_prototype(), None),
(_join_paths([path, ZATTRS_JSON]), default_buffer_prototype(), None),
]
retrieved_objects = {}
async for key, value in store._get_many(ordered_keys):
retrieved_objects[key] = value
zarray_bytes, zgroup_bytes, zattrs_bytes = tuple(retrieved_objects.get(key[0]) for key in ordered_keys)
The first block can be read much faster.
At the same time, we cannot use the original,
zarray_bytes, zgroup_bytes, zattrs_bytes = await asyncio.gather(
store.get(_join_paths([path, ZARRAY_JSON]), prototype=default_buffer_prototype()),
store.get(_join_paths([path, ZGROUP_JSON]), prototype=default_buffer_prototype()),
store.get(_join_paths([path, ZATTRS_JSON]), prototype=default_buffer_prototype()),
as this would lead to deadlocks in synchronous file systems, and we of course also don't want to call .get
sequentially because it reduces performance in asynchronous systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.get sequentially because it reduces performance in asynchronous systems.
get
is async. caling it returns a coroutine. these can be scheduled together with asyncio.gather
, which will preserve order. So calling get
sequentially is not a performance problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, but it is, if we were to await them sequentially, e.g. if we instead used
zarray_bytes = await store.get(_join_paths([path, ZARRAY_JSON]), prototype=default_buffer_prototype())
zgroup_bytes = await store.get(_join_paths([path, ZGROUP_JSON]), prototype=default_buffer_prototype())
zattrs_bytes = await store.get(_join_paths([path, ZATTRS_JSON]), prototype=default_buffer_prototype())
This would assure that synchronous file systems don't run into deadlocks. But it would not be a good alternative for asynchronous systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm confused, aren't you doing all this exactly because the fsspec sftp backend is not async? so then sequential awaiting (inside the logic of _get_many
) is exactly what's we expect to happen, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but in the general case, the implementation should be asynchronous. It must be implemented, however, in such a way that we can make it synchronous by overwriting store methods in the FSSpec store, making what is generally asynchronous synchronous if store.fs.asynchronous==False
.
Note also, that originally, we were not using await.gather
wrapping multiple get
statements, so this was not possible. And a general sequential solution is not desirable in an I/O-limited system.
async def _get_with_name( | ||
key: str, prototype: BufferPrototype, byte_range: ByteRequest | None | ||
) -> tuple[str, Buffer | None]: | ||
value = await self.get(key, prototype, byte_range) | ||
return key, value | ||
|
||
tasks = [_get_with_name(*req) for req in requests] | ||
for completed in as_completed(tasks): | ||
task = await completed | ||
yield task |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the advantage of this new implementation? the previous implementation was extremely simple, which I think is good for an abc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The claim in the docstring is incorrect given the previous implementation.
This loop is sequential: it awaits each self.get(*req) and yields it before moving on to the next. Each request is handled one at a time, in the exact order provided. Therefore, results are always yielded in the same order as the input requests.
It is thus not fully concurrent which would be desirable in an I/O-limited system and, at least as I understand, kind of defeats the purpose of having an asynchronous _get_many
method yielding results in the first place. Because if we stick to the order, we might as well await all results and simply replace the implementation of _get_many
with that of _get_many_ordered
, making it faster and arguable more easy to used in the asynchronous case. If we want to give the extra flexibility of not awaiting all at once, but still requesting all at the same time, the new implementation would be the right one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the point of the default implementation is to be the simplest possible implementation of _get_many
that any child class can safely support, given an implementation of get
. But child classes should also be able to override this with more efficient methods where applicable, and in these cases the order of results is not guaranteed. hence the type annotation in the original method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this somewhat confusing, as I would have expected the standard implementation to be fully asynchronous. However, if the goal is to maximize simplicity, then having an asynchronous implementation that runs synchronously might be the way to go.
That being said, if we revert this to the original, we would only have to also remove the FsspecStore._get_many
from my current solution. Unless you think we should not have a _get_many_ordered
method and use the _get_many
method instead and then always sort the values locally, as they could be of a different order in other implementations.
As described in #3196, I encountered issues opening Zarr v3 arrays stored over SFTP using
fsspec
. Specifically, python would freeze opening zarr arrays.Root Cause
The issue stems from the use of
asyncio.gather
inzarr.core.array.get_array_metadata
, which attempts to read multiple metadata files (e.g.,.zarray
,.zattrs
,zarr.json
) concurrently. This works well for truly asynchronous filesystems, but breaks when using systems likeSFTPFileSystem
, which does not seem to be concurrency-safe in async contexts (potentially relying on blocking I/O internally or managing connection states using global locks) leading to deadlocks or indefinite hangs whenasyncio.gather
is used to perform multiple reads simultaneously.Solution
To address this, I’ve implemented a fallback to sequential reads for filesystems that are not concurrency-safe. The logic is as follows: For non asynchronous file systems, the user sets
store.fs.asynchronous=False
. The helper functionis_concurrency_safe(store_path: StorePath) -> bool
, checks thisgetattr(fs, "asynchronous", True)
. If Trueasyncio.gather
is used, else we fall back to sequentialawait
. This Preserves the performance benefit of concurrent reads for safe filesystems (e.g., local disk, S3, GCS), while preventing deadlocks and improved robustness when using backends like SFTP.These changes may not address all scenarios not asynchronous file systems could cause issues, as there are several other instances of
asyncio.gather
inzarr.core.array
andzarr.core.group
. However, I opted to focus on this specific problem first, as enabling the opening of arrays and groups is likely the highest priority, and I wanted to discuss this approach before making too many changes.I look forward to hearing your thoughts and seeing this issue resolved!
TODO:
docs/user-guide/*.rst
changes/