-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add chunks='auto' support for cftime datasets #10527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
charles-turner-1
commented
Jul 13, 2025
- Closes #xxxx
- Tests added
Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient. |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Would these changes also work for cf timedeltas or are they going to still cause problems? |
If you can find something thats specifically a cftimedelta and run the |
…1/xarray into autochunk-cftime
…pect disk chunks sensibly & this should be ready to go, I think
I did some prodding around yesterday and I realised this won't let us do something like import xarray as xr
cftime_datafile = "/path/to/file.nc"
xr.open_dataset(cftime_datafile, chunks='auto') yet, only stuff along the lines of import xarray as xr
cftime_datafile = "/path/to/file.nc"
ds = xr.open_dataset(cftime_datafile, chunks=-1)
ds = ds.chunk('auto') I think implementing the former is going to be a bit harder, but I'm starting to clock the code structure a bit more now so I'll have a decent crack. |
Why so? Are we sending |
Yup, this is the call stack: ----> 3 xr.open_dataset(
4 "/Users/u1166368/xarray/tos_Omon_CESM2-WACCM_historical_r2i1p1f1_gr_185001-201412.nc", chunks="auto"
/Users/u1166368/xarray/xarray/backends/api.py(721)open_dataset()
720 )
--> 721 ds = _dataset_from_backend_dataset(
722 backend_ds,
/Users/u1166368/xarray/xarray/backends/api.py(418)_dataset_from_backend_dataset()
417 if chunks is not None:
--> 418 ds = _chunk_ds(
419 ds,
/Users/u1166368/xarray/xarray/backends/api.py(368)_chunk_ds()
367 for name, var in backend_ds.variables.items():
--> 368 var_chunks = _get_chunk(var, chunks, chunkmanager)
369 variables[name] = _maybe_chunk(
/Users/u1166368/xarray/xarray/structure/chunks.py(102)_get_chunk()
101
--> 102 chunk_shape = chunkmanager.normalize_chunks(
103 chunk_shape, shape=shape, dtype=var.dtype, previous_chunks=preferred_chunk_shape
> /Users/u1166368/xarray/xarray/namedarray/daskmanager.py(60)normalize_chunks() I've fixed it in the latest commit - but I think the implementation leaves a lot to be desired too. Do I want to refactor to move the changes in Once I've got the structure there cleaned up, I'll work on replacing the |
xarray/structure/chunks.py
Outdated
|
||
from xarray.namedarray.utils import build_chunkspec | ||
|
||
target_chunksize = parse_bytes(dask_config.get("array.chunk-size")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding get_auto_chunk_size
to the ChunkManager class; and put the dask-specific stuff in the DaskManager.
cc @TomNicholas
I guess one bit that's confusing here is that the code-path for backends and normal variables is different? So let's add a test that reads form disk; and one that works iwth a DataArray constructed in memory. |
cubed.Array.rechunk | ||
""" | ||
|
||
if _contains_cftime_datetimes(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this can be deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a play and I don't think I can fully get rid of it, I've reused as much of the abstracted logic as possible though.
xarray/namedarray/utils.py
Outdated
@@ -195,6 +198,30 @@ def either_dict_or_kwargs( | |||
return pos_kwargs | |||
|
|||
|
|||
def build_chunkspec( | |||
data: T_ChunkedArray, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be "duck array"
xarray/structure/chunks.py
Outdated
chunk_shape = chunkmanager.normalize_chunks( | ||
chunk_shape, shape=shape, dtype=var.dtype, previous_chunks=preferred_chunk_shape | ||
) | ||
if _contains_cftime_datetimes(var): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if _contains_cftime_datetimes(var): | |
if _contains_cftime_datetimes(var) and chunks == "auto": |
xarray/structure/chunks.py
Outdated
chunk_shape = chunkmanager.normalize_chunks( | ||
chunk_shape, shape=shape, previous_chunks=preferred_chunk_shape | ||
) | ||
else: | ||
chunk_shape = chunkmanager.normalize_chunks( | ||
chunk_shape, | ||
shape=shape, | ||
dtype=var.dtype, | ||
previous_chunks=preferred_chunk_shape, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chunk_shape = chunkmanager.normalize_chunks( | |
chunk_shape, shape=shape, previous_chunks=preferred_chunk_shape | |
) | |
else: | |
chunk_shape = chunkmanager.normalize_chunks( | |
chunk_shape, | |
shape=shape, | |
dtype=var.dtype, | |
previous_chunks=preferred_chunk_shape, | |
) | |
chunk_shape = chunkmanager.normalize_chunks( | |
chunk_shape, | |
shape=shape, | |
dtype=var.dtype, | |
previous_chunks=preferred_chunk_shape, | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no dtype=var.dtype
in the if _contains_cftime_datetimes(var)
clause. We could do this:
if _contains_cftime_datetimes(var):
...
chunk_shape = build_chunkspec(...)
var_dtype = None
else:
var_dtype = var.dtype
chunk_shape = chunkmanager.normalize_chunks(
chunk_shape,
shape=shape,
dtype=var_dtype,
previous_chunks=preferred_chunk_shape,
)
which seems cleaner than what I've currently got?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore that, I've changed how this works to allow us to use the dask native chunk normalization - by computing a ratio of sizes to a np.float64
and then adjusting our limit by that so we get the correct size chunks.
…djusting target chunksize so we get sensible chunks using dask's default chunking strategy
for more information, see https://pre-commit.ci
…1/xarray into autochunk-cftime
…1/xarray into autochunk-cftime
…re comprehensive tests, etc & Shut mypy up
for more information, see https://pre-commit.ci
…1/xarray into autochunk-cftime
@@ -5427,6 +5427,35 @@ def test_open_multi_dataset(self) -> None: | |||
) as actual: | |||
assert_identical(expected, actual) | |||
|
|||
def test_open_dataset_cftime_autochunk(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fix our min-deps tests
def test_open_dataset_cftime_autochunk(self) -> None: | |
@requires_cftime | |
def test_open_dataset_cftime_autochunk(self) -> None: |
xarray/tests/test_backends.py
Outdated
with create_tmp_file() as tmp: | ||
original.to_netcdf(tmp) | ||
with open_dataset(tmp, chunks="auto") as actual: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with create_tmp_file() as tmp: | |
original.to_netcdf(tmp) | |
with open_dataset(tmp, chunks="auto") as actual: | |
with self.roundtrip(original, open_kwargs={"chunks": "auto"}) as actual: |
xarray/structure/chunks.py
Outdated
# at this point, so check for # this before we manually construct our chunk | ||
# spec- if we've set chunks to auto | ||
_chunks = list(chunks.values()) if is_dict_like(chunks) else chunks | ||
auto_chunks = all(_chunk == "auto" for _chunk in _chunks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think technically a subset of this tuple can be "auto" but we can ignore this wrinkle for now.
xarray/namedarray/daskmanager.py
Outdated
def get_auto_chunk_size(self, var: Variable) -> tuple[int, _DType]: | ||
from dask import config as dask_config | ||
from dask.utils import parse_bytes | ||
|
||
from xarray.namedarray.utils import fake_target_chunksize | ||
|
||
target_chunksize = parse_bytes(dask_config.get("array.chunk-size")) | ||
return fake_target_chunksize(var, target_chunksize=target_chunksize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def get_auto_chunk_size(self, var: Variable) -> tuple[int, _DType]: | |
from dask import config as dask_config | |
from dask.utils import parse_bytes | |
from xarray.namedarray.utils import fake_target_chunksize | |
target_chunksize = parse_bytes(dask_config.get("array.chunk-size")) | |
return fake_target_chunksize(var, target_chunksize=target_chunksize) | |
def get_auto_chunk_size(self) -> int: | |
from dask import config as dask_config | |
from dask.utils import parse_bytes | |
return parse_bytes(dask_config.get("array.chunk-size")) |
Only this much is dask-specific, so that's what the DaskManager should be responsible for.
xarray/structure/chunks.py
Outdated
if _contains_cftime_datetimes(var) and auto_chunks: | ||
limit, var_dtype = chunkmanager.get_auto_chunk_size(var) | ||
else: | ||
limit, var_dtype = None, var.dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic would change to use fake_target_chunksize
for more information, see https://pre-commit.ci
…1/xarray into autochunk-cftime
I think most of the work left to do on this is just fixing the typing now... |