Skip to content

Add chunks='auto' support for cftime datasets #10527

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 31 commits into
base: main
Choose a base branch
from

Conversation

charles-turner-1
Copy link

  • Closes #xxxx
  • Tests added

Copy link

welcome bot commented Jul 13, 2025

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

@github-actions github-actions bot added topic-documentation topic-NamedArray Lightweight version of Variable labels Jul 13, 2025
@charles-turner-1 charles-turner-1 changed the title All works, just need to satisfy mypy and whatnot now Add chunks='auto' support for cftime datasets Jul 13, 2025
@jemmajeffree
Copy link
Contributor

Would these changes also work for cf timedeltas or are they going to still cause problems?
I'm tempted to write a script to bash through all the ACCESS-NRI intake datastores and see if there's anything else in there that's dtype object — let me know if this would be useful, or if we should just wait for it to break later

@charles-turner-1
Copy link
Author

Would these changes also work for cf timedeltas or are they going to still cause problems? I'm tempted to write a script to bash through all the ACCESS-NRI intake datastores and see if there's anything else in there that's dtype object — let me know if this would be useful, or if we should just wait for it to break later

If you can find something thats specifically a cftimedelta and run the _contains_cftime_datetimes function on it that'd be super helpful to know whether it returns True or False.

@charles-turner-1 charles-turner-1 marked this pull request as draft July 14, 2025 05:02
@jemmajeffree
Copy link
Contributor

TLDR: don't mind me, it's not going to cause any issues

Firstly, what I thought was a cftimedelta turned out to be a numpy timedelta hanging out with a cftime
Screenshot 2025-07-14 at 5 23 31 pm
When I did manage to coerce this timedelta into cftime conventions, it just contained a floating point number of days, so I can't see anything having issues with its size

coder = xr.coding.times.CFTimedeltaCoder()
result = coder.encode(oops.average_DT).load()
print(result.dtype)
result
Screenshot 2025-07-14 at 5 38 33 pm

@charles-turner-1
Copy link
Author

I did some prodding around yesterday and I realised this won't let us do something like

import xarray as xr
cftime_datafile = "/path/to/file.nc"
xr.open_dataset(cftime_datafile, chunks='auto')

yet, only stuff along the lines of

import xarray as xr
cftime_datafile = "/path/to/file.nc"
ds = xr.open_dataset(cftime_datafile, chunks=-1)
ds = ds.chunk('auto')

I think implementing the former is going to be a bit harder, but I'm starting to clock the code structure a bit more now so I'll have a decent crack.

@dcherian
Copy link
Contributor

Why so? Are we sending "auto" in to normalize_chunks first?

@charles-turner-1
Copy link
Author

charles-turner-1 commented Jul 23, 2025

Yup, this is the call stack:

----> 3 xr.open_dataset(
      4     "/Users/u1166368/xarray/tos_Omon_CESM2-WACCM_historical_r2i1p1f1_gr_185001-201412.nc", chunks="auto"
  /Users/u1166368/xarray/xarray/backends/api.py(721)open_dataset()
    720     )
--> 721     ds = _dataset_from_backend_dataset(
    722         backend_ds,
  /Users/u1166368/xarray/xarray/backends/api.py(418)_dataset_from_backend_dataset()
    417     if chunks is not None:
--> 418         ds = _chunk_ds(
    419             ds,
  /Users/u1166368/xarray/xarray/backends/api.py(368)_chunk_ds()
    367     for name, var in backend_ds.variables.items():
--> 368         var_chunks = _get_chunk(var, chunks, chunkmanager)
    369         variables[name] = _maybe_chunk(
  /Users/u1166368/xarray/xarray/structure/chunks.py(102)_get_chunk()
    101 
--> 102     chunk_shape = chunkmanager.normalize_chunks(
    103         chunk_shape, shape=shape, dtype=var.dtype, previous_chunks=preferred_chunk_shape
> /Users/u1166368/xarray/xarray/namedarray/daskmanager.py(60)normalize_chunks()

I've fixed it in the latest commit - but I think the implementation leaves a lot to be desired too.

Do I want to refactor to move the changes in xarray/structure/chunks.py into the daskmanager module if possible?

Once I've got the structure there cleaned up, I'll work on replacing the build_chunkspec function with something more sensible - I just need to work out how to extract the implementation in dask cleanly now I think - normalize_chunks also seems to calculate sensible chunk sizes.


from xarray.namedarray.utils import build_chunkspec

target_chunksize = parse_bytes(dask_config.get("array.chunk-size"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding get_auto_chunk_size to the ChunkManager class; and put the dask-specific stuff in the DaskManager.

cc @TomNicholas

@dcherian
Copy link
Contributor

dcherian commented Jul 23, 2025

I guess one bit that's confusing here is that the code-path for backends and normal variables is different?

So let's add a test that reads form disk; and one that works iwth a DataArray constructed in memory.

cubed.Array.rechunk
"""

if _contains_cftime_datetimes(data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this can be deleted

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a play and I don't think I can fully get rid of it, I've reused as much of the abstracted logic as possible though.

@@ -195,6 +198,30 @@ def either_dict_or_kwargs(
return pos_kwargs


def build_chunkspec(
data: T_ChunkedArray,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be "duck array"

chunk_shape = chunkmanager.normalize_chunks(
chunk_shape, shape=shape, dtype=var.dtype, previous_chunks=preferred_chunk_shape
)
if _contains_cftime_datetimes(var):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if _contains_cftime_datetimes(var):
if _contains_cftime_datetimes(var) and chunks == "auto":

Comment on lines 98 to 107
chunk_shape = chunkmanager.normalize_chunks(
chunk_shape, shape=shape, previous_chunks=preferred_chunk_shape
)
else:
chunk_shape = chunkmanager.normalize_chunks(
chunk_shape,
shape=shape,
dtype=var.dtype,
previous_chunks=preferred_chunk_shape,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
chunk_shape = chunkmanager.normalize_chunks(
chunk_shape, shape=shape, previous_chunks=preferred_chunk_shape
)
else:
chunk_shape = chunkmanager.normalize_chunks(
chunk_shape,
shape=shape,
dtype=var.dtype,
previous_chunks=preferred_chunk_shape,
)
chunk_shape = chunkmanager.normalize_chunks(
chunk_shape,
shape=shape,
dtype=var.dtype,
previous_chunks=preferred_chunk_shape,
)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no dtype=var.dtype in the if _contains_cftime_datetimes(var) clause. We could do this:

if _contains_cftime_datetimes(var):
    ...
    chunk_shape = build_chunkspec(...)
    var_dtype = None
else:
    var_dtype = var.dtype

chunk_shape = chunkmanager.normalize_chunks(
            chunk_shape,
            shape=shape,
            dtype=var_dtype,
            previous_chunks=preferred_chunk_shape,
        )

which seems cleaner than what I've currently got?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore that, I've changed how this works to allow us to use the dask native chunk normalization - by computing a ratio of sizes to a np.float64 and then adjusting our limit by that so we get the correct size chunks.

@@ -5427,6 +5427,35 @@ def test_open_multi_dataset(self) -> None:
) as actual:
assert_identical(expected, actual)

def test_open_dataset_cftime_autochunk(self) -> None:
Copy link
Contributor

@dcherian dcherian Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fix our min-deps tests

Suggested change
def test_open_dataset_cftime_autochunk(self) -> None:
@requires_cftime
def test_open_dataset_cftime_autochunk(self) -> None:

Comment on lines 5452 to 5454
with create_tmp_file() as tmp:
original.to_netcdf(tmp)
with open_dataset(tmp, chunks="auto") as actual:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with create_tmp_file() as tmp:
original.to_netcdf(tmp)
with open_dataset(tmp, chunks="auto") as actual:
with self.roundtrip(original, open_kwargs={"chunks": "auto"}) as actual:

# at this point, so check for # this before we manually construct our chunk
# spec- if we've set chunks to auto
_chunks = list(chunks.values()) if is_dict_like(chunks) else chunks
auto_chunks = all(_chunk == "auto" for _chunk in _chunks)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think technically a subset of this tuple can be "auto" but we can ignore this wrinkle for now.

Comment on lines 323 to 330
def get_auto_chunk_size(self, var: Variable) -> tuple[int, _DType]:
from dask import config as dask_config
from dask.utils import parse_bytes

from xarray.namedarray.utils import fake_target_chunksize

target_chunksize = parse_bytes(dask_config.get("array.chunk-size"))
return fake_target_chunksize(var, target_chunksize=target_chunksize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_auto_chunk_size(self, var: Variable) -> tuple[int, _DType]:
from dask import config as dask_config
from dask.utils import parse_bytes
from xarray.namedarray.utils import fake_target_chunksize
target_chunksize = parse_bytes(dask_config.get("array.chunk-size"))
return fake_target_chunksize(var, target_chunksize=target_chunksize)
def get_auto_chunk_size(self) -> int:
from dask import config as dask_config
from dask.utils import parse_bytes
return parse_bytes(dask_config.get("array.chunk-size"))

Only this much is dask-specific, so that's what the DaskManager should be responsible for.

Comment on lines 93 to 96
if _contains_cftime_datetimes(var) and auto_chunks:
limit, var_dtype = chunkmanager.get_auto_chunk_size(var)
else:
limit, var_dtype = None, var.dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic would change to use fake_target_chunksize

@charles-turner-1
Copy link
Author

I think most of the work left to do on this is just fixing the typing now...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-documentation topic-NamedArray Lightweight version of Variable
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants