Add chunks='auto' support for cftime datasets #10527

charles-turner-1 · 2025-07-13T01:09:51Z

Closes #xxxx
Tests added

welcome · 2025-07-13T01:09:54Z

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

for more information, see https://pre-commit.ci

jemmajeffree · 2025-07-14T04:52:19Z

Would these changes also work for cf timedeltas or are they going to still cause problems?
I'm tempted to write a script to bash through all the ACCESS-NRI intake datastores and see if there's anything else in there that's dtype object — let me know if this would be useful, or if we should just wait for it to break later

charles-turner-1 · 2025-07-14T05:02:09Z

Would these changes also work for cf timedeltas or are they going to still cause problems? I'm tempted to write a script to bash through all the ACCESS-NRI intake datastores and see if there's anything else in there that's dtype object — let me know if this would be useful, or if we should just wait for it to break later

If you can find something thats specifically a cftimedelta and run the _contains_cftime_datetimes function on it that'd be super helpful to know whether it returns True or False.

jemmajeffree · 2025-07-14T07:39:31Z

TLDR: don't mind me, it's not going to cause any issues

Firstly, what I thought was a cftimedelta turned out to be a numpy timedelta hanging out with a cftime

When I did manage to coerce this timedelta into cftime conventions, it just contained a floating point number of days, so I can't see anything having issues with its size

coder = xr.coding.times.CFTimedeltaCoder()
result = coder.encode(oops.average_DT).load()
print(result.dtype)
result

xarray/namedarray/daskmanager.py

…1/xarray into autochunk-cftime

…pect disk chunks sensibly & this should be ready to go, I think

charles-turner-1 · 2025-07-15T23:13:46Z

I did some prodding around yesterday and I realised this won't let us do something like

import xarray as xr
cftime_datafile = "/path/to/file.nc"
xr.open_dataset(cftime_datafile, chunks='auto')

yet, only stuff along the lines of

import xarray as xr
cftime_datafile = "/path/to/file.nc"
ds = xr.open_dataset(cftime_datafile, chunks=-1)
ds = ds.chunk('auto')

I think implementing the former is going to be a bit harder, but I'm starting to clock the code structure a bit more now so I'll have a decent crack.

dcherian · 2025-07-16T14:23:57Z

Why so? Are we sending "auto" in to normalize_chunks first?

…inda janky

…1/xarray into autochunk-cftime

charles-turner-1 · 2025-07-23T08:40:06Z

Yup, this is the call stack:

----> 3 xr.open_dataset(
      4     "/Users/u1166368/xarray/tos_Omon_CESM2-WACCM_historical_r2i1p1f1_gr_185001-201412.nc", chunks="auto"
  /Users/u1166368/xarray/xarray/backends/api.py(721)open_dataset()
    720     )
--> 721     ds = _dataset_from_backend_dataset(
    722         backend_ds,
  /Users/u1166368/xarray/xarray/backends/api.py(418)_dataset_from_backend_dataset()
    417     if chunks is not None:
--> 418         ds = _chunk_ds(
    419             ds,
  /Users/u1166368/xarray/xarray/backends/api.py(368)_chunk_ds()
    367     for name, var in backend_ds.variables.items():
--> 368         var_chunks = _get_chunk(var, chunks, chunkmanager)
    369         variables[name] = _maybe_chunk(
  /Users/u1166368/xarray/xarray/structure/chunks.py(102)_get_chunk()
    101 
--> 102     chunk_shape = chunkmanager.normalize_chunks(
    103         chunk_shape, shape=shape, dtype=var.dtype, previous_chunks=preferred_chunk_shape
> /Users/u1166368/xarray/xarray/namedarray/daskmanager.py(60)normalize_chunks()

I've fixed it in the latest commit - but I think the implementation leaves a lot to be desired too.

Do I want to refactor to move the changes in xarray/structure/chunks.py into the daskmanager module if possible?

Once I've got the structure there cleaned up, I'll work on replacing the build_chunkspec function with something more sensible - I just need to work out how to extract the implementation in dask cleanly now I think - normalize_chunks also seems to calculate sensible chunk sizes.

dcherian · 2025-07-23T16:12:08Z

xarray/structure/chunks.py

+
+        from xarray.namedarray.utils import build_chunkspec
+
+        target_chunksize = parse_bytes(dask_config.get("array.chunk-size"))


How about adding get_auto_chunk_size to the ChunkManager class; and put the dask-specific stuff in the DaskManager.

cc @TomNicholas

dcherian · 2025-07-23T16:14:26Z

I guess one bit that's confusing here is that the code-path for backends and normal variables is different?

So let's add a test that reads form disk; and one that works iwth a DataArray constructed in memory.

dcherian · 2025-07-23T16:15:27Z

xarray/namedarray/daskmanager.py

+        cubed.Array.rechunk
+        """
+
+        if _contains_cftime_datetimes(data):


I guess this can be deleted

Had a play and I don't think I can fully get rid of it, I've reused as much of the abstracted logic as possible though.

dcherian · 2025-07-23T16:15:56Z

xarray/namedarray/utils.py

@@ -195,6 +198,30 @@ def either_dict_or_kwargs(
    return pos_kwargs


+def build_chunkspec(
+    data: T_ChunkedArray,


should be "duck array"

dcherian · 2025-07-23T16:16:12Z

xarray/structure/chunks.py

-    chunk_shape = chunkmanager.normalize_chunks(
-        chunk_shape, shape=shape, dtype=var.dtype, previous_chunks=preferred_chunk_shape
-    )
+    if _contains_cftime_datetimes(var):


Suggested change

if _contains_cftime_datetimes(var):

if _contains_cftime_datetimes(var) and chunks == "auto":

dcherian · 2025-07-23T16:16:49Z

xarray/structure/chunks.py

+        chunk_shape = chunkmanager.normalize_chunks(
+            chunk_shape, shape=shape, previous_chunks=preferred_chunk_shape
+        )
+    else:
+        chunk_shape = chunkmanager.normalize_chunks(
+            chunk_shape,
+            shape=shape,
+            dtype=var.dtype,
+            previous_chunks=preferred_chunk_shape,
+        )


Suggested change

chunk_shape = chunkmanager.normalize_chunks(

chunk_shape, shape=shape, previous_chunks=preferred_chunk_shape

)

else:

chunk_shape = chunkmanager.normalize_chunks(

chunk_shape,

shape=shape,

dtype=var.dtype,

previous_chunks=preferred_chunk_shape,

)

chunk_shape = chunkmanager.normalize_chunks(

chunk_shape,

shape=shape,

dtype=var.dtype,

previous_chunks=preferred_chunk_shape,

)

There's no dtype=var.dtype in the if _contains_cftime_datetimes(var) clause. We could do this:

if _contains_cftime_datetimes(var): ... chunk_shape = build_chunkspec(...) var_dtype = None else: var_dtype = var.dtype chunk_shape = chunkmanager.normalize_chunks( chunk_shape, shape=shape, dtype=var_dtype, previous_chunks=preferred_chunk_shape, )

which seems cleaner than what I've currently got?

Ignore that, I've changed how this works to allow us to use the dask native chunk normalization - by computing a ratio of sizes to a np.float64 and then adjusting our limit by that so we get the correct size chunks.

…djusting target chunksize so we get sensible chunks using dask's default chunking strategy

for more information, see https://pre-commit.ci

…1/xarray into autochunk-cftime

…re comprehensive tests, etc & Shut mypy up

for more information, see https://pre-commit.ci

…1/xarray into autochunk-cftime

dcherian · 2025-07-25T14:55:38Z

xarray/tests/test_backends.py

@@ -5427,6 +5427,35 @@ def test_open_multi_dataset(self) -> None:
            ) as actual:
                assert_identical(expected, actual)

+    def test_open_dataset_cftime_autochunk(self) -> None:


This will fix our min-deps tests

Suggested change

def test_open_dataset_cftime_autochunk(self) -> None:

@requires_cftime

def test_open_dataset_cftime_autochunk(self) -> None:

dcherian · 2025-07-25T14:56:31Z

xarray/tests/test_backends.py

+        with create_tmp_file() as tmp:
+            original.to_netcdf(tmp)
+            with open_dataset(tmp, chunks="auto") as actual:


Suggested change

with create_tmp_file() as tmp:

original.to_netcdf(tmp)

with open_dataset(tmp, chunks="auto") as actual:

with self.roundtrip(original, open_kwargs={"chunks": "auto"}) as actual:

dcherian · 2025-07-25T14:58:26Z

xarray/structure/chunks.py

+    #  at this point, so check for # this before we manually construct our chunk
+    # spec- if we've set chunks to auto
+    _chunks = list(chunks.values()) if is_dict_like(chunks) else chunks
+    auto_chunks = all(_chunk == "auto" for _chunk in _chunks)


I think technically a subset of this tuple can be "auto" but we can ignore this wrinkle for now.

dcherian · 2025-07-25T15:00:05Z

xarray/namedarray/daskmanager.py

+    def get_auto_chunk_size(self, var: Variable) -> tuple[int, _DType]:
+        from dask import config as dask_config
+        from dask.utils import parse_bytes
+
+        from xarray.namedarray.utils import fake_target_chunksize
+
+        target_chunksize = parse_bytes(dask_config.get("array.chunk-size"))
+        return fake_target_chunksize(var, target_chunksize=target_chunksize)


Suggested change

def get_auto_chunk_size(self, var: Variable) -> tuple[int, _DType]:

from dask import config as dask_config

from dask.utils import parse_bytes

from xarray.namedarray.utils import fake_target_chunksize

target_chunksize = parse_bytes(dask_config.get("array.chunk-size"))

return fake_target_chunksize(var, target_chunksize=target_chunksize)

def get_auto_chunk_size(self) -> int:

from dask import config as dask_config

from dask.utils import parse_bytes

return parse_bytes(dask_config.get("array.chunk-size"))

Only this much is dask-specific, so that's what the DaskManager should be responsible for.

dcherian · 2025-07-25T15:01:37Z

xarray/structure/chunks.py

+    if _contains_cftime_datetimes(var) and auto_chunks:
+        limit, var_dtype = chunkmanager.get_auto_chunk_size(var)
+    else:
+        limit, var_dtype = None, var.dtype


This logic would change to use fake_target_chunksize

for more information, see https://pre-commit.ci

…1/xarray into autochunk-cftime

…added code?

charles-turner-1 · 2025-07-28T02:24:37Z

I think most of the work left to do on this is just fixing the typing now...

All works, just need to satisfy mypy and whatnot now

eb1a967

github-actions bot added topic-documentation topic-NamedArray Lightweight version of Variable labels Jul 13, 2025

[pre-commit.ci] auto fixes from pre-commit.com hooks

852476d

for more information, see https://pre-commit.ci

charles-turner-1 changed the title ~~All works, just need to satisfy mypy and whatnot now~~ Add chunks='auto' support for cftime datasets Jul 13, 2025

charles-turner-1 and others added 5 commits July 12, 2025 18:10

Merge branch 'main' into autochunk-cftime

c921c59

Fix moving import to be optional

1aba531

[pre-commit.ci] auto fixes from pre-commit.com hooks

9429c3d

for more information, see https://pre-commit.ci

Make mypy happy

3c9d27e

Add some clarifying comments about what we need to do to optimise this

5153d2d

charles-turner-1 marked this pull request as draft July 14, 2025 05:02

dcherian reviewed Jul 14, 2025

View reviewed changes

xarray/namedarray/daskmanager.py Outdated Show resolved Hide resolved

dcherian reviewed Jul 14, 2025

View reviewed changes

xarray/namedarray/daskmanager.py Outdated Show resolved Hide resolved

charles-turner-1 added 4 commits July 15, 2025 07:04

Merge branch 'autochunk-cftime' of https://github.com/charles-turner-…

62e71e6

…1/xarray into autochunk-cftime

@dcherian's suggestions. Just need to update chunking strategy to res…

cfdc31b

…pect disk chunks sensibly & this should be ready to go, I think

Merge branch 'main' of https://github.com/charles-turner-1/xarray

2f16bc7

Merge branch 'main' into autochunk-cftime

ce720fa

charles-turner-1 and others added 3 commits July 23, 2025 17:29

Merge branch 'main' into autochunk-cftime

4fa58c1

Can now load cftime arrays with auto-chunking. Implementation still k…

e58d6d7

…inda janky

Merge branch 'autochunk-cftime' of https://github.com/charles-turner-…

590e503

…1/xarray into autochunk-cftime

dcherian reviewed Jul 23, 2025

View reviewed changes

charles-turner-1 and others added 11 commits July 25, 2025 09:20

Test for autochunking when reading from disk

f953976

replace build_chunkspec with faking the dtype of a cftime array & a…

6706524

…djusting target chunksize so we get sensible chunks using dask's default chunking strategy

[pre-commit.ci] auto fixes from pre-commit.com hooks

4e56acd

for more information, see https://pre-commit.ci

Merge branch 'autochunk-cftime' of https://github.com/charles-turner-…

0d008cd

…1/xarray into autochunk-cftime

Merge branch 'main' into autochunk-cftime

49c4e9c

Merge branch 'autochunk-cftime' of https://github.com/charles-turner-…

4594099

…1/xarray into autochunk-cftime

Remove redundant comments, rename things to make them clearer, add mo…

5d00b0a

…re comprehensive tests, etc & Shut mypy up

[pre-commit.ci] auto fixes from pre-commit.com hooks

80421ef

for more information, see https://pre-commit.ci

Refactor to move most of the changes into the DaskManager

d1f7ad3

Merge branch 'autochunk-cftime' of https://github.com/charles-turner-…

1b7de62

…1/xarray into autochunk-cftime

bare-min tests should pass now?

4407185

dcherian reviewed Jul 25, 2025

View reviewed changes

charles-turner-1 and others added 6 commits July 28, 2025 08:43

Deepak's suggestions (think mypy is still going to be angry for now)

d8f45b2

[pre-commit.ci] auto fixes from pre-commit.com hooks

20226c1

for more information, see https://pre-commit.ci

Merge branch 'autochunk-cftime' of https://github.com/charles-turner-…

11ac9f0

…1/xarray into autochunk-cftime

Fix errant line

8485df5

Clean up DaskManager.rechunk a bit - maybe possible to remove more …

2c27877

…added code?

Remove unused import

0983261


		from xarray.namedarray.utils import build_chunkspec

		target_chunksize = parse_bytes(dask_config.get("array.chunk-size"))

	if _contains_cftime_datetimes(var):
	if _contains_cftime_datetimes(var) and chunks == "auto":

	def test_open_dataset_cftime_autochunk(self) -> None:
	@requires_cftime
	def test_open_dataset_cftime_autochunk(self) -> None:

Uh oh!

Add chunks='auto' support for cftime datasets #10527

Are you sure you want to change the base?

Add chunks='auto' support for cftime datasets #10527

Conversation

charles-turner-1 commented Jul 13, 2025

Uh oh!

welcome bot commented Jul 13, 2025

Uh oh!

jemmajeffree commented Jul 14, 2025

Uh oh!

charles-turner-1 commented Jul 14, 2025

Uh oh!

jemmajeffree commented Jul 14, 2025

Uh oh!

Uh oh!

Uh oh!

charles-turner-1 commented Jul 15, 2025

Uh oh!

dcherian commented Jul 16, 2025

Uh oh!

charles-turner-1 commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dcherian commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dcherian Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charles-turner-1 commented Jul 28, 2025

Uh oh!

Uh oh!

charles-turner-1 commented Jul 23, 2025 •

edited

Loading

dcherian commented Jul 23, 2025 •

edited

Loading

dcherian Jul 25, 2025 •

edited

Loading