-
Notifications
You must be signed in to change notification settings - Fork 99
Add streaming decompression for ZSTD_CONTENTSIZE_UNKNOWN case #707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #707 +/- ##
=======================================
Coverage 99.96% 99.96%
=======================================
Files 63 63
Lines 2712 2726 +14
=======================================
+ Hits 2711 2725 +14
Misses 1 1
🚀 New features to boost your workflow:
|
BeforeIn [1]: import numcodecs
In [2]: numcodecs.__version__
Out[2]: '0.16.1'
In [3]: codec = numcodecs.Zstd()
In [4]: bytes_val = b'(\xb5/\xfd\x00Xa\x00\x00Hello World!'
In [5]: codec.decode(bytes_val)
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[5], line 1
----> 1 codec.decode(bytes_val)
File numcodecs/zstd.pyx:261, in numcodecs.zstd.Zstd.decode()
File numcodecs/zstd.pyx:191, in numcodecs.zstd.decompress()
RuntimeError: Zstd decompression error: invalid input data
In [6]: bytes3 = b'(\xb5/\xfd\x00X$\x02\x00\xa4\x03ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz\x01\x00:\xfc\xdfs\x05\x05L\x00\x00\x08s\x01\x00\xfc\xff9\x10\x02L\x00\x00
⋮ \x08k\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08c\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08[\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08S\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08K\x01\x00\xfc\xf
⋮ f9\x10\x02L\x00\x00\x08C\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08u\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08m\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08e\x01\x00\xfc\xff9\x10\x02L\x00\x00\
⋮ x08]\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08U\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08M\x01\x00\xfc\xff9\x10\x02M\x00\x00\x08E\x01\x00\xfc\x7f\x1d\x08\x01'
In [7]: codec.decode(bytes3)
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[7], line 1
----> 1 codec.decode(bytes3)
File numcodecs/zstd.pyx:261, in numcodecs.zstd.Zstd.decode()
File numcodecs/zstd.pyx:191, in numcodecs.zstd.decompress()
RuntimeError: Zstd decompression error: invalid input data After
|
@d-v-b @jakirkham could you review? |
Consider the following script: import zarr, tensorstore as ts, numpy as np
arr = ts.open({
'driver': 'zarr',
'kvstore': {
'driver': 'file',
'path': 'ts_zarr2_zstd',
},
'metadata': {
'compressor': {
'id': 'zstd',
'level': 3,
},
'shape': [1024, 1024],
'chunks': [64, 64],
'dtype': '|u1',
}
}, create=True, delete_existing=True).result()
arr[:,:] = np.random.randint(0, 9, size=(1024,1024), dtype='u1')
arr2 = zarr.open_array("ts_zarr2_zstd")
print(arr2[:,:]) Before
After
|
it seems like compatibility with tensorstore is part of the goal here. Would it make sense to add an integration test that uses tensorstore? |
This be better done in another pull request. It would be good to add zarr and tensorstore as optional test dependencies. Also note that the zstd compatability issue only occurs when writing Zarr v2 with tensorstore. The closest analog to numcodecs for tensorstore is riegeli where most of the compression routines are implemented. |
@normanrz , you may be interested in taking a look as well with regard to Zstandard. |
Zstandard can use a streaming compression scheme where the total size of the data is not known at the beginning of the process. In this case, the size of the data is unknown
and is not saved in the Zstandard frame header.
Before this pull request, numcodecs would refuse to decompress data if the size were unknown. This pull request adds a routine to decompress data if the size is unknown,
specifically when
ZSTD_getFrameContentSize
returnsZSTD_CONTENTSIZE_UNKNOWN
.This pull request is based on prior pull request I made to numcodecs.js:
manzt/numcodecs.js#47
Fixes zarr-developers/zarr-python#2056
xref:
zarr-developers/zarr-python#2056
TODO: