-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Feature request
I propose adding a dedicated feature to the datasets library that allows for the efficient storage and retrieval of multi-dimensional ndarray with dynamic shapes. Similar to how Image columns handle variable-sized images, this feature would provide a structured way to store array data where the dimensions are not fixed.
A possible implementation could be a new Array or Tensor feature type that stores the data in a structured format, for example,
{
"shape": (5, 224, 224),
"dtype": "uint8",
"data": [...]
}
This would allow the datasets library to handle heterogeneous array sizes within a single column without requiring a fixed shape definition in the feature schema.
Motivation
I am currently trying to upload data from astronomical telescopes, specifically FITS files, to the Hugging Face Hub. This type of data is very similar to images but often has more than three dimensions. For example, data from the SDSS project contains five channels (u, g, r, i, z), and the pixel values can exceed 255, making the Pillow based Image feature unsuitable.
The current datasets library requires a fixed shape to be defined in the feature schema for multi-dimensional arrays, which is a major roadblock. This prevents me from saving my data, as the dimensions of the arrays can vary across different FITS files.
datasets/src/datasets/features/features.py
Lines 613 to 614 in 985c9be
shape: tuple | |
dtype: str |
A feature that supports dynamic shapes would be incredibly beneficial for the astronomy community and other fields dealing with similar high-dimensional, variable-sized data (e.g., medical imaging, scientific simulations).
Your contribution
I am willing to create a PR to help implement this feature if the proposal is accepted.