Dataset4EO is a Python library designed to streamline the creation, storage, and benchmarking of Earth observation datasets. The library focuses on two primary methods of handling large multi-channel remote sensing data:
- Channel-Wise Storage: Stores each channel of a multi-channel image as an independent chunk, allowing for selective decoding of specific channels.
- Full-Image Storage: Stores entire multi-channel images in chunks and selects specific channels during the decoding phase.
- Channel-Wise Dataset Support: Efficient storage and selective decoding of individual image channels.
- Full-Image Dataset Support: Traditional storage and decoding of entire multi-channel images.
- Performance Benchmarking: Tools to compare storage efficiency, memory usage, and decoding speed between channel-wise and full-image approaches.
- Integration with LitData: Fully leverages LitData’s streaming capabilities for handling large datasets.
pip install -e .
Dataset4EO is ideal for multispectral or hyperspectral datasets with more than three channels. By chunking image data channel-wise and loading them channel-wise from the disk for streaming, it reduces memory footprint and accelerates training. This is especially useful when you need to randomly select a subset of multispectral or hyperspectral image data. The library extends the functionality of LitData.
import numpy as np
import dataset4eo as eodata
def create_channel_wise_image(index):
# Create a channel-wise chunked dataset from a 10-channel multispectral image.
# Prepare data
image_array = np.random.rand(256, 256, 13) # Multi-channel image data
segmentation_map = np.random.randint(0, 5, (256, 256), dtype='uint8')
depth_map = np.random.rand(256, 256).astype('float16')
data_sample = {
"image": image_array,
"segmentation_map": segmentation_map,
"depth_map": depth_map,
"sample_id": index
}
return data_sample
if __name__ == "__main__":
import os
if not os.path.exists("optimized_channel_dataset"):
# Channel-Wise Dataset
eodata.optimize(
fn=create_channel_wise_image,
inputs=list(range(1000)), # Generate 1000 samples
output_dir="optimized_channel_dataset",
num_workers=4,
chunk_bytes="128MB",
)
# Initialize the dataset
dataset = eodata.StreamingDataset(
input_dir="optimized_channel_dataset", # Directory where chunks are stored
num_channels=13,
channels_to_select=[0, 1, 2, 8, 4, 5] # Channels to load
)
# Retrieve a sample
sample = dataset[100]
print(sample.keys())
print(sample["image"].shape) # Output: (256, 256, 6)
print(sample["segmentation_map"].shape) # Output: (256, 256)
print(sample["depth_map"].shape) # Output: (256, 256)
print(sample["sample_id"]) # Metadata
Please check out all the supported datasets here.
import dataset4eo as eodata
import litdata as ld
import time
from huggingface_hub import snapshot_download
repo_id = eodata.builtin_datasets['so2sat']
local_path = snapshot_download(
repo_id=repo_id,
repo_type = "dataset",
cache_dir="./data_so2sat_cls", # Custom directory
revision="main" # Specific branch, tag, or commit
)
split = "train"
train_dataset = eodata.StreamingDataset(input_dir=f"{local_path}/{split}", num_channels=18, channels_to_select=[0, 3, 5, 7, 9], shuffle=True, drop_last=True)
train_dataloader = ld.StreamingDataLoader(train_dataset)
iters = 0
start = time.time()
for sample in train_dataloader:
img, cls = sample['image'], sample['class']
iters += 1
if iters == 100:
break
end = time.time()
print(end-start)
- Reorganize more than 400 datasets in the remote sensing community in a task-oriented way.
- Support high-level repositories for specific tasks such as object detection, segmentation, and more.
- Provide easy-to-use data loaders for custom projects.
Dataset4EO simplifies the management of Earth observation datasets, offering robust performance and seamless integration with LitData to handle large-scale remote sensing data more efficiently.