Hi Scooby authors,
I’m trying to fine-tune Scooby on the provided multiome training data, but training crashes immediately when the DataLoader fetches the first batch. The error happens inside scooby/data/scdata.py when indexing adata.obsm["fragment_single"] with cell_indices, which appears to be a tuple / ragged structure (e.g., something like (indices, weights)), causing SciPy/anndata sparse indexing to fail.
Resources / setup
• Data (Zenodo): https://zenodo.org/records/14018495
• Pretrained model: https://hf-mirror.com/johahi/borzoi-replicate-0
• Scooby repo / install: https://github.com/gagneurlab/scooby
Command:
CUDA_VISIBLE_DEVICES=1 python train_multiome.py --config_file train_config.yaml
What happens
Training starts, wandb initializes, but iteration stops at the very first batch:0it [00:06, ?it/s]
Then it crashes in the DataLoader worker. The traceback shows:
• scooby/data/scdata.py::getitem → _load_pseudobulk → _process_rna
• Crash line: m = adata.obsm["fragment_single"][cell_indices]
This indexing triggers SciPy/anndata sparse index validation, which errors because cell_indices is not a 1D integer array.
Full traceback
0it [00:06, ?it/s]
Traceback (most recent call last):
File ".../train_multiome_V2.py", line 215, in
train(config)
File ".../train_multiome_V2.py", line 168, in train
for i, [inputs, rc_augs, targets, cell_emb_idx] in tqdm.tqdm(enumerate(training_loader)):
...
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
File ".../numpy/_core/fromnumeric.py", line 3557, in ndim
return a.ndim
AttributeError: 'tuple' object has no attribute 'ndim'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File ".../torch/utils/data/_utils/worker.py", line 349, in _worker_loop
data = fetcher.fetch(index)
File ".../torch/utils/data/_utils/fetch.py", line 52, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File ".../scooby/data/scdata.py", line 549, in getitem
targets.append(self._load_pseudobulk(neighbors_to_load, genome_data))
File ".../scooby/data/scdata.py", line 492, in _load_pseudobulk
seq_cov = self._process_rna(adata, neighbors, seq_coord, strand=strand, custom_read_length=self.custom_read_length)
File ".../scooby/data/scdata.py", line 439, in _process_rna
m=adata.obsm["fragment_single"][cell_indices]
File ".../anndata/_core/sparse_dataset.py", line 452, in getitem
sub = self.to_memory()[row_sp_matrix_validated, col_sp_matrix_validated]
File ".../scipy/sparse/_index.py", line 30, in getitem
index, new_shape = self._validate_indices(key)
File ".../scipy/sparse/_index.py", line 231, in _validate_indices
elif isinstance(idx, slice) or isintlike(idx):
File ".../scipy/sparse/_sputils.py", line 356, in isintlike
if np.ndim(x) != 0:
File ".../numpy/_core/fromnumeric.py", line 3559, in ndim
return asarray(a).ndim
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.
It looks like cell_indices (passed into _process_rna) can be a tuple or ragged structure rather than a 1D integer index array. SciPy/anndata sparse slicing expects something like np.ndarray[int] (1D), but receives a tuple, so it treats it as multi-axis indexing key and fails.
Could you please advise:
1. What is the expected format of cell_indices/neighbors in _process_rna and _load_pseudobulk for the multiome dataset?
2. Is this a known issue with certain dependency versions (e.g., Python 3.13 / anndata / scipy), or should scdata.py normalize cell_indices (e.g., unwrap (indices, weights) to indices) before indexing?
3. If you have a recommended environment (Python version + package versions) that you’ve tested this pipeline with, could you share it?
Happy to provide additional info (config file, package version list, or a small dump of type(cell_indices) / repr(cell_indices) right before the failing line) if helpful.
Hi Scooby authors,
I’m trying to fine-tune Scooby on the provided multiome training data, but training crashes immediately when the DataLoader fetches the first batch. The error happens inside scooby/data/scdata.py when indexing adata.obsm["fragment_single"] with cell_indices, which appears to be a tuple / ragged structure (e.g., something like (indices, weights)), causing SciPy/anndata sparse indexing to fail.
Resources / setup
• Data (Zenodo): https://zenodo.org/records/14018495
• Pretrained model: https://hf-mirror.com/johahi/borzoi-replicate-0
• Scooby repo / install: https://github.com/gagneurlab/scooby
Command:
CUDA_VISIBLE_DEVICES=1 python train_multiome.py --config_file train_config.yaml
What happens
Training starts, wandb initializes, but iteration stops at the very first batch:0it [00:06, ?it/s]
Then it crashes in the DataLoader worker. The traceback shows:
• scooby/data/scdata.py::getitem → _load_pseudobulk → _process_rna
• Crash line: m = adata.obsm["fragment_single"][cell_indices]
This indexing triggers SciPy/anndata sparse index validation, which errors because cell_indices is not a 1D integer array.
Full traceback
0it [00:06, ?it/s]
Traceback (most recent call last):
File ".../train_multiome_V2.py", line 215, in
train(config)
File ".../train_multiome_V2.py", line 168, in train
for i, [inputs, rc_augs, targets, cell_emb_idx] in tqdm.tqdm(enumerate(training_loader)):
...
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
File ".../numpy/_core/fromnumeric.py", line 3557, in ndim
return a.ndim
AttributeError: 'tuple' object has no attribute 'ndim'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File ".../torch/utils/data/_utils/worker.py", line 349, in _worker_loop
data = fetcher.fetch(index)
File ".../torch/utils/data/_utils/fetch.py", line 52, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File ".../scooby/data/scdata.py", line 549, in getitem
targets.append(self._load_pseudobulk(neighbors_to_load, genome_data))
File ".../scooby/data/scdata.py", line 492, in _load_pseudobulk
seq_cov = self._process_rna(adata, neighbors, seq_coord, strand=strand, custom_read_length=self.custom_read_length)
File ".../scooby/data/scdata.py", line 439, in _process_rna
m=adata.obsm["fragment_single"][cell_indices]
File ".../anndata/_core/sparse_dataset.py", line 452, in getitem
sub = self.to_memory()[row_sp_matrix_validated, col_sp_matrix_validated]
File ".../scipy/sparse/_index.py", line 30, in getitem
index, new_shape = self._validate_indices(key)
File ".../scipy/sparse/_index.py", line 231, in _validate_indices
elif isinstance(idx, slice) or isintlike(idx):
File ".../scipy/sparse/_sputils.py", line 356, in isintlike
if np.ndim(x) != 0:
File ".../numpy/_core/fromnumeric.py", line 3559, in ndim
return asarray(a).ndim
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.
It looks like cell_indices (passed into _process_rna) can be a tuple or ragged structure rather than a 1D integer index array. SciPy/anndata sparse slicing expects something like np.ndarray[int] (1D), but receives a tuple, so it treats it as multi-axis indexing key and fails.
Could you please advise:
1. What is the expected format of cell_indices/neighbors in _process_rna and _load_pseudobulk for the multiome dataset?
2. Is this a known issue with certain dependency versions (e.g., Python 3.13 / anndata / scipy), or should scdata.py normalize cell_indices (e.g., unwrap (indices, weights) to indices) before indexing?
3. If you have a recommended environment (Python version + package versions) that you’ve tested this pipeline with, could you share it?
Happy to provide additional info (config file, package version list, or a small dump of type(cell_indices) / repr(cell_indices) right before the failing line) if helpful.