Skip to content

Multiome training crashes at first batch: scdata._process_rna indexes adata.obsm["fragment_single"] with tuple cell_indices #28

@TongWu2022

Description

@TongWu2022

Hi Scooby authors,

I’m trying to fine-tune Scooby on the provided multiome training data, but training crashes immediately when the DataLoader fetches the first batch. The error happens inside scooby/data/scdata.py when indexing adata.obsm["fragment_single"] with cell_indices, which appears to be a tuple / ragged structure (e.g., something like (indices, weights)), causing SciPy/anndata sparse indexing to fail.

Resources / setup
• Data (Zenodo): https://zenodo.org/records/14018495
• Pretrained model: https://hf-mirror.com/johahi/borzoi-replicate-0
• Scooby repo / install: https://github.com/gagneurlab/scooby

Command:
CUDA_VISIBLE_DEVICES=1 python train_multiome.py --config_file train_config.yaml

What happens
Training starts, wandb initializes, but iteration stops at the very first batch:0it [00:06, ?it/s]

Then it crashes in the DataLoader worker. The traceback shows:
• scooby/data/scdata.py::getitem → _load_pseudobulk → _process_rna
• Crash line: m = adata.obsm["fragment_single"][cell_indices]

This indexing triggers SciPy/anndata sparse index validation, which errors because cell_indices is not a 1D integer array.

Full traceback
0it [00:06, ?it/s]
Traceback (most recent call last):
File ".../train_multiome_V2.py", line 215, in
train(config)
File ".../train_multiome_V2.py", line 168, in train
for i, [inputs, rc_augs, targets, cell_emb_idx] in tqdm.tqdm(enumerate(training_loader)):
...
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
File ".../numpy/_core/fromnumeric.py", line 3557, in ndim
return a.ndim
AttributeError: 'tuple' object has no attribute 'ndim'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File ".../torch/utils/data/_utils/worker.py", line 349, in _worker_loop
data = fetcher.fetch(index)
File ".../torch/utils/data/_utils/fetch.py", line 52, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File ".../scooby/data/scdata.py", line 549, in getitem
targets.append(self._load_pseudobulk(neighbors_to_load, genome_data))
File ".../scooby/data/scdata.py", line 492, in _load_pseudobulk
seq_cov = self._process_rna(adata, neighbors, seq_coord, strand=strand, custom_read_length=self.custom_read_length)
File ".../scooby/data/scdata.py", line 439, in _process_rna
m=adata.obsm["fragment_single"][cell_indices]
File ".../anndata/_core/sparse_dataset.py", line 452, in getitem
sub = self.to_memory()[row_sp_matrix_validated, col_sp_matrix_validated]
File ".../scipy/sparse/_index.py", line 30, in getitem
index, new_shape = self._validate_indices(key)
File ".../scipy/sparse/_index.py", line 231, in _validate_indices
elif isinstance(idx, slice) or isintlike(idx):
File ".../scipy/sparse/_sputils.py", line 356, in isintlike
if np.ndim(x) != 0:
File ".../numpy/_core/fromnumeric.py", line 3559, in ndim
return asarray(a).ndim
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.

It looks like cell_indices (passed into _process_rna) can be a tuple or ragged structure rather than a 1D integer index array. SciPy/anndata sparse slicing expects something like np.ndarray[int] (1D), but receives a tuple, so it treats it as multi-axis indexing key and fails.

Could you please advise:
1. What is the expected format of cell_indices/neighbors in _process_rna and _load_pseudobulk for the multiome dataset?
2. Is this a known issue with certain dependency versions (e.g., Python 3.13 / anndata / scipy), or should scdata.py normalize cell_indices (e.g., unwrap (indices, weights) to indices) before indexing?
3. If you have a recommended environment (Python version + package versions) that you’ve tested this pipeline with, could you share it?

Happy to provide additional info (config file, package version list, or a small dump of type(cell_indices) / repr(cell_indices) right before the failing line) if helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions