Multiome training crashes at first batch: scdata._process_rna indexes adata.obsm["fragment_single"] with tuple cell_indices

Hi Scooby authors,

I’m trying to fine-tune Scooby on the provided multiome training data, but training crashes immediately when the DataLoader fetches the first batch. The error happens inside scooby/data/scdata.py when indexing adata.obsm["fragment_single"] with cell_indices, which appears to be a tuple / ragged structure (e.g., something like (indices, weights)), causing SciPy/anndata sparse indexing to fail.

Resources / setup
	•	Data (Zenodo): https://zenodo.org/records/14018495
	•	Pretrained model: https://hf-mirror.com/johahi/borzoi-replicate-0
	•	Scooby repo / install: https://github.com/gagneurlab/scooby

Command:
CUDA_VISIBLE_DEVICES=1 python train_multiome.py --config_file train_config.yaml

What happens
Training starts, wandb initializes, but iteration stops at the very first batch:0it [00:06, ?it/s]

Then it crashes in the DataLoader worker. The traceback shows:
	•	scooby/data/scdata.py::__getitem__ → _load_pseudobulk → _process_rna
	•	Crash line: m = adata.obsm["fragment_single"][cell_indices]

This indexing triggers SciPy/anndata sparse index validation, which errors because cell_indices is not a 1D integer array.

Full traceback
0it [00:06, ?it/s]
Traceback (most recent call last):
  File ".../train_multiome_V2.py", line 215, in <module>
    train(config)
  File ".../train_multiome_V2.py", line 168, in train
    for i, [inputs, rc_augs, targets, cell_emb_idx] in tqdm.tqdm(enumerate(training_loader)):
  ...
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File ".../numpy/_core/fromnumeric.py", line 3557, in ndim
    return a.ndim
AttributeError: 'tuple' object has no attribute 'ndim'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File ".../torch/utils/data/_utils/worker.py", line 349, in _worker_loop
    data = fetcher.fetch(index)
  File ".../torch/utils/data/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File ".../scooby/data/scdata.py", line 549, in __getitem__
    targets.append(self._load_pseudobulk(neighbors_to_load, genome_data))
  File ".../scooby/data/scdata.py", line 492, in _load_pseudobulk
    seq_cov = self._process_rna(adata, neighbors, seq_coord, strand=strand, custom_read_length=self.custom_read_length)
  File ".../scooby/data/scdata.py", line 439, in _process_rna
    m=adata.obsm["fragment_single"][cell_indices]
  File ".../anndata/_core/sparse_dataset.py", line 452, in __getitem__
    sub = self.to_memory()[row_sp_matrix_validated, col_sp_matrix_validated]
  File ".../scipy/sparse/_index.py", line 30, in __getitem__
    index, new_shape = self._validate_indices(key)
  File ".../scipy/sparse/_index.py", line 231, in _validate_indices
    elif isinstance(idx, slice) or isintlike(idx):
  File ".../scipy/sparse/_sputils.py", line 356, in isintlike
    if np.ndim(x) != 0:
  File ".../numpy/_core/fromnumeric.py", line 3559, in ndim
    return asarray(a).ndim
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.

It looks like cell_indices (passed into _process_rna) can be a tuple or ragged structure rather than a 1D integer index array. SciPy/anndata sparse slicing expects something like np.ndarray[int] (1D), but receives a tuple, so it treats it as multi-axis indexing key and fails.

Could you please advise:
	1.	What is the expected format of cell_indices/neighbors in _process_rna and _load_pseudobulk for the multiome dataset?
	2.	Is this a known issue with certain dependency versions (e.g., Python 3.13 / anndata / scipy), or should scdata.py normalize cell_indices (e.g., unwrap (indices, weights) to indices) before indexing?
	3.	If you have a recommended environment (Python version + package versions) that you’ve tested this pipeline with, could you share it?

Happy to provide additional info (config file, package version list, or a small dump of type(cell_indices) / repr(cell_indices) right before the failing line) if helpful.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiome training crashes at first batch: scdata._process_rna indexes adata.obsm["fragment_single"] with tuple cell_indices #28

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multiome training crashes at first batch: scdata._process_rna indexes adata.obsm["fragment_single"] with tuple cell_indices #28

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions