Add the dataset ID to the download modals for easier web <-> API transitions #7423

sidneymbell · 2025-02-01T00:00:17Z

Description

In the download modals for Datasets and Collections, please include the dataset_id and a code snippet for downloading this dataset via the Census API.

Context

Use case: today I wanted to pre-filter the tabula sapiens dataset based on metadata found in .obs before I download the count matrix. This is useful because I'm working on my local laptop, and the count data is large-ish, whereas I only actually need a small fraction of it.

In theory, this should be easy because Census provides a very nice cellxgene_census.get_obs function, which can be run something like this: cellxgene_census.get_obs(obs_value_filter='dataset_id == foo').

However, this dataset ID is impossible to find unless you query all dataset_id values in the Census and filter based on the collection_name. (H/T to @ebezzi for helping me figure out this workaround!)

Impact

I usually browse datasets online, and then download via notebook so I can be more precise in which slices of the data I actually need. Making this more seamless would save me a lot of headache trying to track down the data I want once I'm ready to download.

Alternatives you've considered

I really don't think we surface this dataset_id anywhere visible online. I even checked the dataset info box in Explorer. Maybe I'm just missing something? :)

Ideal behavior

In the modal, replace:
old:

Individual datasets and their versions may also be downloaded programmatically using the Discover API.

new:

To download this dataset via the Discover API, use this Python snippet:
cellxgene_census.get_anndata(obs_value_filter='dataset_id == foo')

The text was updated successfully, but these errors were encountered:

ivirshup · 2025-02-04T00:28:23Z

I think this is a good idea.

FWIW I believe the URL has the dataset ID in it. E.g. from the url:

https://cellxgene.cziscience.com/e/2e5a9b5d-d31b-4e9f-a179-d5d70ba459fb.cxg/

2e5a9b5d-d31b-4e9f-a179-d5d70ba459fb is a dataset_id you can use to filter from census.

brianraymor · 2025-02-04T00:32:20Z

For reproducibility, the dataset version id is preferred which is why it's embedded in the citation and also available in the census schema.

My other comment on the slack thread is that not all CELLxGENE Discover datasets are available in Census.

sidneymbell transferred this issue from chanzuckerberg/cellxgene-census Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the dataset ID to the download modals for easier web <-> API transitions #7423

Add the dataset ID to the download modals for easier web <-> API transitions #7423

sidneymbell commented Feb 1, 2025

ivirshup commented Feb 4, 2025

brianraymor commented Feb 4, 2025 •

edited

Loading

Add the dataset ID to the download modals for easier web <-> API transitions #7423

Add the dataset ID to the download modals for easier web <-> API transitions #7423

Comments

sidneymbell commented Feb 1, 2025

Description

Context

Impact

Alternatives you've considered

Ideal behavior

ivirshup commented Feb 4, 2025

brianraymor commented Feb 4, 2025 • edited Loading

brianraymor commented Feb 4, 2025 •

edited

Loading