Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the dataset ID to the download modals for easier web <-> API transitions #7423

Open
sidneymbell opened this issue Feb 1, 2025 · 2 comments

Comments

@sidneymbell
Copy link
Contributor

Description

In the download modals for Datasets and Collections, please include the dataset_id and a code snippet for downloading this dataset via the Census API.

Context

Use case: today I wanted to pre-filter the tabula sapiens dataset based on metadata found in .obs before I download the count matrix. This is useful because I'm working on my local laptop, and the count data is large-ish, whereas I only actually need a small fraction of it.

In theory, this should be easy because Census provides a very nice cellxgene_census.get_obs function, which can be run something like this: cellxgene_census.get_obs(obs_value_filter='dataset_id == foo').

However, this dataset ID is impossible to find unless you query all dataset_id values in the Census and filter based on the collection_name. (H/T to @ebezzi for helping me figure out this workaround!)

Impact

I usually browse datasets online, and then download via notebook so I can be more precise in which slices of the data I actually need. Making this more seamless would save me a lot of headache trying to track down the data I want once I'm ready to download.

Alternatives you've considered

I really don't think we surface this dataset_id anywhere visible online. I even checked the dataset info box in Explorer. Maybe I'm just missing something? :)

Ideal behavior

In the modal, replace:
old:

Individual datasets and their versions may also be downloaded programmatically using the Discover API.

new:

To download this dataset via the Discover API, use this Python snippet:
cellxgene_census.get_anndata(obs_value_filter='dataset_id == foo')

Image
@sidneymbell sidneymbell transferred this issue from chanzuckerberg/cellxgene-census Feb 3, 2025
@ivirshup
Copy link

ivirshup commented Feb 4, 2025

I think this is a good idea.

FWIW I believe the URL has the dataset ID in it. E.g. from the url:

https://cellxgene.cziscience.com/e/2e5a9b5d-d31b-4e9f-a179-d5d70ba459fb.cxg/

2e5a9b5d-d31b-4e9f-a179-d5d70ba459fb is a dataset_id you can use to filter from census.

@brianraymor
Copy link

brianraymor commented Feb 4, 2025

For reproducibility, the dataset version id is preferred which is why it's embedded in the citation and also available in the census schema.

My other comment on the slack thread is that not all CELLxGENE Discover datasets are available in Census.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants