Identical values reported for "var.nnz" and "var.n_measured_obs" for different datasets retrieved via get_anndata() #1281

khughitt opened this issue Sep 16, 2024 · 2 comments
bug Something isn't working


Describe the bug

The nnz and n_measured_obs fields report the same (global?) values for adata.var.nnz and adata.var.n_measured_obs regardless of the dataset queried.

Tested for different dataset ids, but presumably this applies to all queries and not just those pulling a single dataset.

To Reproduce

import cellxgene_census

d1 = "00ff600e-6e2e-4d76-846f-0eec4f0ae417"
d2 = "0c9a8cfb-6649-4d52-b418-6d8e56bd7afe"

with cellxgene_census.open_soma(census_version="2024-07-01") as census:
    ad1 = cellxgene_census.get_anndata(
        organism="Homo sapiens",
        obs_value_filter=f"dataset_id == '{d1}'"

    ad2 = cellxgene_census.get_anndata(
        organism="Homo sapiens",
        obs_value_filter=f"dataset_id == '{d2}'"

    # True
    (ad1.var.nnz == ad2.var.nnz).all()

    # True
    (ad1.var.n_measured_obs == ad2.var.n_measured_obs).all()

Expected behavior

Query/dataset-specific values should be returned.


Additional context

I checked the docs just to make sure that this is not expected behavior and it also suggests that the expected behavior is for the values to be relative to the (dataset) queried:

n_measured_obs — the “measured” cells for this gene, effectively the number of cells for which this gene was measured in their respective dataset.



Thanks for all of your work on this!

It's appreciated.

Thanks for the bug report @khughitt!

We are tracking this, and it looks related to #1284. But I think your interpretation is correct, just checking in with the schema owners on internal channels to make sure.

ivirshup commented Feb 3, 2025

@khughitt, apologies, but I completely misread this issue when I first responded.

What you are seeing is actually the expected behavior. The summary statistics you are seeing in .var are calculated across the whole census object, and are statically stored. That means for any query within an Measurement you will get the same values for var.

It sounds like what you are expecting is for these values to be calculated dynamically for each query. You will need to recalculate the values on your side for this.

Sorry for the confusion!

bug Something isn't working
