Adding imputed sex labels #2207

erflynn · 2020-03-25T14:12:55Z

Issue Number

This addresses issue number #2181.

Purpose/Implementation Notes

This update includes imputed sex labels for microarray data (mouse, rat, and human).

I will update with more labels soon; I am not sure if you want to integrate this pull request now or come back to it later when I have more labels? I am sending now so we can work through methods, format, etc.

Methods

Details are included in the below (this is also in the README within the config/externally_supplied_metadata/directory).
All code to produce this update is included in the sl_label repository and can be easily applied to other organisms who have XX/XY sex determination (provided sufficient metadata sex labels are available for model training).

The majority of gene expression data is missing metadata sex labels (see Table 1). This lack of labels prevents us from examining the sex breakdown of many studies. We trained a penalized logistic regression model that uses the expression of X and Y chromosome genes to impute the sex of a given sample. The model was trained using the glmnet R package, with the elastic net penalty. Lambda was selected in ten-fold cross validation. We used metadata sex labels for "ground truth" for the training and testing data. To construct the training and testing datasets, we filtered for samples with metadata sex labels that did not have a cell line annotation (using the refine-bio sex and cell_line tags), and grouped these samples into studies. We divided the set of all studies in half, and then within each half, sampled n=700 samples for training, and n=300 samples for testing for both males and females. This provided a balanced training and testing dataset, where none of the samples in the test data were from a study that had been seen in the training.

A previous study [1] indicated that there is widespread mis-annotation in metadata sex labels, because of this, we set up additional testing datasets: a high confidence mixed sex dataset and single sex datasets. The high confidence mixed sex dataset consists of all mixed sex studies with at least five male and five female samples where the metadata sex labels match expression-based labels from at least one of two clustering-based sex imputation methods [1,2] (we did not apply clustering based methods to the entire refine-bio dataset because they have poor performance on small studies, single sex studies, and studies with high class imbalance). The single sex datasets are all studies with at least ten samples and all male or all female labels as indicated by metadata sex labels.

Across all three organisms, our models achieve approximately 95% accuracy in a randomly selected held-out test set as compared to the metadata labels. Additionally, we assessed the accuracy of our model, on various subsets of the data; comparing to all metadata sex labels (agreement 93.5-94.8%), a random sample of single sex studies (agreement 92.6-96.5%), and, in human, manually annotated sex labels from a previous analysis [3] (94.2%) (see Table 2).

organism	Samples (n)	Samples missing metadata annotation	Studies (n)	Studies missing metadata annotation
human	430119	74.90%	14987	87.30%
mouse	228707	74.00%	12995	80.90%
rat	31361	55.50%	1295	65.30%

Table 1. Metadata missingness for sex labels.

dataset	Human	Mouse	Rat
Training (n=1400)	95.60%	96.10%	98.60%
Testing (n=600)	95.20%	95.70%	95.50%
Metadata	93.5% (107748)	93.5% (58473)	94.8% (13995)
High-confidence	97.3% (7301)	95.5% (3968)	n/a
Single sex - f	96.5% (12919)	93.4% (13243)	96.3% (2240)
Single sex - m	92.6% (8128)	93.6% (30225)	95.8% (12689)
Manual annotations	94.2% (8289)	n/a	n/a

Table 2. Concordance of sex labels. Numbers in parentheses indicate the total number of samples, percentages the number of samples that agree divided by the total number of samples. High confidence labels have matching metadata and clustering based expression labels.

The cleaned metadata sex labels are also included in the cleaned_metadata/ directory for microarray and RNA-seq. This process mapped all harmonized sex labels to "male", "female", "mixed", or "unknown". Code for this is included in 01_metadata within the erflynn/sl_label repository here

Types of changes

This includes externally supplied metadata files in .json format. These are under
config/externally_supplied_metadata/. The format is as follows (and as discussed in Issue #2127)

{"sample_accession": "<SAMPLE_ACCESSION_CODE>",
 "attributes": [{"PATO:0000047": {"value": <VALUE>,
                                          "probability": <PROBABILITY>}                } ]
}

where value is one of "PATO:0000383" (female) or "PATO:0000384" (male) and probability is P(imputed_sex=value) from the logistic regression model.

Cleaned harmonized metadata for sex is included in the cleaned_metadata/ directory with the columns "acc" (sample accession), "sex" (the harmonized sex label), and "mapped_sex" (the harmonized sex label mapped to "male", "female", "mixed", or "unknown").

Screenshot

Sex sample breakdown using metadata and (imputed) expression labels for each organism.
sex_breakdown_microarray.pdf

References

[1] Toker, L., et al. F1000Research. 2016, 5: 2103.
[2] Buckberry, S., et al. Bioinformatics. 2014, 30(14): 2084–2085.
[3] Giles, C. B, et al. BMC Bioinformatics. 2017,18(Suppl 14): 509.

[Master] Deploy

[hotfix] Bring master up to date with dev and redeploy to bring the API up

Deploy small fixes for the foreman.

Revert data cleanup changes

Improve our ability to not retry jobs that shouldn't be retried.

[HOTFIX] Deploy compedium fixes to production

[HOTFIX] Deploy config tuning for prod

[HOTFIX] Deploy a couple minor changes to prod

Brings master up to date with dev, deploying many changes including some tuning

[HOTFIX] Fix migration to run in SQL instead of python

[HOTFIX] Disable stats background scheduler

[HOTFIX] Fix Affy docker image

[HOTFIX] Deploy foreman and background scheduler fixes

Bring master up to date with dev to deploy improvements and fixes

[HOTFIX] Bumps max_clients to 8, don't increase RAM when instance cycling.

[HOTFIX] Deploy fix for SRA surveyor

[HOTFIX] Deploy the ENA fallback!

Deploy surveyor jobs running on smasher

[HOTFIX] Deploy new feed-the-beast, up salmon timeout

[HOTFIX] Merge missing commit from the feed the beast branch

[HOTFIX} Deploy fixes to the beast feeder

[HOTFIX] Improves the way we manage the queue of downloader jobs.

Bumps up max clients to scale up a little more

[HOTFIX] Deploy old salmon version rerunning! (and others)

[HOTFIX] Don't scale up number of volumes

[HOTFIX] Deploy salmon rerun fix and scale up larger

[HOTFIX] Bumps max clients up to 14

[HOTFIX] Revert volumes back to 10

[HOTFIX] Deploy migration that was left out before

…revert-2151-dev Revert "Revert "[DEPLOY] Fix for microbes and GSE75083""

…revert-2155-dev Revert "Revert "[DEPLOY] Fix pgbouncer/RDS remaining issues""

[DEPLOY] Finally get pgbouncer traffic routing working correctly

[DEPLOY] Deploy final fixes to pg_bouncer config and fix intermittent test failures.

[HOTFIX] [DEPLOY] Patch long query in smasher jobs

[DEPLOY] Trigger a deploy of quite a lot of stuff

[DEPLOY] Bump volume size for smasher instance

jaclyn-taroni

Hi @erflynn, thanks so much for filing this! I'm Jaclyn, I'm a scientist at the CCDL. 👋 @kurtwheeler asked me to take a look at the methods section of your pull request.

In general, these look good to me! I think the level of detail you have included in the README looks good, but I had a few comments about linking to specific parts of the repo you mentioned (erflynn/sl_label) or previous analyses in case folks want to dig in a bit more at a later date. In addition, I had a question about the content of the CSV files included here – they may be as expected but it was not intuitive to me in the context of the README, which might indicate that that may be an area where we should expand the documentation a bit.

Thanks again! Please let us know if you need anything!

config/externally_supplied_metadata/README.md

jaclyn-taroni · 2020-03-31T15:17:41Z

config/externally_supplied_metadata/README.md

+
+| dataset | Human | Mouse | Rat |
+| ----- | ---- | ---- | ---- | 
+| Training (n=1400) | 95.60% | 96.10% | 98.60% |


I suspect there might be more information about the training and testing set in the https://github.com/erflynn/sl_label about things like class balance. Can we add a link to that information in the paragraph above that begins with The majority of gene expression data is missing metadata sex labels please?

Added some information! thanks for the suggestion

jaclyn-taroni · 2020-03-31T15:17:45Z

config/externally_supplied_metadata/README.md

+
+This update includes imputed sex labels for microarray data (mouse, rat, and human).
+
+The majority of gene expression data is missing metadata sex labels (see Table 1). This lack of labels prevents us from examining the breakdown by sex of many studies. We used the expression of X and Y chromosome genes and metadata sex labels to train a logistic regression model (with elastic net penalty) to predict sample sex. Across all three organisms, our models achieve approximately 95% accuracy in a randomly selected held-out test set as compared to the metadata labels. Additionally, we assessed the accuracy of our model, on various subsets of the data; comparing to all metadata sex labels (agreement 93.5-94.8%), a random sample of single sex studies (agreement 92.6-96.5%), and, in human, manually annotated sex labels from a previous analysis (94.2%) (see Table 2).  


Is there a link to a previous analysis that you would be comfortable putting here?

oops! missing citation here, added in. apologies!

jaclyn-taroni · 2020-03-31T16:38:36Z

config/externally_supplied_metadata/README.md

+
+| organism | Samples (n) | Samples missing metadata annotation | Studies (n) | Studies missing metadata annotation |
+| ----- | ---- | ---- | ---- | ---- |
+| human | 430119 | 74.90% | 14987 | 87.30% |


I was noticing the line counts for the CSV files - I would expect for human_rnaseq + human_microarray not to exceed the number Samples (n) in this table. I took a closer look at the tabular data in the CSV files and it looks like there are duplicate values in the acc column and in the microarray file, there are run accessions that are consistent with RNA-seq data (e.g., DRR, ERR). There are about 99k accessions shared across the human RNA-seq and microarray files but looking at the human RNA-seq file there are no GEO sample accessions (e.g., GSM). This may totally be intentional, but it is a bit different than what I would expect given the context in this document.

Re the csv files - you are right, there are duplicates, apologies here! I have updated these, and put in a commit that fixes that. The numbers now exactly match those in the table.

For the run accessions, these are exactly the set of accessions that are included in the aggregated_metadata.json files in the respective compendia (the normalized microarray and then the RNA-seq compendia). You are right that there is some overlap -- I will look into this more -- but the accessions are the same. Is there a different way that you usually organize this?

For the run accessions, these are exactly the set of accessions that are included in the aggregated_metadata.json files in the respective compendia (the normalized microarray and then the RNA-seq compendia). You are right that there is some overlap -- I will look into this more -- but the accessions are the same.

Ah okay I see! Then I would expect some overlap because the normalized compendium contains both microarray and RNA-seq data. So maybe it would be better to call these files normalized and rnaseq_sample rather than microarray and rnaseq to match the terminology here: https://www.refine.bio/compendia.

ah ok! I will update this, thank you!

config/externally_supplied_metadata/README.md

Co-Authored-By: Jaclyn Taroni <[email protected]>

erflynn · 2020-04-01T15:37:44Z

Thanks for the feedback @jaclyn-taroni. This is very helpful!

I updated the README to include more description, and fixed the duplicates within the .csv files. I am not sure what the coverage difference is between the compendia? But the samples should match the aggregated_metadata.json files. I included a link for the code for this.

I also have the RNA-seq sex labels for all of the files that I could find (Issue #2211), and will amend the pull request to include these.

For linking to the erflynn/sl_label repo - perhaps there is a better way I can organize linking to this? I'm trying to reorganize the repo to make it clearer as to what I did.

…ology

jaclyn-taroni · 2020-04-03T16:59:31Z

Hi @erflynn - thanks for updating your comment to include all of this info!

I have given this question a bit of thought:

For linking to the erflynn/sl_label repo - perhaps there is a better way I can organize linking to this? I'm trying to reorganize the repo to make it clearer as to what I did.

I think we definitely want to make sure all of the information you've now added to your comment makes it into a version-controlled Markdown document. Now the question becomes where because if you have to write things up in two places that's more difficult to maintain. Some other considerations:

I expect that erflynn/sl_label will always be "ahead" of AlexsLemonade/refinebio in regards to what's in config/externally_supplied_metadata/ just based on how I expect development will go.
Sometimes (okay, often...nearly always in my personal experience) it's helpful to have someone who has not done an analysis review the documentation around the analysis because they have some distance.

There are two ways to address this that come to mind for me.

We can either have all of the documentation that's relevant to a particular "release" of labels (the CSV and JSON files in AlexsLemonade:dev) in config/externally_supplied_metadata/README.md. Someone from our team would review that documentation each time you file a pull request. In that documentation, you can use permalinks to link to the state of the code in erflynn/sl_label that was current as of filing the PR to AlexsLemonade/refinebio.

Another approach would be to keep all of the documentation in erflynn/sl_label alongside the code and use permalinks to the documentation that was current as of filing a PR to AlexsLemonade/refinebio in config/externally_supplied_metadata/README.md. A downside of this approach is that the documentation review step is not built-in, but one way to approach that would be to add someone from our team as a collaborator and only file PRs for documentation over on erflynn/sl_label.

Option 1 is probably more straightforward in my opinion, but would like to hear your thoughts. We can talk specifics of organizing the linking once we decide on a general strategy. Please let me know if anything is unclear! Thanks!

config/externally_supplied_metadata/README.md

Co-Authored-By: Jaclyn Taroni <[email protected]>

erflynn · 2020-04-07T18:33:26Z

Hi @jaclyn-taroni,

Thanks for thinking this through!

I agree that erflynn/sl_label will be ahead. I also agree it is helpful to have someone review the analysis.

I think option 1 makes more sense, I would prefer to have the documentation review step built-in.

One thought would be whether it is ok if the config/externally_supplied_metadata/README.md contains permalinks to other documentation in erflynn/sl_label, e.g. a link to documentation about how to run the code in erflynn/sl_label as well?

I would like to add instructions for how to run the code in the erflynn/sl_label repo to create the version of the output that is added in a pull request, but I think the nitty gritty of that (rather than the high-level detailed description, which can go in the config/externally_supplied_metadata/README.md) could go in the documentation in the erflynn/sl_label repository, and be linked from the README.md.

Thanks again,
Emily

jaclyn-taroni · 2020-04-08T14:21:58Z

Hi @erflynn -

One thought would be whether it is ok if the config/externally_supplied_metadata/README.md contains permalinks to other documentation in erflynn/sl_label, e.g. a link to documentation about how to run the code in erflynn/sl_label as well?

I would like to add instructions for how to run the code in the erflynn/sl_label repo to create the version of the output that is added in a pull request, but I think the nitty gritty of that (rather than the high-level detailed description, which can go in the config/externally_supplied_metadata/README.md) could go in the documentation in the erflynn/sl_label repository, and be linked from the README.md.

This sounds perfect to me! My biggest concern would be tying it to documentation in erflynn/sl_label that is in sync with what's in this repo, but the permalink solution you've laid out covers that. Thank you!

erflynn · 2020-04-10T00:00:23Z

Awesome - I will get this set up, and send an updated PR.

Thanks for the feedback!

cgreene · 2020-08-03T15:33:40Z

Hi @erflynn - we're starting to adjust how our metadata are stored so that we can import this. How are things going on your end?

erflynn · 2020-08-03T16:49:27Z

Hi @cgreene - thanks for reaching out. That's great news!

Things are going well. I was wrapping up another project, but am back to working on this manuscript.

One note about the metadata labels: I noticed last week that the harmonized metadata sex labels are somewhat lossy in how they're parsed (e.g. https://www.ebi.ac.uk/ena/data/view/SRS1752937&display=xml has sex information but it does not make it into the harmonized data). I've found this in SRA data, but haven't looked into compendia data yet but I expect this might be similar. I'm working on fixing this right now because I need a more accurate assessment of metadata coverage. I can send an update or post an issue later this week with the expanded set of labels and the parsing code that I write.

kurtwheeler · 2020-08-14T15:14:13Z

Hi @erflynn! We've added some metadata from MetaSRA to the config/externally_supplied_metadata directory. At some point before you merge this, would you mind moving all your files down one level into config/externally_supplied_metadata/erflynn?

erflynn · 2020-08-14T15:57:02Z

not a problem! will do. and I'll add some updated labels

arielsvn and others added 30 commits July 3, 2019 12:04

Merge pull request AlexsLemonade#1367 from AlexsLemonade/dev

bb9f5fb

[Master] Deploy

Merge pull request AlexsLemonade#1383 from AlexsLemonade/dev

e861290

[Master] Deploy

Merge pull request AlexsLemonade#1425 from AlexsLemonade/dev

8c2b3c6

[hotfix] Bring master up to date with dev and redeploy to bring the API up

Merge pull request AlexsLemonade#1430 from AlexsLemonade/dev

d561189

Deploy small fixes for the foreman.

Merge pull request AlexsLemonade#1434 from AlexsLemonade/dev

2affb92

Revert data cleanup changes

Merge pull request AlexsLemonade#1439 from AlexsLemonade/dev

e58d12e

Improve our ability to not retry jobs that shouldn't be retried.

Merge pull request AlexsLemonade#1444 from AlexsLemonade/dev

0958459

[HOTFIX] Deploy compedium fixes to production

Merge pull request AlexsLemonade#1449 from AlexsLemonade/dev

161f639

[HOTFIX] Deploy config tuning for prod

Merge pull request AlexsLemonade#1451 from AlexsLemonade/dev

77b3b3d

[HOTFIX] Deploy a couple minor changes to prod

Merge pull request AlexsLemonade#1460 from AlexsLemonade/dev

165e5b5

Brings master up to date with dev, deploying many changes including some tuning

Merge pull request AlexsLemonade#1464 from AlexsLemonade/dev

ab826cb

[HOTFIX] Fix migration to run in SQL instead of python

Merge pull request AlexsLemonade#1467 from AlexsLemonade/dev

aec7673

[HOTFIX] Disable stats background scheduler

Merge pull request AlexsLemonade#1471 from AlexsLemonade/dev

a312fb9

[HOTFIX] Fix Affy docker image

Merge pull request AlexsLemonade#1475 from AlexsLemonade/dev

759fa61

[HOTFIX] Deploy foreman and background scheduler fixes

Merge pull request AlexsLemonade#1484 from AlexsLemonade/dev

5de1a74

Bring master up to date with dev to deploy improvements and fixes

Merge pull request AlexsLemonade#1489 from AlexsLemonade/dev

54980c0

[HOTFIX] Bumps max_clients to 8, don't increase RAM when instance cycling.

Merge pull request AlexsLemonade#1493 from AlexsLemonade/dev

eb86446

[HOTFIX] Deploy fix for SRA surveyor

Merge pull request AlexsLemonade#1495 from AlexsLemonade/dev

a37f174

[HOTFIX] Deploy the ENA fallback!

Merge pull request AlexsLemonade#1498 from AlexsLemonade/dev

72137d1

Deploy surveyor jobs running on smasher

Merge pull request AlexsLemonade#1502 from AlexsLemonade/dev

885c0c1

[HOTFIX] Deploy new feed-the-beast, up salmon timeout

Merge pull request AlexsLemonade#1504 from AlexsLemonade/dev

9dfb476

[HOTFIX] Merge missing commit from the feed the beast branch

Merge pull request AlexsLemonade#1506 from AlexsLemonade/dev

d65521e

[HOTFIX} Deploy fixes to the beast feeder

Merge pull request AlexsLemonade#1509 from AlexsLemonade/dev

53d18db

[HOTFIX] Improves the way we manage the queue of downloader jobs.

Merge pull request AlexsLemonade#1511 from AlexsLemonade/dev

a790dbc

Bumps up max clients to scale up a little more

Merge pull request AlexsLemonade#1520 from AlexsLemonade/dev

f10da1e

[HOTFIX] Deploy old salmon version rerunning! (and others)

Merge pull request AlexsLemonade#1522 from AlexsLemonade/dev

807c9cd

[HOTFIX] Don't scale up number of volumes

Merge pull request AlexsLemonade#1525 from AlexsLemonade/dev

6211230

[HOTFIX] Deploy salmon rerun fix and scale up larger

Merge pull request AlexsLemonade#1527 from AlexsLemonade/dev

d74c4b1

[HOTFIX] Bumps max clients up to 14

Merge pull request AlexsLemonade#1529 from AlexsLemonade/dev

93208ce

[HOTFIX] Revert volumes back to 10

Merge pull request AlexsLemonade#1531 from AlexsLemonade/dev

4f06ef7

[HOTFIX] Deploy migration that was left out before

kurtwheeler and others added 11 commits February 25, 2020 13:29

Revert "Revert "[DEPLOY] Fix for microbes and GSE75083""

66748ef

Merge pull request AlexsLemonade#2161 from AlexsLemonade/revert-2157-…

4d7f834

…revert-2151-dev Revert "Revert "[DEPLOY] Fix for microbes and GSE75083""

Revert "Revert "[DEPLOY] Fix pgbouncer/RDS remaining issues""

2557c8f

Merge pull request AlexsLemonade#2162 from AlexsLemonade/revert-2156-…

bcd5c38

…revert-2155-dev Revert "Revert "[DEPLOY] Fix pgbouncer/RDS remaining issues""

Merge pull request AlexsLemonade#2163 from AlexsLemonade/dev

b47923f

[DEPLOY] Finally get pgbouncer traffic routing working correctly

Merge pull request AlexsLemonade#2166 from AlexsLemonade/dev

a8476ae

[DEPLOY] Deploy final fixes to pg_bouncer config and fix intermittent test failures.

Merge pull request AlexsLemonade#2169 from AlexsLemonade/dev

28c98de

[HOTFIX] [DEPLOY] Patch long query in smasher jobs

Merge pull request AlexsLemonade#2201 from AlexsLemonade/dev

1235c1a

[DEPLOY] Trigger a deploy of quite a lot of stuff

Merge pull request AlexsLemonade#2203 from AlexsLemonade/dev

da9a344

[DEPLOY] Bump volume size for smasher instance

added imputed sex microarray files

e8a8ad3

added readme and cleaned metadata

e43bd22

jaclyn-taroni reviewed Mar 31, 2020

View reviewed changes

erflynn and others added 5 commits April 1, 2020 06:50

Update config/externally_supplied_metadata/README.md

d6140ac

Co-Authored-By: Jaclyn Taroni <[email protected]>

Update config/externally_supplied_metadata/README.md

732cab5

Co-Authored-By: Jaclyn Taroni <[email protected]>

fixed duplicate issue with cleaned metadata files

7adb9ce

Merge branch 'dev' of https://github.com/erflynn/refinebio into dev

35702dc

added RNA-seq sex labels

715ab52

updated microarray to normalized compendia to match refine-bio termin…

d9145d3

…ology

jaclyn-taroni reviewed Apr 3, 2020

View reviewed changes

config/externally_supplied_metadata/README.md Outdated Show resolved Hide resolved

Update config/externally_supplied_metadata/README.md

4ead408

Co-Authored-By: Jaclyn Taroni <[email protected]>

kurtwheeler mentioned this pull request Aug 4, 2020

Change metadata model to use key-value-confidence #2467

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding imputed sex labels #2207

Adding imputed sex labels #2207

erflynn commented Mar 25, 2020 •

edited

Loading

jaclyn-taroni left a comment

jaclyn-taroni Mar 31, 2020

erflynn Apr 1, 2020

jaclyn-taroni Mar 31, 2020

erflynn Apr 1, 2020

jaclyn-taroni Mar 31, 2020

erflynn Apr 1, 2020

jaclyn-taroni Apr 1, 2020

erflynn Apr 1, 2020

erflynn commented Apr 1, 2020

jaclyn-taroni commented Apr 3, 2020

erflynn commented Apr 7, 2020

jaclyn-taroni commented Apr 8, 2020

erflynn commented Apr 10, 2020

cgreene commented Aug 3, 2020

erflynn commented Aug 3, 2020

kurtwheeler commented Aug 14, 2020

erflynn commented Aug 14, 2020

Adding imputed sex labels #2207

Are you sure you want to change the base?

Adding imputed sex labels #2207

Conversation

erflynn commented Mar 25, 2020 • edited Loading

Issue Number

Purpose/Implementation Notes

Methods

Types of changes

Screenshot

References

jaclyn-taroni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erflynn commented Apr 1, 2020

jaclyn-taroni commented Apr 3, 2020

erflynn commented Apr 7, 2020

jaclyn-taroni commented Apr 8, 2020

erflynn commented Apr 10, 2020

cgreene commented Aug 3, 2020

erflynn commented Aug 3, 2020

kurtwheeler commented Aug 14, 2020

erflynn commented Aug 14, 2020

erflynn commented Mar 25, 2020 •

edited

Loading