-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding imputed sex labels #2207
base: dev
Are you sure you want to change the base?
Conversation
[Master] Deploy
[Master] Deploy
[hotfix] Bring master up to date with dev and redeploy to bring the API up
Deploy small fixes for the foreman.
Revert data cleanup changes
Improve our ability to not retry jobs that shouldn't be retried.
[HOTFIX] Deploy compedium fixes to production
[HOTFIX] Deploy config tuning for prod
[HOTFIX] Deploy a couple minor changes to prod
Brings master up to date with dev, deploying many changes including some tuning
[HOTFIX] Fix migration to run in SQL instead of python
[HOTFIX] Disable stats background scheduler
[HOTFIX] Fix Affy docker image
[HOTFIX] Deploy foreman and background scheduler fixes
Bring master up to date with dev to deploy improvements and fixes
[HOTFIX] Bumps max_clients to 8, don't increase RAM when instance cycling.
[HOTFIX] Deploy fix for SRA surveyor
[HOTFIX] Deploy the ENA fallback!
Deploy surveyor jobs running on smasher
[HOTFIX] Deploy new feed-the-beast, up salmon timeout
[HOTFIX] Merge missing commit from the feed the beast branch
[HOTFIX} Deploy fixes to the beast feeder
[HOTFIX] Improves the way we manage the queue of downloader jobs.
Bumps up max clients to scale up a little more
[HOTFIX] Deploy old salmon version rerunning! (and others)
[HOTFIX] Don't scale up number of volumes
[HOTFIX] Deploy salmon rerun fix and scale up larger
[HOTFIX] Bumps max clients up to 14
[HOTFIX] Revert volumes back to 10
[HOTFIX] Deploy migration that was left out before
…revert-2151-dev Revert "Revert "[DEPLOY] Fix for microbes and GSE75083""
…revert-2155-dev Revert "Revert "[DEPLOY] Fix pgbouncer/RDS remaining issues""
[DEPLOY] Finally get pgbouncer traffic routing working correctly
[DEPLOY] Deploy final fixes to pg_bouncer config and fix intermittent test failures.
[HOTFIX] [DEPLOY] Patch long query in smasher jobs
[DEPLOY] Trigger a deploy of quite a lot of stuff
[DEPLOY] Bump volume size for smasher instance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @erflynn, thanks so much for filing this! I'm Jaclyn, I'm a scientist at the CCDL. 👋 @kurtwheeler asked me to take a look at the methods section of your pull request.
In general, these look good to me! I think the level of detail you have included in the README looks good, but I had a few comments about linking to specific parts of the repo you mentioned (erflynn/sl_label
) or previous analyses in case folks want to dig in a bit more at a later date. In addition, I had a question about the content of the CSV files included here – they may be as expected but it was not intuitive to me in the context of the README, which might indicate that that may be an area where we should expand the documentation a bit.
Thanks again! Please let us know if you need anything!
|
||
| dataset | Human | Mouse | Rat | | ||
| ----- | ---- | ---- | ---- | | ||
| Training (n=1400) | 95.60% | 96.10% | 98.60% | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect there might be more information about the training and testing set in the https://github.com/erflynn/sl_label about things like class balance. Can we add a link to that information in the paragraph above that begins with The majority of gene expression data is missing metadata sex labels
please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some information! thanks for the suggestion
|
||
This update includes imputed sex labels for microarray data (mouse, rat, and human). | ||
|
||
The majority of gene expression data is missing metadata sex labels (see Table 1). This lack of labels prevents us from examining the breakdown by sex of many studies. We used the expression of X and Y chromosome genes and metadata sex labels to train a logistic regression model (with elastic net penalty) to predict sample sex. Across all three organisms, our models achieve approximately 95% accuracy in a randomly selected held-out test set as compared to the metadata labels. Additionally, we assessed the accuracy of our model, on various subsets of the data; comparing to all metadata sex labels (agreement 93.5-94.8%), a random sample of single sex studies (agreement 92.6-96.5%), and, in human, manually annotated sex labels from a previous analysis (94.2%) (see Table 2). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a link to a previous analysis
that you would be comfortable putting here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops! missing citation here, added in. apologies!
|
||
| organism | Samples (n) | Samples missing metadata annotation | Studies (n) | Studies missing metadata annotation | | ||
| ----- | ---- | ---- | ---- | ---- | | ||
| human | 430119 | 74.90% | 14987 | 87.30% | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was noticing the line counts for the CSV files - I would expect for human_rnaseq
+ human_microarray
not to exceed the number Samples (n)
in this table. I took a closer look at the tabular data in the CSV files and it looks like there are duplicate values in the acc
column and in the microarray file, there are run accessions that are consistent with RNA-seq data (e.g., DRR
, ERR
). There are about 99k accessions shared across the human RNA-seq and microarray files but looking at the human RNA-seq file there are no GEO sample accessions (e.g., GSM
). This may totally be intentional, but it is a bit different than what I would expect given the context in this document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re the csv files - you are right, there are duplicates, apologies here! I have updated these, and put in a commit that fixes that. The numbers now exactly match those in the table.
For the run accessions, these are exactly the set of accessions that are included in the aggregated_metadata.json
files in the respective compendia (the normalized microarray and then the RNA-seq compendia). You are right that there is some overlap -- I will look into this more -- but the accessions are the same. Is there a different way that you usually organize this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the run accessions, these are exactly the set of accessions that are included in the
aggregated_metadata.json
files in the respective compendia (the normalized microarray and then the RNA-seq compendia). You are right that there is some overlap -- I will look into this more -- but the accessions are the same.
Ah okay I see! Then I would expect some overlap because the normalized compendium contains both microarray and RNA-seq data. So maybe it would be better to call these files normalized
and rnaseq_sample
rather than microarray
and rnaseq
to match the terminology here: https://www.refine.bio/compendia.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok! I will update this, thank you!
Co-Authored-By: Jaclyn Taroni <[email protected]>
Co-Authored-By: Jaclyn Taroni <[email protected]>
Thanks for the feedback @jaclyn-taroni. This is very helpful! I updated the README to include more description, and fixed the duplicates within the .csv files. I am not sure what the coverage difference is between the compendia? But the samples should match the I also have the RNA-seq sex labels for all of the files that I could find (Issue #2211), and will amend the pull request to include these. For linking to the |
Hi @erflynn - thanks for updating your comment to include all of this info! I have given this question a bit of thought:
I think we definitely want to make sure all of the information you've now added to your comment makes it into a version-controlled Markdown document. Now the question becomes where because if you have to write things up in two places that's more difficult to maintain. Some other considerations:
There are two ways to address this that come to mind for me. We can either have all of the documentation that's relevant to a particular "release" of labels (the CSV and JSON files in Another approach would be to keep all of the documentation in Option 1 is probably more straightforward in my opinion, but would like to hear your thoughts. We can talk specifics of organizing the linking once we decide on a general strategy. Please let me know if anything is unclear! Thanks! |
Co-Authored-By: Jaclyn Taroni <[email protected]>
Hi @jaclyn-taroni, Thanks for thinking this through! I agree that I think option 1 makes more sense, I would prefer to have the documentation review step built-in. One thought would be whether it is ok if the I would like to add instructions for how to run the code in the Thanks again, |
Hi @erflynn -
This sounds perfect to me! My biggest concern would be tying it to documentation in |
Awesome - I will get this set up, and send an updated PR. Thanks for the feedback! |
Hi @erflynn - we're starting to adjust how our metadata are stored so that we can import this. How are things going on your end? |
Hi @cgreene - thanks for reaching out. That's great news! Things are going well. I was wrapping up another project, but am back to working on this manuscript. One note about the metadata labels: I noticed last week that the harmonized metadata sex labels are somewhat lossy in how they're parsed (e.g. https://www.ebi.ac.uk/ena/data/view/SRS1752937&display=xml has sex information but it does not make it into the harmonized data). I've found this in SRA data, but haven't looked into compendia data yet but I expect this might be similar. I'm working on fixing this right now because I need a more accurate assessment of metadata coverage. I can send an update or post an issue later this week with the expanded set of labels and the parsing code that I write. |
Hi @erflynn! We've added some metadata from MetaSRA to the |
not a problem! will do. and I'll add some updated labels |
Issue Number
This addresses issue number #2181.
Purpose/Implementation Notes
This update includes imputed sex labels for microarray data (mouse, rat, and human).
I will update with more labels soon; I am not sure if you want to integrate this pull request now or come back to it later when I have more labels? I am sending now so we can work through methods, format, etc.
Methods
Details are included in the below (this is also in the README within the
config/externally_supplied_metadata/
directory).All code to produce this update is included in the sl_label repository and can be easily applied to other organisms who have XX/XY sex determination (provided sufficient metadata sex labels are available for model training).
The majority of gene expression data is missing metadata sex labels (see Table 1). This lack of labels prevents us from examining the sex breakdown of many studies. We trained a penalized logistic regression model that uses the expression of X and Y chromosome genes to impute the sex of a given sample. The model was trained using the
glmnet
R package, with the elastic net penalty. Lambda was selected in ten-fold cross validation. We used metadata sex labels for "ground truth" for the training and testing data. To construct the training and testing datasets, we filtered for samples with metadata sex labels that did not have a cell line annotation (using the refine-biosex
andcell_line
tags), and grouped these samples into studies. We divided the set of all studies in half, and then within each half, sampled n=700 samples for training, and n=300 samples for testing for both males and females. This provided a balanced training and testing dataset, where none of the samples in the test data were from a study that had been seen in the training.A previous study [1] indicated that there is widespread mis-annotation in metadata sex labels, because of this, we set up additional testing datasets: a high confidence mixed sex dataset and single sex datasets. The high confidence mixed sex dataset consists of all mixed sex studies with at least five male and five female samples where the metadata sex labels match expression-based labels from at least one of two clustering-based sex imputation methods [1,2] (we did not apply clustering based methods to the entire refine-bio dataset because they have poor performance on small studies, single sex studies, and studies with high class imbalance). The single sex datasets are all studies with at least ten samples and all male or all female labels as indicated by metadata sex labels.
Across all three organisms, our models achieve approximately 95% accuracy in a randomly selected held-out test set as compared to the metadata labels. Additionally, we assessed the accuracy of our model, on various subsets of the data; comparing to all metadata sex labels (agreement 93.5-94.8%), a random sample of single sex studies (agreement 92.6-96.5%), and, in human, manually annotated sex labels from a previous analysis [3] (94.2%) (see Table 2).
Table 1. Metadata missingness for sex labels.
Table 2. Concordance of sex labels. Numbers in parentheses indicate the total number of samples, percentages the number of samples that agree divided by the total number of samples. High confidence labels have matching metadata and clustering based expression labels.
The cleaned metadata sex labels are also included in the
cleaned_metadata/
directory for microarray and RNA-seq. This process mapped all harmonized sex labels to "male", "female", "mixed", or "unknown". Code for this is included in01_metadata
within theerflynn/sl_label
repository hereTypes of changes
This includes externally supplied metadata files in .json format. These are under
config/externally_supplied_metadata/
. The format is as follows (and as discussed in Issue #2127)where value is one of "PATO:0000383" (female) or "PATO:0000384" (male) and probability is P(imputed_sex=value) from the logistic regression model.
Cleaned harmonized metadata for sex is included in the
cleaned_metadata/
directory with the columns "acc" (sample accession), "sex" (the harmonized sex label), and "mapped_sex" (the harmonized sex label mapped to "male", "female", "mixed", or "unknown").Screenshot
Sex sample breakdown using metadata and (imputed) expression labels for each organism.
sex_breakdown_microarray.pdf
References
[1] Toker, L., et al. F1000Research. 2016, 5: 2103.
[2] Buckberry, S., et al. Bioinformatics. 2014, 30(14): 2084–2085.
[3] Giles, C. B, et al. BMC Bioinformatics. 2017,18(Suppl 14): 509.