-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial draft of Data Versioning for discussion #28
base: main
Are you sure you want to change the base?
Conversation
I think the most straightforward solution is to get a concrete definition of "breaking change" in regards to our pipelines. Then we only increment the major version of a pipeline once a "breaking change" is introduced. All data versions that are produced by the same major version are suitable for cross-comparison. When we introduce a breaking change, that would trigger a re-run of all the data on the cloud in order to keep everything harmonized. Adopting this approach would require retconning
Absolutely none of these changes are of consequence, unless we're talking about a xenograft sample. IMO, we should rerun all of our xenograft samples with what is currently labelled
Edited to add: |
My biggest issue here is I would like to avoid this being a manual decision point. Can we come up with some metrics and then for each release we run a small cohort of samples through the new version and then compare the pre-decided metrics? That would give us a more concrete decision point. It would also be good to have a biological basis for saying "these are the same".
We should probably just remove v3.X.X of RNA-Seq at this point. I don't know that anyone else would be using it. Consider it a one-time mistake. The XenoCP features go in a 2.X.X release and we move forward with whatever definition of "breaking changes" we have decided above. I would like to avoid getting in a situation where we have to write a document and try to justify why changes aren't consequential to the output. I'd like to have a single document that outlines our criteria and then we can generate automated reports for each release to show we're following that. If people disagree, they can rerun samples of their choosing. We will also open the RFC for public comments at some point. So hopefully people will comment if they object. |
|
Unfortunately, most of the epigenetic data types that we are working to add to St. Jude Cloud do not have variant calls as an end product. So beyond WES/WGS/RNA-Seq, we'd have to come up with a different metric. That may be reasonable, given that the goals of the experiments are vastly different. I'm not sure if we have done any testing with STAR specifically, but for testing purposes, we could choose a fixed seed. |
I think this would be highly dependent on what David comes up with in his examination of the QC metrics. If he can distill a key set of metrics that accurately represent a sample, then I think we can likely reuse that here. |
That's too bad it won't be usable for all out data types. I think that's acceptable though; as you said the experiments are vastly different. I'd say that one test for all data types would be preferable, but we shouldn't twist ourselves into knots coming up with that test. I'd be fine with a different test for each data type if each test "made sense". That being said, it sounds like using QC metrics could be a sensible universal test, regardless of data type. It's definitely worth investigating. |
One thing that might be worth keeping in mind that the file paths on the cloud and in the local cloud backup include the workflow version for RNA files and VCFs. See examples below:
DNA files on the other hand don't have a version incorporated in the path currently. The version is assumed to be 1.0. The upload of DNA files started before the integration of RNA-Seq V2 and therefore the need to show 2 different versions was introduced. We do capture the version of the tools in the database to some extend:
We think that it was a mistake that we included the minor and patch versions in the path. We can consider the current versions in the path as only the major version, e.g. assume that RNA's v2.0.0 is in fact v2 and similarly the VCF are at v2 instead of v2.1. The good news is that this is all internal and is not exposed to the users unlike the public workflow versions, so we can be more flexible.
Was about to mention the above but found it in comments. I definitely agree with this. This has been a concern for me since switching to RNA-Seq V3 would mean the file paths are not consistent with the version of the workflow anymore but at the same time we really don't need to re-run anything which could create ambiguity. But relabeling the workflow back to 2 solves the issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adthrasher
I would go into some more detail about what these comparisons mean. I'm aware because of offline discussion with you, but it needs to make it's way into the RFC as well.
To be more specific: I would like to see it mentioned that we are running 2 comparisons (or tests) per iteration of a workflow, and what the differences are in how those 2 tests will be interpreted.
The more straightforward comparison is between the "current version" (for RNA-Seq that would be v2.0.0) and the "new version" being considered. (The "new version" would not yet have a version number. The appropriate version number would be determined by the results of these 2 tests.) This test seeks to answer the questions "are the two versions comparable in results? Are similar variants being called by both versions?"
The other comparison is against the GIAB "gold standard dataset". This comparison seeks to assess "what is the quality of the variants being reported by the new version?" The goal here is to obtain an objective measure of accuracy and precision.
The first test, between versions of the same pipeline, can't tell us whether differences in results are "good" or "bad". The first test only tells us whether something significant in the alignment changed between versions of the alignment workflow. This could be good news for us or bad news. We need the second test, against the gold standard dataset, to determine which.
I did some more digging on the stochasticity in STAR and it isn't as bad as I feared. There's a Google Groups thread with The author (Alex Dobin) providing good answers to questions about the randomness in STAR: https://groups.google.com/g/rna-star/c/64IpLi10VFA/m/7TUysolUHAAJ To summarize: The things that may change from run to run are the order of multi-mapping alignments and some of the associated flags (like which is marked the primary alignment). So the actual alignments (where in the genome each read is mapped to) is absolutely deterministic. No variability between runs. That was my primary concern, so it's good news for us. A smidge of bad news: if we wanted our STAR runs to be absolutely deterministic in terms of output order of alignments, we would (1) need to set the RNG seed (we already do this on |
text/data_versioning.md
Outdated
|
||
## Whole-Genome and Whole-Exome Sequencing | ||
|
||
We propose to evaluate whole-genome (WGS) and whole-exome (WES) sequencing by running well characterized samples through each iteration of the analysis pipeline. The resulting variant calls (gVCFs) will be compared to existing high-quality variant calls. This comparison will be conducted using Illumina's `hap.py` [comparison tool](https://github.com/Illumina/hap.py) as [recommended](https://www.biorxiv.org/content/10.1101/270157v3) by the Global Alliance for Genomics and Health (GA4GH) Benchmarking Team. Specifically, we propose to run samples from the National Institute for Standards and Technology (NIST)'s Genome in a Bottle (GIAB) project. We will perform analysis using samples HG002, HG003, HG004, HG005, HG006, and HG007 for WGS. For WES, we will use samples HG002, HG003, HG004, and HG005. The results from prior iterations of the pipeline will be supplied as the truth set. The confident call sets from GIAB will be provided as the gold standard dataset. The variant calls from the new workflow version will be treated as the query. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We propose to evaluate whole-genome (WGS) and whole-exome (WES) sequencing by running well characterized samples through each iteration of the analysis pipeline. The resulting variant calls (gVCFs) will be compared to existing high-quality variant calls. This comparison will be conducted using Illumina's `hap.py` [comparison tool](https://github.com/Illumina/hap.py) as [recommended](https://www.biorxiv.org/content/10.1101/270157v3) by the Global Alliance for Genomics and Health (GA4GH) Benchmarking Team. Specifically, we propose to run samples from the National Institute for Standards and Technology (NIST)'s Genome in a Bottle (GIAB) project. We will perform analysis using samples HG002, HG003, HG004, HG005, HG006, and HG007 for WGS. For WES, we will use samples HG002, HG003, HG004, and HG005. The results from prior iterations of the pipeline will be supplied as the truth set. The confident call sets from GIAB will be provided as the gold standard dataset. The variant calls from the new workflow version will be treated as the query. | |
We propose to evaluate whole-genome (WGS) and whole-exome (WES) sequencing by running well characterized samples through each iteration of the analysis pipeline. The resulting variant calls (gVCFs) will be compared to existing high-quality variant calls. This comparison will be conducted using Illumina's `hap.py` [comparison tool](https://github.com/Illumina/hap.py) as [recommended](https://www.biorxiv.org/content/10.1101/270157v3) by the Global Alliance for Genomics and Health (GA4GH) Benchmarking Team. Specifically, we propose to run samples from the National Institute for Standards and Technology (NIST)'s Genome in a Bottle (GIAB) project. We will perform analysis using samples HG002, HG003, HG004, HG005, HG006, and HG007 for WGS. For WES, we will use samples HG002, HG003, HG004, and HG005. The results from prior iterations of the pipeline will be supplied as the truth set. The confident call sets from GIAB will be provided as the gold standard dataset. The variant calls from the new workflow version will be treated as the query. |
|
||
A data versioning process should help us find a reasonable balance that doesn't stop us from making any changes, ever, but also we want to avoid incorporating any and all changes without any assurances, or insight in `msgen`'s case, that those changes are reasonable and not meaningfully impactful to the data results. | ||
|
||
# Discussion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably also need to quantify the common uses of our data. For instance, within RNA-Seq, there are several different "primary" use-cases, including:
- Expression quantification (expression maps, DGE, etc)
- Variant calling (SVs, ASE, etc)
- Isoform detection?
- Others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do that. Ultimately, I think we want to check our data end-points, which for RNA-Seq is a BAM file and a feature counts file. We don't necessarily need to concern ourselves with how those files can be used downstream, as long as we have some mechanism for evaluating levels of change to the output files. In our case, variant calling comparison gives us a metric for the BAM and the (undeveloped) comparison of feature count files gives us a measure of expression-related changes.
```bash | ||
gatk \ | ||
--java-options "-Xms6000m" \ | ||
HaplotypeCallerSpark \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so, in particular, the proposal is focused on germline mutations? Is there any gap with respect to and/or benefit to calling somatic mutations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is focused on germline
mutations. The goal is the capture alterations to the pipeline that change a meaningful result, in this case, variant calls. I would expect any alignment-related issues that would affect somatic
variant calling would also affect germline
variant calling.
Initial draft to facilitate discussion of a data versioning standard.
Rendered