Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial draft of Data Versioning for discussion #28

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

adthrasher
Copy link
Member

@adthrasher adthrasher commented Oct 4, 2023

Initial draft to facilitate discussion of a data versioning standard.

Rendered

@a-frantz
Copy link
Member

a-frantz commented Oct 5, 2023

I think the most straightforward solution is to get a concrete definition of "breaking change" in regards to our pipelines. Then we only increment the major version of a pipeline once a "breaking change" is introduced. All data versions that are produced by the same major version are suitable for cross-comparison. When we introduce a breaking change, that would trigger a re-run of all the data on the cloud in order to keep everything harmonized.

Adopting this approach would require retconning rnaseq-standard v3.0.0. IMO that was a mis-versioned release. It should have been released as v2.1.0, as it's just a feature release. There are no "real" breaking changes. I don't remember the discussions around this release, but I wouldn't be surprised if a deciding factor in the major version bump can be seen here, in the diff between tags v2.0.0 and v3.0.1. It's a messy diff to parse because it's mostly changes in the QC pipeline, but I've enumerated the relevant changes to rnaseq-standard:

  1. allow Single-End data in the sort_star_input.py script
    • Also reflected in the STAR task
    • We don't process SE data, doesn't impact harmonization
  2. Changes to how memory is allocated in htseq-count
    • likely effects how Plaid calls the pipeline, but otherwise a negligible change
  3. The output filename of htseq-count went from basename(bam, ".bam") + ".counts.txt" to basename(bam, ".bam") + ".feature-counts.txt"
    • According to the strictest definition of the SEMVER versioning scheme, this would be a "breaking change"
    • In the context of our pipeline, it's silly to increment a major version over this.
    • Yes, it requires changes to how the pipeline is wrapped in Plaid, but none of the data we care about is changing due to this.
  4. ngsderive went from version 1.0.2 to 1.1.1
    • CHANGELOG
    • These changes predate the CHANGELOG
    • If anyone is concerned by this, I can do a similar diff and find out what changed
    • I'd wager it was changes that aren't impacting results tangibly
  5. The input strand was renamed to strandedness
    • Again, according to an overly strict definition, this would be a breaking change
    • IMO we should not major version bump over something so inconsequential
  6. and finally, the inclusion of XenoCP. A change I would really like to see incorporated into our harmonization.
    • does not impact non-xenograft samples

Absolutely none of these changes are of consequence, unless we're talking about a xenograft sample.

IMO, we should rerun all of our xenograft samples with what is currently labelled v3.0.1 in order to get our users the cleanest data we can provide. But if we leave the version number as is, I think it would raise legitimate concerns from users about "Why the latest isn't run on everything?" We address those concerns with 2 moves:

  1. relabel v3.0.1 to v2.1.0
  2. write a user-facing document explaining how there aren't any "re-run worthy" changes in this release. We need to ensure users they can confidently mix versions under the same major version umbrella.

Edited to add:
Turns out we already released a v2.1.0 of rnaseq-standard AND it includes XenoCP: stjudecloud/workflows@rnaseq-standard/v2.1.0...rnaseq-standard/v3.0.1
This v2.1.0 only exists as a GitHub tag, NOT as a GitHub release. Which is why I didn't notice it at first. Unfortunately we can't just use this, it suffers from the same problematic cyclical dependency as v3.0.0. We could probably just delete this version and pretend it was never tagged 🤷 Or maybe a better solution, re-release v3.0.1 as v2.1.1.

@adthrasher
Copy link
Member Author

I think the most straightforward solution is to get a concrete definition of "breaking change" in regards to our pipelines. Then we only increment the major version of a pipeline once a "breaking change" is introduced. All data versions that are produced by the same major version are suitable for cross-comparison. When we introduce a breaking change, that would trigger a re-run of all the data on the cloud in order to keep everything harmonized.

My biggest issue here is I would like to avoid this being a manual decision point. Can we come up with some metrics and then for each release we run a small cohort of samples through the new version and then compare the pre-decided metrics? That would give us a more concrete decision point. It would also be good to have a biological basis for saying "these are the same".

Absolutely none of these changes are of consequence, unless we're talking about a xenograft sample.

IMO, we should rerun all of our xenograft samples with what is currently labelled v3.0.1 in order to get our users the cleanest data we can provide. But if we leave the version number as is, I think it would raise legitimate concerns from users about "Why the latest isn't run on everything?" We address those concerns with 2 moves:

  1. relabel v3.0.1 to v2.1.0
  2. write a user-facing document explaining how there aren't any "re-run worthy" changes in this release. We need to ensure users they can confidently mix versions under the same major version umbrella.

We should probably just remove v3.X.X of RNA-Seq at this point. I don't know that anyone else would be using it. Consider it a one-time mistake. The XenoCP features go in a 2.X.X release and we move forward with whatever definition of "breaking changes" we have decided above. I would like to avoid getting in a situation where we have to write a document and try to justify why changes aren't consequential to the output. I'd like to have a single document that outlines our criteria and then we can generate automated reports for each release to show we're following that. If people disagree, they can rerun samples of their choosing. We will also open the RFC for public comments at some point. So hopefully people will comment if they object.

@a-frantz
Copy link
Member

a-frantz commented Oct 9, 2023

  1. I like the simplicity of relying on variant calls to determine what constitutes a breaking change. We'll have a VCF from the "before changes" alignment and a VCF from the "post changes" alignment. We call the VCF from the "before changes" alignment the ground truth, and with a few simple set operations we'll have True Positive/False Positive/False Negative rates between changes. We'd need to define thresholds for what constitutes a major version and a minor version release. I imagine most changes such as bug fixes and version bumps will result in 100% matches for variant calls, and we'd call any of those changes "patch releases" regardless of the code content changes.
    Although for RNA-Seq we'd have to contend with stochasticity in STAR alignment; have we done any testing to see if variant calls are stable between runs when everything is consistent besides seed? That may complicate versioning for RNA-Seq, although maybe we can get around it by specifying a constant seed for these "test" runs.

  2. The RFC also proposes using metrics gathered by the QC pipeline to determine the level of change between code versions. This sounds like another viable approach, although more complex than comparing variant calls. There is a very large number of metrics that come out of QC. I could see us using the entirety of these metrics to define a "sample signature". Some variation will always be expected, as many QC metrics are derived from randomly sampling the BAM in question (compound that with stochasticity in aligner algorithms such as STAR), but I'd imagine that as a whole we could develop a stable signature that is robust to slight variations in individual metrics. I think this casts a wider net than relying on variant calls, and could lead to more subtle detection of changes we may consider breaking. The biggest downside to such a method is that it's quite opaque. Can we distill all these metrics into something that is usable, reliable, and easy to communicate with users?

  3. It wasn't mentioned in the RFC, but there was a historical effort that dwindled out a few years ago to create this kind of test for changes in the rnaseq-standard pipeline. The test devised back then was to calculate an R^2 value from the "before changes" HTSeq-Count gene expression and the "post changes" HTSeq-Count gene expression. I would recommend against using this as the only test for our RNA-Seq data.
    The preliminary QC results have shown that it's relatively common for our RNA-Seq samples to have a non-negligible proportion of intergenic reads. Relying entirely on gene expression values will leave us blind to changes in the intergenic portion of alignments. If something radical changes in the intergenic regions of an RNA sample between versions, that's something we want to know about. I could see us using an expression based test as one of multiple tests, so long as the tests used in conjunction can detect changes in intergenic regions.

@adthrasher
Copy link
Member Author

  1. I like the simplicity of relying on variant calls to determine what constitutes a breaking change. We'll have a VCF from the "before changes" alignment and a VCF from the "post changes" alignment. We call the VCF from the "before changes" alignment the ground truth, and with a few simple set operations we'll have True Positive/False Positive/False Negative rates between changes. We'd need to define thresholds for what constitutes a major version and a minor version release. I imagine most changes such as bug fixes and version bumps will result in 100% matches for variant calls, and we'd call any of those changes "patch releases" regardless of the code content changes.
    Although for RNA-Seq we'd have to contend with stochasticity in STAR alignment; have we done any testing to see if variant calls are stable between runs when everything is consistent besides seed? That may complicate versioning for RNA-Seq, although maybe we can get around it by specifying a constant seed for these "test" runs.

Unfortunately, most of the epigenetic data types that we are working to add to St. Jude Cloud do not have variant calls as an end product. So beyond WES/WGS/RNA-Seq, we'd have to come up with a different metric. That may be reasonable, given that the goals of the experiments are vastly different.

I'm not sure if we have done any testing with STAR specifically, but for testing purposes, we could choose a fixed seed.

@adthrasher
Copy link
Member Author

  1. The RFC also proposes using metrics gathered by the QC pipeline to determine the level of change between code versions. This sounds like another viable approach, although more complex than comparing variant calls. There is a very large number of metrics that come out of QC. I could see us using the entirety of these metrics to define a "sample signature". Some variation will always be expected, as many QC metrics are derived from randomly sampling the BAM in question (compound that with stochasticity in aligner algorithms such as STAR), but I'd imagine that as a whole we could develop a stable signature that is robust to slight variations in individual metrics. I think this casts a wider net than relying on variant calls, and could lead to more subtle detection of changes we may consider breaking. The biggest downside to such a method is that it's quite opaque. Can we distill all these metrics into something that is usable, reliable, and easy to communicate with users?

I think this would be highly dependent on what David comes up with in his examination of the QC metrics. If he can distill a key set of metrics that accurately represent a sample, then I think we can likely reuse that here.

@a-frantz
Copy link
Member

a-frantz commented Oct 9, 2023

  1. I like the simplicity of relying on variant calls to determine what constitutes a breaking change. We'll have a VCF from the "before changes" alignment and a VCF from the "post changes" alignment. We call the VCF from the "before changes" alignment the ground truth, and with a few simple set operations we'll have True Positive/False Positive/False Negative rates between changes. We'd need to define thresholds for what constitutes a major version and a minor version release. I imagine most changes such as bug fixes and version bumps will result in 100% matches for variant calls, and we'd call any of those changes "patch releases" regardless of the code content changes.
    Although for RNA-Seq we'd have to contend with stochasticity in STAR alignment; have we done any testing to see if variant calls are stable between runs when everything is consistent besides seed? That may complicate versioning for RNA-Seq, although maybe we can get around it by specifying a constant seed for these "test" runs.

Unfortunately, most of the epigenetic data types that we are working to add to St. Jude Cloud do not have variant calls as an end product. So beyond WES/WGS/RNA-Seq, we'd have to come up with a different metric. That may be reasonable, given that the goals of the experiments are vastly different.

I'm not sure if we have done any testing with STAR specifically, but for testing purposes, we could choose a fixed seed.

That's too bad it won't be usable for all out data types. I think that's acceptable though; as you said the experiments are vastly different.

I'd say that one test for all data types would be preferable, but we shouldn't twist ourselves into knots coming up with that test. I'd be fine with a different test for each data type if each test "made sense". That being said, it sounds like using QC metrics could be a sensible universal test, regardless of data type. It's definitely worth investigating.

@adthrasher adthrasher self-assigned this Oct 27, 2023
@DelaramR
Copy link

DelaramR commented Nov 3, 2023

One thing that might be worth keeping in mind that the file paths on the cloud and in the local cloud backup include the workflow version for RNA files and VCFs. See examples below:

  • /G4K/FINAL/VARIANTS/v1.2.0/SOMATIC_VCF/
  • /ClinicalGenomics/RTCG/FINAL/TRANSCRIPTOME/v2.0.0/FEATURE_COUNTS/
  • /RTCG/FINAL/TRANSCRIPTOME/v2.0.0/BAM/

DNA files on the other hand don't have a version incorporated in the path currently. The version is assumed to be 1.0. The upload of DNA files started before the integration of RNA-Seq V2 and therefore the need to show 2 different versions was introduced. We do capture the version of the tools in the database to some extend:

  • /RTCG/FINAL/EXOME/gVCF/
  • /RTCG/FINAL/EXOME/BAM/
    We do store the version of the tools to some extend in the database though, for example:
  • Cromwell 85,DNA-Seq Standard 1.0.0
  • Cromwell 78,DNA-Seq Standard 1.0.0

We think that it was a mistake that we included the minor and patch versions in the path. We can consider the current versions in the path as only the major version, e.g. assume that RNA's v2.0.0 is in fact v2 and similarly the VCF are at v2 instead of v2.1. The good news is that this is all internal and is not exposed to the users unlike the public workflow versions, so we can be more flexible.

We should probably just remove v3.X.X of RNA-Seq at this point. I don't know that anyone else would be using it. Consider it a one-time mistake. The XenoCP features go in a 2.X.X release and we move forward with whatever definition of "breaking changes" we have decided above.

Was about to mention the above but found it in comments. I definitely agree with this. This has been a concern for me since switching to RNA-Seq V3 would mean the file paths are not consistent with the version of the workflow anymore but at the same time we really don't need to re-run anything which could create ambiguity. But relabeling the workflow back to 2 solves the issue.

Copy link
Member

@a-frantz a-frantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adthrasher
I would go into some more detail about what these comparisons mean. I'm aware because of offline discussion with you, but it needs to make it's way into the RFC as well.

To be more specific: I would like to see it mentioned that we are running 2 comparisons (or tests) per iteration of a workflow, and what the differences are in how those 2 tests will be interpreted.

The more straightforward comparison is between the "current version" (for RNA-Seq that would be v2.0.0) and the "new version" being considered. (The "new version" would not yet have a version number. The appropriate version number would be determined by the results of these 2 tests.) This test seeks to answer the questions "are the two versions comparable in results? Are similar variants being called by both versions?"

The other comparison is against the GIAB "gold standard dataset". This comparison seeks to assess "what is the quality of the variants being reported by the new version?" The goal here is to obtain an objective measure of accuracy and precision.

The first test, between versions of the same pipeline, can't tell us whether differences in results are "good" or "bad". The first test only tells us whether something significant in the alignment changed between versions of the alignment workflow. This could be good news for us or bad news. We need the second test, against the gold standard dataset, to determine which.

@a-frantz
Copy link
Member

Although for RNA-Seq we'd have to contend with stochasticity in STAR alignment; have we done any testing to see if variant calls are stable between runs when everything is consistent besides seed? That may complicate versioning for RNA-Seq, although maybe we can get around it by specifying a constant seed for these "test" runs.

I'm not sure if we have done any testing with STAR specifically, but for testing purposes, we could choose a fixed seed.

I did some more digging on the stochasticity in STAR and it isn't as bad as I feared. There's a Google Groups thread with The author (Alex Dobin) providing good answers to questions about the randomness in STAR: https://groups.google.com/g/rna-star/c/64IpLi10VFA/m/7TUysolUHAAJ

To summarize: The things that may change from run to run are the order of multi-mapping alignments and some of the associated flags (like which is marked the primary alignment). So the actual alignments (where in the genome each read is mapped to) is absolutely deterministic. No variability between runs. That was my primary concern, so it's good news for us.

A smidge of bad news: if we wanted our STAR runs to be absolutely deterministic in terms of output order of alignments, we would (1) need to set the RNG seed (we already do this on main, we default to a seed of 777. I don't see a reason to ever change this) and (2) run STAR with a single thread. Single-threaded STAR is prohibitively slow in my opinion. STAR can already take over a day when we run it with more than a dozen cores. But I don't see the need to have this level of determinism, so it's more of an academic point than one of consequence.


## Whole-Genome and Whole-Exome Sequencing

We propose to evaluate whole-genome (WGS) and whole-exome (WES) sequencing by running well characterized samples through each iteration of the analysis pipeline. The resulting variant calls (gVCFs) will be compared to existing high-quality variant calls. This comparison will be conducted using Illumina's `hap.py` [comparison tool](https://github.com/Illumina/hap.py) as [recommended](https://www.biorxiv.org/content/10.1101/270157v3) by the Global Alliance for Genomics and Health (GA4GH) Benchmarking Team. Specifically, we propose to run samples from the National Institute for Standards and Technology (NIST)'s Genome in a Bottle (GIAB) project. We will perform analysis using samples HG002, HG003, HG004, HG005, HG006, and HG007 for WGS. For WES, we will use samples HG002, HG003, HG004, and HG005. The results from prior iterations of the pipeline will be supplied as the truth set. The confident call sets from GIAB will be provided as the gold standard dataset. The variant calls from the new workflow version will be treated as the query.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We propose to evaluate whole-genome (WGS) and whole-exome (WES) sequencing by running well characterized samples through each iteration of the analysis pipeline. The resulting variant calls (gVCFs) will be compared to existing high-quality variant calls. This comparison will be conducted using Illumina's `hap.py` [comparison tool](https://github.com/Illumina/hap.py) as [recommended](https://www.biorxiv.org/content/10.1101/270157v3) by the Global Alliance for Genomics and Health (GA4GH) Benchmarking Team. Specifically, we propose to run samples from the National Institute for Standards and Technology (NIST)'s Genome in a Bottle (GIAB) project. We will perform analysis using samples HG002, HG003, HG004, HG005, HG006, and HG007 for WGS. For WES, we will use samples HG002, HG003, HG004, and HG005. The results from prior iterations of the pipeline will be supplied as the truth set. The confident call sets from GIAB will be provided as the gold standard dataset. The variant calls from the new workflow version will be treated as the query.
We propose to evaluate whole-genome (WGS) and whole-exome (WES) sequencing by running well characterized samples through each iteration of the analysis pipeline. The resulting variant calls (gVCFs) will be compared to existing high-quality variant calls. This comparison will be conducted using Illumina's `hap.py` [comparison tool](https://github.com/Illumina/hap.py) as [recommended](https://www.biorxiv.org/content/10.1101/270157v3) by the Global Alliance for Genomics and Health (GA4GH) Benchmarking Team. Specifically, we propose to run samples from the National Institute for Standards and Technology (NIST)'s Genome in a Bottle (GIAB) project. We will perform analysis using samples HG002, HG003, HG004, HG005, HG006, and HG007 for WGS. For WES, we will use samples HG002, HG003, HG004, and HG005. The results from prior iterations of the pipeline will be supplied as the truth set. The confident call sets from GIAB will be provided as the gold standard dataset. The variant calls from the new workflow version will be treated as the query.


A data versioning process should help us find a reasonable balance that doesn't stop us from making any changes, ever, but also we want to avoid incorporating any and all changes without any assurances, or insight in `msgen`'s case, that those changes are reasonable and not meaningfully impactful to the data results.

# Discussion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably also need to quantify the common uses of our data. For instance, within RNA-Seq, there are several different "primary" use-cases, including:

  • Expression quantification (expression maps, DGE, etc)
  • Variant calling (SVs, ASE, etc)
  • Isoform detection?
  • Others?

Copy link
Member Author

@adthrasher adthrasher Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do that. Ultimately, I think we want to check our data end-points, which for RNA-Seq is a BAM file and a feature counts file. We don't necessarily need to concern ourselves with how those files can be used downstream, as long as we have some mechanism for evaluating levels of change to the output files. In our case, variant calling comparison gives us a metric for the BAM and the (undeveloped) comparison of feature count files gives us a measure of expression-related changes.

```bash
gatk \
--java-options "-Xms6000m" \
HaplotypeCallerSpark \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so, in particular, the proposal is focused on germline mutations? Is there any gap with respect to and/or benefit to calling somatic mutations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is focused on germline mutations. The goal is the capture alterations to the pipeline that change a meaningful result, in this case, variant calls. I would expect any alignment-related issues that would affect somatic variant calling would also affect germline variant calling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants