-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: endedness and ngsderive update #108
Conversation
* main: docs(picard): state what ref FASTA used for feat(picard): pass reference fasta to picard validatesamfile docs(picard): correct parameter_meta docs: add missing '?' mark fix(fqlint): bump memory calculation
* main: feat: STAR rewrite (#99) chore(fqlint): simplify mem calculation
(cherry picked from commit 4918320)
@adthrasher How do we like the voice I used for these
IMO this 2 rules (could be formalized with better language) would sufficiently differentiate these strings that appear next to each other. Adopting these two styles does beg a question: "do we need to formalize the grammar used for |
* main: chore(fqlint): bump mem... again chore(fqlint): teensy tiny mem bump # Conflicts: # tools/fq.wdl
I do think using the active voice for the descriptions is a good approach. I'm less convinced about the At this point, I would avoid formalizing this, but I think it's a good approach and we can continue to refine it. |
@@ -62,7 +62,7 @@ task fqlint { | |||
Float read2_size = size(read_two_fastq, "GiB") | |||
|
|||
Int memory_gb = ( | |||
ceil((read1_size + read2_size) * 0.2) + 4 + modify_memory_gb | |||
ceil((read1_size + read2_size) * 0.25) + 4 + modify_memory_gb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like a high memory requirement for this task. I assume you've encountered a failure. Looking at the list of validators, the only one that should require much memory is the duplicate name check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is definitely overkill for the majority of cases. But I've run into a very small number of failures at the 0.2
scale (which is also more than what most cases need). So following the ethos for our default resource requirement settings (set defaults so we don't have failures in production/never need to think about it), I've been forced to set it this high.
I don't think we have much of a choice in the matter, unless we revise our resource policies. This is how much memory we need to assign so we guarantee no failures in production.
We could scale this back to a flat ~10gb which would be suitable for roughly 90% of cases? That's a gut estimate, no data to back it up. But then we'd need to override and set it higher to ~50gb for those corner cases. Do we want to deal with that? We have to keep in mind that there's no way to predict which samples are the problem samples ahead of time. Size is a very loose corollary.
tools/ngsderive.wdl
Outdated
awk 'NR > 1' ~{outfile_name} | cut -d$'\t' -f6 > strandedness.txt | ||
else | ||
awk 'NR > 1' ~{outfile_name} | cut -d$'\t' -f5 > strandedness.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to specify -d$'\t
? The tab delimiter should be the default for cut
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. We can cut that extra argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. Also just now realizing this won't work for the first case no matter what. We're going to need to add a grep 'overall'
after v4 of ngsderive is released (in progress here). So this will just be broken till that release. There's no sensical value to put in strandedness.txt
with the current latest release of ngsderive when --split-by-rg
is true. I think that's fine for now. We aren't using the problem configuration in prod.
I will make a note of the broken state though.
maxRetries: max_retries | ||
} | ||
} | ||
|
||
task junction_annotation { | ||
meta { | ||
description: "Annotates junctions found in an RNA-Seq BAM as known, novel, or partially novel" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it is worth defining known
, novel
, and partially novel
here or in a help
text?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add to that, I think the definitions are missing in the ngsderive
docs. It might be better there. RSeQC states:
Annotated (known): The junction is part of the gene model. Both splice sites, 5’ splice site (5’SS) and 3’splice site (3’SS) are annotated by reference gene model.
Complete_novel: Both 5’SS and 3’SS are novel.
Partial_novel: One of the splice site (5’SS or 3’SS) is novel, and the other splice site is annotated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add to the ngsderive docs, and link to those docs here using the external_help
key 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitions have been added to the ngsderive v4 PR. specific commit with those definitions
* main: feat: endedness and ngsderive update (#108) chore(fqlint): bump mem... again chore(fqlint): teensy tiny mem bump
title