feat: endedness and ngsderive update #108

a-frantz · 2023-08-18T16:16:49Z

title

* main: docs(picard): state what ref FASTA used for feat(picard): pass reference fasta to picard validatesamfile docs(picard): correct parameter_meta docs: add missing '?' mark fix(fqlint): bump memory calculation

* main: feat: STAR rewrite (#99) chore(fqlint): simplify mem calculation

* main: feat: calc PHRED stats for first and last base in all reads (#109)

(cherry picked from commit 8316f32)

(cherry picked from commit 4918320)

a-frantz · 2023-09-06T14:09:28Z

@adthrasher
re: this comment #110 (comment)

How do we like the voice I used for these meta.description and outputs? i.e.

meta.description starts with a verb. All tasks are doing something, and should be phrased in the active tense (grammar's not my strong suit, I think these are examples of active tense).
outputs are a little trickier. For this, I'd say "First 'sentence' should be a sentence fragment (just the subject)."
- strandedness_file: "TSV file containing the ngsderive strandedness report"
  - This is an incomplete sentence. I think it's all we need for most outputs. Can be optionally expanded with a full sentence(s) after the initial fragment.

IMO this 2 rules (could be formalized with better language) would sufficiently differentiate these strings that appear next to each other.

Adopting these two styles does beg a question: "do we need to formalize the grammar used for parameter_meta?"
I think the parameter meta is fine as is. But thought I'd rise the question.

* main: chore(fqlint): bump mem... again chore(fqlint): teensy tiny mem bump # Conflicts: # tools/fq.wdl

adthrasher · 2023-09-07T18:24:23Z

@adthrasher re: this comment #110 (comment)

How do we like the voice I used for these meta.description and outputs? i.e.

meta.description starts with a verb. All tasks are doing something, and should be phrased in the active tense (grammar's not my strong suit, I think these are examples of active tense).

outputs are a little trickier. For this, I'd say "First 'sentence' should be a sentence fragment (just the subject)."

strandedness_file: "TSV file containing the ngsderive strandedness report"

This is an incomplete sentence. I think it's all we need for most outputs. Can be optionally expanded with a full sentence(s) after the initial fragment.

IMO this 2 rules (could be formalized with better language) would sufficiently differentiate these strings that appear next to each other.

Adopting these two styles does beg a question: "do we need to formalize the grammar used for parameter_meta?" I think the parameter meta is fine as is. But thought I'd rise the question.

I do think using the active voice for the descriptions is a good approach. I'm less convinced about the outputs. I think it is fine to write a fragment for those, but I don't think that should be a requirement.

At this point, I would avoid formalizing this, but I think it's a good approach and we can continue to refine it.

adthrasher · 2023-09-07T18:30:12Z

tools/fq.wdl

@@ -62,7 +62,7 @@ task fqlint {
    Float read2_size = size(read_two_fastq, "GiB")

    Int memory_gb = (
-        ceil((read1_size + read2_size) * 0.2) + 4 + modify_memory_gb
+        ceil((read1_size + read2_size) * 0.25) + 4 + modify_memory_gb


That seems like a high memory requirement for this task. I assume you've encountered a failure. Looking at the list of validators, the only one that should require much memory is the duplicate name check.

Yes it is definitely overkill for the majority of cases. But I've run into a very small number of failures at the 0.2 scale (which is also more than what most cases need). So following the ethos for our default resource requirement settings (set defaults so we don't have failures in production/never need to think about it), I've been forced to set it this high.

I don't think we have much of a choice in the matter, unless we revise our resource policies. This is how much memory we need to assign so we guarantee no failures in production.

We could scale this back to a flat ~10gb which would be suitable for roughly 90% of cases? That's a gut estimate, no data to back it up. But then we'd need to override and set it higher to ~50gb for those corner cases. Do we want to deal with that? We have to keep in mind that there's no way to predict which samples are the problem samples ahead of time. Size is a very loose corollary.

adthrasher · 2023-09-07T18:32:44Z

tools/ngsderive.wdl

+            awk 'NR > 1' ~{outfile_name} | cut -d$'\t' -f6 > strandedness.txt
+        else
+            awk 'NR > 1' ~{outfile_name} | cut -d$'\t' -f5 > strandedness.txt


Do you need to specify -d$'\t? The tab delimiter should be the default for cut.

Good point. We can cut that extra argument

Ah. Also just now realizing this won't work for the first case no matter what. We're going to need to add a grep 'overall' after v4 of ngsderive is released (in progress here). So this will just be broken till that release. There's no sensical value to put in strandedness.txt with the current latest release of ngsderive when --split-by-rg is true. I think that's fine for now. We aren't using the problem configuration in prod.

I will make a note of the broken state though.

adthrasher · 2023-09-07T18:35:21Z

tools/ngsderive.wdl

        maxRetries: max_retries
    }
 }

 task junction_annotation {
+    meta {
+        description: "Annotates junctions found in an RNA-Seq BAM as known, novel, or partially novel"


Do you think it is worth defining known, novel, and partially novel here or in a help text?

To add to that, I think the definitions are missing in the ngsderive docs. It might be better there. RSeQC states:

Annotated (known): The junction is part of the gene model. Both splice sites, 5’ splice site (5’SS) and 3’splice site (3’SS) are annotated by reference gene model.

Complete_novel: Both 5’SS and 3’SS are novel.

Partial_novel: One of the splice site (5’SS or 3’SS) is novel, and the other splice site is annotated

I'll add to the ngsderive docs, and link to those docs here using the external_help key 👍

Definitions have been added to the ngsderive v4 PR. specific commit with those definitions

* main: feat: endedness and ngsderive update (#108) chore(fqlint): bump mem... again chore(fqlint): teensy tiny mem bump

a-frantz added 7 commits August 18, 2023 11:20

chore: rename task 'infer_strandedness' to 'strandedness'

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

3c387bd

chore: upgrade to latest ngsderive

25e25fb

chore: num_samples -> num_reads

66200b3

feat: add split_by_rg opt to strandedness

f7cf8e3

fix: missing backslash

dec978a

chore: make lower MEM bound consistent (4gb)

52d0c7b

feat: ngsderive endedness task

320e702

a-frantz self-assigned this Aug 18, 2023

a-frantz added 6 commits August 18, 2023 12:19

fix: name collision

246a3af

feat: add endedness to QC

4ba7835

fix: output endedness file and pass to multiqc

8c0fcdc

chore(ngsderive): bump to latest image

01595aa

Merge branch 'main' into endedness

c35d0a8

* main: docs(picard): state what ref FASTA used for feat(picard): pass reference fasta to picard validatesamfile docs(picard): correct parameter_meta docs: add missing '?' mark fix(fqlint): bump memory calculation

Merge branch 'main' into endedness

3f2229c

* main: feat: STAR rewrite (#99) chore(fqlint): simplify mem calculation

a-frantz marked this pull request as ready for review August 21, 2023 15:00

a-frantz requested a review from adthrasher August 21, 2023 15:00

adthrasher approved these changes Aug 21, 2023

View reviewed changes

a-frantz added 13 commits August 21, 2023 15:07

fix: missing backslash

6df1526

fix(endedness): bump mem calculation

0c19a72

docs: remove TODO about calc_rpt

a7dac58

feat: parse entire file in encoding call

4aaa817

Merge branch 'main' into endedness

49b8ac2

* main: feat: calc PHRED stats for first and last base in all reads (#109)

fix(picard): consistently use "new style" arguments in ValidateSamFile

0c10d81

docs: TODO about legacy VS new style picard params

3de81e4

docs: fix bad task meta

453bb61

fix(endedness): bump mem yet again

546de4b

fix(endedness): bump mem again

33572c3

fix(endedness): bump mem... again...

c7b9afe

chore: bump to latest ngsderive image

f025687

chore(fqlint): teensy tiny mem bump

40e47ed

(cherry picked from commit 8316f32)

a-frantz added 2 commits August 29, 2023 15:51

chore(fqlint): bump mem... again

5ad4082

(cherry picked from commit 4918320)

chore(fq lint): bump mem. Hopefully last time

16b0142

a-frantz mentioned this pull request Sep 4, 2023

docs: full task compliance with style-guide #110

Merged

a-frantz added 3 commits September 5, 2023 14:24

docs(ngsderive): comply to style-guide

7b4d4b4

fix(ngsderive): correctly parse strandedness regardless of split-by-rg

0264ca2

fix: gtf -> gene_model

b834917

a-frantz requested a review from adthrasher September 6, 2023 13:58

fix: gtf -> gene_model

48a998c

Merge branch 'main' into endedness

fccbbc1

* main: chore(fqlint): bump mem... again chore(fqlint): teensy tiny mem bump # Conflicts: # tools/fq.wdl

adthrasher approved these changes Sep 7, 2023

View reviewed changes

a-frantz added 3 commits September 7, 2023 15:00

docs(ngsderive): TODO about how split_by_rg breaks awk+cut command

a830b2e

style: add rule to style-guide about active voice in descriptions

8b3d565

docs: add ngsderive "external_help" link in junction-annotation

e679db1

a-frantz merged commit 2c7aa90 into main Sep 8, 2023

a-frantz deleted the endedness branch September 8, 2023 16:09

a-frantz added a commit that referenced this pull request Sep 8, 2023

Merge branch 'main' into docs

9f14627

* main: feat: endedness and ngsderive update (#108) chore(fqlint): bump mem... again chore(fqlint): teensy tiny mem bump

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: endedness and ngsderive update #108

feat: endedness and ngsderive update #108

a-frantz commented Aug 18, 2023 •

edited

Loading

a-frantz commented Sep 6, 2023

adthrasher commented Sep 7, 2023

adthrasher Sep 7, 2023

a-frantz Sep 7, 2023

adthrasher Sep 7, 2023

a-frantz Sep 7, 2023

a-frantz Sep 7, 2023

adthrasher Sep 7, 2023

adthrasher Sep 7, 2023

a-frantz Sep 7, 2023

a-frantz Sep 8, 2023

feat: endedness and ngsderive update #108

feat: endedness and ngsderive update #108

Conversation

a-frantz commented Aug 18, 2023 • edited Loading

a-frantz commented Sep 6, 2023

adthrasher commented Sep 7, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

a-frantz commented Aug 18, 2023 •

edited

Loading