Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: endedness and ngsderive update #108

Merged
merged 36 commits into from
Sep 8, 2023
Merged

feat: endedness and ngsderive update #108

merged 36 commits into from
Sep 8, 2023

Conversation

a-frantz
Copy link
Member

@a-frantz a-frantz commented Aug 18, 2023

title

@a-frantz a-frantz self-assigned this Aug 18, 2023
* main:
  docs(picard): state what ref FASTA used for
  feat(picard): pass reference fasta to picard validatesamfile
  docs(picard): correct parameter_meta
  docs: add missing '?' mark
  fix(fqlint): bump memory calculation
* main:
  feat: STAR rewrite (#99)
  chore(fqlint): simplify mem calculation
@a-frantz a-frantz marked this pull request as ready for review August 21, 2023 15:00
@a-frantz a-frantz requested a review from adthrasher August 21, 2023 15:00
@a-frantz a-frantz requested a review from adthrasher September 6, 2023 13:58
@a-frantz
Copy link
Member Author

a-frantz commented Sep 6, 2023

@adthrasher
re: this comment #110 (comment)

How do we like the voice I used for these meta.description and outputs? i.e.

  • meta.description starts with a verb. All tasks are doing something, and should be phrased in the active tense (grammar's not my strong suit, I think these are examples of active tense).
  • outputs are a little trickier. For this, I'd say "First 'sentence' should be a sentence fragment (just the subject)."
    • strandedness_file: "TSV file containing the ngsderive strandedness report"
      • This is an incomplete sentence. I think it's all we need for most outputs. Can be optionally expanded with a full sentence(s) after the initial fragment.

IMO this 2 rules (could be formalized with better language) would sufficiently differentiate these strings that appear next to each other.

Adopting these two styles does beg a question: "do we need to formalize the grammar used for parameter_meta?"
I think the parameter meta is fine as is. But thought I'd rise the question.

* main:
  chore(fqlint): bump mem... again
  chore(fqlint): teensy tiny mem bump

# Conflicts:
#	tools/fq.wdl
@adthrasher
Copy link
Member

@adthrasher re: this comment #110 (comment)

How do we like the voice I used for these meta.description and outputs? i.e.

  • meta.description starts with a verb. All tasks are doing something, and should be phrased in the active tense (grammar's not my strong suit, I think these are examples of active tense).

  • outputs are a little trickier. For this, I'd say "First 'sentence' should be a sentence fragment (just the subject)."

    • strandedness_file: "TSV file containing the ngsderive strandedness report"

      • This is an incomplete sentence. I think it's all we need for most outputs. Can be optionally expanded with a full sentence(s) after the initial fragment.

IMO this 2 rules (could be formalized with better language) would sufficiently differentiate these strings that appear next to each other.

Adopting these two styles does beg a question: "do we need to formalize the grammar used for parameter_meta?" I think the parameter meta is fine as is. But thought I'd rise the question.

I do think using the active voice for the descriptions is a good approach. I'm less convinced about the outputs. I think it is fine to write a fragment for those, but I don't think that should be a requirement.

At this point, I would avoid formalizing this, but I think it's a good approach and we can continue to refine it.

@@ -62,7 +62,7 @@ task fqlint {
Float read2_size = size(read_two_fastq, "GiB")

Int memory_gb = (
ceil((read1_size + read2_size) * 0.2) + 4 + modify_memory_gb
ceil((read1_size + read2_size) * 0.25) + 4 + modify_memory_gb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a high memory requirement for this task. I assume you've encountered a failure. Looking at the list of validators, the only one that should require much memory is the duplicate name check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is definitely overkill for the majority of cases. But I've run into a very small number of failures at the 0.2 scale (which is also more than what most cases need). So following the ethos for our default resource requirement settings (set defaults so we don't have failures in production/never need to think about it), I've been forced to set it this high.

I don't think we have much of a choice in the matter, unless we revise our resource policies. This is how much memory we need to assign so we guarantee no failures in production.

We could scale this back to a flat ~10gb which would be suitable for roughly 90% of cases? That's a gut estimate, no data to back it up. But then we'd need to override and set it higher to ~50gb for those corner cases. Do we want to deal with that? We have to keep in mind that there's no way to predict which samples are the problem samples ahead of time. Size is a very loose corollary.

Comment on lines 67 to 69
awk 'NR > 1' ~{outfile_name} | cut -d$'\t' -f6 > strandedness.txt
else
awk 'NR > 1' ~{outfile_name} | cut -d$'\t' -f5 > strandedness.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to specify -d$'\t? The tab delimiter should be the default for cut.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We can cut that extra argument

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Also just now realizing this won't work for the first case no matter what. We're going to need to add a grep 'overall' after v4 of ngsderive is released (in progress here). So this will just be broken till that release. There's no sensical value to put in strandedness.txt with the current latest release of ngsderive when --split-by-rg is true. I think that's fine for now. We aren't using the problem configuration in prod.

I will make a note of the broken state though.

maxRetries: max_retries
}
}

task junction_annotation {
meta {
description: "Annotates junctions found in an RNA-Seq BAM as known, novel, or partially novel"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it is worth defining known, novel, and partially novel here or in a help text?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add to that, I think the definitions are missing in the ngsderive docs. It might be better there. RSeQC states:

Annotated (known): The junction is part of the gene model. Both splice sites, 5’ splice site (5’SS) and 3’splice site (3’SS) are annotated by reference gene model.

Complete_novel: Both 5’SS and 3’SS are novel.

Partial_novel: One of the splice site (5’SS or 3’SS) is novel, and the other splice site is annotated

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add to the ngsderive docs, and link to those docs here using the external_help key 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitions have been added to the ngsderive v4 PR. specific commit with those definitions

@a-frantz a-frantz merged commit 2c7aa90 into main Sep 8, 2023
@a-frantz a-frantz deleted the endedness branch September 8, 2023 16:09
a-frantz added a commit that referenced this pull request Sep 8, 2023
* main:
  feat: endedness and ngsderive update (#108)
  chore(fqlint): bump mem... again
  chore(fqlint): teensy tiny mem bump
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants