Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joint germline: GATK GenomicsDBImport chokes on millions of files with WES intervals file #1776

Closed
tdanhorn opened this issue Jan 22, 2025 · 0 comments · Fixed by #1777
Closed
Assignees
Labels
bug Something isn't working

Comments

@tdanhorn
Copy link
Contributor

Description of the bug

I tried running sarek (3.3.2) on 47 germline WES samples with --joint_germline from --step variant_calling to get a joint gVCF. I'm using the Agilent BED file with target region with --intervals. The GATK GenomicsDBImport process runs for several days (before getting killed) and generates millions of files occupying terabytes of data (see https://nfcore.slack.com/archives/CGFUX04HZ/p1736549603967039). It does, however, have a helpful suggestion:

05:58:53.313 WARN  GenomicsDBImport - A large number of intervals were specified. Using more than 100 intervals in a single import is not recommended and can cause performance to suffer. If GVCF data only exists within those intervals, performance can be improved by aggregating intervals with the merge-input-intervals argument.

And indeed, adding --merge-input-intervals to the process's ext.args via a config file solves the issue.
This should be done automatically whenever a pipeline is run with --wes (indicating a large number of intervals).

Command used and terminal output

nextflow run "$pipelinedir" -profile curc_alpine -ansi-log false \
        --step variant_calling --wes --genome GATK.GRCh38 --input "$samplefile" \
        --intervals "$scrproj/targets.bed" --outdir "$pipeoutdir" \
        --tools haplotypecaller,vep --joint_germline


(Runs for days and generates millions of files in the work dir of `GATK4_GENOMICSDBIMPORT`.)

Relevant files

No response

System information

Nextflow 23.04.1
HPC cluster with Red Hat 8,10, SLURM & Apptainer (run as Singularity)
nf-core/sarek 3.3.2

@tdanhorn tdanhorn added the bug Something isn't working label Jan 22, 2025
@tdanhorn tdanhorn self-assigned this Jan 22, 2025
maxulysse pushed a commit that referenced this issue Jan 27, 2025

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
<!--
# nf-core/sarek pull request

Many thanks for contributing to nf-core/sarek!

Please fill in the appropriate checklist below (delete whatever is not
relevant).
These are the most common things requested on pull requests (PRs).

Remember that PRs should be made against the dev branch, unless you're
preparing a pipeline release.

Learn more about contributing:
[CONTRIBUTING.md](https://github.com/nf-core/sarek/tree/master/.github/CONTRIBUTING.md)
-->

## PR checklist

- [x] This comment contains a description of changes (with reason).
- [ ] If you've fixed a bug or added code that should be tested, add
tests!
- [ ] If you've added a new tool - have you followed the pipeline
conventions in the [contribution
docs](https://github.com/nf-core/sarek/tree/master/.github/CONTRIBUTING.md)
- [ ] If necessary, also make a PR on the nf-core/sarek _branch_ on the
[nf-core/test-datasets](https://github.com/nf-core/test-datasets)
repository.
- [x] Make sure your code lints (`nf-core pipelines lint`).
- [ ] Ensure the test suite passes (`nextflow run . -profile test,docker
--outdir <OUTDIR>`).
- [ ] Check for unexpected warnings in debug mode (`nextflow run .
-profile debug,test,docker --outdir <OUTDIR>`).
- [ ] Usage Documentation in `docs/usage.md` is updated.
- [ ] Output Documentation in `docs/output.md` is updated.
- [x] `CHANGELOG.md` is updated.
- [ ] `README.md` is updated (including new tool citations and
authors/contributors).

Running sarek with `--joint_germline` on WES samples with an intervals
file containing many thousands of targets causes GATK `GenomicsDBImport`
to create millions of files and run for several days without completing.
Adding the `--merge-intervals` option to that process fixes that. This
PR add the parameter conditional on the `--wes` pipeline parameter.

Closes #1776

---------

Co-authored-by: Thomas <[email protected]>
Co-authored-by: Friederike Hanssen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants