Tree Nine

Put diff files on an existing phylogenetic tree using UShER's usher sampled task with a bit of help from SRANWRP, followed by conversion of that tree to Taxonium, Newick, and Nextstrain formats. Samples' SNP distance is calculated and output as a distance matrix, and samples will be placed into clusters based on the distance.

Verified on Terra-Cromwell and miniwdl. Make sure to add --copy-input-files for miniwdl. Default inputs assume you're working with Mycobacterium tuberculosis, be sure to change them if you aren't working with that bacterium.

This repo also contains the following subworkflows:

Annotate
Convert to Nextstrain (for viewing in Auspice, non-clade sample annotations, etc)
Extract
Mask tree
Mask subtree
Summarize

features

Highly scalable, even on lower-end computes
Can input a single pre-combined diff file
Includes a sample input tree created from SRA data if no input tree is specified
Trees automatically converted to UsHER (.pb), Taxonium (.jsonl.gz), Newick (.nwk), and Nextstrain (.json) formats
Automatic clustering based on configurable genetic distance
- Nextstrain tree(s) will be annotated by cluster
- Clustering can be limited to only samples specified by the user, all newly added samples, or all samples
- Clustering is also performed after backmasking
- (optional) Create per-cluster Nextstrain subtrees
(optional) Reroot the tree to a specified node
(optional) Backmask newly-added samples against each other to hide positions where any newly-added sample lacks data, then create a new set of trees based on the backmasked diff files
- Designed for highly clonal samples which have a plausible direct epidemiological relationship
- Backmasking can only be performed on samples which have a sample-level diff files
(optional) Summarize input, reroot, and output trees with matutils
(optional) Filter out positions by coverage at that position and/or entire samples by overall coverage
(optional) Specify your own reference genome if you don't want to work with H37Rv
(optional) Annotate clades via matutils with a specified annotation TSV

benchmarking

Formal benchmarks have not been established, but a full run of placing 60 new TB samples on an existing 7000+ TB sample tree, conversion to taxonium and newick formats, distance matrixing, clustering finding, and creating cluster-specific Nextstrain trees executes in about five minutes on a 2019 Macbook Pro.

Backmasking is the least scalable part of the pipeline. The comparison itself theoretically scales n² and once the comparison is completed, n backmasked disk files must be written to the disk. We have observed that memory problems tend to arise during the file-writing part when n≥55 on a local machine. Runtime attributes are adjustable as task-level variables to aid with scaling on cloud backends, although we have seen the default handle 60 samples at a time without much issue.

Name		Name	Last commit message	Last commit date
Latest commit History 237 Commits
data		data
input_jsons		input_jsons
.dockstore.yml		.dockstore.yml
.gitattributes		.gitattributes
.gitignore		.gitignore
.pylintrc		.pylintrc
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
README_for_devs.md		README_for_devs.md
annotate.md		annotate.md
annotate.wdl		annotate.wdl
convert_to_nextstrain.md		convert_to_nextstrain.md
convert_to_nextstrain.wdl		convert_to_nextstrain.wdl
convert_to_nwk.wdl		convert_to_nwk.wdl
debug_clusterfinder.json		debug_clusterfinder.json
debug_clusterfinder.wdl		debug_clusterfinder.wdl
extract.md		extract.md
extract_subtree.wdl		extract_subtree.wdl
find_clusters.py		find_clusters.py
generate_test_nwk.py		generate_test_nwk.py
mask.md		mask.md
mask_subtree.wdl		mask_subtree.wdl
mask_tree.wdl		mask_tree.wdl
mask_vs_backmask_WF.wdl		mask_vs_backmask_WF.wdl
mask_vs_backmask_megaWF.wdl		mask_vs_backmask_megaWF.wdl
matutils_and_friends.wdl		matutils_and_friends.wdl
process_clusters.py		process_clusters.py
sapling_nine.wdl		sapling_nine.wdl
summarize.md		summarize.md
summarize.wdl		summarize.wdl
summarize_changes.py		summarize_changes.py
test_nextstrain_subtrees.sh		test_nextstrain_subtrees.sh
tree_nine.wdl		tree_nine.wdl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tree Nine

features

benchmarking

About

Releases 20

Packages

Languages

License

aofarrel/tree_nine

Folders and files

Latest commit

History

Repository files navigation

Tree Nine

features

benchmarking

About

Resources

License

Stars

Watchers

Forks

Releases 20

Packages 0

Languages

Packages