parsnpAligner.log sequence lengths discrepancies #173

AbigailShockey · 2025-01-17T21:33:51Z

Hello Parsnp devs,

I noticed that the sequence lengths reported in the parsnpAligner.log file do not match the actual sequence lengths of the fasta files provided to Parsnp. The values are larger than expected. How are these values calculated, and do these inflated sequence lengths affect the cluster coverages reported in the parsnpAligner.log file?

I've attached an example of a log file demonstrating this phenomenon. For this run, I used the -r ! option to randomly choose a reference. Sample1, which has a sequence length of 3987309 bps, was chosen as a reference. The sequence length was correctly reported for Sequence 1 (Sample1.fa.ref, line 3) but incorrectly reported as 4035049 bps for Sequence 6 (Sample1.fa, line 28). The other sequence lengths are also incorrectly reported.

I appreciate any assistance on this,

Best,
Abigail

parsnpAligner.log

evagunawan · 2025-02-27T20:21:40Z

@skoren @treangen I'm not quite sure who to tag, so I apologize if this is incorrect. We are also encountering this issue in our pipeline. I was curious if you had any suggestions. Thank you!

bkille · 2025-02-27T21:02:37Z

Hi @evagunawan and @AbigailShockey,

You are correct that (1) those lengths may be incorrect and that (2) it may affect the cluster coverage calculation. TLDR, the more contigs you have in a query sequence, the more impact this issue will have. This should be a straightforward fix, though, and we should be able to address it in the next release. Thanks for pointing it out to us!

Internally, Parsnp handles multi-contig query sequences by concatenating them, using blocks of Ns to separate each contig. If you have many short contigs at the end, this will lead to a relatively larger artificial increase in the reported "genome size" and therefore an artificial decrease in the "cluster coverage." The reference sequence doesn't need this treatment as the query sequences are matched against it, so that is why you see the correct length for the reference.

evagunawan · 2025-02-27T21:12:25Z

Ah, thank you so much for responding, that is super helpful to know! Do we have a rough estimate of when the next release will be? I just want to make sure that I get our pipeline updated with the newest version as soon as possible. Thank you so much for looking into this!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parsnpAligner.log sequence lengths discrepancies #173

parsnpAligner.log sequence lengths discrepancies #173

AbigailShockey commented Jan 17, 2025

evagunawan commented Feb 27, 2025 •

edited

Loading

bkille commented Feb 27, 2025

evagunawan commented Feb 27, 2025

parsnpAligner.log sequence lengths discrepancies #173

parsnpAligner.log sequence lengths discrepancies #173

Comments

AbigailShockey commented Jan 17, 2025

evagunawan commented Feb 27, 2025 • edited Loading

bkille commented Feb 27, 2025

evagunawan commented Feb 27, 2025

evagunawan commented Feb 27, 2025 •

edited

Loading