Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parsnpAligner.log sequence lengths discrepancies #173

Open
AbigailShockey opened this issue Jan 17, 2025 · 3 comments
Open

parsnpAligner.log sequence lengths discrepancies #173

AbigailShockey opened this issue Jan 17, 2025 · 3 comments

Comments

@AbigailShockey
Copy link

Hello Parsnp devs,

I noticed that the sequence lengths reported in the parsnpAligner.log file do not match the actual sequence lengths of the fasta files provided to Parsnp. The values are larger than expected. How are these values calculated, and do these inflated sequence lengths affect the cluster coverages reported in the parsnpAligner.log file?

I've attached an example of a log file demonstrating this phenomenon. For this run, I used the -r ! option to randomly choose a reference. Sample1, which has a sequence length of 3987309 bps, was chosen as a reference. The sequence length was correctly reported for Sequence 1 (Sample1.fa.ref, line 3) but incorrectly reported as 4035049 bps for Sequence 6 (Sample1.fa, line 28). The other sequence lengths are also incorrectly reported.

I appreciate any assistance on this,

Best,
Abigail

parsnpAligner.log

@evagunawan
Copy link

evagunawan commented Feb 27, 2025

@skoren @treangen I'm not quite sure who to tag, so I apologize if this is incorrect. We are also encountering this issue in our pipeline. I was curious if you had any suggestions. Thank you!

@bkille
Copy link
Contributor

bkille commented Feb 27, 2025

Hi @evagunawan and @AbigailShockey,

You are correct that (1) those lengths may be incorrect and that (2) it may affect the cluster coverage calculation. TLDR, the more contigs you have in a query sequence, the more impact this issue will have. This should be a straightforward fix, though, and we should be able to address it in the next release. Thanks for pointing it out to us!

Internally, Parsnp handles multi-contig query sequences by concatenating them, using blocks of Ns to separate each contig. If you have many short contigs at the end, this will lead to a relatively larger artificial increase in the reported "genome size" and therefore an artificial decrease in the "cluster coverage." The reference sequence doesn't need this treatment as the query sequences are matched against it, so that is why you see the correct length for the reference.

@evagunawan
Copy link

Ah, thank you so much for responding, that is super helpful to know! Do we have a rough estimate of when the next release will be? I just want to make sure that I get our pipeline updated with the newest version as soon as possible. Thank you so much for looking into this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants