Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request - account for frameshifts #12

Open
chrisjackson-pellicle opened this issue May 16, 2024 · 3 comments
Open

Feature request - account for frameshifts #12

chrisjackson-pellicle opened this issue May 16, 2024 · 3 comments

Comments

@chrisjackson-pellicle
Copy link

Hi Ian,

Thanks very much for this program - overall it's done a great job of our plastid genome (and it's so fast!).

I have a feature request, if possible. I noticed that for one of our genes, the plastid genome contig contains a frameshift that introduces a premature stop codon about halfway through, and the output GFF3 file only annotates this 5' half.

The truncation in the recorded length of this gene occurs when the setlongestORF! function is run for this feature. Would it be possible to extend Chloe to allow for these scenarios, perhaps by optionally allowing multiple non-overlapping ORFs within a given feature, with a corresponding note in field 9 of the GFF3?

Cheers,

Chris

@chrisjackson-pellicle
Copy link
Author

Some additional information:

Read mapping suggests that our plDNA assembly is in fact correct. So, rather than a frameshift causing the issue with this gene (ccsA), it's likely a small 19 bp intron, as also seen in the ccsA annotation for the Nepenthes khasiana plDNA (our plDNA is also from a Nepenthes species).

I see that Chloe expects a single exon for ccsA based on the gold-standard reference plDNAs, and hence only a single ORF is searched for in the corresponding feature. Perhaps optionally allowing multiple non-overlapping ORFs within a given feature as suggested above would also allow a more complete annotation in cases where exon number expectations are not met?

Also, a general caveat and apology if my understanding of Chloe's process isn't correct - I'm still getting my head around some of the code!

Cheers,

Chris

@ian-small
Copy link
Owner

ian-small commented May 17, 2024 via email

@chrisjackson-pellicle
Copy link
Author

Hi Ian,

Thanks for the reply. Ah, I hadn't stopped to consider the biology of a 19 bp plDNA 'intron' - oops. And yes, I do not see any reads spliced across the 19 bp 'intron' when I map our RNAseq data. So, a likely pseudogene it is!

For the moment, I've forked Choe and added a warning if any predicted gene is less than 80% (default, can be changed with --short_gene_warning_threshold) of the combined non-intron median_length values.

I hope the publication goes smoothly - it's a great tool!

Cheers,

Chris

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants