Skip to content

Sequences < 1,000bp #24

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
imk1 opened this issue Jan 14, 2022 · 9 comments
Open

Sequences < 1,000bp #24

imk1 opened this issue Jan 14, 2022 · 9 comments

Comments

@imk1
Copy link

imk1 commented Jan 14, 2022

How does Beluga handle sequences < 1,000bp? Does it center on the input sequence and pad it with N's, or does it do something else? Thanks!

@jzthree
Copy link
Collaborator

jzthree commented Jan 19, 2022

Beluga requires 2kb sequence. Padding with N is not guaranteed to give meaningful results. If your sequence has any flanking sequence in the genomic context, you can add that to both sides.

@imk1
Copy link
Author

imk1 commented Jan 19, 2022

I ran Beluga (using this site: https://humanbase.flatironinstitute.org/deepsea/) using sequences < 2kb, and Beluga ran to completion. Do you know how Beluga modified the sequences to convert them into 2kb sequences? Thanks!

@jzthree
Copy link
Collaborator

jzthree commented Jan 31, 2022

Thanks for letting us know. It should actually only allow sequences >2kb - we are looking into this and will update here once it's fixed

@imk1
Copy link
Author

imk1 commented Jan 31, 2022

Thanks in advance for keeping me posted!

@imk1
Copy link
Author

imk1 commented Mar 2, 2022

I was wondering if you have an update on this. Thanks!

@jzthree
Copy link
Collaborator

jzthree commented Mar 2, 2022

Sorry for late update. Currently if the input is smaller than 2kb, it will be padded with "N"s. I don't recommend using fasta input smaller than 2kb unless it is very close to 2kb say only a few bps off. I would recommend adding any flanking sequence to your sequence of interest. We should update the website in terms of input length instructions (Beluga uses 2000bp, Sei uses 4096bp and SeqWeaver uses 1000bp).

@imk1
Copy link
Author

imk1 commented Mar 2, 2022

Thanks! If I were to input, say, a 1kb sequence into Beluga, would it get padded with 500 Ns on either side, or would the input be the sequence I inputted followed by 1,000 Ns? Thanks!

@jzthree
Copy link
Collaborator

jzthree commented Mar 2, 2022

It will be padded with 500 Ns on either side. How the Ns will affect the model prediction is largely tested and thus not recommended (in the training the Ns will only appear in assembly gaps and are very rare)

@imk1
Copy link
Author

imk1 commented Mar 2, 2022

That makes sense. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants