Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

length RNA #10

Open
mtinti opened this issue Apr 11, 2022 · 2 comments
Open

length RNA #10

mtinti opened this issue Apr 11, 2022 · 2 comments

Comments

@mtinti
Copy link

mtinti commented Apr 11, 2022

Hi,
I'm getting an error when I try to predict RNA sequences longer than 600 bases:

Here is the error when I input sequences of 700 bases:

Welcome using UFold prediction tool!!!
Traceback (most recent call last):
File "/cluster/majf_lab/mtinti/UFold/ufold_predict.py", line 328, in
main()
File "/cluster/majf_lab/mtinti/UFold/ufold_predict.py", line 302, in main
test_data = RNASSDataGenerator_input('data/', 'input')
File "/cluster/majf_lab/mtinti/UFold/ufold/data_generator.py", line 217, in init
self.load_data()
File "/cluster/majf_lab/mtinti/UFold/ufold/data_generator.py", line 229, in load_data
self.data_x = np.array([self.one_hot_600(item) for item in self.seq])
File "/cluster/majf_lab/mtinti/UFold/ufold/data_generator.py", line 229, in
self.data_x = np.array([self.one_hot_600(item) for item in self.seq])
File "/cluster/majf_lab/mtinti/UFold/ufold/data_generator.py", line 244, in one_hot_600
one_hot_matrix_600[:len(seq_item),] = feat
ValueError: could not broadcast input array from shape (700,4) into shape (600,4)

Is this expected? I thought I could go up to 1600bp...

Cheers
Michele

@sperfu
Copy link
Contributor

sperfu commented Apr 11, 2022

Hi Michele,

Thanks for reaching out. UFold could go up to 1600bp. But as the sequence gets too long, it will inevitably cost a lot memory usage and time to calculate for the final result during our training and testing process, it may also cause severe out-of-memory issue especially for our backend server. So to keep our backend from crashing down. We have deliberately limit the sequence length to 600bp to achieve the best efficiency and accuracy. Please understand that.

Nevertheless, we have also add one comment line in the data_generator.py file (line 244) as shown here:

# one_hot_matrix_600 = np.zeros((len(seq_item),4))

you may replace this line with 243 line to get the whole sequence length feature. But as I mentioned earlier, it may result in high calculation cost. So we still recommended the users to predict the sequence better within 900~1000nt(best is within 600bp), you may cut the sequence to multiple short ones for prediction.

Thanks

@mtinti
Copy link
Author

mtinti commented Apr 11, 2022

Thanks for the speedy response!
I'll try your suggestions.

cheers
Michele

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants