In get_esm_embedding.py>process_fasta,get_esm_if_embedding.py>embedding and data_process.py>prep_test_dataset,utils.py>process_fasta_file functions,fasta ids are parsed like:
# in get_esm_embedding.py>process_fasta
ID_list.append(rec.id.split("|")[1])
# in get_esm_if_embedding.py>embedding
ids = [rec.id.split("|")[1] for rec in recs]
seqs = {rec.id.split("|")[1]: str(rec.seq) for rec in recs}
# in data_process.py>prep_test_dataset
ID_list = [rec.id for rec in recs]
# in utils.py>process_fasta_file
for i in range(0, len(lines), 3): # hard code of fasta formats,not robust
id = lines[i].strip().replace(">", "")
I use get_esm*_embedding.py to generate embedding (see.npy) from a fasta file like:
>|see # `>sea` will lead to get_esm*_embedding.py raise IndexError: list index out of range
some sequence
>|sea
some sequence
When I use inference.py,the id is parsed as |sea and the script fails.I adjusts data_process.py to make it work.
I suggest:
- the scripts above use the same strategy to parse ids
- refactor code to call only one function to keep consistent
- define an optional argument
key to pass a Callable object to let others decide how to parse rec.id,like python's list.sort(key=None).
In
get_esm_embedding.py>process_fasta,get_esm_if_embedding.py>embeddinganddata_process.py>prep_test_dataset,utils.py>process_fasta_filefunctions,fasta ids are parsed like:I use
get_esm*_embedding.pyto generate embedding (see.npy) from a fasta file like:>|see # `>sea` will lead to get_esm*_embedding.py raise IndexError: list index out of range some sequence >|sea some sequenceWhen I use
inference.py,the id is parsed as|seaand the script fails.I adjustsdata_process.pyto make it work.I suggest:
keyto pass a Callable object to let others decide how to parserec.id,like python'slist.sort(key=None).