Skip to content

Inconsistent parsing ways for fasta ids in get_esm*_embedding.py and data_process.py. #1

@alchemistcai

Description

@alchemistcai

In get_esm_embedding.py>process_fasta,get_esm_if_embedding.py>embedding and data_process.py>prep_test_dataset,utils.py>process_fasta_file functions,fasta ids are parsed like:

# in get_esm_embedding.py>process_fasta
ID_list.append(rec.id.split("|")[1])

# in get_esm_if_embedding.py>embedding
ids = [rec.id.split("|")[1] for rec in recs]
seqs = {rec.id.split("|")[1]: str(rec.seq) for rec in recs} 

# in data_process.py>prep_test_dataset
ID_list = [rec.id for rec in recs]

# in utils.py>process_fasta_file
for i in range(0, len(lines), 3): # hard code of fasta formats,not robust
    id = lines[i].strip().replace(">", "")

I use get_esm*_embedding.py to generate embedding (see.npy) from a fasta file like:

>|see   # `>sea` will lead to get_esm*_embedding.py raise IndexError: list index out of range
some sequence
>|sea
some sequence

When I use inference.py,the id is parsed as |sea and the script fails.I adjusts data_process.py to make it work.

I suggest:

  • the scripts above use the same strategy to parse ids
  • refactor code to call only one function to keep consistent
  • define an optional argument key to pass a Callable object to let others decide how to parse rec.id,like python's list.sort(key=None).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions