Inconsistent parsing ways for fasta ids in `get_esm*_embedding.py` and `data_process.py`.

In `get_esm_embedding.py>process_fasta`,`get_esm_if_embedding.py>embedding` and `data_process.py>prep_test_dataset`,`utils.py>process_fasta_file` functions,fasta ids are parsed like:

```py
# in get_esm_embedding.py>process_fasta
ID_list.append(rec.id.split("|")[1])

# in get_esm_if_embedding.py>embedding
ids = [rec.id.split("|")[1] for rec in recs]
seqs = {rec.id.split("|")[1]: str(rec.seq) for rec in recs} 

# in data_process.py>prep_test_dataset
ID_list = [rec.id for rec in recs]

# in utils.py>process_fasta_file
for i in range(0, len(lines), 3): # hard code of fasta formats,not robust
    id = lines[i].strip().replace(">", "")
```

I use `get_esm*_embedding.py` to generate embedding (see.npy) from a fasta file like:

```txt
>|see   # `>sea` will lead to get_esm*_embedding.py raise IndexError: list index out of range
some sequence
>|sea
some sequence
```

When I use `inference.py`,the id is parsed as `|sea` and the script fails.I adjusts `data_process.py` to make it work.

I suggest:

- the scripts above use the same strategy to parse ids
- refactor code to call only one function to keep consistent
- define an optional argument `key` to pass a Callable object to let others decide how to parse `rec.id`,like python's `list.sort(key=None)`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent parsing ways for fasta ids in `get_esm*_embedding.py` and `data_process.py`. #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistent parsing ways for fasta ids in get_esm*_embedding.py and data_process.py. #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Inconsistent parsing ways for fasta ids in `get_esm*_embedding.py` and `data_process.py`. #1