FASTA input filtering

PARM does not allow all IUPAC nucleotide notations. E.g. 'R' and 'Y' for purine and pyrimidine are not supported and will cause PARM to crash.
It would be preferable if PARM checks if FASTA input complies with its requirements before running and crashing unexpectedly.

I would suggest that PARM converts all non ATGCN letters to 'N', while throwing a warning. Alternatively, PARM could discard the non-compliant FASTA entries, while throwing a warning. Yet another possibility would be to throw an error and prevent PARM from running.

Example error:
```
20%|█████▌                      | 1026933/5122413 [4:03:00<14:14:37, 79.87it/s]Traceback (most recent call last):
    File "/foo/bin/parm", line 9, in <module>
      sys.exit(main())
    File "/foo/lib/python3.10/site-packages/PARM/__main__.py", line 70, in main
      args.func(args)
    File "/foo/lib/python3.10/site-packages/PARM/__main__.py", line 132, in predict
      PARM_predict(
    File "/foo/lib/python3.10/site-packages/PARM/PARM_predict.py", line 121, in PARM_predict
      get_prediction(tmp.sequence.to_list(), model)
    File "/foo/lib/python3.10/site-packages/PARM/PARM_predict.py", line 322, in get_prediction
      np.float32(sequence_to_onehot(sequence, L_max=len(sequence[0])))
    File "/foo/lib/python3.10/site-packages/PARM/PARM_predict.py", line 354, in sequence_to_onehot
      x = np.array([letter_to_vector[s] for s in seq])
    File "/foo/lib/python3.10/site-packages/PARM/PARM_predict.py", line 354, in <listcomp>
      x = np.array([letter_to_vector[s] for s in seq])
  KeyError: 'R'

   20%|█████▌                      | 1026936/5122413 [4:03:00<16:09:09, 70.43it/s]
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FASTA input filtering #16

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

FASTA input filtering #16

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions