Skip to content

FASTA input filtering #16

@MJThiecke

Description

@MJThiecke

PARM does not allow all IUPAC nucleotide notations. E.g. 'R' and 'Y' for purine and pyrimidine are not supported and will cause PARM to crash.
It would be preferable if PARM checks if FASTA input complies with its requirements before running and crashing unexpectedly.

I would suggest that PARM converts all non ATGCN letters to 'N', while throwing a warning. Alternatively, PARM could discard the non-compliant FASTA entries, while throwing a warning. Yet another possibility would be to throw an error and prevent PARM from running.

Example error:

20%|█████▌                      | 1026933/5122413 [4:03:00<14:14:37, 79.87it/s]Traceback (most recent call last):
    File "/foo/bin/parm", line 9, in <module>
      sys.exit(main())
    File "/foo/lib/python3.10/site-packages/PARM/__main__.py", line 70, in main
      args.func(args)
    File "/foo/lib/python3.10/site-packages/PARM/__main__.py", line 132, in predict
      PARM_predict(
    File "/foo/lib/python3.10/site-packages/PARM/PARM_predict.py", line 121, in PARM_predict
      get_prediction(tmp.sequence.to_list(), model)
    File "/foo/lib/python3.10/site-packages/PARM/PARM_predict.py", line 322, in get_prediction
      np.float32(sequence_to_onehot(sequence, L_max=len(sequence[0])))
    File "/foo/lib/python3.10/site-packages/PARM/PARM_predict.py", line 354, in sequence_to_onehot
      x = np.array([letter_to_vector[s] for s in seq])
    File "/foo/lib/python3.10/site-packages/PARM/PARM_predict.py", line 354, in <listcomp>
      x = np.array([letter_to_vector[s] for s in seq])
  KeyError: 'R'

   20%|█████▌                      | 1026936/5122413 [4:03:00<16:09:09, 70.43it/s]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions