-
Notifications
You must be signed in to change notification settings - Fork 0
Review CEP170B #40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Review CEP170B #40
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
В целом, работа оставляет положительное впечатление. Код читабельный, в основном рабочий.
Однако есть проблема с нереализованностью функционала, и в будущем имеет смысл использовать линтер.
from abc import ABC, abstractmethod | ||
from Bio import SeqIO | ||
from Bio.SeqUtils import gc_fraction | ||
from typing import Tuple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Насколько мне известно, после python 3.9 не рекомендуется использовать Tuple, хотя работе кода это не мешает.
def filter_fastq(input_path: str, | ||
output_filename: str = None, | ||
gc_bounds: Tuple[int, int] = (0, 100), | ||
length_bounds: Tuple[int, int] = (0, 2**32), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Есть некоторое нарушение pep8: не следует оставлять пробелы на концах строк.
quality_threshold: int = 0) -> None: | ||
''' | ||
Filter FASTQ-sequences based on entered requirements. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Следует убрать \t.
Arguments: | ||
- input_path (str): path to the file with FASTQ-sequences | ||
- output_filename (str): name of the output file with | ||
filtered FASTQ-sequences | ||
- gc_bounds (tuple or int, default = (0, 100)): GC-content | ||
interval (percentage) for filtering. Tuple if contains | ||
lower and upper bounds, int if only contains an upper bound. | ||
- length_bounds (tuple or int, default = (0, 2**32)): length | ||
interval for filtering. Tuple if contains lower and upper | ||
bounds, int if only contains an upper bound. | ||
- quality_threshold (int, default = 0): threshold value of average | ||
read quality for filtering. | ||
|
||
Note: the output file is saved to the /fastq_filtrator_results | ||
directory. The default output file name is the name of the input file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
В некоторых строках имеются пробелы на концах.
Гораздо более существенно то, что функция не соответствует заявленному функционалу. Если подать int аргументам gc_bounds и length_bounds, то падает с ошибкой, т.к. не реализован перевод этого int в соответствующий tuple. Это можно было реализовать, например, так:
if isinstance(gc_bounds, int):
gc_bounds = tuple([0, gc_bounds])
if isinstance(length_bounds, int):
length_bounds = tuple([0, length_bounds])
Возможно, в будущем имеет смысл писать себе pass или комментарии, чтобы не забывать.
records = [record for record in SeqIO.parse(handle, "fastq")] | ||
|
||
filtered_records = [] | ||
for record in records: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Здесь была возможность использовать SeqIO:
for record in records: | |
for i, record in enumerate(SeqIO.parse(input_path, "fastq")): |
print(f"Filtered sequences saved to {output_path}") | ||
|
||
|
||
class BiologicalSequence(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Насколько мне известно, классы более принято располагать перед функциями.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Все так
|
||
|
||
class BiologicalSequence(ABC): | ||
def __init__(self, seq: str = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Хотя это и не запрещено, в лекции/консультации упоминалось, что лучше не писать код в абстрактном классе.
return self.seq | ||
|
||
def check_alphabet(self): | ||
return set(self.seq.upper()).issubset(self.ALPHABET) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Считаю это достаточно изящным решением.
def __repr__(self): | ||
return self.seq | ||
|
||
def check_alphabet(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Я думаю, что проверять алфавит разумно было бы в init в обязательном порядке.
ALPHABET = {"A", "C", "D", "E", "F", "G", "H", "I","K", "L", | ||
"M", "N","P", "Q", "R", "S", "T", "V", "W", "Y"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ALPHABET = {"A", "C", "D", "E", "F", "G", "H", "I","K", "L", | |
"M", "N","P", "Q", "R", "S", "T", "V", "W", "Y"} | |
ALPHABET = {"A", "C", "D", "E", "F", "G", "H", "I", "K", "L", | |
"M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"} |
Стоит добавить пробелы. Помимо этого, есть еще некоторое количество пробелов на концах строк и \t в пустых строках, но я полагаю, не имеет смысла на каждом останавливаться. Стоит проверять код линтером, так можно выявить те огрехи, которые невооруженному глазу плохо заметны.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Особенно хорошо, что вы использовали абстрактные классы и соблюдали принципы ООП. Ваш код довольно лаконичен и легко читаем, что является плюсом.
В целом, код отлично справляется с поставленными задачами, и замечания касаются мелких деталей и стиля. Отличная работа!
|
||
def gc_content(self): | ||
gc_count = self.seq.count('G') + self.seq.count('C') | ||
return gc_count / len(self.seq) * 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Можно учесть деление на 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Верно подмечено
if output_filename is None: | ||
output_filename = input_path.split("/")[-1].split(".")[0] + "_filtered.fastq" | ||
|
||
output_path = "fastq_filtrator_results/" + output_filename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Отличный вариант с добавлением папки для результатов. Но можно проверить ее существование перед созданием.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Это очень очень хорошее замечание. Папку при чем можно было бы делать где нибудь отдельно в начале специальной функцией для создания папок
Review CEP170B