-
Notifications
You must be signed in to change notification settings - Fork 0
Review RGL1 #36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Review RGL1 #36
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Хороший код, есть несколько недочетов, но это некритично)
from typing import Union | ||
from Bio import SeqIO, SeqUtils | ||
from abc import ABC, abstractmethod | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Здорово, что используется Union, это упрощает понимание кода
if (quality_threshold < sum(record.letter_annotations['phred_quality']) / | ||
len(record.letter_annotations['phred_quality'])): | ||
return record | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Все три функции выше возвращают record или None. Это подход работает, но его можно упростить, изменив функции так, чтобы они возвращали True или False, что будет более понятно для проверки условий
|
||
def filter_fastq(file_path: str, output_file: str = 'input_fasta_name_filter.fastq', | ||
lower_gc_bound: int = 0, upper_gc_bound: int = 30, | ||
lower_length_bound: int = 0, upper_length_bound: int = 2**32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Тут тоже можно было написать с помощью Union
gc_filter(record, lower_gc_bound, upper_gc_bound) and \ | ||
quality_filter(record, quality_threshold): | ||
SeqIO.write(record, output, 'fastq') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Повторная запись в файл для каждой подходящей записи через SeqIO.write(record, output, 'fastq') в цикле неэффективна, так как каждая запись обрабатывается отдельно. Лучше собрать подходящие записи в список, а затем записать их одним вызовом SeqIO.write(filtered_records, output, 'fastq')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
В данном случае есть 2 варианта
- Если читаем все разом, то и записывать лучше все разом, да
- Но можно обрабатывать данные "на лету" - прочитали одну запись и если надо - записали. В таком случае просто лучше держать все время файлы открытыми (делать все внутри блока with open). Так мне даже чуть больше нравится
gc_filter(record, lower_gc_bound, upper_gc_bound) and \ | ||
quality_filter(record, quality_threshold): | ||
SeqIO.write(record, output, 'fastq') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Нет проверки корректности диапазонов для lower_gc_bound, upper_gc_bound, lower_length_bound, upper_length_bound и quality_threshold
|
||
class SequenceFunction(BiologicalSequence): | ||
alphabet = () | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Стоило бы настроить алфавит
return self.seq[item] | ||
else: | ||
raise IndexError('Your index is incorrect') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Это избыточно, Python сам генерирует IndexError для индексов вне диапазона
def __getitem__(self, item: int) -> str: | |
return self.seq[item] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Верно
def gc_content(self) -> Union[int, float]: | ||
gc_count = self.seq.count('C') + self.seq.count('G') | ||
return gc_count/len(self.seq)*100 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Аннотация типа возвращает Union[int, float], но на практике это всегда будет float
def gc_content(self) -> float: |
def check_alphabet(self): | ||
return super().check_alphabet() | ||
|
||
def count_molecular_weight(self, amino_acid_weights: dict) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
В аннотации указано, что метод count_molecular_weight возвращает int, но на самом деле результатом будет float
return super().check_alphabet() | ||
|
||
def count_molecular_weight(self, amino_acid_weights: dict) -> int: | ||
molecular_weight = sum(self.amino_acid_weights.get(aa, 0) for aa in self.seq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Метод count_molecular_weight принимает параметр amino_acid_weights, который не используется. Вместо этого, используется атрибут класса
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
molecular_weight = sum(self.amino_acid_weights.get(aa, 0) for aa in self.seq) | |
def count_molecular_weight(self) -> int: | |
molecular_weight = sum(self.amino_acid_weights.get(aa, 0) for aa in self.seq) | |
return molecular_weight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
В целом работа хорошая, код рабочий, написанную утилиту можно использовать для обработки данных. Оформление аккуратное, есть аннотации типов, соблюдены отступы, наименования понятны. Небольшие огрехи довольно легко исправляются.
|
||
def length_filter(record, lower_length_bound: int, upper_length_bound: int): | ||
if lower_length_bound <= len(record.seq) <= upper_length_bound: | ||
return record |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Немного избыточно, можно либо True, либо False
def filter_fastq(file_path: str, output_file: str = 'input_fasta_name_filter.fastq', | ||
lower_gc_bound: int = 0, upper_gc_bound: int = 30, | ||
lower_length_bound: int = 0, upper_length_bound: int = 2**32, | ||
quality_threshold: Union[int, float] = 0) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Хорошее использование модулей
SeqIO.write(record, output, 'fastq') | ||
|
||
|
||
class BiologicalSequence(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Реализованы все необходимые абстрактные методы
pass | ||
|
||
|
||
class SequenceFunction(BiologicalSequence): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Крутое решение с введением вспомогательного класса
self.length = len(self.seq) | ||
return self.length | ||
|
||
def __getitem__(self, item: int) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Работающее решение, но было бы здорово, если бы пользователь мог получить не только отдельный элемент, но целый срез
super().__init__(seq) | ||
|
||
def complement(self) -> str: | ||
complement_seq = self.seq.translate(str.maketrans(self.complement_dict)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Интересное решение!
|
||
|
||
class DNASequence(NucleicAcidSequence): | ||
alphabet = ('A', 'T', 'G', 'C') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Регистрозависимость, на вход будут приниматься только записи большими буквами, возможно, стоит добавить upper в check_alphabet
super().__init__(seq) | ||
|
||
def check_alphabet(self): | ||
return super().check_alphabet() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Опять же, возможно, стоит добавить переведение последовательности в верхний регистр
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Очень классная работа! Было понятно и удобно читать код. Возможно, для человека, который впервые столкнулся бы с модулем, без докстринги было бы сложно понять, что происходит. Есть классные решения, взяла на заметку:)
@@ -0,0 +1,142 @@ | |||
from typing import Union |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Классно, что есть Union! Мне самое порой лениво его использовать, так что респект
from abc import ABC, abstractmethod | ||
|
||
|
||
def length_filter(record, lower_length_bound: int, upper_length_bound: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Немножко не хватило аннотации, как-то было не очевидно сразу, что он может выдавать и record, и None
def filter_fastq(file_path: str, output_file: str = 'input_fasta_name_filter.fastq', | ||
lower_gc_bound: int = 0, upper_gc_bound: int = 30, | ||
lower_length_bound: int = 0, upper_length_bound: int = 2**32, | ||
quality_threshold: Union[int, float] = 0) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like!
def __getitem__(self, item: int) -> str: | ||
if 0 <= item < len(self.seq): | ||
return self.seq[item] | ||
else: | ||
raise IndexError('Your index is incorrect') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Немножко странно, что ошибка по индексу будет вылезать при отрицательных значениях индекса: это же ок для питона, мы можем обращаться к [-1] и тд
return complement_seq | ||
|
||
def gc_content(self) -> Union[int, float]: | ||
gc_count = self.seq.count('C') + self.seq.count('G') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gc_count = self.seq.count('C') + self.seq.count('G') | |
gc_count = sum(1 for base in self.seq if base in 'GC') |
Можно попробовать так, как разбирали недавно на паре. Мне кажется, так будет более эффективно по скорости, потому что мы считаем один раз, а не два.
lower_gc_bound: int = 0, upper_gc_bound: int = 30, | ||
lower_length_bound: int = 0, upper_length_bound: int = 2**32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Возможно, стоило бы вынести эти параметры как константы вне функции
alphabet = ('A', 'T', 'G', 'C') | ||
complement_dict = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Я бы добавила строчные буквы или тогда делала бы upper для входной строки
super().__init__(seq) | ||
|
||
def complement(self) -> str: | ||
complement_seq = self.seq.translate(str.maketrans(self.complement_dict)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Прикольно с maketrans:)) не знала о существовании этой функции
Review RGL1