-
Notifications
You must be signed in to change notification settings - Fork 0
Review TCF4 #37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Review TCF4 #37
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Оба задания сделаны, поэтому уже хорошая работа.
К первой части, FastQ-фильтратору, принципиальных замечаний нет.
Во второй, "Biological sequence world", лучше было бы:
- Не наследоваться от встроенных классов.
- Не прописывать какие-либо детали касательно классов-наследников в родительском классе, а задействовать атрибуты классов-наследников.
class InvalidInputError(ValueError): | ||
pass | ||
|
||
class BiologicalSequence(ABC, str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
При таком наследовании BiologicalSequence
и его наследники переймут все методы класса str
, которые ты не переопределишь. Например, у экземпляров твоего DNASequence
есть методы count
и transtale
.
class BiologicalSequence(ABC, str): | |
class BiologicalSequence(ABC): |
if letter in freq_dict: | ||
freq_dict[letter] += 1 | ||
else: | ||
freq_dict[letter] = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Логика абсолютно правильная, есть способ записать её несколько короче.
У словарей в Python есть метод get
, он принимает два аргумента - ключ и любой аргумент. Если ключ в словаре есть, возвращает значение из словаря по этому ключу, если нет - возвращает второй аргумент.
if letter in freq_dict: | |
freq_dict[letter] += 1 | |
else: | |
freq_dict[letter] = 1 | |
freq_dict[letter] = freq_dict.get(letter, 0) + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Его используют сильно реже, но тут кажется отличный кейс для get
def amino_acid_frequency(self): | ||
"""Calculates molecular weight of a protein |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Не из той функции взята докстринга.
class AminoAcidSequence(BiologicalSequence): | ||
def __init__(self, seq): | ||
self.seq = seq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Здесь есть маленькая деталь: ни в классе-родителе, ни в самом AminoAcidSequence
не реализована проверка алфавита последовательности на корректность. Поэтому я могу, например, создать экземпляр AminoAcidSequence
, передав конструктору класса число 1000
. И потом запросить для него частоту аминокислот.
@abstractmethod | ||
def __init__(self, seq): | ||
self.seq = seq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Насколько я успел посмотреть, в абстрактных классах не используют связку __init__
и @abstractmethod
.
Обычно делают так: в родительском, абстрактном классе объявляют переменную класса __slots__
. Это такой способ зарезервировать память под набор атрибутов. Затем в __init__
класса-наследника их определяют.
@abstractmethod | |
def __init__(self, seq): | |
self.seq = seq | |
__slots__ = ('seq',) |
Если вдруг станет интересно, что можно посмотреть:
- Исходный код абстрактного класса для последовательностей в Python.
- Особенности использования
__slots__
детально разобраны тут.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
upd: Я всё-таки встретил связку __init__
и @abstractmethod
.
filename = input_path | ||
records = SeqIO.parse(filename, "fastq") | ||
###quality filter | ||
good_reads = (rec for rec in records if min(rec.letter_annotations["phred_quality"]) >= quality_threshold) | ||
result_quality = SeqIO.write(good_reads, "good_quality.fastq", "fastq") | ||
result_quality_GC = SeqIO.parse("good_quality.fastq", "fastq") | ||
###GC content filter | ||
min_gc_content = gc_bounds[0] | ||
max_gc_content = gc_bounds[1] | ||
GC_quality_filt = [] | ||
|
||
for sequence in result_quality_GC: | ||
if min_gc_content <= GC(sequence.seq) <= max_gc_content: | ||
GC_quality_filt.append(sequence) | ||
|
||
result_quality = SeqIO.write(GC_quality_filt, "good_quality_GC.fastq", "fastq") | ||
result_quality_GC_length = SeqIO.parse("good_quality_GC.fastq", "fastq") | ||
|
||
##length filter | ||
filtered_GC_quality_length = [] | ||
|
||
for sequence in result_quality_GC_length: | ||
if len(sequence.seq) >= length_bounds[0] and len(sequence.seq) <= length_bounds[1]: | ||
filtered_GC_quality_length.append(sequence) | ||
|
||
result_quality = SeqIO.write(filtered_GC_quality_length, output_filename, "fastq") | ||
|
||
print(result_quality) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Логику последовательных проверок можно реализовать чуть короче, не создавая итератор для каждой новой проверки:
filename = input_path | |
records = SeqIO.parse(filename, "fastq") | |
###quality filter | |
good_reads = (rec for rec in records if min(rec.letter_annotations["phred_quality"]) >= quality_threshold) | |
result_quality = SeqIO.write(good_reads, "good_quality.fastq", "fastq") | |
result_quality_GC = SeqIO.parse("good_quality.fastq", "fastq") | |
###GC content filter | |
min_gc_content = gc_bounds[0] | |
max_gc_content = gc_bounds[1] | |
GC_quality_filt = [] | |
for sequence in result_quality_GC: | |
if min_gc_content <= GC(sequence.seq) <= max_gc_content: | |
GC_quality_filt.append(sequence) | |
result_quality = SeqIO.write(GC_quality_filt, "good_quality_GC.fastq", "fastq") | |
result_quality_GC_length = SeqIO.parse("good_quality_GC.fastq", "fastq") | |
##length filter | |
filtered_GC_quality_length = [] | |
for sequence in result_quality_GC_length: | |
if len(sequence.seq) >= length_bounds[0] and len(sequence.seq) <= length_bounds[1]: | |
filtered_GC_quality_length.append(sequence) | |
result_quality = SeqIO.write(filtered_GC_quality_length, output_filename, "fastq") | |
print(result_quality) | |
def filter_fastq_corr(input_path: str, quality_threshold: int, output_filename="final_filtered_corr.fastq", gc_bounds=(40, 60), length_bounds=(50, 350)): | |
if isinstance(gc_bounds, int): | |
gc_bounds = (0, gc_bounds) | |
min_gc_content, max_gc_content = gc_bounds | |
length_min, length_max = length_bounds | |
records = SeqIO.parse(input_path, "fastq") | |
good_reads = [] | |
for rec in records: | |
rec_len, rec_gc_content = len(rec.seq), GC(rec.seq) | |
if ( | |
min(rec.letter_annotations["phred_quality"]) >= quality_threshold | |
and max_gc_content >= rec_gc_content >= min_gc_content | |
and length_max >= rec_len >= length_min | |
): | |
good_reads.append(rec) | |
result_quality = SeqIO.write(good_reads, output_filename, "fastq") | |
print(result_quality) |
for sequence in result_quality_GC_length: | ||
if len(sequence.seq) >= length_bounds[0] and len(sequence.seq) <= length_bounds[1]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for sequence in result_quality_GC_length: | |
if len(sequence.seq) >= length_bounds[0] and len(sequence.seq) <= length_bounds[1]: | |
length_min, length_max = length_bounds | |
for sequence in result_quality_GC_length: | |
seq_len = len(sequence.seq) | |
if seq_len >= length_min and seq_len <= length_max: |
from Bio import SeqIO | ||
from Bio.SeqUtils import GC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Работа с SeqIO.parse
реализована хорошо, единственное, он избыточен в паре мест.
|
||
class NucleicAcidSequence(BiologicalSequence): | ||
def __init__(self, seq): | ||
super().__init__(seq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Благодаря этому моменту я разобрался в деталях работы super()
. И что им, оказывается, можно задействовать смысловой код из абстрактного метода класса-родителя.
def __len__(self): | ||
return len(self.seq) | ||
|
||
def __getitem__(self, index): | ||
return self.seq[int(index)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Я бы выполнил это задание в таком ключе, определив каждый метод как абстрактный, без конкретной реализации:
def __len__(self): | |
return len(self.seq) | |
def __getitem__(self, index): | |
return self.seq[int(index)] | |
@abstractmethod | |
def __len__(self): | |
raise NotImplementedError | |
@abstractmethod | |
def __getitem__(self, index): | |
raise NotImplementedError |
Питон допускает реализацию в абстрактном методе, но это нужно для отдельных случаев с множественным наследованием.
Review TCF4