Review TCF4 #37

nvaulin · 2024-02-26T17:57:41Z

Review TCF4

angrygeese

Оба задания сделаны, поэтому уже хорошая работа.

К первой части, FastQ-фильтратору, принципиальных замечаний нет.

Во второй, "Biological sequence world", лучше было бы:

Не наследоваться от встроенных классов.
Не прописывать какие-либо детали касательно классов-наследников в родительском классе, а задействовать атрибуты классов-наследников.

angrygeese · 2024-03-02T21:56:31Z

TCF4.py

+class InvalidInputError(ValueError):
+    pass
+
+class BiologicalSequence(ABC, str):


При таком наследовании BiologicalSequence и его наследники переймут все методы класса str, которые ты не переопределишь. Например, у экземпляров твоего DNASequence есть методы count и transtale.

Suggested change

class BiologicalSequence(ABC, str):

class BiologicalSequence(ABC):

angrygeese · 2024-03-02T22:01:16Z

TCF4.py

+            if letter in freq_dict:
+                freq_dict[letter] += 1
+            else:
+                freq_dict[letter] = 1


Логика абсолютно правильная, есть способ записать её несколько короче.

У словарей в Python есть метод get, он принимает два аргумента - ключ и любой аргумент. Если ключ в словаре есть, возвращает значение из словаря по этому ключу, если нет - возвращает второй аргумент.

Suggested change

if letter in freq_dict:

freq_dict[letter] += 1

else:

freq_dict[letter] = 1

freq_dict[letter] = freq_dict.get(letter, 0) + 1

Его используют сильно реже, но тут кажется отличный кейс для get

angrygeese · 2024-03-02T22:02:04Z

TCF4.py

+    def amino_acid_frequency(self):
+        """Calculates molecular weight of a protein


Не из той функции взята докстринга.

angrygeese · 2024-03-02T22:02:35Z

TCF4.py

+class AminoAcidSequence(BiologicalSequence):
+    def __init__(self, seq):
+        self.seq = seq


Здесь есть маленькая деталь: ни в классе-родителе, ни в самом AminoAcidSequence не реализована проверка алфавита последовательности на корректность. Поэтому я могу, например, создать экземпляр AminoAcidSequence, передав конструктору класса число 1000. И потом запросить для него частоту аминокислот.

angrygeese · 2024-03-03T07:13:19Z

TCF4.py

+    @abstractmethod
+    def __init__(self, seq):
+        self.seq = seq


Насколько я успел посмотреть, в абстрактных классах не используют связку __init__ и @abstractmethod.

Обычно делают так: в родительском, абстрактном классе объявляют переменную класса __slots__. Это такой способ зарезервировать память под набор атрибутов. Затем в __init__ класса-наследника их определяют.

Suggested change

@abstractmethod

def __init__(self, seq):

self.seq = seq

__slots__ = ('seq',)

Если вдруг станет интересно, что можно посмотреть:

Исходный код абстрактного класса для последовательностей в Python.

Особенности использования __slots__ детально разобраны тут.

upd: Я всё-таки встретил связку __init__ и @abstractmethod.

angrygeese · 2024-03-04T19:49:30Z

TCF4.py

+    filename = input_path
+    records = SeqIO.parse(filename, "fastq")
+    ###quality filter
+    good_reads = (rec for rec in records if min(rec.letter_annotations["phred_quality"]) >= quality_threshold)
+    result_quality = SeqIO.write(good_reads, "good_quality.fastq", "fastq")
+    result_quality_GC = SeqIO.parse("good_quality.fastq", "fastq")
+    ###GC content filter
+    min_gc_content = gc_bounds[0]
+    max_gc_content = gc_bounds[1]
+    GC_quality_filt = []
+
+    for sequence in result_quality_GC:
+        if min_gc_content <= GC(sequence.seq) <= max_gc_content:
+            GC_quality_filt.append(sequence)
+
+    result_quality = SeqIO.write(GC_quality_filt, "good_quality_GC.fastq", "fastq")
+    result_quality_GC_length = SeqIO.parse("good_quality_GC.fastq", "fastq")
+
+    ##length filter
+    filtered_GC_quality_length = []
+
+    for sequence in result_quality_GC_length:
+        if len(sequence.seq) >= length_bounds[0] and len(sequence.seq) <= length_bounds[1]:
+            filtered_GC_quality_length.append(sequence)
+
+    result_quality = SeqIO.write(filtered_GC_quality_length, output_filename, "fastq")
+
+    print(result_quality)


Логику последовательных проверок можно реализовать чуть короче, не создавая итератор для каждой новой проверки:

Suggested change

filename = input_path

records = SeqIO.parse(filename, "fastq")

###quality filter

good_reads = (rec for rec in records if min(rec.letter_annotations["phred_quality"]) >= quality_threshold)

result_quality = SeqIO.write(good_reads, "good_quality.fastq", "fastq")

result_quality_GC = SeqIO.parse("good_quality.fastq", "fastq")

###GC content filter

min_gc_content = gc_bounds[0]

max_gc_content = gc_bounds[1]

GC_quality_filt = []

for sequence in result_quality_GC:

if min_gc_content <= GC(sequence.seq) <= max_gc_content:

GC_quality_filt.append(sequence)

result_quality = SeqIO.write(GC_quality_filt, "good_quality_GC.fastq", "fastq")

result_quality_GC_length = SeqIO.parse("good_quality_GC.fastq", "fastq")

##length filter

filtered_GC_quality_length = []

for sequence in result_quality_GC_length:

if len(sequence.seq) >= length_bounds[0] and len(sequence.seq) <= length_bounds[1]:

filtered_GC_quality_length.append(sequence)

result_quality = SeqIO.write(filtered_GC_quality_length, output_filename, "fastq")

print(result_quality)

def filter_fastq_corr(input_path: str, quality_threshold: int, output_filename="final_filtered_corr.fastq", gc_bounds=(40, 60), length_bounds=(50, 350)):

if isinstance(gc_bounds, int):

gc_bounds = (0, gc_bounds)

min_gc_content, max_gc_content = gc_bounds

length_min, length_max = length_bounds

records = SeqIO.parse(input_path, "fastq")

good_reads = []

for rec in records:

rec_len, rec_gc_content = len(rec.seq), GC(rec.seq)

if (

min(rec.letter_annotations["phred_quality"]) >= quality_threshold

and max_gc_content >= rec_gc_content >= min_gc_content

and length_max >= rec_len >= length_min

):

good_reads.append(rec)

result_quality = SeqIO.write(good_reads, output_filename, "fastq")

print(result_quality)

angrygeese · 2024-03-04T19:51:50Z

TCF4.py

+    for sequence in result_quality_GC_length:
+        if len(sequence.seq) >= length_bounds[0] and len(sequence.seq) <= length_bounds[1]:


Suggested change

for sequence in result_quality_GC_length:

if len(sequence.seq) >= length_bounds[0] and len(sequence.seq) <= length_bounds[1]:

length_min, length_max = length_bounds

for sequence in result_quality_GC_length:

seq_len = len(sequence.seq)

if seq_len >= length_min and seq_len <= length_max:

angrygeese · 2024-03-04T19:54:21Z

TCF4.py

+from Bio import SeqIO
+from Bio.SeqUtils import GC


Работа с SeqIO.parse реализована хорошо, единственное, он избыточен в паре мест.

angrygeese · 2024-03-06T17:23:21Z

TCF4.py

+
+class NucleicAcidSequence(BiologicalSequence):
+    def __init__(self, seq):
+        super().__init__(seq)


Благодаря этому моменту я разобрался в деталях работы super(). И что им, оказывается, можно задействовать смысловой код из абстрактного метода класса-родителя.

angrygeese · 2024-03-06T17:48:55Z

TCF4.py

+    def __len__(self):
+        return len(self.seq)
+
+    def __getitem__(self, index):
+        return self.seq[int(index)]


Я бы выполнил это задание в таком ключе, определив каждый метод как абстрактный, без конкретной реализации:

Suggested change

def __len__(self):

return len(self.seq)

def __getitem__(self, index):

return self.seq[int(index)]

@abstractmethod

def __len__(self):

raise NotImplementedError

@abstractmethod

def __getitem__(self, index):

raise NotImplementedError

Питон допускает реализацию в абстрактном методе, но это нужно для отдельных случаев с множественным наследованием.

Add TCF4.py

db1b1a6

angrygeese reviewed Mar 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Review TCF4 #37

Review TCF4 #37

Uh oh!

nvaulin commented Feb 26, 2024

Uh oh!

angrygeese left a comment

Uh oh!

angrygeese Mar 2, 2024

Uh oh!

angrygeese Mar 2, 2024

Uh oh!

nvaulin Mar 17, 2024

Uh oh!

angrygeese Mar 2, 2024

Uh oh!

angrygeese Mar 2, 2024

Uh oh!

angrygeese Mar 3, 2024

Uh oh!

angrygeese Apr 24, 2024

Uh oh!

angrygeese Mar 4, 2024 •

edited

Loading

Uh oh!

angrygeese Mar 4, 2024

Uh oh!

angrygeese Mar 4, 2024

Uh oh!

angrygeese Mar 6, 2024 •

edited

Loading

Uh oh!

angrygeese Mar 6, 2024 •

edited

Loading

Uh oh!

Uh oh!

	class BiologicalSequence(ABC, str):
	class BiologicalSequence(ABC):

		def amino_acid_frequency(self):
		"""Calculates molecular weight of a protein

		for sequence in result_quality_GC_length:
		if len(sequence.seq) >= length_bounds[0] and len(sequence.seq) <= length_bounds[1]:

Review TCF4 #37

Are you sure you want to change the base?

Review TCF4 #37

Uh oh!

Conversation

nvaulin commented Feb 26, 2024

Uh oh!

angrygeese left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

angrygeese Mar 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

angrygeese Mar 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

angrygeese Mar 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

angrygeese Mar 4, 2024 •

edited

Loading

angrygeese Mar 6, 2024 •

edited

Loading

angrygeese Mar 6, 2024 •

edited

Loading