-
Notifications
You must be signed in to change notification settings - Fork 0
Review DIABLO #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Review DIABLO #19
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Сделано неплохо, но в целом впечатление, что код усложнен излишними действиями.
Вместо постоянных проверок типов DNA RNA лучше использовать полиморфизм
def filter_fastq(input_path: str, gc_bounds: Tuple[int, int] = (0, 100), | ||
length_bounds: Tuple[int, int] = (0, 2 ** 32), quality_threshold: int = 0, | ||
output_filename: str = None) -> NoReturn: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Почему не использовать обычный вариант: gc_bound: tuple ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Так мы аннотируем не только кортеж (что дает лишь органиченную инфу), а и содержимое кортежа (говорим что это именно целые числа). Это хорошая практика
if not os.path.isdir("fastq_filtrator_resuls"): | ||
os.mkdir("fastq_filtrator_resuls") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Мне нравится, что предусмотрена директория для результатов
with open(f'fastq_filtrator_resuls/{output_filename}.fastq', mode='w'): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Странный мув для создания файла. Можно открыть его в конце и сразу записать результаты
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Согласен
max_gc = gc_bounds | ||
|
||
for record in SeqIO.parse(open(input_path), "fastq"): | ||
name, seq, description, quality = record.id, record.seq, record.description, record.letter_annotations[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Излишние переменные. name и description не используется, остальные по одному разу.
with open(f'fastq_filtrator_resuls/{output_filename}.fastq', mode='a') as new_file: | ||
new_file.write(f'{record.format("fastq")} \n') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Лучше открыть файл перед циклом, чем открывать на каждой иттерации
new_file.write(f'{record.format("fastq")} \n') | ||
|
||
|
||
class BiologicalSequence(str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
не реализованы методы __len__
, __get_item__
for base in self.seq: | ||
res.append(self.complement_pairs[base] if base.islower() else self.complement_pairs[base.lower()].upper()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
можно сразу делать upper
cnt = Counter(self.seq.upper()) | ||
return round((cnt['C'] + cnt['G']) / len(self.seq), 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
лучше переписать с использование метода count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Много интересных решений в коде, которые я раньше особо не встречал. Есть отдельные недопонимания (про отсутствие методов len и getitem уже написали до меня), но работа хорошая :)
AA_SET = set('FLIMVSPTAYHQNKDECWRG') | ||
DNA_NUCLEOTIDES = set('ATGC') | ||
RNA_NUCLEOTIDES = set('AUGC') | ||
PAIRS_DNA = {'a': 't', 't': 'a', 'c': 'g', 'g': 'c'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Мне кажется было бы лаконичнее сделать словари с комплементарными парами с заглавными буквами, раз уж нуклеотиды и аминокислоты тоже написаны капсом
max_length = length_bounds | ||
|
||
try: | ||
min_gc, max_gc = gc_bounds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Не сразу понял даже, как это сделано... Красиво на самом деле, классная задумка)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Задумка интересная, но все таки лучше не злоупотреблять try-except. Все таки если мы хотим какие-то случаи рассмотреть - то это if else. try-except он для ошибок, и не очень хорошо делать конструкции типа "если А, то будет ошибка, а если ошибка то B". Лучше сразу "если А то B"
""" | ||
if not os.path.isdir("fastq_filtrator_resuls"): | ||
os.mkdir("fastq_filtrator_resuls") | ||
with open(f'fastq_filtrator_resuls/{output_filename}.fastq', mode='w'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
опечатка?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Если видете опечатку, то наверное лучше в PR сразу ее и подправить, чтобы другие не искали
min_gc <= gc <= max_gc and \ | ||
q_seq >= quality_threshold: | ||
with open(f'fastq_filtrator_resuls/{output_filename}.fastq', mode='a') as new_file: | ||
new_file.write(f'{record.format("fastq")} \n') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Интересное решение с многократным открытием файла на дозапись, но мне кажется, что один раз открыть на запись было бы проще и читабельнее
|
||
def __repr__(self): | ||
return f'Sequence: {self.seq}' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Я вот репр не сделал, а надо было бы...
float: The GC content of the sequence. | ||
""" | ||
cnt = Counter(self.seq.upper()) | ||
return round((cnt['C'] + cnt['G']) / len(self.seq), 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Можно было использовать встроенную функцию в Biopython
seq (str): The biological sequence. | ||
mol_type (str): The type of molecule in the sequence (DNA, RNA, AA_seq). | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Не очень понял необходимости этого класса, если у тебя в BiologicalSequence каких-то других вариантов для последовательности все равно нет. Да, BiologicalSequence бы тогда стал более громоздким, но мне кажется этот функционал присваивания типа молекулы было бы логично поместить туда.
""" | ||
res = [] | ||
for base in self.seq: | ||
res.append(self.complement_pairs[base] if base.islower() else self.complement_pairs[base.lower()].upper()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Вот тут, если бы словари в начале были большими буквами, можно было бы подсократить.
summ_charge.extend([PK3[key] for _ in range(value)]) | ||
except KeyError: | ||
pass | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Все гуд!
Review DIABLO