Skip to content

Added a clustal test file, an example notebook, and alignment parsing… #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

biophyser
Copy link
Contributor

… in the _read function. Also made two hidden funcitons to iterate sequence and alignment biopython objects.

Did a bit of hacking to get the dataframes to have the same form as previously. Multiple sequence alignments are spit out as a multiindexed dataframe.

… in the _read function. Also made two hidden funcitons to iterate sequence and alignment biopython objects.
@Zsailer
Copy link
Owner

Zsailer commented Oct 29, 2017

Awesome! This looks good. I'm testing it now and will leave comments if I see things that need fixing.

@Zsailer
Copy link
Owner

Zsailer commented Oct 29, 2017

One general comment about coding style:

I typically use a "lowercase and underscore" convention when naming instance variables, functions,
and methods (except in the unusual scenario where a function is a factory for a class).

This convention was set by PEP 8. I'd like to follow PEP 8 as best as we can.

I'll leave comments in the PR where this convention is broken.

Copy link
Owner

@Zsailer Zsailer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only marked a few of the places where PEP 8 was broken. There a few more that need fixing as well.


# Distinguish sequence vs. alignment, pass SeqRecords
if schema == 'clustal':
alignIter = AlignIO.parse(filename, format=schema, **kwargs)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow PEP 8: alignIter -> align_iter

if schema == 'clustal':
alignIter = AlignIO.parse(filename, format=schema, **kwargs)
# Parse the multiple sequence alignments
alignDict = _parseAlignRec(alignIter, seq_label)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow PEP 8: alignDict -> align_dict

# Return DataFrame.
return data

def _parseAlignRec(alignIter, seq_label):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow PEP 8:

  • _parseAlignRec -> _parse_align_record
  • alignIter -> align_iter


return alignData

def _parseSeqRec(seqIter, seq_label):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow PEP 8:

  • _parseSeqRec -> _parse_seq_rec
  • seqIter -> seq_iter

def _parseSeqRec(seqIter, seq_label):
"""A Bio.SequenceRecord parser.
"""
seqDict = dict()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow PEP 8: seqDict -> seq_dict

Continue below.

@Zsailer
Copy link
Owner

Zsailer commented Oct 29, 2017

It might be nice to be able to name the alignments instead of having each alignment be multiindexed by a number but maybe that's not a huge issue.

I think I agree with this statement... give each item a value in the name column (like alignment_1 and alignment_2) rather than multi-index.

Could you explain the output in the example clustal? Are each mouse/opossum line-pairs just segments of one longer sequence? Or are each pair a different sequence?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants