collapse-motifs

Code accompanying the paper "Unsupervised learning reveals landscape of local structural motifs across protein classes" by Alexander Derry and Russ B. Altman.

Install dependencies

All dependencies can be installed by installing COLLAPSE following the instructions here.

Download fitted clustering model

Fitted clustering models for various number of clusters k are available for download here. Download into the data folder of this repo to avoid issues when importing in the scripts below.

The number of clusters controls the specificity of the resulting structural motifs; we recommend using k=50000 for general use, and this is the value used in all analysis in our paper.

Embed a directory of PDB files using COLLAPSE

Given a set of protein structures located at DATA_DIR, this script will embed all residues with COLLAPSE and stored at OUT_DIR in LMDB format.

python embed_pdb_dataset.py DATA_DIR OUT_DIR [--filetype] [--split_id] [--num_splits]

Additional arguments are

--filetype (default pdb): file type of input structures, e.g. pdb,pdb.gz,cif

For large datasets, we recommend processing in parallel using the num_splits argument.

python embed_pdb_dataset.py DATA_DIR OUT_DIR --split_id=$i --num_splits=NUM_SPLITS --filetype=pdb

This produces NUM_SPLITS tmp_ files in OUT_DIR. To combine all into the full dataset, run the following:

python -m atom3d.datasets.scripts.combine_lmdb OUT_DIR/tmp_* OUT_DIR/full

Featurize proteins using pre-trained clustering model

The following script takes an LMDB dataset of protein structures embedded using COLLAPSE (see above) and computes the clusters associated with each embedded residue. The output of this script can be in one of two formats: (1) LMDB, which simply adds a new key clusters to each element and stores a new LMDB dataset, or (2) pickle, which saves a dictionary mapping from the protein id to the list of clusters in residue sequence order.

python cluster_featurize.py DATA_PATH OUT_PATH [--out_fmt] [--k] [--split_id] [--num_splits]

Optional arguments are

--out_fmt (default pkl): Output format; valid options are either lmdb or pkl
--k (default 50000): Number of clusters from pre-fitted clustering model, assumed to be in ./data/pdb100_cluster_fit_{k}.pkl
--split_id and --num_splits are as above. Only used when --out_fmt==lmdb

Compute TF-IDF fingerprints for clusterized dataset

This script computes TF-IDF fingerprints for a cluster-featurized dataset in pkl format, as produced by cluster_featurize.py.

python cluster_tfidf.py cluster_sequences out_path [--id_to_label] [--k]

Arguments are:

cluster_sequences: dict mapping id to cluster sequence, saved in pkl format
out_path: path where output representations will be saved (in pkl format with keys {data: TF-IDF fingerprints, labels: labels, ids: ids})
id_to_label (optional): pkl file containing dict mapping from id (same as in cluster_sequences) to labels. If not give, all labels will be None.

Scripts for reproducing analysis in paper

Analysis performed in the paper can be found in the following notebooks: eval_clustering.ipynb, eval_fold_search.ipynb, klifs_analysis.ipynb, and mutations.ipynb. Code for fitting clustering models is found in run_clustering.py.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cluster_featurize.py		cluster_featurize.py
cluster_sequences.py		cluster_sequences.py
cluster_tfidf.py		cluster_tfidf.py
embed_pdb_dataset.py		embed_pdb_dataset.py
eval_clustering.ipynb		eval_clustering.ipynb
eval_fold_search.ipynb		eval_fold_search.ipynb
fold_search_tfidf.py		fold_search_tfidf.py
foldseek_benchmark.sh		foldseek_benchmark.sh
klifs-analysis.ipynb		klifs-analysis.ipynb
mutations.ipynb		mutations.ipynb
psr_cluster_sequences.py		psr_cluster_sequences.py
psr_cluster_tfidf.py		psr_cluster_tfidf.py
psr_embed.py		psr_embed.py
psr_eval.py		psr_eval.py
psr_train.py		psr_train.py
run_clustering.py		run_clustering.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

collapse-motifs

Install dependencies

Download fitted clustering model

Embed a directory of PDB files using COLLAPSE

Featurize proteins using pre-trained clustering model

Compute TF-IDF fingerprints for clusterized dataset

Scripts for reproducing analysis in paper

About

Releases

Packages

Languages

License

Helix-Research-Lab/collapse-motifs

Folders and files

Latest commit

History

Repository files navigation

collapse-motifs

Install dependencies

Download fitted clustering model

Embed a directory of PDB files using COLLAPSE

Featurize proteins using pre-trained clustering model

Compute TF-IDF fingerprints for clusterized dataset

Scripts for reproducing analysis in paper

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages