Skip to content

Synthyra/Protify

Repository files navigation

Contributors Forks Stargazers Issues LinkedIn


Logo

Protify

A low code solution for computationally predicting the properties of chemicals.
Explore the docs »

View Demo · Report Bug · Request Feature

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Contributing
  5. Built With
  6. License
  7. Contact
  8. Cite

About The Project

Protify is an open source platform designed to simplify and democratize workflows for chemical language models. With Protify, deep learning models can be trained to predict chemical properties at the click of a button, without requiring extensive coding knowledge or computational resources.

Why Protify?

  • Benchmark multiple models efficiently: Need to evaluate 10 different protein language models against 15 diverse datasets with publication-ready figures? Protify makes this possible without writing a single line of code.
  • Flexible for all skill levels: Build custom pipelines with code or use our no-code interface depending on your needs and expertise.
  • Accessible computing: No GPU? No problem. Synthyra offers precomputed embeddings for many popular datasets, which Protify can download for analysis with scikit-learn on your laptop.
  • Cost-effective solutions: The upcoming Synthyra API integration will offer affordable GPU training options, while our Colab notebook provides an accessible entry point for GPU-reliant analysis.

Protify is currently in beta. We're actively working to enhance features and documentation to meet our ambitious goals.

Currently Supported Models

Click to expand model list

pLM - Protein Language Model

Model Name Description Size (parameters) Type
ESM2-8 Small pLM from Meta AI that learns evolutionary information from millions of protein sequences. 8M pLM
ESM2-35 Medium-sized pLM trained on evolutionary data. 35M pLM
ESM2-150 Large pLM with improved protein structure prediction capabilities. 150M pLM
ESM2-650 Very large pLM offering state-of-the-art performance on many protein prediction tasks. 650M pLM
ESM2-3B Largest ESM2 pLM with exceptional capability for protein structure and function prediction. 3B pLM
ESMC-300 pLM optimized for classification tasks. 300M pLM
ESMC-600 Larger pLM for classification. 600M pLM
ProtBert BERT-based pLM trained on protein sequences from UniRef. 420M pLM
ProtBert-BFD BERT-based pLM trained on BFD database with improved performance. 420M pLM
ProtT5 T5-based pLM capable of both encoding and generation tasks. 3B pLM
ANKH-Base Base version of the ANKH pLM focused on protein structure understanding. 400M pLM
ANKH-Large Large version of the ANKH pLM with improved structural predictions. 1.2B pLM
ANKH2-Large Improved second generation ANKH pLM. 1.5B pLM
GLM2-150 Medium-sized general language model adapted for protein sequences. 150M pLM
GLM2-650 Large general language model adapted for protein sequences. 650M pLM
GLM2-GAIA Specialized GLM pLM with GAIA architecture improvements. 650M pLM
DPLM-150 Diffusion pLM focused on joint sequence and structure. 150M pLM
DPLM-650 Larger DPLM. 650M parameters pLM
DPLM-3B Largest DPLM. 3B pLM
DSM-150 Diffusion language model for proteins. 150M pLM
DSM-650 Diffusion language model for proteins. 650M pLM
Random Baseline model with randomly initialized weights, serving as a negative control. Varies Negative control
Random-Transformer Randomly initialized transformer model serving as a homology-based control. Varies Homology control

Currently Supported Datasets

Click to expand dataset list

BC - Binary Classification

MCC - Multi-Class Classification

MLC - Multi-Label Classification

R - Regression

Dataset Name Description Type Task Tokenwise Dual inputs
EC Enzyme Commission numbers dataset for predicting enzyme function classification. MLC Protein function prediction No No
GO-CC Gene Ontology Cellular Component dataset for predicting protein localization in cells. MLC Protein localization prediction No No
GO-BP Gene Ontology Biological Process dataset for predicting protein involvement in biological processes. MLC Protein function prediction No No
GO-MF Gene Ontology Molecular Function dataset for predicting protein molecular functions. MLC Protein function prediction No No
MB Metal ion binding dataset for predicting protein-metal interactions. BC Protein-metal binding prediction No No
DeepLoc-2 Binary classification dataset for predicting protein localization in 2 categories. BC Protein localization prediction No No
DeepLoc-10 Multi-class classification dataset for predicting protein localization in 10 categories. MCC Protein localization prediction No No
enzyme-kcat Dataset for predicting enzyme catalytic rate constants (kcat). R Enzyme kinetics prediction No No
solubility Dataset for predicting protein solubility properties. BC Protein solubility prediction No No
localization Dataset for predicting subcellular localization of proteins. MCC Protein localization prediction No No
temperature-stability Dataset for predicting protein stability at different temperatures. BC Protein stability prediction No No
optimal-temperature Dataset for predicting the optimal temperature for protein function. R Protein property prediction No No
optimal-ph Dataset for predicting the optimal pH for protein function. R Protein property prediction No No
fitness-prediction Dataset for predicting protein fitness in various environments. R Protein fitness prediction No No
SecondaryStructure-3 Dataset for predicting protein secondary structure in 3+1 classes. MCC Protein structure prediction Yes No
SecondaryStructure-8 Dataset for predicting protein secondary structure in 8+1 classes. MCC Protein structure prediction Yes No
human-ppi Dataset for predicting human protein-protein interactions. BC PPI prediction No Yes
human-ppi-pinui Human protein-protein interaction dataset from PiNUI. BC PPI prediction No Yes
yeast-ppi-pinui Yeast protein-protein interaction dataset from PiNUI. BC PPI prediction No Yes
peptide-HLA-MHC-affinity Dataset for predicting peptide binding affinity to HLA/MHC complexes. BC Binding affinity prediction No Yes
gold-ppi Gold standard dataset for protein-protein interaction prediction. BC PPI prediction No Yes
shs27-ppi SHS27k dataset containing 27,000 protein-protein interactions. MCC PPI prediction type No Yes
shs148-ppi SHS148k dataset containing 148,000 protein-protein interactions. MCC PPI prediction type No Yes
PPA-ppi Protein-Protein Affinity dataset for quantitative binding predictions. R Protein-protein affinity prediction No Yes

For more details about supported models and datasets, including programmatic access and command-line utilities, see the Resource Listing Documentation.

Current Key Features

  • Multiple interfaces: Run experiments via an intuitive GUI, CLI, or prepared YAML files
  • Efficient embeddings: Leverage fast and efficient embeddings from ESM2 and ESMC via FastPLMs
    • Coming soon: Additional protein, SMILES, SELFIES, codon, and nucleotide language models
  • Flexible model probing: Use efficient MLPs for sequence-wise tasks or transformer probes for token-wise tasks
    • Coming soon: Full model fine-tuning, hybrid probing, and LoRA
  • Automated model selection: Find optimal scikit-learn models for your data with LazyPredict, enhanced by automatic hyperparameter optimization
    • Coming soon: GPU acceleration
  • Complete reproducibility: Every session generates a detailed log that can be used to reproduce your entire workflow
  • Publication-ready visualizations: Generate cross-model and dataset comparisons with radar and bar plots, embedding analysis with PCA, t-SNE, and UMAP, and statistically sound confidence interval plots
  • Extensive dataset support: Access 25 protein datasets by default, or easily integrate your own local or private datasets
    • Coming soon: Additional protein, SMILES, SELFIES, codon, and nucleotide property datasets
  • Advanced interaction modeling: Support for protein-protein interaction datasets
    • Coming soon: Protein-small molecule interaction capabilities

Support Protify's Development

Help us grow by sharing online, starring our repository, or contributing through our bounty program.

(back to top)

Getting Started

Installation

From pip

pip install Protify

To get started locally

git clone https://@github.com/Synthyra/Protify.git
cd Protify
python -m pip install -r requirements.txt
git submodule update --init --remote --recursive
cd src/protify

If you would like a Python virtual environment with the requirements

chmod +x setup_bioenv.sh
./setup_bioenv.sh

(back to top)

Usage

Toggle

To launch the gui, run

python -m gui

It's recommended to use the user interface alongside an open terminal, as helpful messages and progressbars will show in the terminal while you press the GUI buttons.

An example workflow

Here, we will compare various protein models against a random vector baseline (negative control) and random transformer (homology based control).

1.) Start the session

2.) Select the models you would like to benchmark

3.) Select the datasets you are interested in. Here we chose Enzynme Comission numbers (multi-label classification), metal-ion binding (binary classificaiton), solubility (deeploc2, binary classification), and catalytic rate (kcat, regression).

4.) Embed the proteins in the selected datasets. If your machine does not have a GPU, you can download precomputed embeddings for many common sequences. Note: If you download embeddings, it will be faster to use the scikit model tab than the probe tab

5.) Select which probe and configuration you would like. Here, we will use a simple linear probe, a type neural network. It is the fastest (by a large margin) but worst performing option (by a small margin usually).

6.) Select your settings for training. Like most of the tabs, the defaults are pretty good. If you need information about what setting does what, the ? button provides a helpful note. The documentations has more extensive information

This will train your models!

7.) After training, you can render helpful visualizations by passing the log ID from before. If you forget it, you can look for the file generated in the logs folder.

Here's a sample of the many plots produced. You can find them all inside plots/your_log_id/*

8.) Need to replicate your findings for a report or paper? Just input the generated log into the replay tab

To run the same session from the command line instead, you would simply execute

python -m main --model_names ESM2-8 ESM2-35 ESMC-300 Random Random-Transformer --data_names EC DeepLoc-2 enzyme-kcat --patience 3

Or, set up a yaml file with your desired settings (so you don't have to type out everything in the CLI)

python -m main --yaml_path yamls/your_custom_yaml_path.yaml

Replaying from the CLI is just as simple

python -m main --replay_path logs/your_log_id.txt

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

We work with a bounty system. You can find bounties on this page. Contributing bounties will get you listed on the Protify consortium and potentially coauthorship on published papers involving the framework.

Simply open a pull request with the bounty ID in the title to claim one. For additional features not on the bounty list simply use a descriptive title.

For bugs and general suggestions please use GitHub issues.

(back to top)

Built With

  • PyTorch
  • Transformers
  • Datasets
  • PEFT
  • scikit-learn
  • NumPy
  • SciPy
  • Einops
  • PAUC
  • LazyPredict

(back to top)

License

Distributed under the Protify License. See LICENSE.md for more information.

(back to top)

Contact

Email: [email protected]
Website: https://synthyra.com

(back to top)

Cite

If you use this package, please cite the following papers. (Coming soon)