Welcome to the EPInformer framework repository! EPInformer is a scalable deep learning framework for gene expression prediction by integrating promoter-enhancer sequences with epigenomic signals. EPInformer is designed for three key applications: 1) predict gene expression levels using promoter-enhancer sequences, epigenomic signals, and chromatin contacts; 2) identify cell-type-specific enhancer-gene interactions and conduct in-silico perturbation; 3) predict enhancer activity and recapitulate transcription factor binding motifs from sequences. The framework is described in the following bioRxiv preprint:
https://www.biorxiv.org/content/10.1101/2024.08.01.606099v1.
This repository can be used to run the EPInformer model to predit gene expression (e.g., CAGE and RNA-seq) and prioritize enhancer-gene interactions for input DNA sequences and epigenomic signals (e.g., DNase, H3K27ac and Hi-C).
We also provide information and instructions for how to train different versions of EPInformer given diffenet inputs including DNA sequence, epigemoic signals and chromatin contacts.
EPInformer requires Python 3.6+ and Python packages PyTorch (>=2.1). You can follow PyTorch installation steps here.
EPInformer requires ABC enhancer-gene data for training and predicting gene expression. You can obtain the ABC data from ENCODE or by running the ABC pipeline available on their GitHub acquire cell-type-specific gene-enhancer links. For K562 and GM12878 cell lines, you can download the training resource of EPInformer from Zenodo by running the command:
sh ./download_data.sh
To experiment three applications below with EPInformer, please first run the folloing command to setup the environment:
# Clone this repository
git clone https://github.com/JasonLinjc/EPInformer.git
cd EPInformer
# create 'EPInformer_env' conda environment by running the following:
conda create --name EPInformer_env python=3.8 pandas scipy scikit-learn jupyter seaborn
source activate EPInformer_env
# GPU version pytorch
conda install pytorch pytorch-cuda=12.1 -c pytorch -c nvidia
# CPU version pytorch
conda install pytorch cpuonly -c pytorch
# Other pacakges
pip install pyranges pyfaidx kipoiseq openpyxl tangermeme
An end-to-end example to predict gene expression from promoter-enhancer sequences, epigenomic signals and chromatin contacts is in predict_gene_expression.ipynb. You can run this notebook yourself to experiment with different EPInformers.
We evaluated EPInformer for enhancer–gene link prediction using the K562 CRISPR and eQTLs benchmark datasets from the Engreitz Lab repository (CRISPR and eQTL enrichment).
To predict cell-type-specific enhancer activity, we provide sequence-based predictors trained on H3K27ac and DNase signals in K562 and GM12878 cell lines separately. Enhancer activity was calculated using the ABC score. Additionally, Tangermeme was used to perform in-silico saturation mutagenesis (ISM) on the enhancer sequence to identify key motifs contributing to the predicted activity. The notebook (predict_enhancer_activity.ipynb) is available for experimenting with enhancer activity prediction and transcription factor motif discovery.
Please post in the GitHub issues or e-mail Jiecong Lin (jieconglin(at)@outlook.com) with any question about the repository, requests for more data, etc.



