Skip to content

Commit 1ede2a4

Browse files
author
Zhen Wang
committed
init commit
0 parents  commit 1ede2a4

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

56 files changed

+936970
-0
lines changed

README.md

+116
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# De novo design and optimization of aptamers with AptaDiff
2+
3+
4+
## Abstract
5+
Aptamers are single-strand nucleic acid ligands renowned for their high affinity and specificity to target molecules. Traditionally they are identified from large DNA/RNA libraries using in vitro methods, like Systematic Evolution of Ligands by Exponential Enrichment (SELEX). However, these libraries capture only a small fraction of theoretical sequence space, and various aptamer candidates are constrained by actual sequencing capabilities from the experiment. Addressing this, we proposed AptaDiff, the first in silico aptamer design and optimization method based on the diffusion model. Our Aptadiff can generate aptamers beyond the constraints of high-throughput sequencing data, leveraging motif-dependent latent embeddings from variational autoencoder, and can optimize aptamers by affinity-guided aptamer generation according to Bayesian optimization. Comparative evaluations revealed AptaDiff’s superiority over existing aptamer generation methods in terms of quality and fidelity across four high-throughput screening data targeting distinct proteins. Moreover, our de novo designed aptamers displayed enhanced binding affinity over the top SELEX-screened experimental candidates for two target proteins. The promising results demonstrate that our AptaDiff method can significantly expedite the superior aptamer discovery.
6+
7+
8+
## Tested environment
9+
10+
* Ubuntu == 20.04
11+
* python == 3.8
12+
* pytorch == 1.9.1
13+
* cuda 11.1
14+
15+
## Quick Start
16+
17+
18+
### Train VAE model
19+
The initial stage is to train a VAE to learn the low-dimensional motif-dependent aptamer representation.
20+
21+
```
22+
python vae/scripts/real.py data/raw_data/datasetA_IGFBP3_P6.csv \
23+
0.001 \
24+
```
25+
The vae model is saved in `vae/out/trained_vae/datasetA_IGFBP3_P6_vae.mdl`
26+
27+
### Encode sequence to achieve latent representation
28+
To embed the sequence, use `encode.py`, which input sequences and trained model and output sequences' representation vector. While the VAE model encodes the sequence into the latent space in the form of distribution, the output representation vector is the center of this distribution.
29+
30+
Run:
31+
32+
```
33+
python vae/scripts/encoder.py data/raw_data/datasetA_IGFBP3_P6.csv \
34+
vae/out/trained_vae/datasetA_IGFBP3_P6_vae.mdl
35+
```
36+
37+
This will output sequences' representation vector in the following format:
38+
39+
```csv
40+
index,seq,dim1,dim2
41+
0,CGACATGGGCCGCCCAAGGA,0.56,0.38
42+
1,GCGTACCGTAAATCTGTCGG,0.18,0.34
43+
...
44+
```
45+
The default saving path is `vae/out/encode/embed_datasetA.csv`.
46+
47+
### Train Diffusion model
48+
We convert the file `vae/out/encode/embed_datasetA.csv` into the input format of the diffusion model. The default path is `data/diffusion_data/datasetA_IGFBP3_x_z.csv`.
49+
Run:
50+
```
51+
python diffusion/train.py --data_path data/diffusion_data/datasetA_IGFBP3_x_z.csv \
52+
--dataset datasetA \
53+
--batch_size 32 \
54+
--update_freq 1 \
55+
--lr 0.0001 \
56+
--epochs 1000 \
57+
--eval_every 2 \
58+
--check_every 20 \
59+
--diffusion_steps 1000 \
60+
--transformer_dim 512 \
61+
--transformer_heads 16 \
62+
--transformer_depth 12 \
63+
--transformer_blocks 1 \
64+
--transformer_local_heads 8 \
65+
--transformer_local_size 1 \
66+
--gamma 0.99 \
67+
--log_wandb True \
68+
69+
```
70+
The aptadiff model is saved in:`diffusion/out/datasetA/aptadiff_z/.../check/checkpoint.pt`
71+
72+
### Run GMM
73+
```
74+
python vae/scripts/gmm.py data/raw_data/datasetA_IGFBP3_P6.csv \
75+
vae/out/trained_vae/datasetA_IGFBP3_P6_vae.mdl \
76+
8
77+
```
78+
The output file is saved in: `vae/out/gmm`
79+
80+
### Sampling
81+
```
82+
python diffusion/eval_sample.py --check_path diffusion/out/datasetA/aptadiff_z/.../ \
83+
--data_path data/diffusion_data/datasetA_IGFBP3_x_z.csv \
84+
--eval_path vae/data/sampling_data/datasetA_IGFBP3/gmm_seq.csv \
85+
--samples 8 \
86+
--length 36
87+
88+
```
89+
The sequence generated by sampling is saved in `results/datasetA_IGFBP3/samples/datasetA_gmm.txt`
90+
91+
### Run BO
92+
```
93+
python vae/scripts/bo.py data/raw_data/datasetA_IGFBP3_P6.csv \
94+
vae/out/trained_vae/datasetA_IGFBP3_P6_vae.mdl \
95+
data/spr_data/datasetA_IGFBP3_gmm_RU \
96+
8
97+
```
98+
The output file is saved in:`vae/out/bo`
99+
100+
Then run `diffusion/eval_sample.py` to get the BO optimized sequence
101+
### Sampling
102+
```
103+
python diffusion/eval_sample.py --check_path results/datasetA_IGFBP3 \
104+
--data_path data/diffusion_data/datasetA_IGFBP3_x_z.csv \
105+
--eval_path vae/data/sampling_data/datasetA_IGFBP3/bo_seq.csv \
106+
--samples 8 \
107+
--length 36
108+
```
109+
The sequence generated by sampling is saved in `results/datasetA_IGFBP3/samples/datasetA_bo.txt`
110+
111+
112+
113+
114+
115+
116+

__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)