OpenReview | arXiv | GitHub | HuggingFace 🤗(coming soon)
ConvNova demonstrates that, if carefully designed, a pure CNN can serve as a DNA foundation model that surpasses Transformer and SSM-inspired architectures, while retaining the classic convolutional advantages of stronger locality bias, lower memory footprint, and markedly faster training and inference.
- Scripts for Pretraining, NT & Genomic Benchmarks.
- Paper Released.
- Pretrained Weights of ConvNova.
- Source Code and Pretrained Weights on transformers.
- Scripts for DeepSEA & Bend-gene-finding.
Clone the repo.
git clone [email protected]:aim-uofa/ConvNova.git cd ConvNova/convnova
Prepare conda env.
conda create -n convnova python==3.10 conda activate convnova pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pip install -r requirements.txt --no-deps pip install pytorch-lightning==1.8.6 --no-deps pip install packaging --no-deps pip install lightning_utilities --no-deps pip install torchmetrics pip install tensorboardX
Download the data.(Pretrain)
mkdir data mkdir -p data/hg38/ curl https://storage.googleapis.com/basenji_barnyard2/hg38.ml.fa.gz > data/hg38/hg38.ml.fa.gz gunzip data/hg38/hg38.ml.fa.gz # unzip the fasta file curl https://storage.googleapis.com/basenji_barnyard2/sequences_human.bed > data/hg38/human-sequences.bed
You can check out the Nucleotide Transformer ang Genomic Benchmarks paper for how to download and process NT benchmark & Genomic Benchmark datasets.
The final file structure (data directory) should look like
|____bert_hg38 | |____hg38.ml.fa | |____hg38.ml.fa.fai | |____human-sequences.bed |____nucleotide_transformer | |____H3K36me3 | |____...... |____genomic_benchmark | |____dummy_mouse_enhancers_ensembl | |____....
Coming Soon
python train.py experiment='hg38-pretrain/convnova'
you can adjust the hyperparameters by using cmd like following, detailed hyperparameters setting can be seen in configs/experiment/xxx/xxx.yaml
python train.py experiment='hg38-pretrain/convnova' wandb=null trainer.devices=4
GenomicBenchmarks provides 8 binary- and multi-class tasks packaged as a Python library.
Remeber to adjust the setting for different dataset like max seq length and the pretrained checkpoint(comming soon).
python train.py experiment='genomic-benchmark/convnova' with-some-argments
Datasets are hosted on the Hub as InstaDeepAI/nucleotide_transformer_downstream_tasks
.
python train.py experiment='nt-benchmark/convnova' with-some-argments
@inproceedings{bo2025convnova, title = {Revisiting Convolution Architecture in the Realm of DNA Foundation Models}, author = {Yu Bo and Weian Mao and Yanjun Shao and Weiqiang Bai and Peng Ye and Xinzhu Ma and Junbo Zhao and Hao Chen and Chunhua Shen}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2025} }
ConvNova builds on the training, logging and data-loading scaffolds of HyenaDNA and Caduceus, and evaluates on Genomic Benchmarks, Nucleotide Transformer tasks, and the Long-Range Benchmark. We thank the maintainers of these open resources for making rigorous comparison possible.