Skip to content

aim-uofa/ConvNova

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 

Repository files navigation

ConvNova

[ICLR2025] ConvNova 🧬 Revisiting Convolution Architecture in the Realm of DNA Foundation Models

OpenReview |  arXiv |  GitHub |  HuggingFace 🤗(coming soon)

ConvNova demonstrates that, if carefully designed, a pure CNN can serve as a DNA foundation model that surpasses Transformer and SSM-inspired architectures, while retaining the classic convolutional advantages of stronger locality bias, lower memory footprint, and markedly faster training and inference.


🚩 Plan

  • Scripts for Pretraining, NT & Genomic Benchmarks.
  • Paper Released.
  • Pretrained Weights of ConvNova.
  • Source Code and Pretrained Weights on transformers.
  • Scripts for DeepSEA & Bend-gene-finding.

1 Quick start

Clone the repo.

  git clone [email protected]:aim-uofa/ConvNova.git
  cd ConvNova/convnova

Prepare conda env.

  conda create -n convnova python==3.10
  conda activate convnova
  pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 
  pip install -r requirements.txt --no-deps
  pip install pytorch-lightning==1.8.6 --no-deps
  pip install packaging --no-deps

  pip install lightning_utilities --no-deps
  pip install torchmetrics
  pip install tensorboardX

Download the data.(Pretrain)

  mkdir data
  mkdir -p data/hg38/
  curl https://storage.googleapis.com/basenji_barnyard2/hg38.ml.fa.gz > data/hg38/hg38.ml.fa.gz
  gunzip data/hg38/hg38.ml.fa.gz  # unzip the fasta file
  curl https://storage.googleapis.com/basenji_barnyard2/sequences_human.bed > data/hg38/human-sequences.bed

You can check out the Nucleotide Transformer ang Genomic Benchmarks paper for how to download and process NT benchmark & Genomic Benchmark datasets.

The final file structure (data directory) should look like

  |____bert_hg38
| |____hg38.ml.fa
| |____hg38.ml.fa.fai
| |____human-sequences.bed
|____nucleotide_transformer
| |____H3K36me3
| |____......
|____genomic_benchmark
| |____dummy_mouse_enhancers_ensembl
| |____....

2 Using ConvNova with 🤗 Transformers

Coming Soon


3 Reproducing the paper

3.1 Pre-training on the Human Reference Genome

  python train.py experiment='hg38-pretrain/convnova'

you can adjust the hyperparameters by using cmd like following, detailed hyperparameters setting can be seen in configs/experiment/xxx/xxx.yaml

  python train.py experiment='hg38-pretrain/convnova' wandb=null trainer.devices=4

3.2 Genomic Benchmarks (short-range)

GenomicBenchmarks provides 8 binary- and multi-class tasks packaged as a Python library.

Remeber to adjust the setting for different dataset like max seq length and the pretrained checkpoint(comming soon).

  python train.py experiment='genomic-benchmark/convnova' with-some-argments

3.3 Nucleotide Transformer Benchmark

Datasets are hosted on the Hub as InstaDeepAI/nucleotide_transformer_downstream_tasks.

Remeber to adjust the setting for different dataset like max seq length and the pretrained checkpoint(comming soon).
  python train.py experiment='nt-benchmark/convnova' with-some-argments

4 Citation

@inproceedings{bo2025convnova,
  title     = {Revisiting Convolution Architecture in the Realm of DNA Foundation Models},
  author    = {Yu Bo and Weian Mao and Yanjun Shao and Weiqiang Bai and Peng Ye
               and Xinzhu Ma and Junbo Zhao and Hao Chen and Chunhua Shen},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2025}
}

5 Acknowledgements

ConvNova builds on the training, logging and data-loading scaffolds of HyenaDNA and Caduceus, and evaluates on Genomic Benchmarks, Nucleotide Transformer tasks, and the Long-Range Benchmark. We thank the maintainers of these open resources for making rigorous comparison possible.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages