Skip to content

Latest commit

 

History

History
79 lines (61 loc) · 3.82 KB

README.md

File metadata and controls

79 lines (61 loc) · 3.82 KB

seqLens

This repository contains the code for the seqLens project, which is a study to build DNA language models.
HF repo: omicseye

Overview

Our research introduces several key innovations in DNA sequence modeling. We gathered two different pre-training datasets consisting of:

  • 19,551 reference genomes with over 18,000 prokaryote genomes (over 115B nucleotides)
  • A more balanced dataset composed of 1,355 prokaryote and eukaryote reference genomes (over 180B nucleotides)

We trained five different byte-pair encoding tokenizers and pre-trained 52 DNA language models. We introduce seqLens models, which are based on disentangled attention with relative positional encoding, and they outperform state-of-the-art models in 13 out of 19 benchmarking tasks.

Additionally, we explored:

  • Domain-specific pre-training
  • Token representation strategies
  • Fine-tuning methods

Our findings show that:

  • Using relevant pre-training data significantly boosts performance
  • Alternative pooling techniques can enhance classification
  • There is a trade-off between efficiency and accuracy between full and parameter-efficient fine-tuning

These insights provide a foundation for optimizing model design and training in biological research.

Features

In this repository, we include code for the following tasks:

Benchmarking Results

Benchmarking Results

The following visualization shows how fine-tuning methods affect the performance of the models in vector representations:

Fine-tuning Effect Fine-tuning Effect

Usage

You can use these models for your research or use the provided scripts to train your models.

# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("omicseye/seqLens_4096_512_89M-at-base")
model = AutoModelForMaskedLM.from_pretrained("omicseye/seqLens_4096_512_89M-at-base")

Citation

If you use seqLens in your research, please cite our work:

@article {seqLens,
	author = {Baghbanzadeh, Mahdi and Mann, Brendan and Crandall, Keith A and Rahnavard, Ali},
	title = {seqLens: optimizing language models for genomic predictions},
	elocation-id = {2025.03.12.642848},
	year = {2025},
	doi = {10.1101/2025.03.12.642848},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2025/03/14/2025.03.12.642848},
	eprint = {https://www.biorxiv.org/content/early/2025/03/14/2025.03.12.642848.full.pdf},
	journal = {bioRxiv}
}

License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC-BY-NC 4.0) License.
Commercial use of this software or any related models may require a separate licensing agreement due to a pending patent.
For commercial inquiries, please contact Ali Rahnavard at [email protected].

Contact

For any questions, feel free to email or open an issue in this repository.