seqLens

This repository contains the code for the seqLens project, which is a study to build DNA language models.
HF repo: omicseye

Overview

Our research introduces several key innovations in DNA sequence modeling. We gathered two different pre-training datasets consisting of:

19,551 reference genomes with over 18,000 prokaryote genomes (over 115B nucleotides)
A more balanced dataset composed of 1,355 prokaryote and eukaryote reference genomes (over 180B nucleotides)

We trained five different byte-pair encoding tokenizers and pre-trained 52 DNA language models. We introduce seqLens models, which are based on disentangled attention with relative positional encoding, and they outperform state-of-the-art models in 13 out of 19 benchmarking tasks.

Additionally, we explored:

Domain-specific pre-training
Token representation strategies
Fine-tuning methods

Our findings show that:

Using relevant pre-training data significantly boosts performance
Alternative pooling techniques can enhance classification
There is a trade-off between efficiency and accuracy between full and parameter-efficient fine-tuning

These insights provide a foundation for optimizing model design and training in biological research.

Features

In this repository, we include code for the following tasks:

Benchmarking Results

The following visualization shows how fine-tuning methods affect the performance of the models in vector representations:

Usage

You can use these models for your research or use the provided scripts to train your models.

# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("omicseye/seqLens_4096_512_89M-at-base")
model = AutoModelForMaskedLM.from_pretrained("omicseye/seqLens_4096_512_89M-at-base")

Citation

If you use seqLens in your research, please cite our work:

@article {seqLens,
	author = {Baghbanzadeh, Mahdi and Mann, Brendan and Crandall, Keith A and Rahnavard, Ali},
	title = {seqLens: optimizing language models for genomic predictions},
	elocation-id = {2025.03.12.642848},
	year = {2025},
	doi = {10.1101/2025.03.12.642848},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2025/03/14/2025.03.12.642848},
	eprint = {https://www.biorxiv.org/content/early/2025/03/14/2025.03.12.642848.full.pdf},
	journal = {bioRxiv}
}

License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC-BY-NC 4.0) License.
Commercial use of this software or any related models may require a separate licensing agreement due to a pending patent.
For commercial inquiries, please contact Ali Rahnavard at [email protected].

Contact

For any questions, feel free to email or open an issue in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.vscode		.vscode
benchmarking		benchmarking
classification_heads		classification_heads
train		train
vector_representation		vector_representation
visualizations		visualizations
.DS_Store		.DS_Store
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

seqLens

Overview

Features

Benchmarking Results

Usage

Citation

License

Contact

About

Releases

Packages

Contributors 2

Languages

omicsEye/seqLens

Folders and files

Latest commit

History

Repository files navigation

seqLens

Overview

Features

Benchmarking Results

Usage

Citation

License

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages