Skip to content

omicsEye/seqLens

Repository files navigation

seqLens

This repository contains the code for the seqLens project, which is a study to build DNA language models.
HF repo: omicseye

Overview

Our research introduces several key innovations in DNA sequence modeling. We gathered two different pre-training datasets consisting of:

  • 19,551 reference genomes with over 18,000 prokaryote genomes (over 115B nucleotides)
  • A more balanced dataset composed of 1,355 prokaryote and eukaryote reference genomes (over 180B nucleotides)

We trained five different byte-pair encoding tokenizers and pre-trained 52 DNA language models. We introduce seqLens models, which are based on disentangled attention with relative positional encoding, and they outperform state-of-the-art models in 13 out of 19 benchmarking tasks.

Additionally, we explored:

  • Domain-specific pre-training
  • Token representation strategies
  • Fine-tuning methods

Our findings show that:

  • Using relevant pre-training data significantly boosts performance
  • Alternative pooling techniques can enhance classification
  • There is a trade-off between efficiency and accuracy between full and parameter-efficient fine-tuning

These insights provide a foundation for optimizing model design and training in biological research.

Features

In this repository, we include code for the following tasks:

Benchmarking Results

Benchmarking Results

The following visualization shows how fine-tuning methods affect the performance of the models in vector representations:

Fine-tuning Effect Fine-tuning Effect

Usage

You can use these models for your research or use the provided scripts to train your models.

# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("omicseye/seqLens_4096_512_89M-at-base")
model = AutoModelForMaskedLM.from_pretrained("omicseye/seqLens_4096_512_89M-at-base")

Citation

If you use seqLens in your research, please cite our work:

@article {seqLens,
	author = {Baghbanzadeh, Mahdi and Mann, Brendan and Crandall, Keith A and Rahnavard, Ali},
	title = {seqLens: optimizing language models for genomic predictions},
	elocation-id = {2025.03.12.642848},
	year = {2025},
	doi = {10.1101/2025.03.12.642848},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2025/03/14/2025.03.12.642848},
	eprint = {https://www.biorxiv.org/content/early/2025/03/14/2025.03.12.642848.full.pdf},
	journal = {bioRxiv}
}

License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC-BY-NC 4.0) License.
Commercial use of this software or any related models may require a separate licensing agreement due to a pending patent.
For commercial inquiries, please contact Ali Rahnavard at [email protected].

Contact

For any questions, feel free to email or open an issue in this repository.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published