Lixuan Guo*1,2,
Yifei Wang*3,
Tiansheng Wen*1,2,
Yifan Wang1,
Aosong Feng4,
Bo Chen2,
Stefanie Jegelka3,5,
Chenyu You1
1Stony Brook University 2Xidian University 3MIT CSAIL 4Yale University 5Technical University of Munich
This is the official repository for CSRv2. For implementation details and updates, please also visit Lixuan Guo’s GitHub Repository.
- 2026.01 Our paper is accepted at ICLR26!
- 2025.10 Code released! Let's explore ultra sparsity together!
In this repo, we will release (updating):
- Environment Dependencies
- Experiment Codes
- Text Exp on e5-mistral-7b-instruct and Qwen3-Embedding-4B.
- Inference
- Data preparation & training.
- Image Exp on Imagenet-1K.
- Inference
- Data preparation & training.
- Splade Exp.
- GraphRAG Exp.
- Text Exp on e5-mistral-7b-instruct and Qwen3-Embedding-4B.
- Checkpoints
- Text Exp
- Image Exp
An empty conda environment with Python >= 3.11 is required and install packages according to requirements.txt.
conda create --name csr-v2 python=3.11.13
conda activate csr-v2
pip install -r requirements.txtYou can also migrate conda environment directly with the environment.yml.
conda env create -f environment.ymlInference with Hugging Face 🤗 Sentence Transformer
In Sentence Transformers v5.0.0 release, a new module called SparseEncoder is added, which supports the loading of CSR/CSRv2 models. Our checkpoints will be released in Y-Research-Group, which can easily be loaded with only a few lines of codes and evaluate on your own datasets.
Demo for generating embeddings based on a CSRv2/CSR model:
from sentence_transformers import SparseEncoder
model = SparseEncoder("/MODEL/NAME")
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
similarities = model.similarity(embeddings, embeddings)
print(similarities)Demo for evaluating a pretrained CSRv2/CSR model on MTEB benchmark datasets.
import mteb
from sentence_transformers import SparseEncoder
model = SparseEncoder(
"/MODEL/NAME",
tokenizer_kwargs={"padding_side": "left"},
)
model.prompts = {
"TASK_NAME": "PROMPT",
}
task_list = mteb.get_task("TASK_NAME")
evaluation = mteb.MTEB(tasks=task_list)
evaluation.run(
model,
eval_splits=["test"],
output_folder="EVAL_RESULT_PATH",
show_progress_bar=True,
encode_kwargs={"convert_to_sparse_tensor": False, "batch_size": 2}
)You need to prepare data for CSRv2 training, backbone finetuning and MRL training for e5-mistral-7b-instruct. Detailed instructions are available in data preparation instructions.
For CSRv2 training data, you need to execute get_embeddings_of_each_dataset.py and get_embeddings_for_training.py.
python get_embeddings_of_each_dataset.py \
--task_type $TASK_TYPE \
--model_name $NAME_OF_BACKBONE \
--batch_size 2 \
--save_root_path /PATH/TO/SAVE/ROOT/PATH \
--dataset_name $NAME_OF_DATASET \
--gpu 0
python get_embeddings_for_training.py \
--save_root_path /PATH/TO/SAVE/ROOT/PATH \
--task_type $TASK_TYPE \
--embedding_path /PATH/TO/SINGLE/TASK/EMBEDDING For finetuning and MRL training, you need to execute
get_dataset_for_finetuning.py, combine_datasets_in_mteb_based_on_task_type.py and combine_datasets_in_sentence_transformers.py.
python get_dataset_for_finetuning.py \
--task_type $TASK_TYPE \
--save_root /PATH/TO/DATASET \
--max_samples_per_dataset 20000
python combine_datasets_in_mteb_based_on_task_type.py \
--task_type $TASK_TYPE \
--max_rows_per_dataset 20000 \
--mteb_dataset_path /PATH/TO/DATASET
python combine_datasets_in_sentence_transformers.py \
--max_pairs_per_dataset 20000 We have built complete training and evaluation pipeline and you can train and get evaluation results with only one command. We offer two pipelines with different ways for evaluation, with one evaluated with MTEB library and the other takes our self-built evaluation procedure to avoid unnecessary repetitive backbone embedding inference. Detailed instructions are available in training instructions and evaluation instructions.
python all_step_pipeline_mteb_evaluation.py \
--epochs 10 \
--eval_tasks $TASKS_TO_EVALUATE \
--base_model e5-mistral-7b-instruct \
--gpu 0 \
--embed_dim 4096 \
--hidden_size 16384 \
--topk 32 \
--auxk 1024 \
--auxk_coef 0.1 \
--lr 0.0001 \
--model_suffix $MODEL_SUFFIX \
--training_embedding_path /PATH/TO/EMBEDDING/FOR/TRAINING \
--packaged_model_dir /PATH/TO/PACKAGED/BACKBONE \
--use_label_CL \
--initial_topk 64 python all_step_pipeline_personalized_evaluation.py \
--epochs 10 \
--eval_tasks $TASKS_TO_EVALUATE \
--base_model e5-mistral-7b-instruct \
--gpu 0 \
--embed_dim 4096 \
--hidden_size 16384 \
--topk 32 \
--auxk 1024 \
--auxk_coef 0.1 \
--lr 0.0001 \
--eval_embedding_path /PATH/TO/EMBEDDING/FOR/EVALUATION \
--model_suffix $MODEL_SUFFIX \
--training_embedding_path /PATH/TO/EMBEDDING/FOR/TRAININGThe complete version of CSRv2 requires backbone finetuning, which can be done with topk_lora_finetuning.py. Detailed instructions are available in training instructions.
CUDA_VISIBLE_DEVICES=0,1
torchrun --nproc_per_node=2 --master_port=31233 topk_lora_finetuning.py \
--dataset /PATH/TO/DATASET \
--model_name intfloat/e5-mistral-7b-instruct \
--loss "multiple_negatives_ranking_loss" \
--dataset_suffix $SUFFIX \
--gpu 0,1 \
--batch_size 4 \
--gradient_accumulation_steps 32 \
--topk_k_list "16,32,64,128,256,512,1024,2048,2560" \
--topk_weights "1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0" \
--max_seq_length 512 \
--lora_r 8 \
--lora_alpha 16 \
--lr 1e-5 \
--topk_mode "magnitude" \
--apply_topk_to_backbone \
--load_from_disk \
--save_steps 100We evaluate further on GraphRAG-Bench with Fast GraphRAG framework. Detailed instructions can be found in GraphRAG evaluation instruction.
python run_fast-graphrag.py \
--subset medical \
--base_dir $WORKSPACE_DIR \
--model_name gpt-4o-mini \
--embed_model_path $EMBEDDING_MODEL_PATH \
--llm_base_url $LLM_BASE_URL
CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 --master_port=29500 -m Evaluation.retrieval_eval \
--mode API \
--model gpt-4o-mini \
--base_url $LLM_BASE_URL \
--embedding_model $EMBEDDING_MODEL_PATH \
--data_file $PATH_TO_PREDICTIONS \
--output_file $PATH_TO_RESULT \
--detailed_output
CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 --master_port=29600 -m Evaluation.generation_eval \
--mode API \
--model gpt-4o-mini \
--base_url $LLM_BASE_URL \
--embedding_model $EMBEDDING_MODEL_PATH \
--data_file $PATH_TO_PREDICTIONS \
--output_file $PATH_TO_RESULT \
--detailed_outputYou need to download Imagenet1k and follow the pipeline of FFCV to generate the dataset. Details are available in data preparation instructions.
python ./dataset_preparation annotations.py --xml_dir "/path/to/train/annotation/directory" --output_file "/path/to/annotation.txt/directory"
python ./dataset_preparation to_pytorch_style.py --split_path "/path/to/pytorch/style/dataset"
cd dataset_preparation
export IMAGENET_DIR=/path/to/pytorch/format/imagenet/directory/
export WRITE_DIR=/your/path/here/
./write_imagenet.sh "train" 500 0.50 90
./write_imagenet.sh "val" 500 0.50 90For training and evaluation simplicity, we extract embeddings before training (except backbone finetuning) and stack embeds together.
python pretrained_embeddings.py \
--train_data_ffcv /path/to/train.ffcv \
--eval_data_ffcv /path/to/val.ffcv \
--model_name "pre-trained visual backbone"
python stack_emb.pyWe train CSR/CSRv2 with main_visual.py and the detailed training instructions are available in training instructions.
python main_visual.py \
--pretrained_emb ./CSR-precompute-embeds/FF2048_RN50_Embeds/1K_train_ff2048.npz \
--model_name resnet50d.ra4_e3600_r224_in1k \
--epochs 10 \
--initial_topk 64 \
--topk 32 \
--auxk 1024 \
--auxk_coef 0.03125 \
--cl_coef 0.1 \
--gpu 0 \
--model_suffix demo \
--use_label_CL Evaluation takes two steps: get embeddings for evaluation and compute evaluation results. Detailed instruction are available in evaluation instructions.
python chunk_npz_file.py \
--input_path "Path/to/original/embeddings" \
--output_path "Path/to/chunk/directory" \
--chunk_size "Number of samples per chunk"
python csr_inference.py \
--train_emb_path /path/to/train_emb \
--eval_emb_path /path/to/val_emb \
--model_name "pre-trained visual backbone" \
--topk 8\
--hidden-size 8192
--csr_ckpt "CSR ckpt path"
python ./retrieval/faiss_nn.py --topk $TOPK
python ./retrieval/compute_metrics.py --topk $TOPK If you find this work useful, please cite the accompanying paper:
@inproceedings{guo2026csrv2,
title={{CSR}v2: Unlocking Ultra-sparse Embeddings},
author={Guo, Lixuan and Wang, Yifei and Wen, Tiansheng and Wang, Yifan and Feng, Aosong and Chen, Bo and Jegelka, Stefanie and You, Chenyu},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026}
}This repository was built off of CSR and GraphRAG-Benchmark. Thanks for their amazing works!