DFloat11 is a lossless compression framework that reduces the size of Large Language Models (LLMs) by approximately 30% while preserving bit-for-bit identical outputs to the original model. It enables efficient GPU inference on resource-constrained hardware without sacrificing accuracy.
Requires a CUDA-compatible GPU and PyTorch installed.
pip install dfloat11[cuda12]
# or if you have CUDA version 11:
# pip install dfloat11[cuda11]
- 📉 Significant size reduction: Compresses LLM weights by ~30%, losslessly.
- ✅ Zero loss in accuracy: Produces bit-for-bit identical outputs to the original BFloat16 model.
- 🧩 Easy to use: Seamlessly integrates with HuggingFace framework.
- ⚡ High throughput: Enables up to 38.8× faster generation compared to CPU offloading alternatives.
- 🧠 Supports longer inputs: Extends maximum context length by up to 13.17× under the same GPU memory budget.
👉 Explore pre-compressed DFloat11 models ready to use on HuggingFace: https://huggingface.co/DFloat11
📂 Official Code Repository: https://github.com/LeanModels/DFloat11
To run inference with a DFloat11-compressed LLM:
CUDA_VISIBLE_DEVICES=0 python inference.py \
--model_name_or_path meta-llama/Llama-3.1-8B-Instruct \
--df11_name_or_path DFloat11/Llama-3.1-8B-Instruct-DF11 \
--prompt "Question: What is a binary tree and its applications? Answer:" \
--num_tokens 512 \
--batch_size 1
💡 Tip: If you specify multiple CUDA devices (e.g.,
CUDA_VISIBLE_DEVICES=0,1
), the model will be automatically distributed across them using 🤗 Accelerate'sdevice_map="auto"
.
--model_name_or_path
: HuggingFace name or local path of the original BFloat16 model (e.g.,meta-llama/Llama-3.1-8B-Instruct
)--df11_name_or_path
: HuggingFace name or local path of the corresponding DFloat11 model (e.g.,DFloat11/Llama-3.1-8B-Instruct-DF11
)--use_bf16
: (Optional) Load the original BFloat16 model instead of the compressed one--prompt
: Input prompt string for text generation--num_tokens
: Number of new tokens to generate per sample--batch_size
: Number of prompts to process in parallel--seed
: (Optional) Random seed for reproducible results
See the Model Hub section for a list of available DFloat11 models.
The script prints:
- Generated responses
- Total decoding latency
- Tokens per second (throughput)
- GPU memory usage (allocated and peak)
- Download a model using the Hugging Face command line tool:
huggingface-cli download \
DFloat11/DeepSeek-R1-Distill-Qwen-7B-DF11 \ # DFloat11 model name
--local-dir ./DeepSeek-R1-Distill-Qwen-7B-DF11 # local path to download the DFloat11 model
- Use the model like a standard Hugging Face model:
from dfloat11 import DFloat11ModelForCausalLM
model = DFloat11ModelForCausalLM.from_pretrained(
"deepseek-ai/DeepSeek-R1-Distill-Qwen-7B", # original BFloat16 model name
"./DeepSeek-R1-Distill-Qwen-7B-DF11", # local path to DFloat11 model
device_map="auto",
)
This work is brought to you by the team at Rice University and xMAD.ai.
The GPU kernel was designed and implemented by Tianyi Zhang.
If you found our work useful or interesting, please consider citing our paper:
@misc{zhang2025dfloat11,
title = {70\% Size, 100\% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float},
author = {Tianyi Zhang and Yang Sui and Shaochen Zhong and Vipin Chaudhary and Xia Hu and Anshumali Shrivastava},
year = {2025},
eprint = {2504.11651},
archivePrefix= {arXiv},
primaryClass = {cs.LG},
url = {https://arxiv.org/abs/2504.11651}
}