gguf_offload

gguf_offload is a lightweight inference framework that combines GGUF quantization with asynchronous offloading to enable efficient LLM inference. By leveraging lazy loading from GGUF, GPU dequantization, and pipelining, this tool minimizes GPU memory usage—making it possible to run large models with limited GPU resources.

Performance Considerations

LLM inference with CPU offloading is often bottlenecked by two major factors:

PCIe Bandwidth: Transferring large amounts of data (such as fully dequantized model weights) between the host and the GPU can saturate the PCIe bus, creating a significant performance bottleneck.
GPU Memory: Fully loading and storing dequantized model parameters on the GPU may exceed available memory, especially for large language models.

By performing dequantization directly on the GPU, our approach mitigates these bottlenecks. Here's how:

Reduced Data Transfer: The model is stored in a quantized format, which takes up much less space. Only minimal quantized representations are transferred over the PCIe bus.
On-GPU Processing: Once on the GPU, dequantization is executed in parallel, converting the quantized data into the required full-precision format. This eliminates the need for transferring large, dequantized datasets between the host and GPU.
Improved Throughput: The combination of asynchronous offloading and pipelining means that data transfers and GPU computations can overlap, further hiding the latency associated with PCIe transfers.

Usage

To run the Qwen2 model with minimal GPU memory usage, simply execute:

python lazy_deq.py

This script demonstrates:

Lazy Loading: Only loads model parts as needed.
GPU Dequantization: Converts quantized weights on the GPU.
Pipelining: Manages asynchronous offloading to optimize memory and compute resources.

Benchmark

Using batchsize=1, seq_len=512, prefilling

For Qwen-0.5B

Full GPU inference: 0.030s (2348 MiB)
gguf_offload: 0.075s (1086 MiB)
sequential offload using accelerate: 0.293s

For DeepSeek-R1 671B 1.58bit quantization

gguf_offload: ~10s (7836 MiB)

Potential bugs

Surge GPU memory usage

TODO

Support end to end generation

Happy Inference!

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
deepseek		deepseek
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
gguf_gpu.py		gguf_gpu.py
gguf_parse.py		gguf_parse.py
lazy.py		lazy.py
lazy_deepseek.py		lazy_deepseek.py
lazy_deq.py		lazy_deq.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gguf_offload

Performance Considerations

Usage

Benchmark

Potential bugs

TODO

About

Releases

Packages

Languages

License

zinccat/gguf_offload

Folders and files

Latest commit

History

Repository files navigation

gguf_offload

Performance Considerations

Usage

Benchmark

Potential bugs

TODO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages