Home

Multi-TurboQuant Wiki

Unified KV cache compression toolkit for LLM inference.

10 compression methods, 16 presets, GPU-validated, cross-platform. Compress KV cache 5-80x to run bigger models, longer context, and more agents on your GPU.

Getting Started

Installation — pip install and setup
Quick Start — first compression in 4 lines
Web Dashboard — browser UI, no code needed

Understanding the Methods

What Is Multi-TurboQuant — what it does, what it doesn't
How KV Cache Compression Works — the math explained simply
Compression Methods — all 10 methods compared

Configuration

Configuration — CacheConfig, CacheMethod, all options
Presets — 16 named presets for common use cases
Calibration — which methods need it, how to generate

Planning & Hardware

Capacity Planner — what fits on your GPU
Multi GPU Setup — tensor split, GPU ordering
Multi Agent Deployments — run 4-16 agents simultaneously
Hardware Detection — auto-detect NVIDIA, AMD, Metal
Platform Guide — what works on each platform

Integration

Integration with llama.cpp — flags, forks, cmake
Integration with vLLM — monkeypatch, dtypes
Using as a Library — drop into your own app

Reference

Benchmarking — run your own benchmarks
GPU Benchmark Results — real numbers from RTX 3090
Architecture Reference — package structure
API Reference — all public functions
Troubleshooting — common issues and fixes

Credits

Attribution — upstream research we built on

Getting Started

Methods

Configuration

Planning

Integration

Reference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Multi-TurboQuant Wiki

Getting Started

Understanding the Methods

Configuration

Planning & Hardware

Integration

Reference

Credits

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally