-
Notifications
You must be signed in to change notification settings - Fork 5
Home
rookiemann edited this page Apr 10, 2026
·
2 revisions
Unified KV cache compression toolkit for LLM inference.
10 compression methods, 16 presets, GPU-validated, cross-platform. Compress KV cache 5-80x to run bigger models, longer context, and more agents on your GPU.
- Installation — pip install and setup
- Quick Start — first compression in 4 lines
- Web Dashboard — browser UI, no code needed
- What Is Multi-TurboQuant — what it does, what it doesn't
- How KV Cache Compression Works — the math explained simply
- Compression Methods — all 10 methods compared
- Configuration — CacheConfig, CacheMethod, all options
- Presets — 16 named presets for common use cases
- Calibration — which methods need it, how to generate
- Capacity Planner — what fits on your GPU
- Multi GPU Setup — tensor split, GPU ordering
- Multi Agent Deployments — run 4-16 agents simultaneously
- Hardware Detection — auto-detect NVIDIA, AMD, Metal
- Platform Guide — what works on each platform
- Integration with llama.cpp — flags, forks, cmake
- Integration with vLLM — monkeypatch, dtypes
- Using as a Library — drop into your own app
- Benchmarking — run your own benchmarks
- GPU Benchmark Results — real numbers from RTX 3090
- Architecture Reference — package structure
- API Reference — all public functions
- Troubleshooting — common issues and fixes
- Attribution — upstream research we built on
Getting Started
Methods
Configuration
Planning
Integration
Reference