-
Notifications
You must be signed in to change notification settings - Fork 5
Compression Methods
| Method | Family | Bits | Compression | Calibration | Speed Impact | Best For |
|---|---|---|---|---|---|---|
| turbo2 | TurboQuant | 2.25 | 7.1x | Required | -16% decode | Maximum VRAM savings |
| turbo3 | TurboQuant | 3.25 | 4.9x | Required | -5% decode | Balanced compression |
| turbo4 | TurboQuant | 4.25 | 3.8x | Required | -4% decode | Near-lossless quality |
| turbo2_tcq | TCQ | 2.25 | 7.1x | Required | -16% decode | Max savings + better quality |
| turbo3_tcq | TCQ | 3.25 | 4.9x | Required | -5% decode | Best quality at 5x |
| iso3 | IsoQuant | 3.25 | 4.9x | No | ~0% decode | K-only, zero speed cost |
| iso4 | IsoQuant | 4.25 | 3.8x | No | ~0% decode | Higher quality K-only |
| planar3 | PlanarQuant | 3.25 | 4.9x | No | -1% decode | Simplest, Metal support |
| planar4 | PlanarQuant | 4.25 | 3.8x | No | ~0% decode | Quality K-only |
| triattention | TriAttention | 16 | 10-16x | Required | Varies | Long reasoning, compose with above |
If you want zero setup: Use iso3 or planar3. No calibration files needed, no speed penalty in K-only mode.
If you want maximum quality: Use turbo4 symmetric. Near-lossless at 3.8x compression. Requires calibration.
If you want maximum VRAM savings: Use turbo2_tcq symmetric (7.1x) or combine any method with TriAttention for 40-80x total.
If you're on AMD or Mac: Use iso3 or planar3. TurboQuant requires CUDA flash attention kernels.
If you want speed: Use iso3 K-only. Your benchmarks showed it can actually beat FP16 decode speed because the reduced memory bandwidth outweighs the rotation cost.
You can use different methods for K and V caches. This is useful because:
- K cache benefits more from compression (attention score computation is bandwidth-bound)
- V cache quality matters more for output quality (weighted sum of values)
Common asymmetric configs:
# K compressed, V full precision -- zero speed cost
CacheConfig(k_method=CacheMethod.ISO3, v_method=CacheMethod.FP16)
# K at higher compression, V at lower
CacheConfig(k_method=CacheMethod.TURBO3, v_method=CacheMethod.TURBO4)Getting Started
Methods
Configuration
Planning
Integration
Reference