Skip to content

Compression Methods

rookiemann edited this page Apr 10, 2026 · 1 revision

Compression Methods

Method comparison

Method Family Bits Compression Calibration Speed Impact Best For
turbo2 TurboQuant 2.25 7.1x Required -16% decode Maximum VRAM savings
turbo3 TurboQuant 3.25 4.9x Required -5% decode Balanced compression
turbo4 TurboQuant 4.25 3.8x Required -4% decode Near-lossless quality
turbo2_tcq TCQ 2.25 7.1x Required -16% decode Max savings + better quality
turbo3_tcq TCQ 3.25 4.9x Required -5% decode Best quality at 5x
iso3 IsoQuant 3.25 4.9x No ~0% decode K-only, zero speed cost
iso4 IsoQuant 4.25 3.8x No ~0% decode Higher quality K-only
planar3 PlanarQuant 3.25 4.9x No -1% decode Simplest, Metal support
planar4 PlanarQuant 4.25 3.8x No ~0% decode Quality K-only
triattention TriAttention 16 10-16x Required Varies Long reasoning, compose with above

Choosing a method

If you want zero setup: Use iso3 or planar3. No calibration files needed, no speed penalty in K-only mode.

If you want maximum quality: Use turbo4 symmetric. Near-lossless at 3.8x compression. Requires calibration.

If you want maximum VRAM savings: Use turbo2_tcq symmetric (7.1x) or combine any method with TriAttention for 40-80x total.

If you're on AMD or Mac: Use iso3 or planar3. TurboQuant requires CUDA flash attention kernels.

If you want speed: Use iso3 K-only. Your benchmarks showed it can actually beat FP16 decode speed because the reduced memory bandwidth outweighs the rotation cost.

Asymmetric configurations

You can use different methods for K and V caches. This is useful because:

  • K cache benefits more from compression (attention score computation is bandwidth-bound)
  • V cache quality matters more for output quality (weighted sum of values)

Common asymmetric configs:

# K compressed, V full precision -- zero speed cost
CacheConfig(k_method=CacheMethod.ISO3, v_method=CacheMethod.FP16)

# K at higher compression, V at lower
CacheConfig(k_method=CacheMethod.TURBO3, v_method=CacheMethod.TURBO4)

Clone this wiki locally