RTX 3090 x2 CUDA report: TurboQuant's clearest win was context capacity on our setup #67

Burntgogi · 2026-04-02T15:38:24Z

Burntgogi
Apr 2, 2026

RTX 3090 x2 CUDA report: on our setup, TurboQuant's clearest win was context capacity

We tested the current feature/turboquant-kv-cache branch on Windows CUDA with RTX 3090 x2 using build 8ad0f00e9 and Qwen3.5-27B in both Q4_K_M and Q8_0.

The main reason I'm posting is that our results were most convincing on KV memory reduction / larger usable context, more than on universal speed gains.

Main result

Using q8_0/turbo3, KV usage dropped by a consistent 31.6% across the contexts we tested.

Single 3090 + Q4_K_M

baseline OOM at 192K
turbo3 still ran at 192K
practical gain: about +60K usable context

Dual 3090 + Q8_0

baseline OOM at 256K
turbo3 reached the 262K model limit
practical gain: about +32K usable context

So on our CUDA 3090 setup, TurboQuant was very clearly useful as a capacity / VRAM efficiency feature.

Speed observations

Short-prompt results were mixed rather than universally better:

Single Q4 16K
- baseline: 92.4 / 33.8 t/s (prompt / gen)
- turbo3: 67.8 / 31.8 t/s
Single Q4 32K
- baseline: 8.4 / 34.2 t/s (unstable run)
- turbo3: 81.5 / 34.0 t/s
Dual Q8 16K
- baseline: 88.3 / 22.8 t/s
- turbo3: 102.9 / 22.9 t/s

So we did see some speed wins in selected cases, but not a blanket decode-speed improvement.

turbo4 validation

We also validated turbo4 on CUDA, both as turbo4/turbo4 and q8_0/turbo4.

It worked correctly.
It preserved the same general capacity benefit.
But on our setup it did not show a strong practical decode-speed advantage in long-context real-use testing.

Real-use style long-prompt runs:

Single 3090 + Q4 + 128K + turbo4/turbo4
- prompt: about 505 t/s
- generation: about 5.4 t/s
Dual 3090 + Q8 + 192K + turbo4/turbo4
- prompt: about 341 t/s
- generation: about 3.7 t/s

For our deployment-style use, turbo4 looked more like a capacity / compatibility / quality-validation path than a clear decode-speed winner.

Practical deployment profiles that came out of our testing

Single 3090 + Q4_K_M: q8_0-K + turbo3-V at around 96K
Dual 3090 + Q8_0: q8_0-K + turbo3-V at around 192K

We ended up exposing those profiles through local API launchers and Open WebUI for manual real-use testing.

Caveat

These are CUDA RTX 3090 results, so I would not treat them as directly equivalent to every turboquant_plus headline number, especially the Apple / Metal ones.

Still, from our side the conclusion was fairly consistent:

On this setup, TurboQuant's strongest and most reproducible value was context capacity / VRAM efficiency, while speed gains depended heavily on context, config, and benchmark method.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RTX 3090 x2 CUDA report: TurboQuant's clearest win was context capacity on our setup #67

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

RTX 3090 x2 CUDA report: TurboQuant's clearest win was context capacity on our setup #67

Uh oh!

Burntgogi Apr 2, 2026

RTX 3090 x2 CUDA report: on our setup, TurboQuant's clearest win was context capacity

Main result

Single 3090 + Q4_K_M

Dual 3090 + Q8_0

Speed observations

turbo4 validation

Practical deployment profiles that came out of our testing

Caveat

Replies: 0 comments

Burntgogi
Apr 2, 2026