You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RTX 3090 x2 CUDA report: on our setup, TurboQuant's clearest win was context capacity
We tested the current feature/turboquant-kv-cache branch on Windows CUDA with RTX 3090 x2 using build 8ad0f00e9 and Qwen3.5-27B in both Q4_K_M and Q8_0.
The main reason I'm posting is that our results were most convincing on KV memory reduction / larger usable context, more than on universal speed gains.
Main result
Using q8_0/turbo3, KV usage dropped by a consistent 31.6% across the contexts we tested.
Single 3090 + Q4_K_M
baseline OOM at 192K
turbo3 still ran at 192K
practical gain: about +60K usable context
Dual 3090 + Q8_0
baseline OOM at 256K
turbo3 reached the 262K model limit
practical gain: about +32K usable context
So on our CUDA 3090 setup, TurboQuant was very clearly useful as a capacity / VRAM efficiency feature.
Speed observations
Short-prompt results were mixed rather than universally better:
Single Q4 16K
baseline: 92.4 / 33.8 t/s (prompt / gen)
turbo3: 67.8 / 31.8 t/s
Single Q4 32K
baseline: 8.4 / 34.2 t/s (unstable run)
turbo3: 81.5 / 34.0 t/s
Dual Q8 16K
baseline: 88.3 / 22.8 t/s
turbo3: 102.9 / 22.9 t/s
So we did see some speed wins in selected cases, but not a blanket decode-speed improvement.
turbo4 validation
We also validated turbo4 on CUDA, both as turbo4/turbo4 and q8_0/turbo4.
It worked correctly.
It preserved the same general capacity benefit.
But on our setup it did not show a strong practical decode-speed advantage in long-context real-use testing.
Real-use style long-prompt runs:
Single 3090 + Q4 + 128K + turbo4/turbo4
prompt: about 505 t/s
generation: about 5.4 t/s
Dual 3090 + Q8 + 192K + turbo4/turbo4
prompt: about 341 t/s
generation: about 3.7 t/s
For our deployment-style use, turbo4 looked more like a capacity / compatibility / quality-validation path than a clear decode-speed winner.
Practical deployment profiles that came out of our testing
Single 3090 + Q4_K_M: q8_0-K + turbo3-V at around 96K
Dual 3090 + Q8_0: q8_0-K + turbo3-V at around 192K
We ended up exposing those profiles through local API launchers and Open WebUI for manual real-use testing.
Caveat
These are CUDA RTX 3090 results, so I would not treat them as directly equivalent to every turboquant_plus headline number, especially the Apple / Metal ones.
Still, from our side the conclusion was fairly consistent:
On this setup, TurboQuant's strongest and most reproducible value was context capacity / VRAM efficiency, while speed gains depended heavily on context, config, and benchmark method.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
RTX 3090 x2 CUDA report: on our setup, TurboQuant's clearest win was context capacity
We tested the current
feature/turboquant-kv-cachebranch on Windows CUDA with RTX 3090 x2 using build8ad0f00e9and Qwen3.5-27B in bothQ4_K_MandQ8_0.The main reason I'm posting is that our results were most convincing on KV memory reduction / larger usable context, more than on universal speed gains.
Main result
Using
q8_0/turbo3, KV usage dropped by a consistent 31.6% across the contexts we tested.Single 3090 + Q4_K_M
Dual 3090 + Q8_0
So on our CUDA 3090 setup, TurboQuant was very clearly useful as a capacity / VRAM efficiency feature.
Speed observations
Short-prompt results were mixed rather than universally better:
92.4 / 33.8 t/s(prompt / gen)67.8 / 31.8 t/s8.4 / 34.2 t/s(unstable run)81.5 / 34.0 t/s88.3 / 22.8 t/s102.9 / 22.9 t/sSo we did see some speed wins in selected cases, but not a blanket decode-speed improvement.
turbo4 validation
We also validated
turbo4on CUDA, both asturbo4/turbo4andq8_0/turbo4.Real-use style long-prompt runs:
turbo4/turbo4turbo4/turbo4For our deployment-style use, turbo4 looked more like a capacity / compatibility / quality-validation path than a clear decode-speed winner.
Practical deployment profiles that came out of our testing
q8_0-K + turbo3-Vat around 96Kq8_0-K + turbo3-Vat around 192KWe ended up exposing those profiles through local API launchers and Open WebUI for manual real-use testing.
Caveat
These are CUDA RTX 3090 results, so I would not treat them as directly equivalent to every
turboquant_plusheadline number, especially the Apple / Metal ones.Still, from our side the conclusion was fairly consistent:
Beta Was this translation helpful? Give feedback.
All reactions