fix(quantizer): make _pack/_unpack_qjl_signs CUDA-graph-safe by AEON-7 · Pull Request #12 · 0xSero/turboquant

AEON-7 · 2026-04-24T04:21:39Z

Summary

Fix an unpinned CPU → GPU copy in the quantizer hot path that breaks
CUDA graph capture. Both capture_only and hybrid modes currently
crash during vLLM's determine_available_memory() unless the server
is launched with --enforce-eager (which costs 20–30% decode throughput).

Root cause

turboquant/quantizer.py:216-224 —
_pack_qjl_signs and _unpack_qjl_signs allocate the powers-of-two
lookup tensor inside the function:

powers = torch.tensor([1, 2, 4, 8, 16, 32, 64, 128],
                       device=signs.device, dtype=torch.uint8)

torch.tensor([...], device=cuda) creates the tensor on CPU first then
copies to GPU. That's an unpinned host-to-device copy, which PyTorch
refuses inside a CUDA graph capture region:

RuntimeError: Cannot copy between CPU and CUDA tensors during CUDA
graph capture unless the CPU tensor is pinned. Please use
tensor.pin_memory() or allocate the tensor with pin_memory=True.

Repro

# any vllm serve with TQ installed via collective_rpc on the worker:
vllm serve <model> --attention-backend flash_attn \
    --gpu-memory-utilization 0.85 --max-num-seqs 16
# then: executor.collective_rpc(lambda w: install_turboquant_hooks(
#           w.model_runner, mode="capture_only"))

Without this patch the engine crashes during
_warmup_and_capture → _dummy_run. With --enforce-eager, the plugin
works but at a substantial perf cost.

Fix

Cache the powers tensor per-device in a module-level dict. Allocate
once, reuse forever. No hot-path host-device copy.

Validation

Tested on NVIDIA DGX Spark (GB10 / sm_121a) with
ghcr.io/aeon-7/vllm-dflash:latest, Qwen3.5-27B NVFP4 hybrid model
(16 full-attention + 48 linear-attention layers), vLLM 0.19.1rc1.
Both modes now boot cleanly with CUDA graphs enabled.

Throughput vs baseline (no TQ) at full production config

Natural-language prompts, deterministic (temperature=0), steady-state
post-warmup, 16 requests per level.

Concurrency	TQ off (ref)	TQ capture_only	TQ hybrid
c=1 (code)	64.02 tok/s	61.50 tok/s	61.71 tok/s
c=4 (code)	181.47 tok/s	175.71 tok/s	175.79 tok/s
c=8 (code)	262.77 tok/s	255.19 tok/s	252.78 tok/s
c=16 (code)	327.89 tok/s	314.93 tok/s	318.36 tok/s
c=1 (prose)	29.46 tok/s	28.14 tok/s	28.49 tok/s
c=16 (prose)	151.81 tok/s	147.43 tok/s	148.80 tok/s

Measured overhead: ~3% across all modes, concurrencies, and prompt
styles — matching the advertised profile.

Test plan for reviewer

pytest test_modular.py test_turboquant.py validate_paper.py still passes on CUDA
Boot a vLLM server with install_turboquant_hooks(..., mode="hybrid") without --enforce-eager; engine should reach Application startup complete without the "Cannot copy between CPU and CUDA tensors" error
No regression on the existing benchmark.py / proof.py flows

🤖 Generated with Claude Code

_pack_qjl_signs and _unpack_qjl_signs allocate the powers-of-two lookup tensor via `torch.tensor([1, 2, 4, ..., 128], device=..., dtype=uint8)` on every call. That form creates the tensor on CPU first, then copies to GPU — an unpinned host-to-device copy that PyTorch refuses inside a CUDA graph capture region, raising: RuntimeError: Cannot copy between CPU and CUDA tensors during CUDA graph capture unless the CPU tensor is pinned. Please use tensor.pin_memory() or allocate the tensor with pin_memory=True. The call sites are on the decode hot path (each quantize() call runs them), so the allocation happens every graph-capture warmup iteration and the capture fails. Repro (without the patch): any model served via `vllm serve` with `install_turboquant_hooks(worker.model_runner, mode="capture_only")` crashes during `determine_available_memory` → `profile_cudagraph_memory` → `_warmup_and_capture`. vLLM's default launch path uses CUDA graphs, so the plugin is unusable unless the server is started with `--enforce-eager`, which costs 20-30% decode throughput. Fix: cache the powers tensor once per device in a module-level dict and return the cached tensor on subsequent calls. No more host-device copy in the hot path. Validated on NVIDIA DGX Spark (GB10 / sm_121a) with ghcr.io/aeon-7/vllm-dflash, Qwen3.5-27B-NVFP4 hybrid model. Both `capture_only` and `hybrid` modes boot cleanly with CUDA graphs enabled; end-to-end decode overhead vs TQ-off is ~3% (matching the advertised profile). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

README is rewritten end-to-end for simplicity and accuracy: - Single "Quick Start" with 3 copy-paste steps (was 5+ per-model walkthroughs) - Performance section updated with natural-prompt numbers (superseded random-token) - New "TurboQuant (optional)" section with the 3-mode overhead matrix and long-context decode data - Troubleshooting collapsibles covering the 5 most common startup issues - Dropped legacy sections: duplicated "Usage Patterns", "Why Dense Over MoE" prose sprawl, "Optimized for DGX Spark GB10" (duplicated tuning recap) - 629 → 397 lines, same substantive information New turboquant/ subfolder: optional Docker build that layers 0xSero/turboquant on top of this image via a clean site-packages .pth bootstrap (no vLLM source patches). Pins to AEON-7/turboquant@fix/cuda-graph-safe-qjl-powers which carries 0xSero/turboquant#12 — a 20-line fix making the quantizer's QJL-sign helpers CUDA-graph-safe (they were allocating `torch.tensor([1,2,4,...], device=cuda, dtype=uint8)` every forward, which breaks vLLM's graph capture unless the server is launched with `--enforce-eager`). BENCHMARKS.md gains three TurboQuant-specific tables (code/prose concurrency, long-context decode) matching what the README summarises. All TurboQuant numbers measured on DGX Spark GB10 with ghcr.io/aeon-7/vllm-dflash:latest + the extension container, Qwen3.5-27B NVFP4 hybrid model, DFlash k=15, tuned config (MAX_NUM_SEQS=16, MAX_NUM_BATCHED_TOKENS=32768, GPU_MEMORY_UTILIZATION=0.85). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(quantizer): make _pack/_unpack_qjl_signs CUDA-graph-safe#12

fix(quantizer): make _pack/_unpack_qjl_signs CUDA-graph-safe#12
AEON-7 wants to merge 1 commit into
0xSero:mainfrom
AEON-7:fix/cuda-graph-safe-qjl-powers

AEON-7 commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AEON-7 commented Apr 24, 2026

Summary

Root cause

Repro

Fix

Validation

Throughput vs baseline (no TQ) at full production config

Test plan for reviewer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant