Skip to content

fix(quantizer): make _pack/_unpack_qjl_signs CUDA-graph-safe#12

Open
AEON-7 wants to merge 1 commit into
0xSero:mainfrom
AEON-7:fix/cuda-graph-safe-qjl-powers
Open

fix(quantizer): make _pack/_unpack_qjl_signs CUDA-graph-safe#12
AEON-7 wants to merge 1 commit into
0xSero:mainfrom
AEON-7:fix/cuda-graph-safe-qjl-powers

Conversation

@AEON-7
Copy link
Copy Markdown

@AEON-7 AEON-7 commented Apr 24, 2026

Summary

Fix an unpinned CPU → GPU copy in the quantizer hot path that breaks
CUDA graph capture. Both capture_only and hybrid modes currently
crash during vLLM's determine_available_memory() unless the server
is launched with --enforce-eager (which costs 20–30% decode throughput).

Root cause

turboquant/quantizer.py:216-224
_pack_qjl_signs and _unpack_qjl_signs allocate the powers-of-two
lookup tensor inside the function:

powers = torch.tensor([1, 2, 4, 8, 16, 32, 64, 128],
                       device=signs.device, dtype=torch.uint8)

torch.tensor([...], device=cuda) creates the tensor on CPU first then
copies to GPU. That's an unpinned host-to-device copy, which PyTorch
refuses inside a CUDA graph capture region:

RuntimeError: Cannot copy between CPU and CUDA tensors during CUDA
graph capture unless the CPU tensor is pinned. Please use
tensor.pin_memory() or allocate the tensor with pin_memory=True.

Repro

# any vllm serve with TQ installed via collective_rpc on the worker:
vllm serve <model> --attention-backend flash_attn \
    --gpu-memory-utilization 0.85 --max-num-seqs 16
# then: executor.collective_rpc(lambda w: install_turboquant_hooks(
#           w.model_runner, mode="capture_only"))

Without this patch the engine crashes during
_warmup_and_capture → _dummy_run. With --enforce-eager, the plugin
works but at a substantial perf cost.

Fix

Cache the powers tensor per-device in a module-level dict. Allocate
once, reuse forever. No hot-path host-device copy.

Validation

Tested on NVIDIA DGX Spark (GB10 / sm_121a) with
ghcr.io/aeon-7/vllm-dflash:latest, Qwen3.5-27B NVFP4 hybrid model
(16 full-attention + 48 linear-attention layers), vLLM 0.19.1rc1.
Both modes now boot cleanly with CUDA graphs enabled.

Throughput vs baseline (no TQ) at full production config

Natural-language prompts, deterministic (temperature=0), steady-state
post-warmup, 16 requests per level.

Concurrency TQ off (ref) TQ capture_only TQ hybrid
c=1 (code) 64.02 tok/s 61.50 tok/s 61.71 tok/s
c=4 (code) 181.47 tok/s 175.71 tok/s 175.79 tok/s
c=8 (code) 262.77 tok/s 255.19 tok/s 252.78 tok/s
c=16 (code) 327.89 tok/s 314.93 tok/s 318.36 tok/s
c=1 (prose) 29.46 tok/s 28.14 tok/s 28.49 tok/s
c=16 (prose) 151.81 tok/s 147.43 tok/s 148.80 tok/s

Measured overhead: ~3% across all modes, concurrencies, and prompt
styles
— matching the advertised profile.

Test plan for reviewer

  • pytest test_modular.py test_turboquant.py validate_paper.py still passes on CUDA
  • Boot a vLLM server with install_turboquant_hooks(..., mode="hybrid") without --enforce-eager; engine should reach Application startup complete without the "Cannot copy between CPU and CUDA tensors" error
  • No regression on the existing benchmark.py / proof.py flows

🤖 Generated with Claude Code

_pack_qjl_signs and _unpack_qjl_signs allocate the powers-of-two lookup
tensor via `torch.tensor([1, 2, 4, ..., 128], device=..., dtype=uint8)`
on every call.  That form creates the tensor on CPU first, then copies
to GPU — an unpinned host-to-device copy that PyTorch refuses inside a
CUDA graph capture region, raising:

    RuntimeError: Cannot copy between CPU and CUDA tensors during CUDA
    graph capture unless the CPU tensor is pinned. Please use
    tensor.pin_memory() or allocate the tensor with pin_memory=True.

The call sites are on the decode hot path (each quantize() call runs
them), so the allocation happens every graph-capture warmup iteration
and the capture fails.

Repro (without the patch): any model served via `vllm serve` with
`install_turboquant_hooks(worker.model_runner, mode="capture_only")`
crashes during `determine_available_memory` → `profile_cudagraph_memory`
→ `_warmup_and_capture`.  vLLM's default launch path uses CUDA graphs,
so the plugin is unusable unless the server is started with
`--enforce-eager`, which costs 20-30% decode throughput.

Fix: cache the powers tensor once per device in a module-level dict
and return the cached tensor on subsequent calls.  No more host-device
copy in the hot path.

Validated on NVIDIA DGX Spark (GB10 / sm_121a) with
ghcr.io/aeon-7/vllm-dflash, Qwen3.5-27B-NVFP4 hybrid model.  Both
`capture_only` and `hybrid` modes boot cleanly with CUDA graphs
enabled; end-to-end decode overhead vs TQ-off is ~3% (matching the
advertised profile).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AEON-7 added a commit to AEON-7/vllm-dflash that referenced this pull request Apr 24, 2026
README is rewritten end-to-end for simplicity and accuracy:
 - Single "Quick Start" with 3 copy-paste steps (was 5+ per-model walkthroughs)
 - Performance section updated with natural-prompt numbers (superseded random-token)
 - New "TurboQuant (optional)" section with the 3-mode overhead matrix and
   long-context decode data
 - Troubleshooting collapsibles covering the 5 most common startup issues
 - Dropped legacy sections: duplicated "Usage Patterns", "Why Dense Over MoE"
   prose sprawl, "Optimized for DGX Spark GB10" (duplicated tuning recap)
 - 629 → 397 lines, same substantive information

New turboquant/ subfolder: optional Docker build that layers 0xSero/turboquant
on top of this image via a clean site-packages .pth bootstrap (no vLLM source
patches).  Pins to AEON-7/turboquant@fix/cuda-graph-safe-qjl-powers which
carries 0xSero/turboquant#12 — a 20-line fix making
the quantizer's QJL-sign helpers CUDA-graph-safe (they were allocating
`torch.tensor([1,2,4,...], device=cuda, dtype=uint8)` every forward, which
breaks vLLM's graph capture unless the server is launched with
`--enforce-eager`).

BENCHMARKS.md gains three TurboQuant-specific tables (code/prose concurrency,
long-context decode) matching what the README summarises.

All TurboQuant numbers measured on DGX Spark GB10 with
ghcr.io/aeon-7/vllm-dflash:latest + the extension container, Qwen3.5-27B NVFP4
hybrid model, DFlash k=15, tuned config (MAX_NUM_SEQS=16,
MAX_NUM_BATCHED_TOKENS=32768, GPU_MEMORY_UTILIZATION=0.85).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant