fix(quantizer): make _pack/_unpack_qjl_signs CUDA-graph-safe#12
Open
AEON-7 wants to merge 1 commit into
Open
Conversation
_pack_qjl_signs and _unpack_qjl_signs allocate the powers-of-two lookup
tensor via `torch.tensor([1, 2, 4, ..., 128], device=..., dtype=uint8)`
on every call. That form creates the tensor on CPU first, then copies
to GPU — an unpinned host-to-device copy that PyTorch refuses inside a
CUDA graph capture region, raising:
RuntimeError: Cannot copy between CPU and CUDA tensors during CUDA
graph capture unless the CPU tensor is pinned. Please use
tensor.pin_memory() or allocate the tensor with pin_memory=True.
The call sites are on the decode hot path (each quantize() call runs
them), so the allocation happens every graph-capture warmup iteration
and the capture fails.
Repro (without the patch): any model served via `vllm serve` with
`install_turboquant_hooks(worker.model_runner, mode="capture_only")`
crashes during `determine_available_memory` → `profile_cudagraph_memory`
→ `_warmup_and_capture`. vLLM's default launch path uses CUDA graphs,
so the plugin is unusable unless the server is started with
`--enforce-eager`, which costs 20-30% decode throughput.
Fix: cache the powers tensor once per device in a module-level dict
and return the cached tensor on subsequent calls. No more host-device
copy in the hot path.
Validated on NVIDIA DGX Spark (GB10 / sm_121a) with
ghcr.io/aeon-7/vllm-dflash, Qwen3.5-27B-NVFP4 hybrid model. Both
`capture_only` and `hybrid` modes boot cleanly with CUDA graphs
enabled; end-to-end decode overhead vs TQ-off is ~3% (matching the
advertised profile).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AEON-7
added a commit
to AEON-7/vllm-dflash
that referenced
this pull request
Apr 24, 2026
README is rewritten end-to-end for simplicity and accuracy: - Single "Quick Start" with 3 copy-paste steps (was 5+ per-model walkthroughs) - Performance section updated with natural-prompt numbers (superseded random-token) - New "TurboQuant (optional)" section with the 3-mode overhead matrix and long-context decode data - Troubleshooting collapsibles covering the 5 most common startup issues - Dropped legacy sections: duplicated "Usage Patterns", "Why Dense Over MoE" prose sprawl, "Optimized for DGX Spark GB10" (duplicated tuning recap) - 629 → 397 lines, same substantive information New turboquant/ subfolder: optional Docker build that layers 0xSero/turboquant on top of this image via a clean site-packages .pth bootstrap (no vLLM source patches). Pins to AEON-7/turboquant@fix/cuda-graph-safe-qjl-powers which carries 0xSero/turboquant#12 — a 20-line fix making the quantizer's QJL-sign helpers CUDA-graph-safe (they were allocating `torch.tensor([1,2,4,...], device=cuda, dtype=uint8)` every forward, which breaks vLLM's graph capture unless the server is launched with `--enforce-eager`). BENCHMARKS.md gains three TurboQuant-specific tables (code/prose concurrency, long-context decode) matching what the README summarises. All TurboQuant numbers measured on DGX Spark GB10 with ghcr.io/aeon-7/vllm-dflash:latest + the extension container, Qwen3.5-27B NVFP4 hybrid model, DFlash k=15, tuned config (MAX_NUM_SEQS=16, MAX_NUM_BATCHED_TOKENS=32768, GPU_MEMORY_UTILIZATION=0.85). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix an unpinned CPU → GPU copy in the quantizer hot path that breaks
CUDA graph capture. Both
capture_onlyandhybridmodes currentlycrash during vLLM's
determine_available_memory()unless the serveris launched with
--enforce-eager(which costs 20–30% decode throughput).Root cause
turboquant/quantizer.py:216-224—_pack_qjl_signsand_unpack_qjl_signsallocate the powers-of-twolookup tensor inside the function:
torch.tensor([...], device=cuda)creates the tensor on CPU first thencopies to GPU. That's an unpinned host-to-device copy, which PyTorch
refuses inside a CUDA graph capture region:
Repro
Without this patch the engine crashes during
_warmup_and_capture → _dummy_run. With--enforce-eager, the pluginworks but at a substantial perf cost.
Fix
Cache the powers tensor per-device in a module-level dict. Allocate
once, reuse forever. No hot-path host-device copy.
Validation
Tested on NVIDIA DGX Spark (GB10 / sm_121a) with
ghcr.io/aeon-7/vllm-dflash:latest, Qwen3.5-27B NVFP4 hybrid model(16 full-attention + 48 linear-attention layers), vLLM 0.19.1rc1.
Both modes now boot cleanly with CUDA graphs enabled.
Throughput vs baseline (no TQ) at full production config
Natural-language prompts, deterministic (temperature=0), steady-state
post-warmup, 16 requests per level.
Measured overhead: ~3% across all modes, concurrencies, and prompt
styles — matching the advertised profile.
Test plan for reviewer
pytest test_modular.py test_turboquant.py validate_paper.pystill passes on CUDAinstall_turboquant_hooks(..., mode="hybrid")without--enforce-eager; engine should reachApplication startup completewithout the "Cannot copy between CPU and CUDA tensors" errorbenchmark.py/proof.pyflows🤖 Generated with Claude Code