Skip to content

fix: RuntimeError in free_kv_cache with vLLM v1 (0.19+)#14

Open
IthaloPS wants to merge 1 commit into
0xSero:mainfrom
IthaloPS:fix/free-kv-cache-tensor-bool-vllm-v1
Open

fix: RuntimeError in free_kv_cache with vLLM v1 (0.19+)#14
IthaloPS wants to merge 1 commit into
0xSero:mainfrom
IthaloPS:fix/free-kv-cache-tensor-bool-vllm-v1

Conversation

@IthaloPS
Copy link
Copy Markdown

@IthaloPS IthaloPS commented May 8, 2026

Bug

When running TurboQuant with vLLM 0.19+ (v1 engine), calling free_kv_cache raises:

RuntimeError: Boolean value of Tensor with more than one value is ambiguous

Traceback points to free_kv_cache in both turboquant/integration/vllm.py and turboquant/vllm_attn_backend.py:

kv_list = getattr(attn_module, "kv_cache", None)
if kv_list and len(kv_list) > 0:  # ← crashes when kv_list is a Tensor

In vLLM v1, attn_module.kv_cache is a Tensor, not a Python list. Evaluating a multi-element Tensor as a boolean is ambiguous in PyTorch and raises the error.

Fix

Replace the truthiness check with an explicit None check in all 3 occurrences across both files:

# Before
if kv_list and len(kv_list) > 0:

# After
if kv_list is not None and len(kv_list) > 0:

Tested on

Component Version
vLLM 0.19.1
PyTorch 2.10.0+cu128
CUDA 12.8
GPU RTX 3090 (24 GB)
Model Qwen3.5-9B (dense, bf16)
OS Ubuntu 22.04

Benchmark results (RTX 3090 24GB — Qwen3.5-9B, TP=1, max_len=32768)

After the fix, TurboQuant runs end-to-end including free_kv_cache:

Metric Baseline (bf16 KV) TurboQuant (3b key / 2b val)
Decode tok/s 32.0 45.3 (+41.6%)
KV tensors freed 822 MB
TQ hooks (layers) 8
Output quality Identical to baseline

The throughput gain is higher than reported for the 27B model because the 9B model is more memory-bound on a single 24 GB GPU, so KV cache compression reduces the decode bottleneck more aggressively.

Made with Cursor

In vLLM v1, `attn_module.kv_cache` is a Tensor, not a list. Using
`if kv_list` on a multi-element Tensor raises:
  RuntimeError: Boolean value of Tensor with more than one value is ambiguous

Replace `if kv_list` with `if kv_list is not None` in both
`turboquant/integration/vllm.py` and `turboquant/vllm_attn_backend.py`.

Verified on: vLLM 0.19.1, PyTorch 2.10, CUDA 12.8, RTX 3090 (24GB)

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants