fix: RuntimeError in free_kv_cache with vLLM v1 (0.19+) by IthaloPS · Pull Request #14 · 0xSero/turboquant

IthaloPS · 2026-05-08T14:25:27Z

Bug

When running TurboQuant with vLLM 0.19+ (v1 engine), calling free_kv_cache raises:

RuntimeError: Boolean value of Tensor with more than one value is ambiguous

Traceback points to free_kv_cache in both turboquant/integration/vllm.py and turboquant/vllm_attn_backend.py:

kv_list = getattr(attn_module, "kv_cache", None)
if kv_list and len(kv_list) > 0:  # ← crashes when kv_list is a Tensor

In vLLM v1, attn_module.kv_cache is a Tensor, not a Python list. Evaluating a multi-element Tensor as a boolean is ambiguous in PyTorch and raises the error.

Fix

Replace the truthiness check with an explicit None check in all 3 occurrences across both files:

# Before
if kv_list and len(kv_list) > 0:

# After
if kv_list is not None and len(kv_list) > 0:

Tested on

Component	Version
vLLM	0.19.1
PyTorch	2.10.0+cu128
CUDA	12.8
GPU	RTX 3090 (24 GB)
Model	Qwen3.5-9B (dense, bf16)
OS	Ubuntu 22.04

Benchmark results (RTX 3090 24GB — Qwen3.5-9B, TP=1, max_len=32768)

After the fix, TurboQuant runs end-to-end including free_kv_cache:

Metric	Baseline (bf16 KV)	TurboQuant (3b key / 2b val)
Decode tok/s	32.0	45.3 (+41.6%)
KV tensors freed	—	822 MB
TQ hooks (layers)	—	8
Output quality	—	Identical to baseline

The throughput gain is higher than reported for the 27B model because the 9B model is more memory-bound on a single 24 GB GPU, so KV cache compression reduces the decode bottleneck more aggressively.

Made with Cursor

In vLLM v1, `attn_module.kv_cache` is a Tensor, not a list. Using `if kv_list` on a multi-element Tensor raises: RuntimeError: Boolean value of Tensor with more than one value is ambiguous Replace `if kv_list` with `if kv_list is not None` in both `turboquant/integration/vllm.py` and `turboquant/vllm_attn_backend.py`. Verified on: vLLM 0.19.1, PyTorch 2.10, CUDA 12.8, RTX 3090 (24GB) Co-authored-by: Cursor <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: RuntimeError in free_kv_cache with vLLM v1 (0.19+)#14

fix: RuntimeError in free_kv_cache with vLLM v1 (0.19+)#14
IthaloPS wants to merge 1 commit into
0xSero:mainfrom
IthaloPS:fix/free-kv-cache-tensor-bool-vllm-v1

IthaloPS commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

IthaloPS commented May 8, 2026

Bug

Fix

Tested on

Benchmark results (RTX 3090 24GB — Qwen3.5-9B, TP=1, max_len=32768)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants