Skip to content

fix: strip x-anthropic-billing-header to fix vLLM prefix caching#101

Open
mtparet wants to merge 1 commit into
MadAppGang:mainfrom
blackfuel-ai:fix/strip-billing-header-for-prefix-caching
Open

fix: strip x-anthropic-billing-header to fix vLLM prefix caching#101
mtparet wants to merge 1 commit into
MadAppGang:mainfrom
blackfuel-ai:fix/strip-billing-header-for-prefix-caching

Conversation

@mtparet
Copy link
Copy Markdown
Contributor

@mtparet mtparet commented Apr 7, 2026

Summary

  • Strip x-anthropic-billing-header from system prompts in filterIdentity to fix vLLM prefix caching
  • Add unit tests for the identity filter

Problem

Claude Code injects x-anthropic-billing-header: cc_version=...; cch=<hash> into the system prompt on every turn. The cch (conversation context hash) changes per turn, making the system prompt tokens differ between requests. This breaks vLLM's prefix cache for the entire ~31k token prompt — reducing cache hits from 99% to 0.15% (only 48 tokens cached).

Root Cause Analysis

  1. Sent 4 identical "test" messages via claudish → MiniMax M2.5 on vLLM
  2. VictoriaMetrics showed prefix cache metrics were healthy (~50% cluster-wide)
  3. But per-request cached_tokens was stuck at 48 out of ~31k
  4. Dumped full request payloads and diffed between turns
  5. Found the only difference: cch=8ae40cch=4e20acch=ad80e in the system prompt

Fix

Added .replace(/x-anthropic-billing-header:[^\n]*\n?/g, "") as the first filter in filterIdentity(). This is safe because filterIdentity only runs for non-Anthropic providers (OpenAI-compat, Gemini, LiteLLM, OpenRouter, local) — never for native Anthropic passthrough where the header is meaningful.

Validation

Before fix: cached_tokens = 48 / 31k on every turn (0.15%)
After fix: cached_tokens = 31,728 / 31,749 by 4th turn (99.9%)

Turn Before Fix After Fix
1st 48 (0.15%) 32 (cold)
2nd 48 (0.15%) 24,864 (79%)
3rd 48 (0.15%) 31,296 (99%)
4th 48 (0.15%) 31,728 (99.9%)

Test plan

  • bun test identity-filter.test.ts — 9 tests pass
  • bun run build — compiles clean
  • Manual: run claudish with --debug, send "test" twice, verify cached_tokens grows

🤖 Generated with Claude Code

Claude Code injects `x-anthropic-billing-header: ...cch=<hash>` into
the system prompt, where the cch value changes every turn. This breaks
vLLM prefix caching since the system prompt tokens differ between
requests, reducing cache hits from 99% to ~0.15% (48 tokens out of
~31k).

Strip this header in filterIdentity (which only runs for non-Anthropic
providers) to keep system prompts stable across turns. Validated to
restore prefix cache hit rates to 99.9%.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mtparet
Copy link
Copy Markdown
Contributor Author

mtparet commented Apr 7, 2026

cc @erudenko is the CI failure expected ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant