fix: strip x-anthropic-billing-header to fix vLLM prefix caching#101
Open
mtparet wants to merge 1 commit into
Open
fix: strip x-anthropic-billing-header to fix vLLM prefix caching#101mtparet wants to merge 1 commit into
mtparet wants to merge 1 commit into
Conversation
Claude Code injects `x-anthropic-billing-header: ...cch=<hash>` into the system prompt, where the cch value changes every turn. This breaks vLLM prefix caching since the system prompt tokens differ between requests, reducing cache hits from 99% to ~0.15% (48 tokens out of ~31k). Strip this header in filterIdentity (which only runs for non-Anthropic providers) to keep system prompts stable across turns. Validated to restore prefix cache hit rates to 99.9%. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
Author
|
cc @erudenko is the CI failure expected ? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
x-anthropic-billing-headerfrom system prompts infilterIdentityto fix vLLM prefix cachingProblem
Claude Code injects
x-anthropic-billing-header: cc_version=...; cch=<hash>into the system prompt on every turn. Thecch(conversation context hash) changes per turn, making the system prompt tokens differ between requests. This breaks vLLM's prefix cache for the entire ~31k token prompt — reducing cache hits from 99% to 0.15% (only 48 tokens cached).Root Cause Analysis
cached_tokenswas stuck at 48 out of ~31kcch=8ae40→cch=4e20a→cch=ad80ein the system promptFix
Added
.replace(/x-anthropic-billing-header:[^\n]*\n?/g, "")as the first filter infilterIdentity(). This is safe becausefilterIdentityonly runs for non-Anthropic providers (OpenAI-compat, Gemini, LiteLLM, OpenRouter, local) — never for native Anthropic passthrough where the header is meaningful.Validation
Before fix:
cached_tokens= 48 / 31k on every turn (0.15%)After fix:
cached_tokens= 31,728 / 31,749 by 4th turn (99.9%)Test plan
bun test identity-filter.test.ts— 9 tests passbun run build— compiles clean--debug, send "test" twice, verifycached_tokensgrows🤖 Generated with Claude Code