fix(rate-limit): gracefully handle GLM/Z.AI in-stream burst limits (HTTP 200 + [1302])#138
Closed
jsboige wants to merge 23 commits into
Closed
fix(rate-limit): gracefully handle GLM/Z.AI in-stream burst limits (HTTP 200 + [1302])#138jsboige wants to merge 23 commits into
jsboige wants to merge 23 commits into
Conversation
…creds - Proxy auth middleware validates authorization/x-api-key/x-proxy-key headers against proxyKey from config (no custom header needed) - NativeHandler falls back to stored ANTHROPIC_API_KEY when client doesn't provide auth (cluster scenario) - Custom endpoints registered at runtime pass credential checks - loadConfig() preserves proxyKey field - Proxy binds to configurable hostname (0.0.0.0 for Docker) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
OAuth tokens (sk-ant-oat01-) are rejected by Anthropic when sent as x-api-key. Detect the prefix and use authorization: Bearer instead. Also pass proxy key to NativeHandler so it can distinguish proxy auth from genuine client tokens. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…tion Forward ALL incoming headers to api.anthropic.com instead of selectively forwarding only auth/beta headers. Claude Code sends internal headers that make Max subscription work for Opus/Sonnet — the proxy must not drop them. Proxy key override still works: when proxyKey is configured and matches client auth, it's replaced with the stored Anthropic key. When no proxyKey is set (pass-through mode), everything flows through unmodified. Same change applied to the count_tokens endpoint. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The standalone proxy now reads the default profile from ~/.claudish/config.json and passes its model mapping to createProxyServer(). This means: - opus/sonnet/haiku role requests get remapped to the profile's models - Profile "default" maps: opus→glm-5.1, sonnet→glm-4.7, haiku→glm-4.7-flash - Without a profile mapping, all models pass through unchanged Startup logs now show the active profile name and role mappings. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…elMap - Add /v1/models endpoint for Claude Code gateway discovery (CLAUDE_CODE_ENABLE_GATEWAY_MODEL_DISCOVERY=1) - Returns all models from config.json routing rules + custom endpoints - Standalone proxy passes no modelMap — every model routes via config rules - No role remapping: opus→claude-opus-4-7, sonnet→glm-5.1, haiku→qwen3.6 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ropic providers Claude Code injects `x-anthropic-billing-header: cc_version=...; cch=XXXXX;` into the system prompt body. The cch= token changes every request, which invalidates vLLM prefix caching (strict block hash). Strip this line from the prompt for all non-Anthropic handlers — only Anthropic uses it. Expected impact: vLLM TTFT from 30-67s → 1-3s on conversation turns 2+. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Dockerfile: Bun alpine, standalone proxy, healthcheck - docker-compose.yml: volume mount for config, port 3000 - .dockerignore: exclude traces, node_modules, dist - start-claudish-proxy.ps1: Windows service wrapper with logging Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Direct connections (no IIS ARR) had no x-forwarded-for/x-real-ip headers, so all LAN machines appeared as "direct" in logs. Capture remote address via Bun's server context and include in request logging. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…utocompact All non-Anthropic stream parsers (openai-sse, gemini-sse, ollama-jsonl, openai-responses-sse) hardcoded input_tokens: 100 in message_start and omitted input_tokens from message_delta. This broke Claude Code's autocompact because tokenCountWithEstimation() accumulated usage from the SSE stream — seeing input_tokens=100 forever meant the context appeared nearly empty, so condensation never triggered. Changes: - message_start: input_tokens: 100 → 0 (placeholder, replaced by delta) - message_delta: include actual input_tokens from provider response - openai-sse fallback: remove hardcoded 100 in onTokenUpdate Affected models: GLM-5.1, Qwen, Gemini, Ollama, Codex — all non-Anthropic providers using these parsers. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extract all fork-specific code (proxy auth, model discovery, billing header strip, request logger, hostname binding, standalone proxy) into dedicated modules under packages/cli/src/fork/. proxy-server.ts drops from 715 to 597 lines, with fork diff now ~30 lines (import + registration call). This makes future upstream syncs straightforward — fork code is visible in one directory and easily rebased. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…e errors When thinking blocks are suppressed, subsequent content block indices can jump ahead of what the client expects. Clamp indices to highestSeen+1 and log the adjustment for debugging. Also pass CLAUDISH_PROXY_KEY via docker-compose environment. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Some Anthropic-compat providers (z.ai, GLM) send message_delta with stop_reason but never emit the message_stop event. Claude Code requires message_stop as the terminal event — without it, the client reports "API returned an empty or malformed response (HTTP 200)". Track sawMessageStop separately from stopReason so the synthetic finalization fires even when stopReason was received but message_stop was not. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the "web search unsupported" warning with actual execution via a local SearXNG instance. Two interception paths: 1. Structured tool_call: WebSearch/WebFetch tool calls are suppressed from the stream and replaced with SearXNG results as a text block. 2. GLM <searchWeb> tags: GLM models emit <searchWeb><query>...</query> </searchWeb> in text — these are intercepted at finalize and replaced with SearXNG results. Also intercepts sub-agent web search/fetch requests at the proxy level (Claude Code sends "Perform a web search for the query: X" as a single user message). Additional improvements: - ToolState.suppressed flag for clean tool call suppression - Empty-response error handling: providers returning no content now emit a structured api_error event instead of a malformed empty message - Finalize logging with model, text length, tool count - Proxy logs tool names in incoming requests for debug Requires SEARXNG_URL env var. Graceful fallback when unavailable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…responses Z.AI/GLM providers emit server_tool_use content blocks that the Anthropic SDK does not understand (it expects only text, tool_use, thinking). These blocks are now silently suppressed from the stream with index tracking. When the stream ends without a message_start (empty response from provider), emit a synthetic message sequence (message_start → text block with error message → message_stop) so the client SDK receives a well-formed response instead of a malformed empty one. Also includes minor whitespace reformatting for consistency. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nsport providers Claude Code v2.1.153+ injects role:"system" messages inline in the messages array (e.g. system-reminders). Anthropic-compatible providers (Z.AI, MiniMax, Kimi) reject role:"system" in messages — only "user"/"assistant" are accepted. The fix strips inline system messages from the payload and merges their content into the top-level system prompt field. Applied at two levels: 1. ComposedHandler: strips from requestPayload.messages for anthropic-sse transport 2. openai-messages converter: handles the OpenAI format path by merging into the system message at index 0 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add CLAUDISH_CAPTURE_DIR env-gated diagnostic capture that writes the
full request body to JSON files. Disabled by default (no-op when env
var is unset), so zero overhead in normal production use.
Purpose: when a hang or malformed response is reported, enable the env
var to capture the exact request payload for offline replay against
different bun versions or configurations.
Files are written to the capture dir with pattern:
req-{pid}-{seq}-{timestamp}-{source}.json
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rtifacts - Bump Dockerfile base image from oven/bun:1.2-alpine to 1.3-alpine (pending Docker Hub egress fix for rebuild validation) - Switch SEARXNG_URL from IP:port to search.myia.io DNS name - Add .gitignore entries for diagnostic capture/, trace files, deploy notes, and monitoring scripts Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…andling, and diagnostic capture Document three new subsystems: - Web Search Interception (v7.1+): SearXNG backend, two interception paths (tool_call suppression + GLM searchWeb tags), sub-agent handling - Inline System Message Handling: strip role:system for Anthropic-transport providers (Claude Code v2.1.153+ compatibility) - Diagnostic Body Capture: CLAUDISH_CAPTURE_DIR env-gated full request body capture for offline hang reproduction Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…de capture The finalize() function in openai-sse.ts referenced opts.modelName which does not exist in that scope (the parameter is target). This caused a ReferenceError thrown AFTER state.finalized=true but BEFORE controller.close(). The catch block re-invoked finalize() which returned immediately (already finalized), swallowing the error and leaving the ReadableStream permanently open - the client HTTP connection hung forever. Fix: replace opts.modelName with target (lines 160-161, 350). Also adds response-side SSE capture infrastructure (gated by CLAUDISH_CAPTURE_DIR env var, no-op when unset): - response-capture.ts: new diagnostic helper - Wired into openai-sse, anthropic-sse, and native-handler - request-logger: add machine= tag from x-claudish-machine header Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… non-native providers ComposedHandler is never used for native Anthropic (that's NativeHandler), so every request here goes to a provider that doesn't understand Anthropic thinking signatures. Previously, Opus thinking blocks (with signatures) flowed through to Z.AI/GLM/MiniMax/etc. unmodified, causing either silent corruption or API rejection on subsequent turns. The strip removes thinking blocks (type: 'thinking') from assistant messages in the request payload before forwarding to the provider. Native Anthropic requests via NativeHandler are unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…TTP 200 + [1302])
Root cause of the recurring GLM dropouts ("empty or malformed response (HTTP 200)"
crashes that required manual "continue"/"reprends" recovery): Z.AI delivers its
instantaneous burst / RPM limit as **HTTP 200** followed by an in-stream SSE error
([1302]) carrying no message envelope. That escapes both `response.ok` and
FallbackHandler (both key off HTTP status), so it reached Claude Code as a bare
error and crashed the turn. Confirmed via CLAUDISH_CAPTURE_DIR body captures: the
errors are instant (0-4ms), not timeouts — large context correlates with crashes
via request bursting/concurrency, not duration.
Three proportionate, fail-open layers (no shared adaptive pacer — see ROADMAP):
1. Parser safety-net (anthropic-sse.ts): finalizeWithError() always emits a clean
terminal message (synthetic message_start if needed + message_delta end_turn +
message_stop) and suppresses bare `event: error`, so a stream error can never
again surface as a raw crash. Rate-limit errors get a friendly retry notice.
2. Transport backoff (anthropic-compat.ts): AnthropicProviderTransport.enqueueRequest
retries genuine HTTP 429 with jittered exponential backoff (2/4/8/16/30s + 0-1s).
3. In-stream peek + temporize (stream-peek.ts + composed-handler.ts): peek the first
bytes of anthropic-sse responses (errors arrive in 0-4ms, so a bounded ~2s window
catches them without delaying slow-but-healthy big-context responses), and on a
[1302] retry the SAME provider with jittered backoff (1/2/4s + 0-750ms) up to 3x
— the sustained quota has headroom, only the burst limit is hit. On persistence,
return 429 so FallbackHandler switches provider (Z.AI -> GLM Coding, separate
quota). Uses ReadableStream.tee() so no bytes are lost downstream; fails open on
any unexpected condition.
Jitter throughout de-synchronizes the 6-machine cluster (one shared key per provider)
so retries don't re-collide on the burst limit.
Also enables CLAUDISH_CAPTURE_DIR in docker-compose for ongoing production diagnosis.
Tests: stream-peek.test.ts (5, full-byte passthrough + classification + fail-open),
anthropic-compat 429-retry (2), MadAppGang#106 regression rewritten to the no-crash contract +
start-of-stream rate-limit fixture/test. Build + tsc clean.
ROADMAP: adaptive/shared AIMD pacer deferred with explicit trigger conditions
(per-request jittered backoff judged sufficient for current load).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The in-stream rate-limit log calls (ComposedHandler peek+retry, and the anthropic-sse safety-net finalizer) used plain log() with no forceConsole flag. In the Docker deployment the always-on log path is a read-only mount and --debug is off, so these operational events were invisible to `docker logs`. Add forceConsole=true and a greppable `[RateLimit]` prefix to all four call sites so burst events can be observed live. Purely additive logging — no behavior change. 14/14 rate-limit tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Recurring GLM dropouts — Claude Code crashes with "empty or malformed response (HTTP 200)" mid-turn, requiring a manual
continue/reprends stpto recover. Every one of those manual recoveries follows the same client-side crash.Root cause (confirmed via
CLAUDISH_CAPTURE_DIRcaptures)Z.AI delivers its instantaneous burst / RPM limit as HTTP 200 followed by an in-stream SSE error (
[1302]) that carries no message envelope. Because bothresponse.okandFallbackHandlerkey off HTTP status, a 200-with-in-stream-error is invisible to them — it flowed straight to Claude Code as a bare error and crashed the turn.Key finding: the errors are instant (0–4 ms), not timeouts. The user's intuition that large context → more crashes is real but indirect — big context drives request bursting/concurrency (and cache misses), which trips the burst limit. It is not a duration/TTFB problem. Condensing helps by reducing request volume.
The 5h sustained quota has headroom (≤50% even at peak rush); only the instantaneous burst limit is hit, occasionally.
Fix — three proportionate, fail-open layers
Parser safety-net (
anthropic-sse.ts):finalizeWithError()always emits a clean terminal message (syntheticmessage_startif none was seen, +message_deltaend_turn+message_stop) and suppresses bareevent: error. A stream error can no longer surface as a raw client crash; rate-limit errors get a friendly retry notice.Transport backoff (
anthropic-compat.ts):AnthropicProviderTransport.enqueueRequestretries genuine HTTP 429 with jittered exponential backoff (2/4/8/16/30s + 0–1s, honoringRetry-After).In-stream peek + temporize (
stream-peek.ts+composed-handler.ts): peek the first bytes ofanthropic-sseresponses viaReadableStream.tee()(no bytes lost downstream; bounded ~2s window catches the 0–4ms error without delaying slow-but-healthy big-context responses). On[1302], temporize with jittered backoff (1/2/4s + 0–750ms) and retry the same provider up to 3× (the sustained quota has headroom). On persistence, return 429 soFallbackHandlerswitches provider (Z.AI → GLM Coding — a separate quota). Fails open on any unexpected condition.Jitter throughout de-synchronizes the 6-machine cluster (one shared key per provider) so retries don't re-collide on the burst limit.
Also enables
CLAUDISH_CAPTURE_DIRindocker-composefor ongoing production diagnosis.On "shared/adaptive backoff"
The design is per-request jittered exponential backoff, intentionally not a shared AIMD pacer. More 429s → more requests back off → offered load drops (implicit aggregate self-throttling), which is sufficient for the stated load. A coordinated pacer that makes new arrivals probe more cautiously during a storm is the correct next step only if the headroom premise stops holding — deferred to
ROADMAP.mdwith explicit trigger conditions to avoid over-engineering.Tests
stream-peek.test.ts(5): full-byte passthrough on healthy, multi-read past leading ping,[1302]→rate-limit, non-RL→other-error (still streamed), no-body→fail-open.anthropic-compat.test.ts(+2): 429 retry-then-success; non-429 no-retry.format-translation.test.ts: [bug] error handling: undefined is not an object #106 regression rewritten to the no-crash contract + new start-of-stream rate-limit fixture/test.tsc --noEmitclean. (One unrelated pre-existingMiniMaxModelDialectcontext-window test failure from the Firebase catalog migration — not touched here.)🤖 Generated with Claude Code