Skip to content

fix(rate-limit): gracefully handle GLM/Z.AI in-stream burst limits (HTTP 200 + [1302])#138

Closed
jsboige wants to merge 23 commits into
MadAppGang:mainfrom
jsboige:fix/glm-instream-rate-limit
Closed

fix(rate-limit): gracefully handle GLM/Z.AI in-stream burst limits (HTTP 200 + [1302])#138
jsboige wants to merge 23 commits into
MadAppGang:mainfrom
jsboige:fix/glm-instream-rate-limit

Conversation

@jsboige

@jsboige jsboige commented Jun 2, 2026

Copy link
Copy Markdown

Problem

Recurring GLM dropouts — Claude Code crashes with "empty or malformed response (HTTP 200)" mid-turn, requiring a manual continue / reprends stp to recover. Every one of those manual recoveries follows the same client-side crash.

Root cause (confirmed via CLAUDISH_CAPTURE_DIR captures)

Z.AI delivers its instantaneous burst / RPM limit as HTTP 200 followed by an in-stream SSE error ([1302]) that carries no message envelope. Because both response.ok and FallbackHandler key off HTTP status, a 200-with-in-stream-error is invisible to them — it flowed straight to Claude Code as a bare error and crashed the turn.

Key finding: the errors are instant (0–4 ms), not timeouts. The user's intuition that large context → more crashes is real but indirect — big context drives request bursting/concurrency (and cache misses), which trips the burst limit. It is not a duration/TTFB problem. Condensing helps by reducing request volume.

The 5h sustained quota has headroom (≤50% even at peak rush); only the instantaneous burst limit is hit, occasionally.

Fix — three proportionate, fail-open layers

  1. Parser safety-net (anthropic-sse.ts): finalizeWithError() always emits a clean terminal message (synthetic message_start if none was seen, + message_delta end_turn + message_stop) and suppresses bare event: error. A stream error can no longer surface as a raw client crash; rate-limit errors get a friendly retry notice.

  2. Transport backoff (anthropic-compat.ts): AnthropicProviderTransport.enqueueRequest retries genuine HTTP 429 with jittered exponential backoff (2/4/8/16/30s + 0–1s, honoring Retry-After).

  3. In-stream peek + temporize (stream-peek.ts + composed-handler.ts): peek the first bytes of anthropic-sse responses via ReadableStream.tee() (no bytes lost downstream; bounded ~2s window catches the 0–4ms error without delaying slow-but-healthy big-context responses). On [1302], temporize with jittered backoff (1/2/4s + 0–750ms) and retry the same provider up to 3× (the sustained quota has headroom). On persistence, return 429 so FallbackHandler switches provider (Z.AI → GLM Coding — a separate quota). Fails open on any unexpected condition.

Jitter throughout de-synchronizes the 6-machine cluster (one shared key per provider) so retries don't re-collide on the burst limit.

Also enables CLAUDISH_CAPTURE_DIR in docker-compose for ongoing production diagnosis.

On "shared/adaptive backoff"

The design is per-request jittered exponential backoff, intentionally not a shared AIMD pacer. More 429s → more requests back off → offered load drops (implicit aggregate self-throttling), which is sufficient for the stated load. A coordinated pacer that makes new arrivals probe more cautiously during a storm is the correct next step only if the headroom premise stops holding — deferred to ROADMAP.md with explicit trigger conditions to avoid over-engineering.

Tests

  • stream-peek.test.ts (5): full-byte passthrough on healthy, multi-read past leading ping, [1302]→rate-limit, non-RL→other-error (still streamed), no-body→fail-open.
  • anthropic-compat.test.ts (+2): 429 retry-then-success; non-429 no-retry.
  • format-translation.test.ts: [bug] error handling: undefined is not an object #106 regression rewritten to the no-crash contract + new start-of-stream rate-limit fixture/test.
  • Build + tsc --noEmit clean. (One unrelated pre-existing MiniMaxModelDialect context-window test failure from the Firebase catalog migration — not touched here.)

🤖 Generated with Claude Code

jsboigeEpita and others added 23 commits May 14, 2026 01:09
…creds

- Proxy auth middleware validates authorization/x-api-key/x-proxy-key
  headers against proxyKey from config (no custom header needed)
- NativeHandler falls back to stored ANTHROPIC_API_KEY when client
  doesn't provide auth (cluster scenario)
- Custom endpoints registered at runtime pass credential checks
- loadConfig() preserves proxyKey field
- Proxy binds to configurable hostname (0.0.0.0 for Docker)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
OAuth tokens (sk-ant-oat01-) are rejected by Anthropic when sent as
x-api-key. Detect the prefix and use authorization: Bearer instead.
Also pass proxy key to NativeHandler so it can distinguish proxy auth
from genuine client tokens.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…tion

Forward ALL incoming headers to api.anthropic.com instead of selectively
forwarding only auth/beta headers. Claude Code sends internal headers that
make Max subscription work for Opus/Sonnet — the proxy must not drop them.

Proxy key override still works: when proxyKey is configured and matches
client auth, it's replaced with the stored Anthropic key. When no proxyKey
is set (pass-through mode), everything flows through unmodified.

Same change applied to the count_tokens endpoint.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The standalone proxy now reads the default profile from ~/.claudish/config.json
and passes its model mapping to createProxyServer(). This means:
- opus/sonnet/haiku role requests get remapped to the profile's models
- Profile "default" maps: opus→glm-5.1, sonnet→glm-4.7, haiku→glm-4.7-flash
- Without a profile mapping, all models pass through unchanged

Startup logs now show the active profile name and role mappings.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…elMap

- Add /v1/models endpoint for Claude Code gateway discovery
  (CLAUDE_CODE_ENABLE_GATEWAY_MODEL_DISCOVERY=1)
- Returns all models from config.json routing rules + custom endpoints
- Standalone proxy passes no modelMap — every model routes via config rules
- No role remapping: opus→claude-opus-4-7, sonnet→glm-5.1, haiku→qwen3.6

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ropic providers

Claude Code injects `x-anthropic-billing-header: cc_version=...; cch=XXXXX;`
into the system prompt body. The cch= token changes every request, which
invalidates vLLM prefix caching (strict block hash). Strip this line from
the prompt for all non-Anthropic handlers — only Anthropic uses it.

Expected impact: vLLM TTFT from 30-67s → 1-3s on conversation turns 2+.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Dockerfile: Bun alpine, standalone proxy, healthcheck
- docker-compose.yml: volume mount for config, port 3000
- .dockerignore: exclude traces, node_modules, dist
- start-claudish-proxy.ps1: Windows service wrapper with logging

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Direct connections (no IIS ARR) had no x-forwarded-for/x-real-ip
headers, so all LAN machines appeared as "direct" in logs. Capture
remote address via Bun's server context and include in request logging.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…utocompact

All non-Anthropic stream parsers (openai-sse, gemini-sse, ollama-jsonl,
openai-responses-sse) hardcoded input_tokens: 100 in message_start and
omitted input_tokens from message_delta. This broke Claude Code's
autocompact because tokenCountWithEstimation() accumulated usage from
the SSE stream — seeing input_tokens=100 forever meant the context
appeared nearly empty, so condensation never triggered.

Changes:
- message_start: input_tokens: 100 → 0 (placeholder, replaced by delta)
- message_delta: include actual input_tokens from provider response
- openai-sse fallback: remove hardcoded 100 in onTokenUpdate

Affected models: GLM-5.1, Qwen, Gemini, Ollama, Codex — all non-Anthropic
providers using these parsers.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extract all fork-specific code (proxy auth, model discovery, billing
header strip, request logger, hostname binding, standalone proxy) into
dedicated modules under packages/cli/src/fork/. proxy-server.ts drops
from 715 to 597 lines, with fork diff now ~30 lines (import + registration
call). This makes future upstream syncs straightforward — fork code is
visible in one directory and easily rebased.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…e errors

When thinking blocks are suppressed, subsequent content block indices can
jump ahead of what the client expects. Clamp indices to highestSeen+1 and
log the adjustment for debugging. Also pass CLAUDISH_PROXY_KEY via
docker-compose environment.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Some Anthropic-compat providers (z.ai, GLM) send message_delta with
stop_reason but never emit the message_stop event. Claude Code requires
message_stop as the terminal event — without it, the client reports
"API returned an empty or malformed response (HTTP 200)".

Track sawMessageStop separately from stopReason so the synthetic
finalization fires even when stopReason was received but message_stop
was not.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the "web search unsupported" warning with actual execution via
a local SearXNG instance. Two interception paths:

1. Structured tool_call: WebSearch/WebFetch tool calls are suppressed
   from the stream and replaced with SearXNG results as a text block.
2. GLM <searchWeb> tags: GLM models emit <searchWeb><query>...</query>
   </searchWeb> in text — these are intercepted at finalize and replaced
   with SearXNG results.

Also intercepts sub-agent web search/fetch requests at the proxy level
(Claude Code sends "Perform a web search for the query: X" as a single
user message).

Additional improvements:
- ToolState.suppressed flag for clean tool call suppression
- Empty-response error handling: providers returning no content now emit
  a structured api_error event instead of a malformed empty message
- Finalize logging with model, text length, tool count
- Proxy logs tool names in incoming requests for debug

Requires SEARXNG_URL env var. Graceful fallback when unavailable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…responses

Z.AI/GLM providers emit server_tool_use content blocks that the Anthropic
SDK does not understand (it expects only text, tool_use, thinking). These
blocks are now silently suppressed from the stream with index tracking.

When the stream ends without a message_start (empty response from provider),
emit a synthetic message sequence (message_start → text block with error
message → message_stop) so the client SDK receives a well-formed response
instead of a malformed empty one.

Also includes minor whitespace reformatting for consistency.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nsport providers

Claude Code v2.1.153+ injects role:"system" messages inline in the
messages array (e.g. system-reminders). Anthropic-compatible providers
(Z.AI, MiniMax, Kimi) reject role:"system" in messages — only
"user"/"assistant" are accepted.

The fix strips inline system messages from the payload and merges their
content into the top-level system prompt field. Applied at two levels:

1. ComposedHandler: strips from requestPayload.messages for
   anthropic-sse transport
2. openai-messages converter: handles the OpenAI format path by merging
   into the system message at index 0

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add CLAUDISH_CAPTURE_DIR env-gated diagnostic capture that writes the
full request body to JSON files. Disabled by default (no-op when env
var is unset), so zero overhead in normal production use.

Purpose: when a hang or malformed response is reported, enable the env
var to capture the exact request payload for offline replay against
different bun versions or configurations.

Files are written to the capture dir with pattern:
  req-{pid}-{seq}-{timestamp}-{source}.json

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rtifacts

- Bump Dockerfile base image from oven/bun:1.2-alpine to 1.3-alpine
  (pending Docker Hub egress fix for rebuild validation)
- Switch SEARXNG_URL from IP:port to search.myia.io DNS name
- Add .gitignore entries for diagnostic capture/, trace files,
  deploy notes, and monitoring scripts

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…andling, and diagnostic capture

Document three new subsystems:
- Web Search Interception (v7.1+): SearXNG backend, two interception
  paths (tool_call suppression + GLM searchWeb tags), sub-agent handling
- Inline System Message Handling: strip role:system for Anthropic-transport
  providers (Claude Code v2.1.153+ compatibility)
- Diagnostic Body Capture: CLAUDISH_CAPTURE_DIR env-gated full request
  body capture for offline hang reproduction

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…de capture

The finalize() function in openai-sse.ts referenced opts.modelName which
does not exist in that scope (the parameter is target). This caused a
ReferenceError thrown AFTER state.finalized=true but BEFORE
controller.close(). The catch block re-invoked finalize() which returned
immediately (already finalized), swallowing the error and leaving the
ReadableStream permanently open - the client HTTP connection hung forever.

Fix: replace opts.modelName with target (lines 160-161, 350).

Also adds response-side SSE capture infrastructure (gated by
CLAUDISH_CAPTURE_DIR env var, no-op when unset):
- response-capture.ts: new diagnostic helper
- Wired into openai-sse, anthropic-sse, and native-handler
- request-logger: add machine= tag from x-claudish-machine header

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… non-native providers

ComposedHandler is never used for native Anthropic (that's NativeHandler),
so every request here goes to a provider that doesn't understand Anthropic
thinking signatures. Previously, Opus thinking blocks (with signatures)
flowed through to Z.AI/GLM/MiniMax/etc. unmodified, causing either silent
corruption or API rejection on subsequent turns.

The strip removes thinking blocks (type: 'thinking') from assistant messages
in the request payload before forwarding to the provider. Native Anthropic
requests via NativeHandler are unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…TTP 200 + [1302])

Root cause of the recurring GLM dropouts ("empty or malformed response (HTTP 200)"
crashes that required manual "continue"/"reprends" recovery): Z.AI delivers its
instantaneous burst / RPM limit as **HTTP 200** followed by an in-stream SSE error
([1302]) carrying no message envelope. That escapes both `response.ok` and
FallbackHandler (both key off HTTP status), so it reached Claude Code as a bare
error and crashed the turn. Confirmed via CLAUDISH_CAPTURE_DIR body captures: the
errors are instant (0-4ms), not timeouts — large context correlates with crashes
via request bursting/concurrency, not duration.

Three proportionate, fail-open layers (no shared adaptive pacer — see ROADMAP):

1. Parser safety-net (anthropic-sse.ts): finalizeWithError() always emits a clean
   terminal message (synthetic message_start if needed + message_delta end_turn +
   message_stop) and suppresses bare `event: error`, so a stream error can never
   again surface as a raw crash. Rate-limit errors get a friendly retry notice.

2. Transport backoff (anthropic-compat.ts): AnthropicProviderTransport.enqueueRequest
   retries genuine HTTP 429 with jittered exponential backoff (2/4/8/16/30s + 0-1s).

3. In-stream peek + temporize (stream-peek.ts + composed-handler.ts): peek the first
   bytes of anthropic-sse responses (errors arrive in 0-4ms, so a bounded ~2s window
   catches them without delaying slow-but-healthy big-context responses), and on a
   [1302] retry the SAME provider with jittered backoff (1/2/4s + 0-750ms) up to 3x
   — the sustained quota has headroom, only the burst limit is hit. On persistence,
   return 429 so FallbackHandler switches provider (Z.AI -> GLM Coding, separate
   quota). Uses ReadableStream.tee() so no bytes are lost downstream; fails open on
   any unexpected condition.

Jitter throughout de-synchronizes the 6-machine cluster (one shared key per provider)
so retries don't re-collide on the burst limit.

Also enables CLAUDISH_CAPTURE_DIR in docker-compose for ongoing production diagnosis.

Tests: stream-peek.test.ts (5, full-byte passthrough + classification + fail-open),
anthropic-compat 429-retry (2), MadAppGang#106 regression rewritten to the no-crash contract +
start-of-stream rate-limit fixture/test. Build + tsc clean.

ROADMAP: adaptive/shared AIMD pacer deferred with explicit trigger conditions
(per-request jittered backoff judged sufficient for current load).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The in-stream rate-limit log calls (ComposedHandler peek+retry, and the
anthropic-sse safety-net finalizer) used plain log() with no forceConsole
flag. In the Docker deployment the always-on log path is a read-only mount
and --debug is off, so these operational events were invisible to
`docker logs`. Add forceConsole=true and a greppable `[RateLimit]` prefix
to all four call sites so burst events can be observed live.

Purely additive logging — no behavior change. 14/14 rate-limit tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jsboige jsboige closed this Jun 8, 2026
@jsboige jsboige deleted the fix/glm-instream-rate-limit branch June 8, 2026 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants