Reject concurrent chat runs on the same conversation with 409#51
Merged
mgoldsborough merged 2 commits intomainfrom Apr 20, 2026
Merged
Reject concurrent chat runs on the same conversation with 409#51mgoldsborough merged 2 commits intomainfrom
mgoldsborough merged 2 commits intomainfrom
Conversation
When a user sends a new message while an agent turn on the same conversation is still in flight, the runtime currently starts a second run on overlapping state. The Anthropic API then rejects the call with "model does not support assistant message prefill" because the serialized history ends with in-progress assistant content. Observed in production on conv_ce7995f740af44d4 as five back-to-back prefill failures while one long-running turn (27 minutes of image/Typst thrash) held the conversation. This adds an in-memory Set of conversation IDs with an in-flight chat() call to Runtime. A second concurrent call on the same conversation throws RunInProgressError; /v1/chat and /v1/chat/stream surface it as HTTP 409 run_in_progress. The web client drops its optimistic placeholders and shows a clear "assistant is still working" banner so the user's draft is preserved for retry. Out of scope: queueing deferred messages, canceling the in-flight run, or any scheduler changes. Those land in a follow-up; this is the smallest change that stops the bleeding.
- Defer user.message broadcast on the stream path until the engine emits chat.start. Prevents a phantom broadcast to other participants of a shared conversation when the request loses a concurrency race and is rejected before the run actually starts. - Capture web.chat_run_in_progress telemetry so we can measure whether the fix stops the bleeding in production. - Add HTTP-level tests for /v1/chat/stream covering both rejection paths: pre-check 409 (held open deterministically via a gated mock model) and the concurrent-requests invariant (≥1 winner, every rejected request carries the stable run_in_progress code). - Document the single-replica assumption on the activeConversations lock; if a tenant is ever scaled past 1 replica the invariant needs to move to a shared store.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Runtime.chat(): a second concurrent call on the sameconversationIdrejects withRunInProgressErrorinstead of starting a duplicate run on in-flight state.POST /v1/chatandPOST /v1/chat/streamsurface the rejection as HTTP 409{error: "run_in_progress"}. The stream path also pre-checks so the client gets an HTTP code, not an SSE error mid-stream. For shared conversations, theuser.messagebroadcast to other participants is deferred until the engine emitschat.start, so a rejected request never broadcasts a phantom message with no assistant reply.formatSendError, and captures aweb.chat_run_in_progresstelemetry event so we can measure whether the fix actually stops the bleeding.Why
A single conversation in production burned 9.3M input tokens in one day. Root cause wasn't an agent loop — it was the user sending follow-up messages while a 27-minute agent turn was still in flight. Each follow-up started a new run on a conversation with in-progress assistant state, which the Anthropic API rejected with:
Five back-to-back prefill errors on
conv_ce7995f740af44d4between 17:33 and 17:50, while one long-running turn held the conversation. From the user's seat this looks like the app is frozen. The runtime had no concept of "one active turn per conversation," so retries silently corrupted state instead of being rejected cleanly.This is the smallest change that stops the bleeding. It does not queue deferred messages or cancel the in-flight run — those are follow-ups. It does give the client a typed HTTP signal it can render correctly.
Scope & assumptions
Set<string>on theRuntimeinstance. Correct today because each tenant runs withplatform.replicas: 1— all traffic for a conversation lands on the same pod. Documented on the field; flagged to be revisited if a tenant is ever scaled past one replica (the conversation JSONL on the shared PVC has the same single-writer assumption, so they would move together).Test plan
bun run verify— 1843 unit + 103 web + 358 integration + 16 smoke tests passtest/integration/runtime/concurrent-chat.test.ts(runtime-level):RunInProgressErrortest/integration/chat-stream-concurrent.test.ts(HTTP-level):/v1/chat/streampre-check returns 409 (held open deterministically via a gated mock model)/v1/chat/streamrequests: ≥1 winner, all rejections carryrun_in_progresscode — covers both the pre-check 409 path and the SSE-error-on-race pathtest/integration/api-integration.test.tsconcurrent-requests test updated to assert the new contract: ≥1 request returns 200, the rest return 409 witherror: "run_in_progress".conv_ce7995f740af44d4, and watchweb.chat_run_in_progressevent rate.Out of scope (tracked for follow-ups)