fix(#64): stop Claude Code disconnects via lazy progress + opt-out#88
fix(#64): stop Claude Code disconnects via lazy progress + opt-out#88jamubc wants to merge 3 commits into
Conversation
Three problems caused intermittent disconnects: 1. The 25s keepalive interval only fired notifications when the MCP client sent a progressToken. Claude Code does not always send one, so the stdio pipe went silent for the full duration of every tool call, hitting the client's ~30s inactivity timeout. 2. Shared module-level globals (isProcessing, currentOperationName, latestOutput) meant that concurrent tool calls corrupted each other's state. The first call to finish set isProcessing=false, stopping the keepalive timer for any still-running sibling call. 3. stopProgressUpdates fired the final progress notification without awaiting it, sending it after the tool response was already on the wire. Fixes: - Replace shared globals with a per-call ProgressContext object. - Send notifications/message (logging) on every interval tick regardless of progressToken — any write to stdout resets the client silence timer. - On connection error inside the interval, clear the timer and mark the context done rather than swallowing the failure silently. - Reduce interval from 25s to 15s (two beats before a 30s timeout). - Make stopProgressUpdates async and await it at both call sites so the final completion notification is flushed before the response frame.
Self-review against the installed MCP SDK (shared/protocol.js) corrected two mistaken assumptions from the first commit: 1. notifications/message does NOT keep a request alive. The SDK only resets a request's timeout in _onprogress, and only for a notifications/progress that carries the client's progressToken (when resetTimeoutOnProgress is set). DEFAULT_REQUEST_TIMEOUT_MSEC is 60000. So the interval now sends a real progress notification (token required) instead of a logging notification that the SDK ignores for timeout purposes. Without a token the server has no lever on the client's timeout, so it sends nothing rather than pretending to. 2. The likely cause of "disconnects after a few successful calls" is the process dying on a stray async/stream error (e.g. EPIPE on stdout after the client went away), since there were no process-level guards. Added stdout/stderr error handlers plus unhandledRejection/uncaughtException handlers that log to stderr and keep the long-lived server up. Startup failures still exit. The interval also catches a failed progress write now and stops itself, so a dead pipe cannot leak a timer or keep throwing.
Root cause of #64 is anthropics/claude-code#53617: some Claude Code versions disconnect a stdio MCP server after a successful tools/call when the server emits progress notifications. The #64 repro is `ping` — an instant call — so this is not a timeout; it is a fast call whose response gets bracketed by progress notifications. Fixes: - Lazy progress: drop the immediate progress:0 ack and gate the final progress:100 behind a per-call emittedProgress flag. A tool that finishes before the first keepalive tick now emits ZERO progress notifications, so quick calls never trip #53617. Long operations still get keepalive progress every interval and survive the 60s request timeout. - GEMINI_MCP_DISABLE_PROGRESS=1|true env opt-out suppresses all progress notifications for clients still affected by #53617 (absorbed from #80). Progress stays on by default. - Troubleshooting docs: Known Issues entry for #53617 + the env var. Keeps the earlier robustness work on this branch (per-call ProgressContext; process-level stdout/stderr + unhandledRejection/uncaughtException guards; spec-correct progress-token keepalive).
There was a problem hiding this comment.
Code Review
This pull request addresses MCP server disconnect issues by implementing lazy progress notifications, managing progress state per-call to support concurrent tool executions, and adding process-level error handlers to prevent stray async or stream errors from crashing the server. The feedback recommends terminating the process on uncaughtException rather than keeping it alive to avoid unpredictable state corruption, and suggests optimizing the progress update logic by not scheduling a setInterval timer when no progress token is provided.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| process.stdout.on("error", (err) => Logger.error("stdout stream error (ignored):", err)); | ||
| process.stderr.on("error", () => { /* nowhere safe left to log */ }); | ||
| process.on("unhandledRejection", (reason) => Logger.error("Unhandled rejection (server kept alive):", reason)); | ||
| process.on("uncaughtException", (error) => Logger.error("Uncaught exception (server kept alive):", error)); |
There was a problem hiding this comment.
Keeping the process alive after an uncaughtException is highly discouraged in Node.js because the application state becomes unpredictable and corrupted. Since you have already added a stream error handler on process.stdout to catch EPIPE errors (which prevents them from becoming uncaught exceptions), any other uncaught exception is likely a critical error where the process should be terminated gracefully to avoid memory leaks or zombie processes.
process.on("uncaughtException", (error) => {
Logger.error("Uncaught exception (shutting down):", error);
process.exit(1);
});| function startProgressUpdates(operationName: string, progressToken?: string | number) { | ||
| const ctx: ProgressContext = { | ||
| isProcessing: true, | ||
| operationName, | ||
| latestOutput: "", | ||
| messageIndex: 0, | ||
| progress: 0, | ||
| emittedProgress: false, | ||
| }; | ||
|
|
||
| const progressMessages = [ | ||
| `🧠 ${operationName} - Gemini is analyzing your request...`, | ||
| `📊 ${operationName} - Processing files and generating insights...`, | ||
| `✨ ${operationName} - Creating structured response for your review...`, | ||
| `⏱️ ${operationName} - Large analysis in progress (this is normal for big requests)...`, | ||
| `🔍 ${operationName} - Still working... Gemini takes time for quality results...`, | ||
| ]; | ||
|
|
||
| let messageIndex = 0; | ||
| let progress = 0; | ||
|
|
||
| // Send immediate acknowledgment if progress requested | ||
| if (progressToken) { | ||
| sendProgressNotification( | ||
| progressToken, | ||
| 0, | ||
| undefined, // No total - indeterminate progress | ||
| `🔍 Starting ${operationName}` | ||
| ); | ||
| } | ||
|
|
||
| // Keep client alive with periodic updates | ||
|
|
||
| // Lazy progress: we deliberately do NOT send an immediate progress:0 ack. A fast | ||
| // tool (e.g. ping) finishes before the first interval tick and so emits no progress | ||
| // notifications at all — which is what avoids the #53617 client disconnect on quick | ||
| // calls. Progress only starts once a call actually runs past KEEPALIVE_INTERVAL, | ||
| // i.e. exactly the long operations that need the keepalive. | ||
| const progressInterval = setInterval(async () => { | ||
| if (isProcessing && progressToken) { | ||
| // Simply increment progress value | ||
| progress += 1; | ||
|
|
||
| // Include latest output if available | ||
| const baseMessage = progressMessages[messageIndex % progressMessages.length]; | ||
| const outputPreview = latestOutput.slice(-150).trim(); // Last 150 chars | ||
| const message = outputPreview | ||
| ? `${baseMessage}\n📝 Output: ...${outputPreview}` | ||
| : baseMessage; | ||
|
|
||
| await sendProgressNotification( | ||
| progressToken, | ||
| progress, | ||
| undefined, // No total - indeterminate progress | ||
| message | ||
| ); | ||
| messageIndex++; | ||
| } else if (!isProcessing) { | ||
| if (!ctx.isProcessing) { | ||
| clearInterval(progressInterval); | ||
| return; | ||
| } | ||
| }, PROTOCOL.KEEPALIVE_INTERVAL); // Every 25 seconds | ||
|
|
||
| return { interval: progressInterval, progressToken }; | ||
|
|
||
| // The only server-side lever on a client's in-flight request timeout is a | ||
| // notifications/progress carrying the client's progressToken: the MCP SDK | ||
| // resets the per-request timer in _onprogress for that token (when the client | ||
| // set resetTimeoutOnProgress). Logging notifications do not touch the timer, | ||
| // so without a token there is nothing useful to send here. This is precisely | ||
| // why a slow (e.g. 15-minute) changeMode survives only when the client opted | ||
| // into progress — see PROTOCOL.KEEPALIVE_INTERVAL and the docs on long ops. | ||
| if (!progressToken) return; | ||
|
|
||
| const baseMessage = progressMessages[ctx.messageIndex % progressMessages.length]; | ||
| const outputPreview = ctx.latestOutput.slice(-150).trim(); | ||
| const message = outputPreview ? `${baseMessage}\n📝 Output: ...${outputPreview}` : baseMessage; | ||
| ctx.messageIndex++; | ||
| ctx.progress += 1; | ||
|
|
||
| try { | ||
| await server.notification({ | ||
| method: PROTOCOL.NOTIFICATIONS.PROGRESS, | ||
| params: { progressToken, progress: ctx.progress, message }, | ||
| }); | ||
| ctx.emittedProgress = true; | ||
| } catch (error) { | ||
| // Transport gone (e.g. EPIPE after the client went away): stop ticking so | ||
| // we neither leak this timer nor keep throwing on a dead pipe. | ||
| Logger.error("Keepalive progress notification failed; stopping updates:", error); | ||
| ctx.isProcessing = false; | ||
| clearInterval(progressInterval); | ||
| } | ||
| }, PROTOCOL.KEEPALIVE_INTERVAL); | ||
|
|
||
| return { interval: progressInterval, progressToken, ctx }; | ||
| } |
There was a problem hiding this comment.
If progressToken is undefined (which happens when progress is disabled or not requested by the client), we still set up a setInterval timer that ticks every 15 seconds. Although it is cleared when the tool execution finishes, during execution it schedules unnecessary ticks on the event loop that do nothing because of the if (!progressToken) return; guard. We can optimize this by returning early and not scheduling the interval when progressToken is not provided.
function startProgressUpdates(operationName: string, progressToken?: string | number) {
const ctx: ProgressContext = {
isProcessing: true,
operationName,
latestOutput: "",
messageIndex: 0,
progress: 0,
emittedProgress: false,
};
if (!progressToken) {
return { interval: undefined, progressToken, ctx };
}
const progressMessages = [
"🧠 " + operationName + " - Gemini is analyzing your request...",
"📊 " + operationName + " - Processing files and generating insights...",
"✨ " + operationName + " - Creating structured response for your review...",
"⏱️ " + operationName + " - Large analysis in progress (this is normal for big requests)...",
"🔍 " + operationName + " - Still working... Gemini takes time for quality results..."
];
const progressInterval = setInterval(async () => {
if (!ctx.isProcessing) {
clearInterval(progressInterval);
return;
}
const baseMessage = progressMessages[ctx.messageIndex % progressMessages.length];
const outputPreview = ctx.latestOutput.slice(-150).trim();
const message = outputPreview ? baseMessage + "\n📝 Output: ..." + outputPreview : baseMessage;
ctx.messageIndex++;
ctx.progress += 1;
try {
await server.notification({
method: PROTOCOL.NOTIFICATIONS.PROGRESS,
params: { progressToken, progress: ctx.progress, message },
});
ctx.emittedProgress = true;
} catch (error) {
Logger.error("Keepalive progress notification failed; stopping updates:", error);
ctx.isProcessing = false;
clearInterval(progressInterval);
}
}, PROTOCOL.KEEPALIVE_INTERVAL);
return { interval: progressInterval, progressToken, ctx };
}| async function stopProgressUpdates( | ||
| progressData: { interval: NodeJS.Timeout; progressToken?: string | number; ctx: ProgressContext }, | ||
| success: boolean = true | ||
| ) { | ||
| const operationName = currentOperationName; // Store before clearing | ||
| isProcessing = false; | ||
| currentOperationName = ""; | ||
| const operationName = progressData.ctx.operationName; | ||
| progressData.ctx.isProcessing = false; | ||
| clearInterval(progressData.interval); |
There was a problem hiding this comment.
Update the signature of stopProgressUpdates to allow interval to be undefined (since we avoided scheduling the interval when progressToken is not provided), and conditionally clear it.
| async function stopProgressUpdates( | |
| progressData: { interval: NodeJS.Timeout; progressToken?: string | number; ctx: ProgressContext }, | |
| success: boolean = true | |
| ) { | |
| const operationName = currentOperationName; // Store before clearing | |
| isProcessing = false; | |
| currentOperationName = ""; | |
| const operationName = progressData.ctx.operationName; | |
| progressData.ctx.isProcessing = false; | |
| clearInterval(progressData.interval); | |
| async function stopProgressUpdates( | |
| progressData: { interval?: NodeJS.Timeout; progressToken?: string | number; ctx: ProgressContext }, | |
| success: boolean = true | |
| ) { | |
| const operationName = progressData.ctx.operationName; | |
| progressData.ctx.isProcessing = false; | |
| if (progressData.interval) { | |
| clearInterval(progressData.interval); | |
| } |
Summary
Fixes the MCP server disconnects reported in #64, folded into the unpublished 1.1.8 release. The reproduction in #64 is a plain
ping— an instant call that disconnects right after a successful response — so this is not a timeout. The root cause is the upstream Claude Code behavior tracked in anthropics/claude-code#53617: on some client versions, emitting progress notifications around a successfultools/callresponse disconnects the stdio server.The 1.1.7 stdin/EPIPE hardening was a candidate fix but did not resolve it, which pointed away from a crash-only cause and toward the progress-notification trigger.
What changed
Lazy progress (the core fix). The server no longer sends an immediate
progress:0ack, and the finalprogress:100is gated behind a per-callemittedProgressflag. A tool that finishes before the first keepalive tick now emits zero progress notifications, so quick calls likepingnever bracket their response with progress — removing the #53617 trigger for the common case. Long operations still receive keepalive progress on each interval and survive the SDK's 60s request timeout.GEMINI_MCP_DISABLE_PROGRESSopt-out (absorbed from #80). SettingGEMINI_MCP_DISABLE_PROGRESS=1(or=true) suppresses all progress notifications, for clients still affected by #53617 even on long calls. Progress stays on by default.Robustness, independent of the above:
ProgressContextreplaces shared module globals, so a finishing call can no longer clear the keepalive timer of another call in flight.stdout/stderrerror handlers +unhandledRejection/uncaughtException) keep the long-lived server alive through a stray async error or anEPIPEafter the client goes away. Startup failures still exit.notifications/progresscarrying the client'sprogressToken(the only notification the MCP SDK uses to reset a request's timeout — verified inshared/protocol.js,_onprogress), and stops its own timer if a write fails on a dead pipe.Docs. Added a "Known Issues" entry in
docs/resources/troubleshooting.mdcovering #53617 and the opt-out env var.Why not "more heartbeats"
A drive-by comment on #64 suggested adding more frequent heartbeats. That is the wrong direction for this symptom: #64's failing case is an instant call, and on a #53617-affected client more progress notifications mean more disconnect triggers. Two of that comment's three hypotheses (undici
fetch failedin@google/genai,p-retrybehavior) do not apply here at all — this project spawns thegeminiCLI viachild_processand pulls in none of@google/genai,undici, orp-retry(confirmed withnpm ls). The 429 case is already handled by the existing flash fallback.Test plan
npm run build— cleannpm test— 57/57 unit + integration passpingrepeatedly in Claude Code v2.1.109 and confirm no disconnectchangeModestill streams progress and does not time outCloses #64.