Skip to content

fix(#64): stop Claude Code disconnects via lazy progress + opt-out#88

Open
jamubc wants to merge 3 commits into
mainfrom
fix/issue-64-keepalive
Open

fix(#64): stop Claude Code disconnects via lazy progress + opt-out#88
jamubc wants to merge 3 commits into
mainfrom
fix/issue-64-keepalive

Conversation

@jamubc

@jamubc jamubc commented Jun 2, 2026

Copy link
Copy Markdown
Owner

Summary

Fixes the MCP server disconnects reported in #64, folded into the unpublished 1.1.8 release. The reproduction in #64 is a plain ping — an instant call that disconnects right after a successful response — so this is not a timeout. The root cause is the upstream Claude Code behavior tracked in anthropics/claude-code#53617: on some client versions, emitting progress notifications around a successful tools/call response disconnects the stdio server.

The 1.1.7 stdin/EPIPE hardening was a candidate fix but did not resolve it, which pointed away from a crash-only cause and toward the progress-notification trigger.

What changed

Lazy progress (the core fix). The server no longer sends an immediate progress:0 ack, and the final progress:100 is gated behind a per-call emittedProgress flag. A tool that finishes before the first keepalive tick now emits zero progress notifications, so quick calls like ping never bracket their response with progress — removing the #53617 trigger for the common case. Long operations still receive keepalive progress on each interval and survive the SDK's 60s request timeout.

GEMINI_MCP_DISABLE_PROGRESS opt-out (absorbed from #80). Setting GEMINI_MCP_DISABLE_PROGRESS=1 (or =true) suppresses all progress notifications, for clients still affected by #53617 even on long calls. Progress stays on by default.

Robustness, independent of the above:

  • Per-call ProgressContext replaces shared module globals, so a finishing call can no longer clear the keepalive timer of another call in flight.
  • Process-level guards (stdout/stderr error handlers + unhandledRejection/uncaughtException) keep the long-lived server alive through a stray async error or an EPIPE after the client goes away. Startup failures still exit.
  • The keepalive now sends a spec-correct notifications/progress carrying the client's progressToken (the only notification the MCP SDK uses to reset a request's timeout — verified in shared/protocol.js, _onprogress), and stops its own timer if a write fails on a dead pipe.

Docs. Added a "Known Issues" entry in docs/resources/troubleshooting.md covering #53617 and the opt-out env var.

Why not "more heartbeats"

A drive-by comment on #64 suggested adding more frequent heartbeats. That is the wrong direction for this symptom: #64's failing case is an instant call, and on a #53617-affected client more progress notifications mean more disconnect triggers. Two of that comment's three hypotheses (undici fetch failed in @google/genai, p-retry behavior) do not apply here at all — this project spawns the gemini CLI via child_process and pulls in none of @google/genai, undici, or p-retry (confirmed with npm ls). The 429 case is already handled by the existing flash fallback.

Test plan

  • npm run build — clean
  • npm test — 57/57 unit + integration pass
  • Manual: ping repeatedly in Claude Code v2.1.109 and confirm no disconnect
  • Manual: a >15s changeMode still streams progress and does not time out

Closes #64.

jamubc added 3 commits June 1, 2026 06:15
Three problems caused intermittent disconnects:

1. The 25s keepalive interval only fired notifications when the MCP client
   sent a progressToken. Claude Code does not always send one, so the stdio
   pipe went silent for the full duration of every tool call, hitting the
   client's ~30s inactivity timeout.

2. Shared module-level globals (isProcessing, currentOperationName,
   latestOutput) meant that concurrent tool calls corrupted each other's
   state. The first call to finish set isProcessing=false, stopping the
   keepalive timer for any still-running sibling call.

3. stopProgressUpdates fired the final progress notification without
   awaiting it, sending it after the tool response was already on the wire.

Fixes:
- Replace shared globals with a per-call ProgressContext object.
- Send notifications/message (logging) on every interval tick regardless of
  progressToken — any write to stdout resets the client silence timer.
- On connection error inside the interval, clear the timer and mark the
  context done rather than swallowing the failure silently.
- Reduce interval from 25s to 15s (two beats before a 30s timeout).
- Make stopProgressUpdates async and await it at both call sites so the
  final completion notification is flushed before the response frame.
Self-review against the installed MCP SDK (shared/protocol.js) corrected two
mistaken assumptions from the first commit:

1. notifications/message does NOT keep a request alive. The SDK only resets a
   request's timeout in _onprogress, and only for a notifications/progress that
   carries the client's progressToken (when resetTimeoutOnProgress is set).
   DEFAULT_REQUEST_TIMEOUT_MSEC is 60000. So the interval now sends a real
   progress notification (token required) instead of a logging notification
   that the SDK ignores for timeout purposes. Without a token the server has no
   lever on the client's timeout, so it sends nothing rather than pretending to.

2. The likely cause of "disconnects after a few successful calls" is the process
   dying on a stray async/stream error (e.g. EPIPE on stdout after the client
   went away), since there were no process-level guards. Added stdout/stderr
   error handlers plus unhandledRejection/uncaughtException handlers that log to
   stderr and keep the long-lived server up. Startup failures still exit.

The interval also catches a failed progress write now and stops itself, so a
dead pipe cannot leak a timer or keep throwing.
Root cause of #64 is anthropics/claude-code#53617: some Claude Code
versions disconnect a stdio MCP server after a successful tools/call when
the server emits progress notifications. The #64 repro is `ping` — an
instant call — so this is not a timeout; it is a fast call whose response
gets bracketed by progress notifications.

Fixes:
- Lazy progress: drop the immediate progress:0 ack and gate the final
  progress:100 behind a per-call emittedProgress flag. A tool that finishes
  before the first keepalive tick now emits ZERO progress notifications, so
  quick calls never trip #53617. Long operations still get keepalive
  progress every interval and survive the 60s request timeout.
- GEMINI_MCP_DISABLE_PROGRESS=1|true env opt-out suppresses all progress
  notifications for clients still affected by #53617 (absorbed from #80).
  Progress stays on by default.
- Troubleshooting docs: Known Issues entry for #53617 + the env var.

Keeps the earlier robustness work on this branch (per-call ProgressContext;
process-level stdout/stderr + unhandledRejection/uncaughtException guards;
spec-correct progress-token keepalive).

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses MCP server disconnect issues by implementing lazy progress notifications, managing progress state per-call to support concurrent tool executions, and adding process-level error handlers to prevent stray async or stream errors from crashing the server. The feedback recommends terminating the process on uncaughtException rather than keeping it alive to avoid unpredictable state corruption, and suggests optimizing the progress update logic by not scheduling a setInterval timer when no progress token is provided.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/index.ts
process.stdout.on("error", (err) => Logger.error("stdout stream error (ignored):", err));
process.stderr.on("error", () => { /* nowhere safe left to log */ });
process.on("unhandledRejection", (reason) => Logger.error("Unhandled rejection (server kept alive):", reason));
process.on("uncaughtException", (error) => Logger.error("Uncaught exception (server kept alive):", error));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Keeping the process alive after an uncaughtException is highly discouraged in Node.js because the application state becomes unpredictable and corrupted. Since you have already added a stream error handler on process.stdout to catch EPIPE errors (which prevents them from becoming uncaught exceptions), any other uncaught exception is likely a critical error where the process should be terminated gracefully to avoid memory leaks or zombie processes.

process.on("uncaughtException", (error) => {
  Logger.error("Uncaught exception (shutting down):", error);
  process.exit(1);
});

Comment thread src/index.ts
Comment on lines +77 to 137
function startProgressUpdates(operationName: string, progressToken?: string | number) {
const ctx: ProgressContext = {
isProcessing: true,
operationName,
latestOutput: "",
messageIndex: 0,
progress: 0,
emittedProgress: false,
};

const progressMessages = [
`🧠 ${operationName} - Gemini is analyzing your request...`,
`📊 ${operationName} - Processing files and generating insights...`,
`✨ ${operationName} - Creating structured response for your review...`,
`⏱️ ${operationName} - Large analysis in progress (this is normal for big requests)...`,
`🔍 ${operationName} - Still working... Gemini takes time for quality results...`,
];

let messageIndex = 0;
let progress = 0;

// Send immediate acknowledgment if progress requested
if (progressToken) {
sendProgressNotification(
progressToken,
0,
undefined, // No total - indeterminate progress
`🔍 Starting ${operationName}`
);
}

// Keep client alive with periodic updates

// Lazy progress: we deliberately do NOT send an immediate progress:0 ack. A fast
// tool (e.g. ping) finishes before the first interval tick and so emits no progress
// notifications at all — which is what avoids the #53617 client disconnect on quick
// calls. Progress only starts once a call actually runs past KEEPALIVE_INTERVAL,
// i.e. exactly the long operations that need the keepalive.
const progressInterval = setInterval(async () => {
if (isProcessing && progressToken) {
// Simply increment progress value
progress += 1;

// Include latest output if available
const baseMessage = progressMessages[messageIndex % progressMessages.length];
const outputPreview = latestOutput.slice(-150).trim(); // Last 150 chars
const message = outputPreview
? `${baseMessage}\n📝 Output: ...${outputPreview}`
: baseMessage;

await sendProgressNotification(
progressToken,
progress,
undefined, // No total - indeterminate progress
message
);
messageIndex++;
} else if (!isProcessing) {
if (!ctx.isProcessing) {
clearInterval(progressInterval);
return;
}
}, PROTOCOL.KEEPALIVE_INTERVAL); // Every 25 seconds

return { interval: progressInterval, progressToken };

// The only server-side lever on a client's in-flight request timeout is a
// notifications/progress carrying the client's progressToken: the MCP SDK
// resets the per-request timer in _onprogress for that token (when the client
// set resetTimeoutOnProgress). Logging notifications do not touch the timer,
// so without a token there is nothing useful to send here. This is precisely
// why a slow (e.g. 15-minute) changeMode survives only when the client opted
// into progress — see PROTOCOL.KEEPALIVE_INTERVAL and the docs on long ops.
if (!progressToken) return;

const baseMessage = progressMessages[ctx.messageIndex % progressMessages.length];
const outputPreview = ctx.latestOutput.slice(-150).trim();
const message = outputPreview ? `${baseMessage}\n📝 Output: ...${outputPreview}` : baseMessage;
ctx.messageIndex++;
ctx.progress += 1;

try {
await server.notification({
method: PROTOCOL.NOTIFICATIONS.PROGRESS,
params: { progressToken, progress: ctx.progress, message },
});
ctx.emittedProgress = true;
} catch (error) {
// Transport gone (e.g. EPIPE after the client went away): stop ticking so
// we neither leak this timer nor keep throwing on a dead pipe.
Logger.error("Keepalive progress notification failed; stopping updates:", error);
ctx.isProcessing = false;
clearInterval(progressInterval);
}
}, PROTOCOL.KEEPALIVE_INTERVAL);

return { interval: progressInterval, progressToken, ctx };
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If progressToken is undefined (which happens when progress is disabled or not requested by the client), we still set up a setInterval timer that ticks every 15 seconds. Although it is cleared when the tool execution finishes, during execution it schedules unnecessary ticks on the event loop that do nothing because of the if (!progressToken) return; guard. We can optimize this by returning early and not scheduling the interval when progressToken is not provided.

function startProgressUpdates(operationName: string, progressToken?: string | number) {
  const ctx: ProgressContext = {
    isProcessing: true,
    operationName,
    latestOutput: "",
    messageIndex: 0,
    progress: 0,
    emittedProgress: false,
  };

  if (!progressToken) {
    return { interval: undefined, progressToken, ctx };
  }

  const progressMessages = [
    "🧠 " + operationName + " - Gemini is analyzing your request...",
    "📊 " + operationName + " - Processing files and generating insights...",
    "✨ " + operationName + " - Creating structured response for your review...",
    "⏱️ " + operationName + " - Large analysis in progress (this is normal for big requests)...",
    "🔍 " + operationName + " - Still working... Gemini takes time for quality results..."
  ];

  const progressInterval = setInterval(async () => {
    if (!ctx.isProcessing) {
      clearInterval(progressInterval);
      return;
    }

    const baseMessage = progressMessages[ctx.messageIndex % progressMessages.length];
    const outputPreview = ctx.latestOutput.slice(-150).trim();
    const message = outputPreview ? baseMessage + "\n📝 Output: ..." + outputPreview : baseMessage;
    ctx.messageIndex++;
    ctx.progress += 1;

    try {
      await server.notification({
        method: PROTOCOL.NOTIFICATIONS.PROGRESS,
        params: { progressToken, progress: ctx.progress, message },
      });
      ctx.emittedProgress = true;
    } catch (error) {
      Logger.error("Keepalive progress notification failed; stopping updates:", error);
      ctx.isProcessing = false;
      clearInterval(progressInterval);
    }
  }, PROTOCOL.KEEPALIVE_INTERVAL);

  return { interval: progressInterval, progressToken, ctx };
}

Comment thread src/index.ts
Comment on lines +139 to 145
async function stopProgressUpdates(
progressData: { interval: NodeJS.Timeout; progressToken?: string | number; ctx: ProgressContext },
success: boolean = true
) {
const operationName = currentOperationName; // Store before clearing
isProcessing = false;
currentOperationName = "";
const operationName = progressData.ctx.operationName;
progressData.ctx.isProcessing = false;
clearInterval(progressData.interval);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Update the signature of stopProgressUpdates to allow interval to be undefined (since we avoided scheduling the interval when progressToken is not provided), and conditionally clear it.

Suggested change
async function stopProgressUpdates(
progressData: { interval: NodeJS.Timeout; progressToken?: string | number; ctx: ProgressContext },
success: boolean = true
) {
const operationName = currentOperationName; // Store before clearing
isProcessing = false;
currentOperationName = "";
const operationName = progressData.ctx.operationName;
progressData.ctx.isProcessing = false;
clearInterval(progressData.interval);
async function stopProgressUpdates(
progressData: { interval?: NodeJS.Timeout; progressToken?: string | number; ctx: ProgressContext },
success: boolean = true
) {
const operationName = progressData.ctx.operationName;
progressData.ctx.isProcessing = false;
if (progressData.interval) {
clearInterval(progressData.interval);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Keeps disconnecting after every few successful calls in Claude Code

1 participant