Skip to content

fix: mitigate MCP process accumulation#83

Merged
jcfischer merged 3 commits intomainfrom
fix/mcp-process-accumulation
Mar 20, 2026
Merged

fix: mitigate MCP process accumulation#83
jcfischer merged 3 commits intomainfrom
fix/mcp-process-accumulation

Conversation

@jcfischer
Copy link
Copy Markdown
Owner

Summary

  • Idle auto-exit: MCP server self-terminates after 30 min without tool calls, preventing zombie process accumulation across Claude Code sessions (configurable via SUPERTAG_MCP_IDLE_TIMEOUT, set 0 to disable)
  • Connection health check: DeltaSyncService.ensureHealthyConnection() runs SELECT 1 before each sync and auto-reconnects if the SQLite connection has gone stale ("database or disk is full")
  • Periodic health check: Poller runs connection health check every 10th tick as a proactive safeguard

Reported by Ryan Baum — Claude Code spawns one supertag-mcp per session and never cleans up idle processes, leading to 14+ zombie processes with stale SQLite connections.

Test plan

  • TypeScript typecheck passes
  • 3450+ fast tests pass (5 pre-existing flaky failures unrelated)
  • Verify idle timeout fires after 30 min of MCP inactivity
  • Verify connection reconnect recovers from stale SQLite state

🤖 Generated with Claude Code

…tion health checks

Claude Code spawns one supertag-mcp per session and never cleans up idle
processes. Over days this leads to 14+ zombie processes holding stale SQLite
connections, causing "database or disk is full" errors.

Three mitigations:
- Idle auto-exit: MCP server self-terminates after 30min without tool calls
  (configurable via SUPERTAG_MCP_IDLE_TIMEOUT, set 0 to disable)
- Connection health check: DeltaSyncService tests connection before each sync
  and auto-reconnects if stale
- Periodic health check in poller every 10th tick

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@jcfischer
Copy link
Copy Markdown
Owner Author

AI Review: CHANGES REQUESTED

Code Review: PR #83 - MCP Process Accumulation Mitigation

Summary

This PR addresses zombie process accumulation in Claude Code by adding idle auto-exit and SQLite connection health checks. The implementation is clean and focused, but requires test coverage before approval.


✅ Strengths

  1. Surgical fix - Three focused changes without over-engineering
  2. Configurable behavior - Environment variable for timeout duration, can disable with 0
  3. Clean timer management - Proper .unref() to avoid keeping process alive
  4. Defensive reconnection - Catches and recovers from stale SQLite connections
  5. Follows existing patterns - Uses established logging and error handling

🔴 Blocking Issues

1. Missing Test Coverage (CRITICAL)

None of the three new features have tests:

  • ✗ No tests for ensureHealthyConnection() method
  • ✗ No tests for idle timeout behavior
  • ✗ No tests for periodic health check (every 10th tick)

Required coverage:

// tests/unit/delta-sync-service.test.ts
describe('ensureHealthyConnection', () => {
  it('should reconnect when SELECT 1 fails');
  it('should not reconnect when connection is healthy');
  it('should handle close() failures gracefully');
  it('should log reconnection events');
});

// tests/unit/mcp-idle-timeout.test.ts
describe('idle auto-exit', () => {
  it('should exit after timeout with no tool calls');
  it('should reset timer on each tool call');
  it('should be disabled when SUPERTAG_MCP_IDLE_TIMEOUT=0');
  it('should cleanup activePoller on timeout');
});

// tests/unit/delta-sync-poller.test.ts (add to existing)
describe('periodic health check', () => {
  it('should call ensureHealthyConnection every 10 ticks');
  it('should not interfere with normal sync cycle');
});

2. No Integration Test for the Full Fix

The PR describes a real bug reported by Ryan Baum but doesn't demonstrate the fix works end-to-end. Need a test that:

  1. Simulates stale SQLite connection
  2. Verifies reconnection succeeds
  3. Confirms sync continues without data loss

⚠️ Medium Priority Issues

3. Idle Timer Could Fire During Active Sync

Issue: If a sync operation takes >30 minutes, the idle timer could fire mid-operation.

// Line 81-85 in index.ts
idleTimer = setTimeout(() => {
  logger.info('Idle timeout reached, shutting down', { timeoutMs: IDLE_TIMEOUT_MS });
  activePoller?.stop();  // What if actively syncing?
  process.exit(0);
}, IDLE_TIMEOUT_MS);

Recommendation: Check activePoller?.isSyncing() before exiting, or reset timer on sync start.

4. tickCount Not Reset on Stop/Restart

Issue: If poller is stopped and restarted, tickCount continues from previous value.

// src/mcp/delta-sync-poller.ts:82
private tickCount = 0;  // Never reset in stop()

Recommendation: Reset tickCount = 0 in stop() method for predictable behavior.

5. Swallowed Close Error in ensureHealthyConnection

// Line 59-62 in delta-sync.ts
try {
  this.db.close();
} catch {
  // Already broken — ignore
}

Recommendation: Log the ignored error at debug level for troubleshooting.


💡 Low Priority Suggestions

6. Magic Number: Why Every 10th Tick?

// src/mcp/delta-sync-poller.ts:187
if (this.tickCount % 10 === 0) {
  this.service.ensureHealthyConnection();
}

Suggestion: Extract to constant with rationale comment:

/** Check connection health every N ticks to catch stale connections early */
const HEALTH_CHECK_INTERVAL_TICKS = 10;

7. Idle Timeout Default Could Be Configurable Per Environment

30 minutes is reasonable for desktop use, but cloud deployments might want different values. Consider documenting this in README/SKILL.md.


📋 Required Actions Before Approval

  1. ✅ Add unit tests for ensureHealthyConnection()
  2. ✅ Add unit tests for idle timeout behavior
  3. ✅ Add test coverage for periodic health check
  4. ✅ Add integration test demonstrating stale connection recovery
  5. ⚠️ Address idle timer vs active sync race condition
  6. ⚠️ Reset tickCount on poller stop

📊 Review Scorecard

Dimension Score Notes
Spec Compliance N/A No formal spec for this fix
Plan Adherence N/A No formal plan
Security No security concerns
Code Quality ⚠️ Clean code, missing tests
Architecture Respects boundaries
Edge Cases ⚠️ Timer race condition, tickCount reset
Code Duplication No duplication

Verdict: CHANGES REQUESTED - Add test coverage and address timer/tickCount issues.

@jcfischer
Copy link
Copy Markdown
Owner Author

Additional Review Findings (not covered above)

🔴 CRITICAL: NaN timeout causes immediate exit on startup

// src/mcp/index.ts:75
const IDLE_TIMEOUT_MS = Number(process.env.SUPERTAG_MCP_IDLE_TIMEOUT ?? 30 * 60 * 1000);

If a user sets SUPERTAG_MCP_IDLE_TIMEOUT=abc (typo, misconfigured env), Number("abc") returns NaN.

The guard NaN <= 0 evaluates to false (IEEE 754), so the timer IS created. setTimeout(fn, NaN) treats NaN as 0 in most runtimes — the MCP server exits immediately on startup.

Fix:

const rawTimeout = Number(process.env.SUPERTAG_MCP_IDLE_TIMEOUT ?? 30 * 60 * 1000);
const IDLE_TIMEOUT_MS = Number.isNaN(rawTimeout) ? 30 * 60 * 1000 : rawTimeout;

⚠️ MEDIUM: Redundant ensureHealthyConnection() — tick() check is dead code

ensureHealthyConnection() is called in two places:

  1. delta-sync-poller.ts:187 — every 10th tick
  2. delta-sync.ts:255 — at the top of sync(), every time

Since tick() always calls sync() immediately after the health check, the tick-level check never adds value:

  • On the 10th tick: runs twice back-to-back (tick check → sync check)
  • On all other ticks: sync() does it anyway

Recommendation: Remove the tick-level check entirely. The sync() check already covers every call path including triggerNow().


💡 LOW: Reconnected DB connection missing busy_timeout

// src/services/delta-sync.ts:64
this.db = new Database(this.dbPath);

The reconnected connection doesn't set PRAGMA busy_timeout = 5000. WAL mode persists at the file level (fine), but busy_timeout is per-connection and won't carry over.

Not a regression — the original constructor at line 38 doesn't configure it either. But worth fixing in the reconnect path (and the constructor) as a follow-up, since the whole point of this PR is resilience under concurrent access.

jcfischer and others added 2 commits March 19, 2026 17:03
Critical:
- NaN guard: extract DEFAULT_IDLE_TIMEOUT_MS constant, fall back on NaN
- Idle timer race: check activePoller.isSyncing() before exit, defer if busy

Medium:
- busy_timeout: configure PRAGMA busy_timeout=5000 in constructor and reconnect
- Simplify unref() guard (Bun always supports .unref() on timers)

Tests:
- Add delta-sync-health-check.test.ts (6 tests: healthy/unhealthy/reconnect/busy_timeout)
- Rewrite mcp-idle-timeout.test.ts (14 tests: parsing, timer behavior, sync guard)
- Add tickCount reset test to delta-sync-poller.test.ts

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@jcfischer jcfischer merged commit f17ec43 into main Mar 20, 2026
1 check passed
@jcfischer jcfischer deleted the fix/mcp-process-accumulation branch March 20, 2026 10:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant