Skip to content

perf(self-healing): idle-adaptive monitoring + opt-in --idle-timeout#651

Merged
shaun0927 merged 2 commits intodevelopfrom
feat/649-idle-adaptive-monitoring
Apr 23, 2026
Merged

perf(self-healing): idle-adaptive monitoring + opt-in --idle-timeout#651
shaun0927 merged 2 commits intodevelopfrom
feat/649-idle-adaptive-monitoring

Conversation

@shaun0927
Copy link
Copy Markdown
Owner

Closes #649.

Opened as DRAFT because 22 pre-existing monitor tests assert
`setInterval`-specific semantics and need migration to `setTimeout`-chain-aware
assertions. The production code is complete and all new tests pass (39 tests
across 3 new unit suites). See Known Limitations below.

Summary

Two independent parts behind feature flags:

  • Part A (default ON) — idle-adaptive monitoring. When the server has had
    no MCP RPC and no CDP event for 5 minutes, all 8 long-lived monitors switch
    to relaxed tick rates (≤10× slower). Restored to active rates within one
    tick on the next RPC or CDP event. Disable with `OPENCHROME_IDLE_ADAPTIVE=0`.
  • Part B (opt-in) — `--idle-timeout=` / `OPENCHROME_IDLE_TIMEOUT_MS`.
    When set, the server cleanly `process.exit(0)` after the specified window
    of no RPCs AND no active sessions. Default OFF (zero change from today).

In-scope prerequisite: `enhancedShutdown` reentrancy guard

Verified absent on `develop`: `grep -rn 'shuttingDown' src/` returned zero hits.
Without it, PR #645's PPID watcher and this PR's idle-timeout could double-invoke
shutdown if they fire in the same tick. Added a `shuttingDown` boolean guard at
the top of `enhancedShutdown` (commit part of this PR) so the second entrant
returns immediately. Test `tests/integration/shutdown-reentrancy.test.ts` exercises
the double-trigger path.

Acceptance criteria (from issue #649 §4)

Part A

  1. ✅ Idle-rate verified via fake clock (`tests/watchdog/idle-adaptive-monitors.test.ts`).
  2. ✅ Active-rate recovery within one tick on RPC / CDP event.
  3. ✅ Active-mode tick cadence unchanged (same delay constants).
  4. ✅ Idle rates capped at 10× active per §3.1 table (explicit test).
  5. 🟡 CPU reduction: pending bench run in this PR; `scripts/bench-idle.mjs` ready.
  6. 🟡 `heapUsed` growth gate: pending bench run.
  7. ✅ `OPENCHROME_IDLE_ADAPTIVE=0` disables Part A (unit test).
  8. ✅ Every MCP request handler calls `notifyActive()` on entry. Enumeration:
    • `src/mcp-server.ts` — 1 central location (request dispatcher).

Part B

  1. ✅ `serve --idle-timeout=30s` exits within 45 s when idle (integration test).
  2. ✅ Periodic RPC keeps server alive (integration test).
  3. ✅ Active tracked session blocks exit (unit test).
  4. ✅ Invalid `--idle-timeout` rejected at parse time (unit test).
  5. ✅ Unset flags → zero change from today.
  6. ✅ Graceful exit via `enhancedShutdown('idle-timeout')` once only (reentrancy test).

Pre-merge checklist

5.1 Static / hygiene

  • ✅ `npm run lint` — 0 errors, 44 pre-existing warnings (no new warnings).
  • ✅ `npm run build` — 0 TypeScript errors.
  • ❌ `npm test` — 22 failed / 3535 passed / 86 skipped (see Known Limitations).

5.2–5.4 New test suites (all green)

  • ✅ `tests/utils/idle-state.test.ts` — all §5.2 cases, 4+ deterministic cases.
  • ✅ `tests/utils/idle-timeout.test.ts` — all §5.3 cases.
  • ✅ `tests/watchdog/idle-adaptive-monitors.test.ts` — per-monitor rates + 10× ratio cap.
  • Total new tests green: 39/39.

5.5–5.7 Integration tests (added; to be re-verified after test-migration fix)

  • `tests/integration/idle-adaptive-monitoring.test.ts`
  • `tests/integration/idle-timeout.test.ts`
  • `tests/integration/shutdown-reentrancy.test.ts`

5.8 HTTP / daemon-mode regression

  • ✅ HTTP transport unchanged; idle-timeout off by default.

5.9 Memory / benchmark gate

  • 🟡 `scripts/bench-idle.mjs` included; benchmark run pending in this PR thread.

5.10 Cross-platform CI

  • 🟡 Will run on push.

Known Limitations (why this is DRAFT)

6 pre-existing monitor test suites assert `setInterval`-specific behaviour that
no longer holds after the `setTimeout`-chain migration (which is itself required
by issue §3.1 to actually reduce CPU, per the critic pass on the issue):

Suite Failing Tests
`tests/watchdog/chrome-monitor.test.ts` 4
`tests/browser-state/snapshot.test.ts` 5
`tests/session-state-persistence.test.ts` 1
`tests/cdp/tab-health-monitor.test.ts` 3
`tests/watchdog/event-loop-monitor.test.ts` 4
`tests/chrome/process-watchdog.test.ts` 5
Total 22

Example: `ChromeProcessMonitor.start() schedules periodic sampling via setInterval`
uses `jest.advanceTimersByTime(1000)` twice and expects 2 sequential execFile
calls. With `setTimeout`-chain the first tick's async body hasn't resolved
`scheduleNext()` before the second `advanceTimersByTime` fires, so the
counter stays at 2. Fix: insert `await Promise.resolve()` (or
`jest.runAllTimersAsync()`) between time advances so microtasks flush
and the next `setTimeout` is registered.

The monitors still fire at byte-for-byte identical cadence in active mode
(acceptance criterion 3); only the internal mechanism changed. Migrating
these 22 cases is mechanical but spans 6 files. Tracked as a follow-up
commit on this branch before removing DRAFT status.

Trade-off decision (for reviewer)

The alternative to `setTimeout`-chain is `setInterval` + early-return on
idle
. That would preserve the existing tests but keeps the timer wake rate
at the active cadence on idle (e.g. 5 wakes/sec for EventLoopMonitor even
while idle). The critic pass on issue #649 explicitly rejected the no-op
approach because it doesn't meet criterion 5 (≥30% CPU reduction). If the
maintainer prefers the less-invasive early-return approach, the existing tests
stay green and the only change is the idle CPU number.

Files

  • New: `src/utils/idle-state.ts`, `src/utils/idle-timeout.ts`, `scripts/bench-idle.mjs`
  • New tests: 6 files (3 unit, 3 integration)
  • Modified: `src/index.ts` (+80), `src/mcp-server.ts` (+7), `src/cdp/client.ts` (+9), 7 monitor implementations (setTimeout-chain migration)
  • Net: +1731 / −73 LoC across 19 files (tests included).

Implements issue #649. Two independent parts behind feature flags:

Part A (default ON, OPENCHROME_IDLE_ADAPTIVE=0 disables):
- New src/utils/idle-state.ts tracks last-active timestamp (RPC + CDP).
- Each of 8 monitors (EventLoop, ChromeProcessWatchdog, TabHealth,
  ChromeProcessMonitor, Disk, BrowserStateManager, SessionStatePersistence)
  migrated from setInterval to setTimeout chain, re-scheduling at end of
  each tick with active-rate or idle-rate delay per issue #649 §3.1.
- After 5 min of silence, timers drop to configured idle rates (≤10x slower);
  next RPC or CDP event restores active rate within one tick.

Part B (opt-in via --idle-timeout=<duration> or OPENCHROME_IDLE_TIMEOUT_MS):
- New src/utils/idle-timeout.ts. On tick, if idle AND sessionCount=0,
  invokes enhancedShutdown('idle-timeout') -> process.exit(0).
- Duration parser rejects bare numbers; requires ms/s/m/h suffix.

In-scope prerequisite per issue §2:
- enhancedShutdown now carries a shuttingDown reentrancy guard so the
  PPID watcher (PR #645) and idle-timeout cannot double-invoke shutdown.

Tests:
- tests/utils/idle-state.test.ts (new, all §5.2 cases).
- tests/utils/idle-timeout.test.ts (new, all §5.3 cases).
- tests/watchdog/idle-adaptive-monitors.test.ts (new, per-monitor rate
  and ratio checks).
- tests/integration/idle-adaptive-monitoring.test.ts (new).
- tests/integration/idle-timeout.test.ts (new).
- tests/integration/shutdown-reentrancy.test.ts (new, double-trigger test).
- scripts/bench-idle.mjs (new, CPU + heapUsed growth bench).

Known test migration pending:
- 22 pre-existing monitor tests across 6 suites assert setInterval-specific
  semantics (e.g. "start() schedules via setInterval"). The monitors still
  fire at identical active-mode cadence, but these tests need migration to
  setTimeout-chain-aware assertions. PR is opened as DRAFT while that
  migration is completed. New tests (39 passing) already cover the new
  behaviour end-to-end.
The draft idle-adaptive branch relaxed monitors immediately on process start, which broke the intended quiet-window semantics and left legacy timing tests asserting the wrong observable behavior.

This change treats process startup as the initial active edge and updates the timer-focused suites to validate the timeout-chain scheduler against that contract. The result keeps the adaptive feature intact while restoring a releasable test baseline.

Constraint: Adaptive monitors must not regress into immediate idle cadence on startup
Rejected: Disable idle-adaptive mode in tests globally | would hide the startup-semantics bug instead of fixing it
Confidence: high
Scope-risk: moderate
Reversibility: clean
Directive: Keep idle-state semantics and timer tests aligned; if startup behavior changes again, update both together
Tested: npx jest tests/utils/idle-state.test.ts tests/browser-state/snapshot.test.ts tests/watchdog/chrome-monitor.test.ts tests/utils/idle-timeout.test.ts tests/watchdog/idle-adaptive-monitors.test.ts --runInBand
Not-tested: Cross-platform CI rerun after pushing the branch
@shaun0927 shaun0927 marked this pull request as ready for review April 23, 2026 11:56
@shaun0927 shaun0927 merged commit a5a226e into develop Apr 23, 2026
8 of 9 checks passed
@shaun0927 shaun0927 deleted the feat/649-idle-adaptive-monitoring branch April 23, 2026 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant