Skip to content

perf(serve): gate HealthEndpoint off by default in stdio mode#652

Merged
shaun0927 merged 3 commits intodevelopfrom
feat/648-health-endpoint-stdio-gating
Apr 23, 2026
Merged

perf(serve): gate HealthEndpoint off by default in stdio mode#652
shaun0927 merged 3 commits intodevelopfrom
feat/648-health-endpoint-stdio-gating

Conversation

@shaun0927
Copy link
Copy Markdown
Owner

Closes #648.

Summary

HealthEndpoint's HTTP listener served every stdio MCP instance with no consumer. This PR gates it off by default in stdio mode (and preserves daemon-mode behaviour) behind a pure resolver with an explicit opt-in via `OPENCHROME_HEALTH_ENDPOINT`.

Acceptance criteria (from issue #648 §4)

  1. ✅ Default stdio → health port NOT bound (integration test, `ECONNREFUSED` within 500 ms).
  2. ✅ `--transport http` / `--transport both` → health port bound (integration test, `GET /health` returns 200 JSON).
  3. ✅ `OPENCHROME_HEALTH_ENDPOINT=1 --transport stdio` → bound (integration test).
  4. ✅ `OPENCHROME_HEALTH_ENDPOINT=0 --transport http` → not bound (integration test).
  5. ✅ Exactly one startup log line declares the state (grep of stderr in each scenario).
  6. ✅ Graceful shutdown with `healthEndpoint=null` produces no `TypeError` / `Cannot read properties of null` / unhandled rejection. Explicit audit in integration test: SIGTERM each of 7 scenarios, assert clean exit and clean stderr.
  7. ✅ Invalid values (`garbage`, `2`, `yes`, empty) fall through to transport-mode default — no startup error (unit + integration tests).
  8. ✅ FD count drops by exactly 1 per stdio instance (bench: 31 → 30, deterministic).

Pre-merge checklist (from issue #648 §5)

5.1 Static / hygiene

  • ✅ `npm run lint` — 0 errors, 44 pre-existing warnings (none in new files).
  • ✅ `npm run build` — 0 TypeScript errors.
  • ✅ `npm test` — 3526 passed / 0 failed (+15 new tests from this PR, zero regressions vs baseline).
  • ✅ `grep -n 'healthEndpoint\.' src/index.ts` — references: line 549 inside `if (healthEndpoint) { ... }` block and line 621 optional-chained (`healthEndpoint?.stop()`). No unguarded references.
  • ✅ No new `console.log` introduced in the diff (audited by `git diff upstream/develop | grep -E '^\+.*console\.log'` — empty).

5.2 Unit: pure resolver (8/8 green)

  • ✅ All 8 cases from §5.2 pass: stdio unset → false; http unset → true; both unset → true; stdio+'1' → true; stdio+'true' → true; http+'0' → false; http+'false' → false; stdio+'garbage' → false.

5.3 Integration: real port binding (7/7 green)

  • ✅ stdio default refused.
  • ✅ `--transport http` bound.
  • ✅ `OPENCHROME_HEALTH_ENDPOINT=1 + stdio` bound.
  • ✅ `OPENCHROME_HEALTH_ENDPOINT=0 + http` refused.
  • ✅ `garbage + stdio` falls through (refused).
  • ✅ `garbage + http` falls through (bound).
  • ✅ `--transport both` bound with mode=both log line.
  • ✅ Graceful shutdown audit — all 7 scenarios exit 0 with no `TypeError` in stderr.

5.4 Memory measurement gate

  • ✅ Bench run: n=20 spawns per scenario.
  • RSS median stdio-patched: 121.995 MB
  • RSS median stdio-baseline: 122.028 MB (delta: ~33 KB, within noise as expected)
  • FD median stdio-patched: 30
  • FD median stdio-baseline: 31 (delta: -1 FD per instance, hard evidence)
  • `heapUsed` not externally observable; the bench script documents this limitation and relies on RSS + FD as gates. RSS within noise, FD delta deterministic. Note: the issue §4 criterion 8 specified a `heapUsed` gate — this PR substitutes FD count (cleanly deterministic) plus RSS observation as the substantive evidence.

5.5 Cross-platform CI

  • 🟡 Will run on PR open (9 jobs: ubuntu/macOS/windows × Node 18/20/22).

5.6 Operator UX regression

  • ✅ Daemon mode (`--transport http`, `--transport both`) byte-for-byte unchanged.
  • ✅ Log line at startup makes current state observable.

Benchmark results (from issue §5.4)

Scenario RSS median FD median
stdio-patched (endpoint disabled) 121,995,264 B 30
stdio-baseline (OPENCHROME_HEALTH_ENDPOINT=1) 122,028,032 B 31
http-patched 122,953,728 B 32

Per-stdio-instance FD saving: -1 FD. With 80 concurrent stdio instances this reclaims 80 FDs that were bound listeners with no consumer.

Files

  • `src/utils/health-endpoint-gating.ts` (new, ~20 LoC, pure function)
  • `src/index.ts` (modified: gated constructor + log line + optional-chained teardown)
  • `tests/utils/health-endpoint-gating.test.ts` (new, 8 unit tests)
  • `tests/integration/health-endpoint-gating.test.ts` (new, 7 integration tests)
  • `scripts/bench-health-endpoint.mjs` (new, n=20 benchmark)
  • `README.md` (env-var table entry)
  • `docs/releases/v1.10.3.md` (new, staging notes bullet)

Rollback

Immediate operator-side: `OPENCHROME_HEALTH_ENDPOINT=1` in environment, no restart needed on next spawn. No config migration, no data to undo. Full revert is a single `git revert`.

Implements issue #648. Per-instance cost of the HealthEndpoint HTTP
listener is unnecessary for stdio MCP sessions (1 FD, ~200 KB heap)
but remains useful in daemon modes (http / both). This change:

- Adds src/utils/health-endpoint-gating.ts with a pure resolver:
  OPENCHROME_HEALTH_ENDPOINT=1/true -> on, =0/false -> off, other values
  fall through to the transport-mode default (http/both -> on, stdio -> off).
- Wires the resolver at HealthEndpoint construction in src/index.ts.
  When disabled, the instance is null and teardown is optional-chained.
- Preserves daemon-mode behaviour byte-for-byte (existing integrators
  continue to see /health on the same port).
- Emits one stderr log line per startup declaring enabled/disabled state.

Tests:
- tests/utils/health-endpoint-gating.test.ts — 8 resolver cases.
- tests/integration/health-endpoint-gating.test.ts — 7 scenarios
  including graceful-shutdown audit (no TypeError with null endpoint).

Documentation:
- README.md env-var table entry for OPENCHROME_HEALTH_ENDPOINT.
- docs/releases/v1.10.3.md staging notes bullet.

Benchmark (n=20 spawns, scripts/bench-health-endpoint.mjs):
- stdio baseline (endpoint=1): 31 FDs per instance, RSS median 122 MB.
- stdio patched (endpoint=0 default): 30 FDs per instance, RSS median 122 MB.
- FD delta: -1 per stdio instance (deterministic, matches expectation).
- http-patched: 32 FDs (unchanged from baseline).

Rollback: OPENCHROME_HEALTH_ENDPOINT=1 restores the stdio listener.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 903e913c9b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/index.ts
Comment on lines +548 to +550
console.error(`[SelfHealing] HealthEndpoint: enabled (port=${healthPort}, bind=${healthBind}, mode=${transportMode})`);
healthEndpoint.start().catch((err: unknown) => {
console.error('[SelfHealing] HealthEndpoint start failed:', err);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Log enabled state only after listener startup succeeds

The code emits HealthEndpoint: enabled before healthEndpoint.start() has actually bound the socket. If startup fails (for example, EADDRINUSE on the default port when another instance is running), this line still claims the endpoint is enabled even though no listener is active. That makes the new startup state log unreliable in a realistic daemon deployment scenario and can mislead health/ops checks that consume this signal.

Useful? React with 👍 / 👎.

The stdio-gating change correctly stopped binding the health listener in unnecessary cases, but the disabled log always claimed stdio mode even when operators had explicitly forced the endpoint off in HTTP mode.

This narrows the message to the actual reason so stderr remains trustworthy during rollout and debugging.

Constraint: Startup logs must stay on stderr and preserve the existing gating behavior
Rejected: Leave the misleading message in place | would make forced-off HTTP cases look like transport-default disablement
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: If more health-endpoint gating states are added, update the operator log strings and integration assertions together
Tested: npx jest tests/integration/health-endpoint-gating.test.ts --runInBand
Not-tested: Fresh GitHub Actions matrix after branch push
The stdio health-endpoint fix does not need to stage a future release note in the feature branch, and carrying a v1.10.3 draft here collides with the other open performance PRs.

Dropping the staging note keeps the branch mergeable and leaves final release documentation to the actual release-prep step.

Constraint: Open performance PRs must merge cleanly into develop without competing release-note drafts
Rejected: Keep the v1.10.3 staging file here | creates needless conflicts with sibling PRs touching the same placeholder doc
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Do not stage shared future release-note files in issue-specific PR branches unless the release owner explicitly asks for it
Tested: git diff --name-only upstream/develop...HEAD
Not-tested: Re-run GitHub CI after push
@shaun0927 shaun0927 merged commit 2ff1d02 into develop Apr 23, 2026
6 of 9 checks passed
@shaun0927 shaun0927 deleted the feat/648-health-endpoint-stdio-gating branch April 23, 2026 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant