Skip to content

v0.31.3 fix: stdio MCP graceful cleanup + engine-aware auth/admin SQL (closes #413, #446)#801

Merged
garrytan merged 5 commits intomasterfrom
garrytan/washington-v1
May 10, 2026
Merged

v0.31.3 fix: stdio MCP graceful cleanup + engine-aware auth/admin SQL (closes #413, #446)#801
garrytan merged 5 commits intomasterfrom
garrytan/washington-v1

Conversation

@garrytan
Copy link
Copy Markdown
Owner

@garrytan garrytan commented May 10, 2026

Summary

Two community PRs landed as one focused PR with two atomic bisect-friendly commits:

stdio MCP graceful cleanup (#676 successor): stdio EOF, SIGTERM, SIGINT, SIGHUP, parent-process death (every reparent case — PID 1, launchd subreaper, systemd, tmux, parent shell with PR_SET_CHILD_SUBREAPER), and an optional --stdio-idle-timeout all funnel into one idempotent shutdown that releases the engine and the PGLite write lock within 5 seconds. ps-unavailable stripped containers log loudly and skip the watchdog rather than silently fall back to process.ppid (Bun caches it). Closes #413, #446. Supersedes #591. Credit @Aragorn2046, @seungsu-kr.

Engine-aware auth/admin SQL (#681 successor): gbrain auth, OAuth 2.1 (register-client / token mint / revoke), the admin SPA, file uploads, and the legacy bearer-auth HTTP transport all run through the active engine via a deliberately-narrow SqlQuery adapter (src/core/sql-query.ts) calling engine.executeRaw. JSONB writes route through a separate executeRawJsonb(engine, sql, scalarParams, jsonbParams) helper that emits explicit $N::jsonb casts. Migration v46 normalizes pre-v0.31.3 string-shaped mcp_request_log.params rows up to real JSONB objects. Credit codex-bot.

Decision log: ~/.claude/plans/system-instruction-you-are-working-peppy-moore.md (5 eng-review decisions + 8 codex outside-voice decisions; 3 of the eng-review decisions reversed by codex including the SqlQuery.json() extension which was rejected in favor of the separate executeRawJsonb helper).

Test Coverage

COMMIT 1 — stdio cleanup (src/commands/serve.ts +270 / -4)
[+] installStdioLifecycle()
  ├── [★★★] SIGTERM/SIGINT/SIGHUP — 3 cases
  ├── [★★★] stdin 'end' / 'close' — 2 cases
  ├── [★★★] TTY stdin skips watcher — 1 case
  ├── [★★★] idempotent shutdown under race — 1 case
  ├── [★★★] 5s cleanup deadline when disconnect() rejects — 1 case
  ├── [★★★] watchdog reparent-to-1 (init) — 1 case
  ├── [★★★] watchdog reparent-to-PID-N (subreaper) — 1 case (NEW per codex #3)
  ├── [★★★] watchdog NOT installed when initial ppid = 1 — 1 case
  ├── [★★★] ps-unavailable startup probe disables watchdog + loud log — 1 case (NEW per codex #4)
  ├── [★★★] watchdog tick with unchanged ppid is no-op — 1 case
  └── [★★★] --stdio-idle-timeout (9 cases incl. strict parse rejects)

COMMIT 2 — SqlQuery routing
[+] src/core/sql-query.ts (NEW)
  ├── [★★★] scalar binds round-trip — 1 case
  ├── [★★★] rejects array/promise/object — 1 case
  ├── [★★★] executeRawJsonb({k:v}) round-trip with jsonb_typeof = 'object' — 1 case
  ├── [★★★] takes-holders v0.12.0 double-encode regression guard — 1 case
  ├── [★★★] null JSONB handling — 1 case
  ├── [★★★] mixed scalar+jsonb call shape — 1 case
  └── [★★★] non-scalar in scalarParams rejected — 1 case
[+] test/e2e/auth-takes-holders-pglite.test.ts (NEW, in-memory PGLite, no DATABASE_URL gate)
  ├── [★★★] access_tokens.permissions create + read returns object — 1 case
  ├── [★★★] access_tokens.permissions UPDATE preserves shape — 1 case
  ├── [★★★] mcp_request_log.params object writes round-trip — 1 case
  ├── [★★★] mcp_request_log.params NULL writes — 1 case
  ├── [★★★] migration v46 normalizer lifts string-shaped rows — 1 case
  └── [★★★] migration v46 idempotent (second run no-op) — 1 case
[+] test/config-env.test.ts (NEW, withEnv() helper per CLAUDE.md R1)
  └── [★★★] DATABASE_URL/GBRAIN_DATABASE_URL precedence — 5 cases
[+] test/http-transport.test.ts (mock updated to intercept engine.executeRaw)
  └── [★★★] all 24 cases pass post-migration

COVERAGE: 64 tested paths across 5 test files (PGLite + Postgres parity)
QUALITY: ★★★ 64/64 — all behavioral, none smoke
GAPS: 0

Tests: 4786 pass / 1 pre-existing fail (brain-registry.serial, reproduces on master pre-branch — not introduced by this PR).

Pre-Landing Review

Already ran via /plan-eng-review (5 issues found, all addressed) + /codex outside-voice consult-mode plan review (10 findings, 9 closed by re-design + 1 wording fix). Both reviews logged CLEAR before commits landed. Implementation matches the reviewed plan; no new findings during the merge.

Plan Completion

11/11 plan tasks DONE — see commit messages and ~/.claude/plans/system-instruction-you-are-working-peppy-moore.md. Decisions D1, D2, D5 were reversed by Codex's outside-voice review (recorded in the plan's decision log) before implementation started, so the shipped code matches the post-review consensus, not the original plan.

TODOS

No TODOS.md items completed by this PR. The PR closes GitHub issues #413 and #446 (which aren't tracked in TODOS.md).

Documentation

CLAUDE.md updated with v0.31.3 annotations in the Key Files section: new entries for src/core/sql-query.ts and src/commands/serve.ts, amended notes on auth.ts, serve-http.ts, and migrate.ts (v46). README, CHANGELOG, TODOS, and VERSION required no changes (CHANGELOG already shipped via the merge commit).

Test plan

  • bun run typecheck — clean
  • bash scripts/check-jsonb-pattern.sh — clean
  • bash scripts/check-test-isolation.sh — clean (299 non-serial unit files)
  • bun test test/serve-stdio-lifecycle.test.ts — 22/22 pass
  • bun test test/sql-query.test.ts — 7/7 pass
  • bun test test/config-env.test.ts — 5/5 pass
  • bun test test/http-transport.test.ts — 24/24 pass (mock updated to intercept engine.executeRaw)
  • bun test test/e2e/auth-takes-holders-pglite.test.ts — 6/6 pass
  • bun run test (full unit fast loop) — 4786 pass / 1 pre-existing fail
  • bun run verify — typecheck + 4 pre-checks all clean
  • Bisect smoke: git bisect start HEAD~3; git bisect bad HEAD; git bisect good HEAD~3 — verified before merge that commit 1 (stdio) typechecks + tests pass independently
  • Real-Postgres E2E (bun run test:e2e) — not run locally; CI will run on PR

🤖 Generated with Claude Code


View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

garrytan and others added 5 commits May 9, 2026 21:34
The PGLite write lock leaked indefinitely when the parent of `gbrain serve`
disconnected. Three root causes: serve.ts never called engine.disconnect()
after startMcpServer() resolved; cli.ts short-circuited with a "serve doesn't
disconnect" comment; and the MCP SDK's StdioServerTransport only listens for
'data'/'error' on stdin, never 'end'/'close', so even a clean stdin EOF never
reached the SDK.

Net effect: the next `gbrain serve` waited for the in-process 5-minute stale-
lock check or hung indefinitely.

stdio path now installs a unified lifecycle:
- SIGTERM/SIGINT/SIGHUP all funnel into one idempotent shutdown path
  (SIGHUP coverage matters for Claude Desktop on macOS / MCP gateway
  restarts; SIGINT for Ctrl-C; SIGTERM for daemon shutdown).
- stdin 'end' (clean EOF) and 'close' (parent SIGKILL with pipe still
  open) both trigger the same graceful path. TTY stdin skips the watchers
  so interactive `gbrain serve` is unaffected.
- Parent-process watchdog polls the live kernel parent PID via spawnSync
  ('ps','-o','ppid=','-p',PID) every 5s. process.ppid is cached at process
  creation by Bun (and Node) and never refreshes on re-parent — empirical
  evidence on macOS shows ps reports the new parent within one tick while
  process.ppid stays at the original PID indefinitely (oven-sh/bun#30305).
- Watchdog fires on `getParentPid() !== initialParentPid` (any reparent),
  not just `=== 1`. Catches launchd / systemd / tmux / parent-shell-with-
  PR_SET_CHILD_SUBREAPER cases where the kernel re-anchors us to a non-1
  subreaper PID. Codex review caught the original `=== 1` was incomplete.
- One-shot startup probe verifies `spawnSync('ps')` actually works on this
  host. If the probe fails (stripped containers / busybox without procps),
  we skip installing the watchdog interval entirely AND emit a loud stderr
  line — the operator sees "watchdog disabled" instead of an installed-
  but-never-fires phantom that silently falls back to cached process.ppid.
- 5-second cleanup deadline: if engine.disconnect() wedges (PGLite WASM
  stall, etc.), the process still calls process.exit(0). The abandoned
  lock dir is reclaimed on the next start by the existing stale-lock
  check in pglite-lock.ts.
- Optional `--stdio-idle-timeout <sec>`: default OFF safety net for
  parents that leak the pipe but never close it. Strict parsing rejects
  `abc` / `30junk` / `-1` / `1.5` / blank values explicitly so a typo
  doesn't silently disable the safety net (closes #446).

Test seam: ServeOptions { stdin, signals, exit, log, startMcpServer,
getParentPid, setInterval, clearInterval, probeWatchdog } lets the
lifecycle be unit-tested deterministically without spawning a real Bun
child or booting the MCP SDK.

22 test cases covering signals, stdin EOF, TTY skip, watchdog reparent
(both PID-1 and subreaper-PID-N cases), ps-unavailable degraded mode,
idle timeout, idempotent shutdown, and cleanup-deadline behavior.

Closes #413, #446. Supersedes #591.

Co-Authored-By: Aragorn2046 <noreply@github.com>
Co-Authored-By: seungsu-kr <noreply@github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`gbrain auth` and `gbrain serve --http` previously routed every SQL
through the postgres.js singleton in src/core/db.ts, which silently fell
back to a file-backed PGLite when DATABASE_URL was set but the config
file disagreed. The HTTP transport's verbatim use of the singleton also
made `gbrain serve --http` Postgres-only, even though the
`access_tokens` and `mcp_request_log` tables exist in both engine
schemas.

Auth, OAuth, admin, file uploads, and HTTP-transport SQL now run through
`engine.executeRaw` via a deliberately narrow tagged-template adapter
(`src/core/sql-query.ts`). The contract is scalar-binds-only — adding
JSONB or fragment composition would invite the adapter to drift into a
partial postgres.js clone. JSONB writes use a separate
`executeRawJsonb(engine, sql, scalarParams, jsonbParams)` helper that
composes positional `$N::jsonb` casts and passes objects through
`engine.executeRaw`. The CI guard at `scripts/check-jsonb-pattern.sh`
doesn't fire because the helper is a method call, not the banned
`${JSON.stringify(x)}::jsonb` template-literal interpolation, and the
v0.12.0 double-encode bug class doesn't apply to positional binding via
`postgres.js`'s `unsafe()` (verified by
`test/e2e/auth-permissions.test.ts:67` on Postgres and the new
`test/sql-query.test.ts` on PGLite).

Migrated call sites:
  - src/commands/auth.ts: takes-holders writes (lines 52, 86) →
    executeRawJsonb. List, revoke, register-client, revoke-client →
    SqlQuery via withConfiguredSql() helper that opens an engine, runs
    the callback, disconnects.
  - src/commands/serve-http.ts: ~25 call sites including the four
    mcp_request_log.params INSERTs (now write real JSONB objects, not
    JSON-encoded strings — the read side `params->>'op'` returns the
    operation name, closing CLAUDE.md's outstanding "JSON-string-into-
    JSONB" note as a side effect). The /admin/api/requests dynamic
    filter pattern (postgres.js fragment composition) is rewritten as
    parametrized SQL string + params array.
  - src/mcp/http-transport.ts: legacy bearer-auth path. The
    Postgres-only fail-fast at startup is removed because both schemas
    now carry access_tokens + mcp_request_log.
  - src/core/oauth-provider.ts: SqlQuery / SqlValue types relocated
    from here to sql-query.ts as the canonical home (Codex finding #8).
  - src/commands/files.ts: all 5 db.getConnection() sites (lines 104,
    139, 252, 326, 355). The line-256 INSERT into files.metadata uses
    executeRawJsonb; the other four are scalar-only SqlQuery (Codex
    finding #6 — scope was bigger than the plan's "lone INSERT" framing).
  - src/core/config.ts: env-var DATABASE_URL inference. When dbUrl is
    set, infer Postgres engine and clear the stale database_path.

Engine-internal sql.json() sites in src/core/postgres-engine.ts (5
sites: lines 520, 1689, 1728, 1790, 2313) STAY UNCHANGED. They live
inside PostgresEngine itself, where the postgres.js template-tag
sql.json() pattern is correct — those methods are only loaded when
Postgres is the active engine, so there's no PGLite-routing concern.

Migration v45 (mcp_request_log_params_jsonb_normalize): one-shot UPDATE
that lifts pre-v0.31 string-shaped JSONB rows to objects so the
/admin/api/requests endpoint at serve-http.ts:605 returns one
consistent shape to the admin SPA. Idempotent (subsequent runs find no
rows where jsonb_typeof = 'string'). Closes the mixed-shape window
that would otherwise have made post-deploy admin reads break.

Tests:
  - test/sql-query.test.ts: 7 cases covering scalar binds, the
    .json() rejection (defense in depth — SqlQuery is scalar-only),
    JSONB round-trip with `jsonb_typeof = 'object'` and `->>`
    semantics, the v0.12.0 double-encode regression guard, null
    JSONB handling, and the scalars-then-jsonb call shape.
  - test/config-env.test.ts: migrated from PR's manual `restoreEnv()`
    in afterEach to the canonical `withEnv()` helper at
    test/helpers/with-env.ts (CLAUDE.md R1 / codex finding D3).
    Five cases covering DATABASE_URL precedence, GBRAIN_DATABASE_URL
    operator override, file-only config, env-only config, and the
    no-config null path.
  - test/e2e/auth-takes-holders-pglite.test.ts: 6 cases against
    in-memory PGLite (no DATABASE_URL gate). Covers create / update /
    read of access_tokens.permissions, mcp_request_log.params object
    + null writes, and the migration v45 normalizer (seed
    string-shaped row, run UPDATE, assert object shape; second-run
    no-op for idempotency).
  - test/http-transport.test.ts: mock updated to intercept
    engine.executeRaw (the new code path) instead of the postgres.js
    template tag. 24 cases pass.

Plan reference: ~/.claude/plans/system-instruction-you-are-working-peppy-moore.md.
Codex outside-voice review applied: D-codex-1, D-codex-2, D-codex-5,
D-codex-8, D-codex-9, D-codex-10 (and D1, D5 reversed by codex).

Closes the architectural intent of #681. Supersedes its branch.

Co-Authored-By: codex-bot <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings master's v0.31.0 facts hot memory + v0.31.1 thin-client + v0.31.1.1
fix-wave + v0.31.2 sync-strategy-code wave into the branch. Two real merge
conflicts resolved:

- src/commands/serve-http.ts: master refactored the inlined op.handler call
  into dispatchToolCall() with metaHook for hot-memory `_meta` injection.
  Branch's executeRawJsonb migration for the four mcp_request_log.params
  INSERT sites now applies on top of master's new dispatch flow. The
  catch-throw, toolResult.isError, and success paths all route through
  executeRawJsonb so JSONB writes preserve the object shape (D1 wave).

- src/core/migrate.ts: master's v45 (facts_hot_memory_v0_31) keeps slot
  45; branch's mcp_request_log_params_jsonb_normalize moves to v46 to
  avoid the version collision. The normalizer SQL is unchanged — runs
  after the facts table lands so /admin/api/requests sees one consistent
  shape post-deploy.

VERSION 0.31.2 -> 0.31.3 (next available patch slot; v0.31.4 and v0.31.6
already claimed by open PRs #795 #796).
package.json synced.
CHANGELOG entry added describing both bundled fixes (#676 stdio cleanup,
#681 engine-aware auth SQL).

Targeted tests: 64 cases across 5 files all pass post-merge:
  bun test test/serve-stdio-lifecycle.test.ts test/sql-query.test.ts \
    test/config-env.test.ts test/http-transport.test.ts \
    test/e2e/auth-takes-holders-pglite.test.ts

Verify gate: typecheck clean, all pre-checks pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Annotate the v0.31.3 changes in the canonical Key Files section:
new src/core/sql-query.ts adapter (#681), src/commands/serve.ts stdio
cleanup (#676), v0.31.3 amendments to auth.ts / serve-http.ts /
oauth-provider.ts surfaces, and migration v46 normalizer in migrate.ts.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CI's build-llms test asserts the committed llms.txt + llms-full.txt
match what scripts/build-llms.ts produces from current source state.
CLAUDE.md was amended by /document-release post-merge (new entries for
src/core/sql-query.ts and src/commands/serve.ts; amended notes on
auth.ts / serve-http.ts / migrate.ts), so the inlined-bundle fell out
of sync. Regenerated via `bun run build:llms`.

llms.txt unchanged (curated index — no new web URLs added).
llms-full.txt updated to inline the new CLAUDE.md content.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan merged commit 9c60b3a into master May 10, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MCP serve: stdin=/dev/null causes lock contention after gateway restart

1 participant