Skip to content

Add OAuth client for outbound remote MCP sources#68

Open
mgoldsborough wants to merge 4 commits intomainfrom
feat/oauth-remote-mcp
Open

Add OAuth client for outbound remote MCP sources#68
mgoldsborough wants to merge 4 commits intomainfrom
feat/oauth-remote-mcp

Conversation

@mgoldsborough
Copy link
Copy Markdown
Contributor

Summary

  • Closes the gap where workspace bundles with { url } refs against OAuth-gated remote MCP servers returned 401 invalid_token and failed to start
  • Adds WorkspaceOAuthProvider (file-backed per (workspace, serverName)), /v1/mcp-auth/callback route, and a one-shot retry in McpSource.start() — the MCP TS SDK's OAuthClientProvider path is now wired end-to-end
  • Handles headless flows (Reboot Anonymous dev provider, client-credentials-style auth) by probing the authorize chain up to 10 hops and resolving in-process when it lands at our callback. Interactive browser OAuth (Granola, Claude.ai hosted) fails fast with a clear InteractiveOAuthNotSupportedError — staged for a follow-up iteration so the extension point is visible, not hidden behind TODOs

See the commit message for the full per-file breakdown.

Test plan

  • bun run verify passes on this branch alone (verified locally; 1881 unit + 123 web + 370 integration + 16 smoke, all green)
  • New unit tests cover: DCR roundtrip, tokens roundtrip, multi-hop headless detection, interactive-reject, missing-verifier, scope invalidation (10 tests for the provider + 4 for the flow registry)
  • Existing remote-integration.test.ts and remote-lifecycle.test.ts still pass — covers bundles with static transport.auth (which now bypasses the provider) and url-only bundles
  • Manual: added a Reboot-framework app to a workspace as { url: "http://localhost:9991/mcp" } with no static auth, restart NB — bundle registers, tokens persist under workspaces/<ws>/credentials/mcp-oauth/reboot-hello/, tools callable from the agent

Closes the gap where workspace bundles with `{ url }` refs against
OAuth-gated remote MCP servers (Reboot under `rbt dev`, ChatGPT Apps,
hosted Claude.ai MCP endpoints) returned `401 invalid_token` and
failed to start. The MCP TS SDK already supports the OAuth client
flow via `OAuthClientProvider`; we just weren't passing one.

- WorkspaceOAuthProvider: file-backed `OAuthClientProvider` scoped per
  `(workspace, serverName)`. Persists DCR client info, tokens, and
  PKCE verifier under `.nimblebrain/workspaces/<wsId>/credentials/
  mcp-oauth/<serverName>/` (mode 0o700). Follows the authorize-redirect
  chain up to 10 hops and detects when the chain lands at our own
  callback — Reboot's `Anonymous` dev flow completes headlessly this
  way. Non-self-target flows throw `InteractiveOAuthNotSupportedError`,
  staging real browser OAuth for a follow-up.
- oauth-flow-registry: process-local `Map<state, PendingFlow>` bridging
  the provider to the HTTP callback route. Extension point for the
  interactive flow.
- /v1/mcp-auth/callback route: `GET` handler that resolves pending
  flows by state. Unauthenticated by design (state param guards
  against unsolicited codes).
- createRemoteTransport: optional `authProvider` parameter; static
  `transport.auth` takes precedence.
- McpSource.start(): retry exactly once on `UnauthorizedError` after
  awaiting the provider's pending flow and calling
  `transport.finishAuth(code)`. Extracted helpers (connectWithTimeout,
  cleanupOnStartFailure) keep the retry readable.
- startBundleSource: instantiate WorkspaceOAuthProvider for url-ref
  bundles without static auth; threads `opts.wsId` + `opts.workDir`
  with sensible defaults.

Tests: 10 unit tests for the provider (persistence roundtrips,
headless self-target detection via multi-hop redirect, interactive
rejection) plus 4 for the flow registry.
Critical fixes:

- Cross-tenant OAuth credential leak closed. `lifecycle.ts::installRemote`
  now threads `wsId` + `workDir` into `startBundleSource` (previously
  dropped), and `startup.ts`'s URL-branch hard-errors on missing `wsId`
  when no static auth is configured. The old `?? "ws_default"` fallback
  would have silently pooled OAuth tokens across every workspace under
  the default id; named bundles already threw, URL bundles silently
  diverged. Matches the named-bundle precedent at the credential
  boundary.
- SSRF vector on the OAuth authorize chain closed. `validateBundleUrl`
  previously ran only on `ref.url`; a compromised remote MCP server
  could hand us an authorize URL pointing at AWS IMDS / RFC1918 / our
  own loopback, and we'd probe it directly. Every hop through
  `redirectToAuthorization`'s redirect loop now passes through
  `validateBundleUrl`; blocks surface as explicit
  `[workspace-oauth-provider] SSRF block …` errors so operators see the
  real cause instead of the generic "interactive not supported"
  fallthrough. `allowInsecureRemotes` is propagated into the provider
  so local-dev (Reboot on localhost:9991) still works when the flag
  is explicitly set.
- Callback self-match normalized. Comparing `next.origin + next.pathname`
  as a raw string was flaky: configured `callbackUrl` with a trailing
  slash, or a server echoing an explicit default port / uppercase
  hostname, would fail the equality check and silently fall through to
  InteractiveOAuthNotSupportedError. Provider now precomputes a
  canonical form (lowercased origin, pathname stripped of trailing `/`)
  at constructor time and compares against that.

Also: discovered and fixed a latent retry-path bug while writing the
integration test QA requested (W5). `StreamableHTTPClientTransport`
rejects a second `start()` on the same instance, so my original
"connect → catch 401 → finishAuth → connect on same transport" would
have failed whenever the retry actually fired. Matches the SDK's own
`simpleOAuthClient` pattern of new-transport-per-attempt: retry now
tears down the first transport+client, rebuilds via
`createRemoteTransport`, and reconnects on the fresh instance. The
real Reboot flow happened to work because tokens landed on disk via a
prior partial run and the first connect succeeded with cached tokens;
the test surfaced the gap.

Warnings applied:

- Bun.serve-based cases moved from `test/unit/workspace-oauth-provider`
  to `test/integration/workspace-oauth-provider`, per CLAUDE.md's
  classification rule.
- `test/integration/mcp-source-oauth-retry.test.ts` added with an
  end-to-end Bun.serve mock of the OAuth + MCP stack (discovery, DCR,
  authorize, token, /mcp 401-then-200). Directly exercises the retry
  path the PR introduces.
- `connectWithTimeout` timer leak fixed. The `setTimeout` handle is
  now captured and cleared in a finally; previously every successful
  connect leaked a 15–30s timer, doubled under the retry path.
- `rename` moved to the top static import in `workspace-oauth-provider`
  — dynamic `await import("node:fs/promises")` on every token write
  was out of step with the rest of the file.
- `mcp-auth.ts` callback route adds `Cache-Control: no-store` +
  `Pragma: no-cache`. Defense-in-depth against intermediate caches
  storing the success/failure HTML with `?code=` in the URL.

Warning #8 (token_endpoint_auth_method): acknowledged, deferred.
Switching between "none" and "client_secret_basic" based on DCR
response is not a one-liner — it requires implementing
`OAuthClientProvider.addClientAuthentication` to inject the credential
on token requests post-hoc. Noted as a TODO in a follow-up ticket;
shipping "none" unconditionally is correct for Reboot Anonymous and
works for any AS that issues refresh tokens to public clients (the
ext-apps examples do).

Suggestions applied:

- oauth-flow-registry gains a 15-minute TTL per registration (also the
  boundary documented in INTERACTIVE_OAUTH_UI.md). Timer handles are
  cleared on resolve/reject/_clearAll so late fires can't double-settle
  a promise. `timeout.unref?.()` so a stuck flow's timer doesn't keep
  CLI invocations alive.
- `mcp-source` cast `this.transport as StreamableHTTPClientTransport`
  replaced with a narrower `Transport & { finishAuth?: (...) }` —
  SSE also has finishAuth, the old cast was a narrowing lie.
- Redundant `this.dead = false` in retry path dropped — happy path
  leaves it unset, matching the default, one code path for "started."
- `NB_API_URL` startup warning: if unset when a URL-ref bundle is
  being wired, log a one-time warning that localhost:27247 is dev-only
  and prod deployments behind a proxy need the env var set.

Suggestion #11 (replace hand-rolled escape with Hono `html` template)
deliberately skipped — the escape helper is minimal, auditable, and
doesn't depend on Hono's tagging API; swap isn't worth the churn.

Full verify green: 1874 unit + 117 web + 378 integration + 16 smoke.
@mgoldsborough
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review. Fix-up landed in db8dfbf. PR #69 rebased onto the updated base.

Adjudication

Critical #1 (wsId drop) — Confirmed and fixed. installRemote now threads wsId + workDir into startBundleSource; the URL-branch hard-errors on missing wsId when no static auth is set (matching the named-bundle precedent). No more ws_default silent fallback. Saved a memory note on "don't ?? 'ws_default' workspace ids on credential boundaries" so future sessions don't step on the same mine.

Critical #2 (SSRF) — Confirmed and fixed. redirectToAuthorization now validates every hop through validateBundleUrl — initial URL, every redirect Location. Blocks surface as explicit [workspace-oauth-provider] SSRF block … errors so operators see the real cause, not the generic "interactive not supported." allowInsecureRemotes propagated into WorkspaceOAuthProviderOptions so local-dev (Reboot on localhost:9991) still works when the flag is explicitly on.

Critical #3 (callback match) — Confirmed and fixed. Precomputed canonical form at constructor time (lowercased origin, pathname stripped of trailing /). Regression test in integration covers trailing-slash + uppercase-host round-trip.

Bonus bug found writing W5's test: my original retry-path was broken — StreamableHTTPClientTransport rejects a second start() on the same instance, so the retry would have failed whenever it actually fired. Reboot happened to work because tokens landed on disk via a prior partial run, so the first connect succeeded with cached tokens. Fix mirrors the SDK's own simpleOAuthClient pattern: retry tears down the first transport+client, rebuilds via createRemoteTransport, reconnects on the fresh instance. Integration test at test/integration/mcp-source-oauth-retry.test.ts exercises the full 401→OAuth→200 flow.

Warnings 4, 5, 6, 7, 9 — all applied:

  • W4: Bun.serve cases moved from unit → integration per AGENTS.md classification
  • W5: new mcp-source-oauth-retry.test.ts — full mock of discovery + DCR + authorize + token + MCP endpoints, drives the SDK through the retry end-to-end
  • W6: connectWithTimeout timer captured + cleared in finally (was leaking 15–30s timer per start, doubled under retry)
  • W7: rename moved to top static import
  • W9: Cache-Control: no-store + Pragma: no-cache on callback responses

W8 (token_endpoint_auth_method) — acknowledged, deferred with TODO. "One-line check after DCR" understates it — switching between "none" and "client_secret_basic" post-registration requires implementing OAuthClientProvider.addClientAuthentication to inject the credential on token requests. Ship "none" unconditionally for now (works for Reboot Anonymous and AS's that issue refresh tokens to public clients, which the ext-apps examples do); follow-up ticket for confidential-client switching.

Suggestions 10, 12, 13, 14 — all applied:

  • S10: 15-min TTL per registration in flow-registry with clearTimeout on resolve/reject/_clearAll (prevents intra-process leaks); timeout.unref?.() so stuck flows don't keep CLIs alive
  • S12: dropped redundant this.dead = false in retry path (matches happy path)
  • S13: narrower Transport & { finishAuth?: (...) } cast — SSE has finishAuth too, old cast was a narrowing lie
  • S14: startup warning when NB_API_URL is unset and a URL-ref bundle is being wired

S11 (Hono html template) — deliberately skipped. Hand-rolled escape is minimal and auditable; the swap adds Hono API surface coupling for cosmetic gain.

Verify

1874 unit + 117 web + 378 integration + 16 smoke. Clean across the stack.

Process note

Your review caught two landable bugs my PR didn't — cross-tenant leak (ws_default fallback) and the SSRF. Both were specifically introduced by adding the new OAuth plumbing; pre-existing code didn't have these vectors. The lesson I'm writing down for future me: when adding a new code path behind a credential / identity / isolation boundary, the "sensible default" instinct for missing inputs is an anti-pattern. Hard error, not fallback.

@mgoldsborough mgoldsborough added the qa-reviewed QA review completed with no critical issues label Apr 23, 2026
Two-part follow-up to the QA fix-up's credential-boundary guard:

- Positive + negative integration tests for `startBundleSource`'s URL
  branch. (a) missing wsId + no static auth throws
  `requires opts.wsId`; (b) missing wsId WITH static auth starts cleanly
  — confirms the wsId requirement is scoped exactly to the path that
  would otherwise construct an OAuth provider. A future refactor that
  weakens the check to a default now fails CI.
- `products/nimblebrain/code/CLAUDE.md`'s "Workspace Isolation" section
  gains a one-line rule: hard-error on missing `wsId` at credential
  boundaries, don't silently default. Sits next to the existing
  `requireWorkspaceId()` guidance — same rule family.

The rule is enforceable by the test; the doc is a cross-reference for
humans reading the section. Not saved to session memory — it's a
codebase convention, versioned with the code.
Companion to 8666a75 — that commit caught the test but not the docs
update because `git add CLAUDE.md` resolved the symlink to AGENTS.md
and I didn't notice the status-line divergence. One-liner in the
existing "Workspace Isolation" section pointing at the rule.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qa-reviewed QA review completed with no critical issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant