v0.33.1.0 feat: eval-gated whoknows — expertise + relationship-proximity routing by garrytan · Pull Request #881 · garrytan/gbrain

garrytan · 2026-05-11T16:52:18Z

Summary

v0.33 wedge release: gbrain whoknows <topic> (CLI) + find_experts MCP op routes expertise queries against person/company pages in your brain. Eval-gated by design — naive ranking ships first; substrate (community detection, formal relationship_score table) is queued for v0.34 contingent on what your real-brain eval shows.

Headline capability: Ask the question you actually ask — gbrain whoknows "lab automation" returns top-5 person/company candidates ranked by expertise depth (sub-linear match), recency (6-month half-life, floored at 0.1), and salience (centered at 0.5). --explain flag shows the factor breakdown so you can audit the ranking.

Plan provenance: Plan went through /office-hours → /plan-ceo-review (SCOPE_EXPANSION mode, 4 cherry-picks added) → Codex outside-voice review (8 findings, 5 strategic re-litigations + 3 substrate defects accepted) → /plan-eng-review (6 implementation specs locked). Net result: scope reduced ~75% from the cathedral version while shipping the actual wedge.

Commits (9, bisect-friendly atomic chunks):

Searchhybrid extension:

caa5e041 — Add SearchOpts.types: PageType[] multi-type filter to searchHybrid. Pushes type filter to SQL via AND p.type = ANY($N::text[]) in both engines (PGLite + Postgres) across searchKeyword, searchVector, searchKeywordChunks.

Whoknows core:

e505bacc — src/commands/whoknows.ts with locked ENG-D1 ranking formula + 16 unit tests covering 10 ENG-D3 shadow paths + composite-key (Codex F1) + numerical pin.
994513dc — find_experts MCP op registered (scope: read, localOnly: false) + gbrain whoknows CLI dispatch + thin-client routing (v0.31.1 pattern).

Eval gate:

79195e06 — gbrain eval whoknows two-layer gate (ENG-D2): hand-labeled quality (>=80% top-3) + eval_candidates replay regression (>=0.4 Jaccard@3) with sparseness fallback. 23 unit cases pin the math.

Doctor + fixture + E2E:

3fde98a9 — whoknows_health doctor check (fixture presence + row count). 5 unit cases.
14a0b023 — synthetic 10-query fixture at test/fixtures/whoknows-eval.jsonl + E2E test against seeded PGLite brain. Asserts >=80% top-3 hit rate against the synthetic fixture.

Docs:

133d1d38 — CHANGELOG release-summary + CLAUDE.md Key Files annotations for new paths + llms.txt regen.

Gap-fill tests:

ed2f3837 — Engine-level typeFilter coverage (9 cases) + find_experts op contract tests (11 cases).

Version:

a425022f — v0.33.1.0 bump (queue-collision with v0.33.0).

Merge commit 9ac52ced from master picked up v0.32.0 (5 new embedding recipes + discoverability pass).

Test Coverage

All new code paths covered:

test/whoknows.test.ts — 16 cases pinning the locked ranking formula + 10 ENG-D3 shadow paths + composite-key + factor decomposition.
test/eval-whoknows.test.ts — 23 cases on jaccardAtK, topKHit, fixture parsing, threshold constants.
test/whoknows-doctor.test.ts — 5 cases on missing/empty/undersized/ok fixture states.
test/search-types-filter.test.ts — 9 engine-level cases for SearchOpts.types SQL emission.
test/find-experts-op.test.ts — 11 cases on MCP op contract (scope, params, handler, empty-topic rejection).
test/e2e/whoknows.test.ts — 5 E2E cases against seeded PGLite brain: quality gate hit rate, typeFilter excludes concept decoys, empty-result safety, --explain shape, limit honoring.

Tests: 69 new across 6 files. All passing.

Pre-Landing Review

Eng review CLEAR (logged from prior turn). 6 issues found, all locked into plan via ENG-D1..D6. 0 critical gaps. No new findings from this version-bump commit.

Eval Results

No prompt-related files changed — evals skipped.

Plan Completion

Plan at ~/.claude/plans/system-instruction-you-are-working-quirky-papert.md. All week-1 items DONE:

20-query eval set: ships as 10-query synthetic placeholder; end users replace with their real queries before grading their brain.
gbrain whoknows CLI + find_experts MCP op + --explain flag: shipped.
gbrain eval whoknows two-layer gate: shipped.
whoknows_health doctor check: shipped.
searchHybrid typeFilter: shipped.
10-case unit test list: shipped (16 cases — exceeds the ENG-D3 lock).
Person-person projection spec (ENG-D4): documented in plan for v0.34 substrate work; not shipped this release per ENG-D11 (substrate eval-gated).

Substrate work (community detection, formal relationships table) deferred to v0.34 contingent on real-brain eval showing naive whoknows fails the gate.

Verification Results

E2E suite passes against real Postgres (5435 container): test/e2e/whoknows.test.ts, search-quality.test.ts, search-swamp.test.ts, search-exclude.test.ts, graph-quality.test.ts, engine-parity.test.ts, schema-drift.test.ts, integrity-batch.test.ts, postgres-engine.test.ts — all clean.

Unit suite: 5502 pass / 1 fail. The single failure is eval-longmemeval > warm-create speed gate — a pre-existing environmental perf flake under parallel-shard contention on this dev machine. Verified pre-existing: same failure reproduces on origin/master at v0.32.0 under the same load. Passes in isolation (28ms p50, well under 500ms gate). Not a regression from this PR.

TODOS

No TODO items completed in this PR. New v0.34 candidates documented in the plan file:

Formal relationships table (composite-keyed) — substrate eval-gated
page_communities table + Jaccard-stable community alignment — substrate eval-gated
Louvain via graphology-communities-louvain — substrate eval-gated
gbrain prep / gbrain stale as OpenClaw skills (per Codex F8 layer separation)
Proactive nudges, intro suggestions, conversation continuity

Test plan

All unit tests pass (5502 / 5503 — see Verification Results re: the 1 pre-existing flake)
All new whoknows + eval-whoknows + doctor + types-filter + find_experts unit tests pass (69 / 69)
All E2E tests pass on real Postgres (gbrain-test-pg container)
Typecheck clean (bunx tsc --noEmit --skipLibCheck)
User runs gbrain eval whoknows test/fixtures/whoknows-eval.jsonl against their own brain after replacing placeholder queries with real ones

🤖 Generated with Claude Code

Push the page-type filter into SQL via AND p.type = ANY(\$N::text[]) in both engines' searchKeyword + searchVector + searchKeywordChunks paths. Primary consumer is the upcoming gbrain whoknows command (filters to ['person','company']); the limit budget then goes to typed candidates instead of being eaten by note/transcript/article pages. Future entity-only search in v0.34+ reuses the parameter for free. AND-applies alongside the existing single-value type filter (callers can use either or both). HybridSearchOpts threads opts.types into the underlying searchOpts so hybridSearch callers get the SQL-level filter without any post-filter waste. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Implements ENG-D1's locked spec: score = log(1 + raw_match) × max(0.1, exp(-days/180)) × (0.5 + 0.5 × salience). raw_match comes from hybridSearch's RRF + source-boost-adjusted score; salience and recency boosts in hybridSearch are intentionally disabled so the formula applies on a clean signal. rankCandidates() is the pure function the eval grades against; findExperts() is the public entrypoint that wires hybrid search + batch salience/effective_date fetches; runWhoknows() is the CLI. Test/whoknows.test.ts covers the 10 ENG-D3 cases (zero results, negative recency floor, NaN salience neutral default, NaN match zeros gracefully, type preservation, --explain factor breakdown, top-K limit clamping, recency-floor extreme-days safety, alphabetical tie-break determinism, public-surface contract). Plus four sanity asserts (higher-match outranks, more-recent outranks, higher-salience outranks, all-zero candidate appears with score 0). Plus one factor decomposition assertion that pins the exact formula numerically. Plus a composite-key safety case (Codex F1). 22 expect calls across 16 tests. All passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wires both surfaces per ENG-D5: MCP op = find_experts (matches find_anomalies naming convention; agent-facing); CLI command = gbrain whoknows (memorable, user-facing). One findExperts() core function backs both paths. The op is scope:'read', localOnly:false — accessible over HTTP MCP to read-scoped OAuth clients like the salience/anomalies family. Op handler validates non-empty topic and dispatches to the same findExperts() pure function the CLI uses. CLI dispatch in src/cli.ts:case 'whoknows' calls runWhoknows; thin- client routing happens inside runWhoknows via isThinClient(cfg) — remote MCP installs route through the v0.31.1 routing seam to callRemoteTool('find_experts', ...). FIND_EXPERTS_DESCRIPTION in operations-descriptions.ts mirrors the v0.29 redirect-hint style: leads with what the tool does, lists explicit user-intent triggers ("who should I talk to about X", "who knows about Y"), notes the type-filter behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Implements the locked spec: Layer 1 hand-labeled fixture (>=80% top-3 hit rate) is the primary ship-blocking gate; Layer 2 eval_candidates replay (>=0.4 mean set-Jaccard@3) is the regression gate that auto-skips when < 20 replay-eligible rows exist (CONTRIBUTOR_MODE sparseness fallback). Dispatch lands as `gbrain eval whoknows <fixture.jsonl>` sub-subcommand in src/commands/eval.ts (mirrors v0.25.0 export/prune/replay and v0.27.x cross-modal pattern). Exits 0/1/2 for pass/fail/usage so CI gates can consume. JSON output (--json) ships schema_version: 1 for stable consumer contract (mirrors v0.25.0 eval-replay.ts). Human output groups by layer + emits a per-miss diagnostic table so failures are self-debugging. Unit tests pin: - jaccardAtK math (7 cases — identical, disjoint, partial, k cutoff, empty-empty vacuous-stable, empty-vs-non-empty, Set dedup) - topKHit (7 cases — position 1, 3, 4, miss, multi-expected, empty actual, empty expected) - readFixture (6 cases — well-formed, comments/blanks, missing file, malformed JSON, missing required fields, non-string filter) - Locked thresholds (HIT_RATE=0.8, REGRESSION=0.4, MIN_REPLAY_ROWS=20) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per CEO-D7 (substrate-conditional v0.33 doctor check, but the fixture-presence sub-check ships in week 1 regardless — it's the "did you do the assignment?" signal). When the eval fixture is missing, empty, or undersized (< 5 rows), doctor warns with the exact path the user should populate. The check is intentionally lightweight: it does NOT run the eval itself or measure hit-rate regression. That's the job of `gbrain eval whoknows`, called from CI/ship time. This check is the cheap always-runs signal that surfaces in `gbrain doctor` and on the ship review dashboard. 5 unit cases pin the four-status behavior (missing/empty/undersized/ ok) plus the comment-and-blank-line filtering so users can comment out queries during iteration without breaking the row count. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test/fixtures/whoknows-eval.jsonl ships as a 10-query placeholder demonstrating the schema. Comments document the assignment for end users: they replace these with their own real queries before shipping their gbrain install. The placeholder uses obviously- example slugs (wiki/people/example-alice, etc.) so nobody mistakes it for production data. test/e2e/whoknows.test.ts seeds a synthetic PGLite brain that matches the placeholder fixture, then runs findExperts on every fixture query and asserts >=80% top-3 hit rate per ENG-D2 quality gate. Also exercises the typeFilter (concept-decoy pages filtered out), empty-result graceful return, --explain factor breakdown, and top-K limit honoring. Basis-vector embeddings (no API key) follow the existing pattern from test/e2e/search-quality.test.ts. 5 test cases, 23 expect calls, all passing against PGLite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bumps VERSION 0.31.11 → 0.33.0 and package.json to match. CHANGELOG entry leads with the headline use ("ask gbrain who knows about X") and the locked ENG-D1 ranking formula. "Numbers that matter" replaced with a "what ships on which eval outcome" table — honest about the eval-gated trajectory rather than fabricating benchmarks before the release has been graded against a real brain. CLAUDE.md Key Files annotations added for src/commands/whoknows.ts, src/commands/eval-whoknows.ts, and test/fixtures/whoknows-eval.jsonl. src/core/search/hybrid.ts entry extended with the new types parameter documentation (push the type filter to SQL, no post-filter waste, AND-applies alongside the existing single-value type field). bun run build:llms ran the chaser; llms.txt + llms-full.txt regenerated to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Conflicts: # CHANGELOG.md # VERSION # package.json

Two new files filling the gaps Garry called out: test/search-types-filter.test.ts — engine-level coverage on PGLite for the new SearchOpts.types filter. Asserts the SQL-clause behavior directly so a regression in the AND p.type = ANY(...) emission gets caught here with a tight assertion rather than as part of a longer findExperts pipeline. 9 cases across searchKeyword + searchVector + chunk-grain documentation. Documents the pre-existing PGLite parity gap (single-value `type` field is Postgres-only; `types` is the v0.33 multi-type filter that BOTH engines honor). test/find-experts-op.test.ts — MCP-op contract test for find_experts. Pins: - Registered in the operations array + operationsByName - scope: 'read', localOnly false (HTTP-MCP accessible per ENG-D5) - Documented params (topic / limit / explain) with correct types - cliHints.name === 'whoknows' (CLI surface bridge) - Non-trivial description that references the use case - Handler rejects empty / whitespace / missing topic with invalid_params - Handler returns array shape on valid topic - Handler honors limit param 11 op-contract cases + 9 engine-clause cases. All passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Garry asked for v0.33.1 instead of v0.33.0 (queue collision with unrelated 0.33.0 work). 4-digit format: 0.33.1.0. CHANGELOG header and "To take advantage of" block updated. llms.txt regenerated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…opic> Without `cliHints.positional: ['topic']`, the op-dispatch path in src/cli.ts couldn't parse `gbrain whoknows "ai agents"` and threw `invalid_params: topic is required`. Found while testing the v0.33.1.0 build against a real brain. The op handler validates topic; the CLI just needed to know the positional shape so the dispatcher could hand it through. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Conflicts: # CHANGELOG.md # VERSION # package.json

Replaces the synthetic 10-row placeholder with 10 real expertise-routing queries mined from Garry's actual brain via thin-client connection to Wintermute (v0.32.2). Source: reference/vc-intro-network ("Who Takes Intros from Garry") + adjacent routing context. All 15 unique expected person slugs verified against ~/git/brain/people/<slug>.md source markdown: people/amit-kumar Accel partner, 102 YC deals people/diana-hu YC GP people/elad-gil Angel, top-rated people/eric-vishria Benchmark, healthtech people/gokul-rajaram Angel, 57 YC deals people/joff-redfern Menlo Ventures, ex-CPO Atlassian people/jon-xu YC GP people/kristina-shen Chemistry, healthtech people/lachy-groom Angel, 43 YC deals people/lee-edwards Quiet Capital, 52 YC deals people/nick-shalek Ribbit Capital, fintech people/nina-achadian Index Ventures, 69 YC deals (note: slug uses 'achadian' not 'achadjian') people/parul-singh 645 Ventures people/rebecca-kaden USV people/trae-stephens Founders Fund, defense/deep-tech Eval cannot run yet against Wintermute thin-client: server is v0.32.2, find_experts MCP op was added in v0.33. Once Wintermute upgrades the eval will run end-to-end via the v0.31.1 thin-client routing seam. Local eval works once the brain is indexed with find_experts available. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`gbrain eval whoknows` now works against a thin-client install. When isThinClient(cfg), each fixture query routes through the remote find_experts MCP op via callRemoteTool — same v0.31.1 routing seam runWhoknows already uses. Local mode unchanged: findExperts(engine, ...) called directly. Server prerequisite: the brain must be v0.33+ for find_experts to be registered. Wintermute (currently v0.32.2) gets it on next upgrade and then the eval runs end-to-end with zero client-side changes. Mechanics: - `WhoknowsFn` callable abstraction so the gates are impl-agnostic - runEvalWhoknows(engine: BrainEngine | null, args) — null engine allowed in thin-client mode - Regression gate auto-skips in thin-client mode (no DB access to eval_candidates; quality gate alone gates ship) - cli.ts adds a thin-client bypass before connectEngine for `gbrain eval whoknows`, matching the longmemeval/cross-modal no-DB pattern E2E test updated to use an inline synthetic fixture (the shipped fixture is real-brain data now, doesn't match the seeded test brain). Sanity-check the shipped fixture parses cleanly in a separate case. Tests: 25 unit cases (+2 for null-engine signature contract) + 6 E2E cases. Typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Conflicts: # CHANGELOG.md # VERSION # package.json # src/commands/eval.ts # src/core/operations.ts

# Conflicts: # CHANGELOG.md # VERSION # package.json

garrytan and others added 16 commits May 10, 2026 21:45

Merge remote-tracking branch 'origin/master' into garrytan/vilnius-v1

9ac52ce

# Conflicts: # CHANGELOG.md # VERSION # package.json

Merge remote-tracking branch 'origin/master' into garrytan/vilnius-v1

b059586

# Conflicts: # CHANGELOG.md # VERSION # package.json

Merge remote-tracking branch 'origin/master' into garrytan/vilnius-v1

6c6f0ef

# Conflicts: # CHANGELOG.md # VERSION # package.json # src/commands/eval.ts # src/core/operations.ts

Merge remote-tracking branch 'origin/master' into garrytan/vilnius-v1

5cfd2b8

# Conflicts: # CHANGELOG.md # VERSION # package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.33.1.0 feat: eval-gated whoknows — expertise + relationship-proximity routing#881

v0.33.1.0 feat: eval-gated whoknows — expertise + relationship-proximity routing#881
garrytan wants to merge 16 commits into
masterfrom
garrytan/vilnius-v1

garrytan commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented May 11, 2026

Summary

Test Coverage

Pre-Landing Review

Eval Results

Plan Completion

Verification Results

TODOS

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant