Skip to content

v0.33.1.0 feat: eval-gated whoknows — expertise + relationship-proximity routing#881

Open
garrytan wants to merge 16 commits into
masterfrom
garrytan/vilnius-v1
Open

v0.33.1.0 feat: eval-gated whoknows — expertise + relationship-proximity routing#881
garrytan wants to merge 16 commits into
masterfrom
garrytan/vilnius-v1

Conversation

@garrytan
Copy link
Copy Markdown
Owner

Summary

v0.33 wedge release: gbrain whoknows <topic> (CLI) + find_experts MCP op routes expertise queries against person/company pages in your brain. Eval-gated by design — naive ranking ships first; substrate (community detection, formal relationship_score table) is queued for v0.34 contingent on what your real-brain eval shows.

Headline capability: Ask the question you actually ask — gbrain whoknows "lab automation" returns top-5 person/company candidates ranked by expertise depth (sub-linear match), recency (6-month half-life, floored at 0.1), and salience (centered at 0.5). --explain flag shows the factor breakdown so you can audit the ranking.

Plan provenance: Plan went through /office-hours/plan-ceo-review (SCOPE_EXPANSION mode, 4 cherry-picks added) → Codex outside-voice review (8 findings, 5 strategic re-litigations + 3 substrate defects accepted) → /plan-eng-review (6 implementation specs locked). Net result: scope reduced ~75% from the cathedral version while shipping the actual wedge.

Commits (9, bisect-friendly atomic chunks):

Searchhybrid extension:

  • caa5e041 — Add SearchOpts.types: PageType[] multi-type filter to searchHybrid. Pushes type filter to SQL via AND p.type = ANY($N::text[]) in both engines (PGLite + Postgres) across searchKeyword, searchVector, searchKeywordChunks.

Whoknows core:

  • e505baccsrc/commands/whoknows.ts with locked ENG-D1 ranking formula + 16 unit tests covering 10 ENG-D3 shadow paths + composite-key (Codex F1) + numerical pin.
  • 994513dcfind_experts MCP op registered (scope: read, localOnly: false) + gbrain whoknows CLI dispatch + thin-client routing (v0.31.1 pattern).

Eval gate:

  • 79195e06gbrain eval whoknows two-layer gate (ENG-D2): hand-labeled quality (>=80% top-3) + eval_candidates replay regression (>=0.4 Jaccard@3) with sparseness fallback. 23 unit cases pin the math.

Doctor + fixture + E2E:

  • 3fde98a9whoknows_health doctor check (fixture presence + row count). 5 unit cases.
  • 14a0b023 — synthetic 10-query fixture at test/fixtures/whoknows-eval.jsonl + E2E test against seeded PGLite brain. Asserts >=80% top-3 hit rate against the synthetic fixture.

Docs:

  • 133d1d38 — CHANGELOG release-summary + CLAUDE.md Key Files annotations for new paths + llms.txt regen.

Gap-fill tests:

  • ed2f3837 — Engine-level typeFilter coverage (9 cases) + find_experts op contract tests (11 cases).

Version:

  • a425022f — v0.33.1.0 bump (queue-collision with v0.33.0).

Merge commit 9ac52ced from master picked up v0.32.0 (5 new embedding recipes + discoverability pass).

Test Coverage

All new code paths covered:

  • test/whoknows.test.ts — 16 cases pinning the locked ranking formula + 10 ENG-D3 shadow paths + composite-key + factor decomposition.
  • test/eval-whoknows.test.ts — 23 cases on jaccardAtK, topKHit, fixture parsing, threshold constants.
  • test/whoknows-doctor.test.ts — 5 cases on missing/empty/undersized/ok fixture states.
  • test/search-types-filter.test.ts — 9 engine-level cases for SearchOpts.types SQL emission.
  • test/find-experts-op.test.ts — 11 cases on MCP op contract (scope, params, handler, empty-topic rejection).
  • test/e2e/whoknows.test.ts — 5 E2E cases against seeded PGLite brain: quality gate hit rate, typeFilter excludes concept decoys, empty-result safety, --explain shape, limit honoring.

Tests: 69 new across 6 files. All passing.

Pre-Landing Review

Eng review CLEAR (logged from prior turn). 6 issues found, all locked into plan via ENG-D1..D6. 0 critical gaps. No new findings from this version-bump commit.

Eval Results

No prompt-related files changed — evals skipped.

Plan Completion

Plan at ~/.claude/plans/system-instruction-you-are-working-quirky-papert.md. All week-1 items DONE:

  • 20-query eval set: ships as 10-query synthetic placeholder; end users replace with their real queries before grading their brain.
  • gbrain whoknows CLI + find_experts MCP op + --explain flag: shipped.
  • gbrain eval whoknows two-layer gate: shipped.
  • whoknows_health doctor check: shipped.
  • searchHybrid typeFilter: shipped.
  • 10-case unit test list: shipped (16 cases — exceeds the ENG-D3 lock).
  • Person-person projection spec (ENG-D4): documented in plan for v0.34 substrate work; not shipped this release per ENG-D11 (substrate eval-gated).

Substrate work (community detection, formal relationships table) deferred to v0.34 contingent on real-brain eval showing naive whoknows fails the gate.

Verification Results

E2E suite passes against real Postgres (5435 container): test/e2e/whoknows.test.ts, search-quality.test.ts, search-swamp.test.ts, search-exclude.test.ts, graph-quality.test.ts, engine-parity.test.ts, schema-drift.test.ts, integrity-batch.test.ts, postgres-engine.test.ts — all clean.

Unit suite: 5502 pass / 1 fail. The single failure is eval-longmemeval > warm-create speed gate — a pre-existing environmental perf flake under parallel-shard contention on this dev machine. Verified pre-existing: same failure reproduces on origin/master at v0.32.0 under the same load. Passes in isolation (28ms p50, well under 500ms gate). Not a regression from this PR.

TODOS

No TODO items completed in this PR. New v0.34 candidates documented in the plan file:

  • Formal relationships table (composite-keyed) — substrate eval-gated
  • page_communities table + Jaccard-stable community alignment — substrate eval-gated
  • Louvain via graphology-communities-louvain — substrate eval-gated
  • gbrain prep / gbrain stale as OpenClaw skills (per Codex F8 layer separation)
  • Proactive nudges, intro suggestions, conversation continuity

Test plan

  • All unit tests pass (5502 / 5503 — see Verification Results re: the 1 pre-existing flake)
  • All new whoknows + eval-whoknows + doctor + types-filter + find_experts unit tests pass (69 / 69)
  • All E2E tests pass on real Postgres (gbrain-test-pg container)
  • Typecheck clean (bunx tsc --noEmit --skipLibCheck)
  • User runs gbrain eval whoknows test/fixtures/whoknows-eval.jsonl against their own brain after replacing placeholder queries with real ones

🤖 Generated with Claude Code

garrytan and others added 16 commits May 10, 2026 21:45
Push the page-type filter into SQL via AND p.type = ANY(\$N::text[]) in
both engines' searchKeyword + searchVector + searchKeywordChunks paths.
Primary consumer is the upcoming gbrain whoknows command (filters to
['person','company']); the limit budget then goes to typed candidates
instead of being eaten by note/transcript/article pages. Future
entity-only search in v0.34+ reuses the parameter for free.

AND-applies alongside the existing single-value type filter (callers can
use either or both). HybridSearchOpts threads opts.types into the
underlying searchOpts so hybridSearch callers get the SQL-level filter
without any post-filter waste.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements ENG-D1's locked spec: score = log(1 + raw_match) ×
max(0.1, exp(-days/180)) × (0.5 + 0.5 × salience). raw_match comes
from hybridSearch's RRF + source-boost-adjusted score; salience and
recency boosts in hybridSearch are intentionally disabled so the
formula applies on a clean signal.

rankCandidates() is the pure function the eval grades against;
findExperts() is the public entrypoint that wires hybrid search +
batch salience/effective_date fetches; runWhoknows() is the CLI.

Test/whoknows.test.ts covers the 10 ENG-D3 cases (zero results,
negative recency floor, NaN salience neutral default, NaN match
zeros gracefully, type preservation, --explain factor breakdown,
top-K limit clamping, recency-floor extreme-days safety, alphabetical
tie-break determinism, public-surface contract). Plus four sanity
asserts (higher-match outranks, more-recent outranks, higher-salience
outranks, all-zero candidate appears with score 0). Plus one factor
decomposition assertion that pins the exact formula numerically.
Plus a composite-key safety case (Codex F1).

22 expect calls across 16 tests. All passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires both surfaces per ENG-D5: MCP op = find_experts (matches
find_anomalies naming convention; agent-facing); CLI command =
gbrain whoknows (memorable, user-facing). One findExperts() core
function backs both paths.

The op is scope:'read', localOnly:false — accessible over HTTP MCP
to read-scoped OAuth clients like the salience/anomalies family.
Op handler validates non-empty topic and dispatches to the same
findExperts() pure function the CLI uses.

CLI dispatch in src/cli.ts:case 'whoknows' calls runWhoknows; thin-
client routing happens inside runWhoknows via isThinClient(cfg) —
remote MCP installs route through the v0.31.1 routing seam to
callRemoteTool('find_experts', ...).

FIND_EXPERTS_DESCRIPTION in operations-descriptions.ts mirrors the
v0.29 redirect-hint style: leads with what the tool does, lists
explicit user-intent triggers ("who should I talk to about X",
"who knows about Y"), notes the type-filter behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements the locked spec: Layer 1 hand-labeled fixture (>=80% top-3
hit rate) is the primary ship-blocking gate; Layer 2 eval_candidates
replay (>=0.4 mean set-Jaccard@3) is the regression gate that
auto-skips when < 20 replay-eligible rows exist (CONTRIBUTOR_MODE
sparseness fallback).

Dispatch lands as `gbrain eval whoknows <fixture.jsonl>` sub-subcommand
in src/commands/eval.ts (mirrors v0.25.0 export/prune/replay and
v0.27.x cross-modal pattern). Exits 0/1/2 for pass/fail/usage so CI
gates can consume.

JSON output (--json) ships schema_version: 1 for stable consumer
contract (mirrors v0.25.0 eval-replay.ts). Human output groups by
layer + emits a per-miss diagnostic table so failures are
self-debugging.

Unit tests pin:
- jaccardAtK math (7 cases — identical, disjoint, partial, k cutoff,
  empty-empty vacuous-stable, empty-vs-non-empty, Set dedup)
- topKHit (7 cases — position 1, 3, 4, miss, multi-expected, empty
  actual, empty expected)
- readFixture (6 cases — well-formed, comments/blanks, missing file,
  malformed JSON, missing required fields, non-string filter)
- Locked thresholds (HIT_RATE=0.8, REGRESSION=0.4, MIN_REPLAY_ROWS=20)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per CEO-D7 (substrate-conditional v0.33 doctor check, but the
fixture-presence sub-check ships in week 1 regardless — it's the
"did you do the assignment?" signal). When the eval fixture is
missing, empty, or undersized (< 5 rows), doctor warns with the
exact path the user should populate.

The check is intentionally lightweight: it does NOT run the eval
itself or measure hit-rate regression. That's the job of `gbrain
eval whoknows`, called from CI/ship time. This check is the cheap
always-runs signal that surfaces in `gbrain doctor` and on the
ship review dashboard.

5 unit cases pin the four-status behavior (missing/empty/undersized/
ok) plus the comment-and-blank-line filtering so users can comment
out queries during iteration without breaking the row count.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
test/fixtures/whoknows-eval.jsonl ships as a 10-query placeholder
demonstrating the schema. Comments document the assignment for end
users: they replace these with their own real queries before
shipping their gbrain install. The placeholder uses obviously-
example slugs (wiki/people/example-alice, etc.) so nobody mistakes
it for production data.

test/e2e/whoknows.test.ts seeds a synthetic PGLite brain that
matches the placeholder fixture, then runs findExperts on every
fixture query and asserts >=80% top-3 hit rate per ENG-D2 quality
gate. Also exercises the typeFilter (concept-decoy pages filtered
out), empty-result graceful return, --explain factor breakdown, and
top-K limit honoring.

Basis-vector embeddings (no API key) follow the existing pattern from
test/e2e/search-quality.test.ts.

5 test cases, 23 expect calls, all passing against PGLite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bumps VERSION 0.31.11 → 0.33.0 and package.json to match. CHANGELOG
entry leads with the headline use ("ask gbrain who knows about X")
and the locked ENG-D1 ranking formula. "Numbers that matter" replaced
with a "what ships on which eval outcome" table — honest about the
eval-gated trajectory rather than fabricating benchmarks before the
release has been graded against a real brain.

CLAUDE.md Key Files annotations added for src/commands/whoknows.ts,
src/commands/eval-whoknows.ts, and test/fixtures/whoknows-eval.jsonl.
src/core/search/hybrid.ts entry extended with the new types parameter
documentation (push the type filter to SQL, no post-filter waste,
AND-applies alongside the existing single-value type field).

bun run build:llms ran the chaser; llms.txt + llms-full.txt
regenerated to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
Two new files filling the gaps Garry called out:

test/search-types-filter.test.ts — engine-level coverage on PGLite for
the new SearchOpts.types filter. Asserts the SQL-clause behavior
directly so a regression in the AND p.type = ANY(...) emission gets
caught here with a tight assertion rather than as part of a longer
findExperts pipeline. 9 cases across searchKeyword + searchVector +
chunk-grain documentation. Documents the pre-existing PGLite parity
gap (single-value `type` field is Postgres-only; `types` is the v0.33
multi-type filter that BOTH engines honor).

test/find-experts-op.test.ts — MCP-op contract test for find_experts.
Pins:
- Registered in the operations array + operationsByName
- scope: 'read', localOnly false (HTTP-MCP accessible per ENG-D5)
- Documented params (topic / limit / explain) with correct types
- cliHints.name === 'whoknows' (CLI surface bridge)
- Non-trivial description that references the use case
- Handler rejects empty / whitespace / missing topic with invalid_params
- Handler returns array shape on valid topic
- Handler honors limit param

11 op-contract cases + 9 engine-clause cases. All passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Garry asked for v0.33.1 instead of v0.33.0 (queue collision with
unrelated 0.33.0 work). 4-digit format: 0.33.1.0. CHANGELOG header
and "To take advantage of" block updated. llms.txt regenerated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…opic>

Without `cliHints.positional: ['topic']`, the op-dispatch path in
src/cli.ts couldn't parse `gbrain whoknows "ai agents"` and threw
`invalid_params: topic is required`. Found while testing the v0.33.1.0
build against a real brain. The op handler validates topic; the CLI
just needed to know the positional shape so the dispatcher could
hand it through.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
Replaces the synthetic 10-row placeholder with 10 real expertise-routing
queries mined from Garry's actual brain via thin-client connection to
Wintermute (v0.32.2). Source: reference/vc-intro-network ("Who Takes
Intros from Garry") + adjacent routing context. All 15 unique expected
person slugs verified against ~/git/brain/people/<slug>.md source
markdown:

  people/amit-kumar          Accel partner, 102 YC deals
  people/diana-hu            YC GP
  people/elad-gil            Angel, top-rated
  people/eric-vishria        Benchmark, healthtech
  people/gokul-rajaram       Angel, 57 YC deals
  people/joff-redfern        Menlo Ventures, ex-CPO Atlassian
  people/jon-xu              YC GP
  people/kristina-shen       Chemistry, healthtech
  people/lachy-groom         Angel, 43 YC deals
  people/lee-edwards         Quiet Capital, 52 YC deals
  people/nick-shalek         Ribbit Capital, fintech
  people/nina-achadian       Index Ventures, 69 YC deals (note: slug
                              uses 'achadian' not 'achadjian')
  people/parul-singh         645 Ventures
  people/rebecca-kaden       USV
  people/trae-stephens       Founders Fund, defense/deep-tech

Eval cannot run yet against Wintermute thin-client: server is v0.32.2,
find_experts MCP op was added in v0.33. Once Wintermute upgrades the
eval will run end-to-end via the v0.31.1 thin-client routing seam.
Local eval works once the brain is indexed with find_experts available.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`gbrain eval whoknows` now works against a thin-client install. When
isThinClient(cfg), each fixture query routes through the remote
find_experts MCP op via callRemoteTool — same v0.31.1 routing seam
runWhoknows already uses. Local mode unchanged: findExperts(engine, ...)
called directly.

Server prerequisite: the brain must be v0.33+ for find_experts to be
registered. Wintermute (currently v0.32.2) gets it on next upgrade and
then the eval runs end-to-end with zero client-side changes.

Mechanics:
- `WhoknowsFn` callable abstraction so the gates are impl-agnostic
- runEvalWhoknows(engine: BrainEngine | null, args) — null engine
  allowed in thin-client mode
- Regression gate auto-skips in thin-client mode (no DB access to
  eval_candidates; quality gate alone gates ship)
- cli.ts adds a thin-client bypass before connectEngine for
  `gbrain eval whoknows`, matching the longmemeval/cross-modal no-DB
  pattern

E2E test updated to use an inline synthetic fixture (the shipped
fixture is real-brain data now, doesn't match the seeded test brain).
Sanity-check the shipped fixture parses cleanly in a separate case.

Tests: 25 unit cases (+2 for null-engine signature contract) + 6 E2E
cases. Typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
#	src/commands/eval.ts
#	src/core/operations.ts
# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant