diff --git a/CHANGELOG.md b/CHANGELOG.md index 749f848cf..96af47169 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,6 +2,117 @@ All notable changes to GBrain will be documented in this file. +## [0.33.1.0] - 2026-05-10 + +**Ask gbrain who in your network knows about a topic, and get a ranked answer with the reasoning shown.** + +The new `gbrain whoknows ` command (CLI + `find_experts` MCP op) routes expertise + relationship-proximity queries against person and company pages in your brain. Returns top-5 by default. `--explain` dumps the per-result factor breakdown so you can see why the ranking landed where it did. The release ships the wedge query without committing to a new substrate; community detection and a formal relationship_score table are deferred until the eval set proves they're earned, not because they sound good in a CHANGELOG. + +### What you can now do + +**Ask the question you actually ask.** `gbrain whoknows "lab automation"` returns the top-5 people or companies in your brain that know about lab automation, ranked by expertise depth (sub-linear chunk-match), relationship recency (6-month half-life), and salience. Filters at SQL to person/company pages only — note pages and articles drop out without you asking. Mirrors the v0.29 `salience` / `anomalies` shape: CLI + MCP op + thin-client routing all on day one. + +**See the math.** `gbrain whoknows "fintech compliance" --explain` adds a one-line factor breakdown per result. You see `expertise=0.405 (raw=0.500) recency=0.846 (60d) salience=0.300 → factor=0.650`. Trust through transparency, not opacity. The MCP op accepts the same flag; agents can return the breakdown to the user verbatim. + +**Get a SQL-level type filter for free.** The new `SearchOpts.types: PageType[]` parameter on `searchHybrid` (and underlying `searchKeyword` + `searchVector` in both engines) pushes the page-type filter into SQL via `AND p.type = ANY($N::text[])`. The limit budget goes to candidate-typed pages instead of being eaten by transcripts and articles. Future entity-only search reuses the parameter without touching this code. + +**Grade the headline against a two-layer eval gate.** `gbrain eval whoknows test/fixtures/whoknows-eval.jsonl` runs the locked ENG-D2 two-layer gate: Layer 1 hand-labeled fixture passes at ≥ 80% top-3 hit rate (the primary gate); Layer 2 `eval_candidates` replay passes at ≥ 0.4 mean set-Jaccard@3 (the regression gate). Layer 2 auto-skips with a stderr warning if `eval_candidates` has fewer than 20 replay-eligible captured rows — sparseness fallback lets users without `GBRAIN_CONTRIBUTOR_MODE=1` history still ship. + +**See if you did the assignment.** `gbrain doctor` adds a `whoknows_health` check that warns when `test/fixtures/whoknows-eval.jsonl` is missing, empty, or undersized (< 5 rows). The check is intentionally narrow: it does NOT measure hit-rate regression (that's the eval command's job). It surfaces "you haven't written your fixture yet" — the single highest-leverage signal in the doctor sweep. + +### The locked ranking spec (ENG-D1) + +``` +score = log(1 + raw_match) // expertise (sub-linear) + × max(0.1, exp(-days/180)) // recency (6mo half-life, floored at 0.1) + × (0.5 + 0.5 × clamp(salience)) // salience (centered at 0.5) +``` + +Floors prevent multiplicative-zero edge cases (cold-start people without an `effective_date` get `recency_factor = 0.1` — visible, not zeroed). NaN inputs (negative recency, missing salience, undefined match score) all return `Number.isFinite(score) === true`. Same-score ties break alphabetically by slug for determinism. 16 unit tests in `test/whoknows.test.ts` pin the math. + +### Eval-gated trajectory + +| Outcome at end of week 1 | What ships in v0.33 | +|---|---| +| Naive whoknows ≥ 80% on hand-labeled + ≥ 0.4 Jaccard on replay | Clean release: command family + eval gate + doctor check. Substrate (community detection, formal `relationships` table) queues to v0.34 contingent on demand. | +| Naive whoknows fails the eval | v0.34 picks up substrate work (composite-keyed `relationships` table + person-person projection from `attended` links + Jaccard-stable community alignment via graphology Louvain + Haiku-named clusters). The eval told us substrate was earned. | + +### What this means for your workflow + +If you've been muscle-memorying the search bar to find "who in my network knows about X" — that workflow becomes `gbrain whoknows`. The `--explain` flag means you stop wondering why result #2 landed at #2; you can see the recency or salience that put it there. The MCP op makes the same query agent-composable: an agent asks `find_experts` for routing candidates and brings them to the conversation. + +The release is eval-gated by design (per /office-hours, /plan-ceo-review, and Codex outside-voice). If the naive ranking passes your real-brain eval, you didn't need the cathedral substrate after all. If it fails, v0.34 builds it — measured, not speculated. + +## To take advantage of v0.33.1 + +`gbrain upgrade` should do this automatically. Then run the eval gate against your real brain: + +1. **Write your eval fixture** at `test/fixtures/whoknows-eval.jsonl`: + ```bash + # 10 queries you'd actually ask, with hand-labeled expected slugs: + # {"query":"lab automation","expected_top_3_slugs":["wiki/people/your-expert"],"notes":"..."} + ``` + The shipped placeholder uses obviously-example slugs (`wiki/people/example-alice`) so you won't mistake it for real grading. + +2. **Run the gate:** + ```bash + gbrain eval whoknows test/fixtures/whoknows-eval.jsonl + ``` + Pass = ≥ 80% top-3 hit rate. Layer 2 (eval_candidates replay) auto-engages if you have ≥ 20 captured queries from `GBRAIN_CONTRIBUTOR_MODE=1` history; otherwise skips with a warning. + +3. **Ask the brain:** + ```bash + gbrain whoknows "lab automation" + gbrain whoknows "fintech compliance" --explain + gbrain whoknows "ai agents" --limit 10 --json + ``` + +4. **From an agent (MCP):** + ```json + {"tool": "find_experts", "params": {"topic": "lab automation", "limit": 5, "explain": true}} + ``` + The op is `scope: 'read'`, accessible to any client with the read OAuth scope. + +5. **If `gbrain doctor` warns about `whoknows_health`,** it means your fixture is missing or undersized. The fix hint points at the exact path. + +6. **If any step fails,** please file an issue: https://github.com/garrytan/gbrain/issues with the output of `gbrain doctor --json` and `gbrain eval whoknows test/fixtures/whoknows-eval.jsonl --json`. + +### Itemized changes + +**New CLI commands:** +- `gbrain whoknows [--explain] [--limit N] [--json]` — routes expertise queries to top-K person/company pages. +- `gbrain eval whoknows [--json] [--skip-replay]` — two-layer eval gate (quality fixture + regression replay). + +**New MCP op:** +- `find_experts` (`scope: 'read'`, `localOnly: false`) — backs the same `findExperts()` core that the CLI calls. Mirrors the v0.29 `find_anomalies` naming convention. Accessible to read-scoped OAuth clients on HTTP MCP installs. + +**New core files:** +- `src/commands/whoknows.ts` — pure `rankCandidates()` ranking function (ENG-D1 locked spec), `findExperts()` orchestrator (hybrid search + batch salience/recency fetch + rank), `runWhoknows()` CLI dispatch. +- `src/commands/eval-whoknows.ts` — two-layer gate orchestrator. `jaccardAtK()` / `topKHit()` / `readFixture()` exported for tests. +- `test/fixtures/whoknows-eval.jsonl` — 10-row synthetic placeholder. + +**searchHybrid extension:** +- `SearchOpts.types?: PageType[]` — multi-type SQL-level filter, threaded through `searchKeyword` + `searchVector` + `searchKeywordChunks` on both engines. AND-applies alongside the existing single-value `type` filter. No retrieval waste: limit budget goes to typed candidates. + +**Doctor:** +- `whoknows_health` check warns when the fixture is missing / empty / undersized. + +**Tests:** +- `test/whoknows.test.ts` — 16 cases covering the 10 locked ENG-D3 shadow paths, ranking sanity (higher-match / more-recent / higher-salience outrank), source-id composite-key safety (Codex F1), factor-decomposition numerical pin. +- `test/eval-whoknows.test.ts` — 23 cases on `jaccardAtK`, `topKHit`, fixture parsing, locked thresholds. +- `test/whoknows-doctor.test.ts` — 5 cases on the fixture-presence states. +- `test/e2e/whoknows.test.ts` — 5 E2E cases against a seeded PGLite brain, asserting the >= 80% gate against the synthetic fixture, type-filter exclusion, empty-result safety, `--explain` shape, limit honoring. + +**What we deferred (v0.34+ candidates):** +- Formal `relationships` table (composite-keyed `(from_slug, from_source_id, to_slug, to_source_id)` per Codex F1) — eval-gated. +- `page_communities` table + Jaccard-stable community alignment (Codex F4) — eval-gated. +- Louvain via graphology-communities-louvain (CEO-D6 walked back from native igraph per Codex F5) — eval-gated. +- `gbrain prep ` and `gbrain stale` — moved to OpenClaw skills layer per Codex F8 (thin-harness ethos). +- Proactive nudges, intro suggestions, conversation continuity — v0.34+ as the substrate proves itself. + +### Process notes + +The plan went through `/office-hours` → `/plan-ceo-review` → Codex outside-voice → `/plan-eng-review`. Each pass changed the shape. Office-hours locked the headline + eval-first principle. CEO review proposed 8 deliverables in SCOPE EXPANSION mode. Codex pushed back on 5 fronts (sequencing, eval methodology, library choice, layer separation, schema design) and was accepted on all 5 + 3 substrate defects. Eng review locked the ranking formula, the two-layer eval gate, the 10-case test list, and the SQL-level typeFilter. Net result: scope reduced ~75% from the cathedral version while shipping the actual wedge users ask for. ## [0.33.0] - 2026-05-11 **`gbrain recall` now answers "what changed since last time?" in one command, and thin-client installs stop silently lying about empty results.** diff --git a/CLAUDE.md b/CLAUDE.md index 9b2a6df11..27c327de4 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -88,7 +88,7 @@ strict behavior when unset. - `docs/eval-bench.md` (v0.25.0) — contributor guide for using captured data to benchmark retrieval changes before merging. Linked from CONTRIBUTING.md under "Running real-world eval benchmarks (touching retrieval code)". - `src/core/eval-capture.ts` (v0.25.0) — op-layer capture wrapper called from `src/core/operations.ts` `query` + `search` handlers. Catches MCP + CLI + subagent tool-bridge from one site. Fire-and-forget; failures route to `engine.logEvalCaptureFailure` so `gbrain doctor` sees drops cross-process. **Capture is off by default** — `isEvalCaptureEnabled` resolution: explicit `config.eval.capture` (true/false) wins, else `process.env.GBRAIN_CONTRIBUTOR_MODE === '1'`, else off. Production users get a quiet brain; contributors set `export GBRAIN_CONTRIBUTOR_MODE=1` in `.zshrc` to enable the dev loop. PII scrubber gate is independent and defaults to true regardless of CONTRIBUTOR_MODE. - `src/core/eval-capture-scrub.ts` (v0.25.0) — zero-deps PII scrubber: emails, phones, SSN, Luhn-verified credit cards, JWT-shaped tokens, bearer tokens. -- `src/core/search/hybrid.ts` — Cathedral II `Promise` return shape unchanged in v0.25.0. Adds `onMeta?: (m: HybridSearchMeta) => void` callback so op-layer capture can record what hybridSearch actually did. Existing callers leave it undefined. +- `src/core/search/hybrid.ts` — Cathedral II `Promise` return shape unchanged in v0.25.0. Adds `onMeta?: (m: HybridSearchMeta) => void` callback so op-layer capture can record what hybridSearch actually did. Existing callers leave it undefined. **v0.33:** `HybridSearchOpts.types?: PageType[]` (defined on `SearchOpts`) threads a multi-type filter through to per-engine `searchKeyword` + `searchVector` + `searchKeywordChunks`, where it lands as `AND p.type = ANY($N::text[])`. Primary consumer is `gbrain whoknows` (filters to `['person','company']`). AND-applies alongside the existing single-value `type` filter; either or both can be used. - `docs/eval-capture.md` (v0.25.0) — stable NDJSON schema reference for gbrain-evals consumers. - `test/public-exports.test.ts` (v0.25.0 / R2) — runtime contract test. Imports each of the 17 public subpaths via package name and pins a canary symbol per module. Paired with `scripts/check-exports-count.sh`. - `src/core/embedding.ts` — OpenAI text-embedding-3-large, batch, retry, backoff. **v0.28.7:** `BATCH_SIZE` reverted 50→100 — the original Voyage safety guard halved OpenAI throughput on every page. Per-recipe pre-split + recursive halving + adaptive shrink-on-miss now live in the gateway, so the outer paginator goes back to its original purpose: progress-callback granularity, not batch protection. @@ -167,6 +167,9 @@ strict behavior when unset. - `src/commands/orphans.ts` — `gbrain orphans [--json] [--count] [--include-pseudo]`: surfaces pages with zero inbound wikilinks, grouped by domain. Auto-generated/raw/pseudo pages filtered by default. Also exposed as `find_orphans` MCP operation. Shipped in v0.12.3 (contributed by @knee5). - `src/commands/salience.ts` (v0.29) — `gbrain salience [--days N] [--limit N] [--kind PREFIX] [--json]`: pages ranked by emotional + activity salience over a recency window. Mirrors orphans.ts shape (pure data fn + JSON formatter + human formatter). Calls `engine.getRecentSalience(opts)`. Score formula: `(emotional_weight × 5) + ln(1 + active_take_count) + 1/(1 + days_since_update)`. - `src/commands/anomalies.ts` (v0.29) — `gbrain anomalies [--since YYYY-MM-DD] [--lookback-days N] [--sigma N] [--json]`: cohort-level activity outliers. Calls `engine.findAnomalies(opts)`. Two cohort kinds in v1: tag, type. Year cohort deferred to v0.30. +- `src/commands/whoknows.ts` (v0.33) — `gbrain whoknows [--explain] [--limit N] [--json]`: expertise + relationship-proximity routing. Mirrors v0.29 salience/anomalies shape (pure `rankCandidates()` + `findExperts()` orchestrator + `runWhoknows()` CLI dispatch + thin-client routing). MCP op = `find_experts` (scope: read, localOnly: false) per ENG-D5. Ranking formula (ENG-D1 locked): `score = log(1 + raw_match) × max(0.1, exp(-days/180)) × (0.5 + 0.5 × salience)` where `raw_match` is hybridSearch's RRF+source-boost score. Filters at SQL via the new `SearchOpts.types: ['person', 'company']` (no post-filter waste). hybridSearch's internal salience+recency boosts are intentionally disabled — the locked formula applies on a clean signal. Floors prevent multiplicative-zero edge cases (cold-start people stay visible); ties break alphabetically by slug for determinism. 16 unit tests in `test/whoknows.test.ts` pin the math. +- `src/commands/eval-whoknows.ts` (v0.33, v0.33.1.3 thin-client wiring) — `gbrain eval whoknows [--json] [--skip-replay]`: two-layer eval gate (ENG-D2). Layer 1 quality (hand-labeled fixture, top-3 hit rate ≥ 0.8). Layer 2 regression (`eval_candidates` replay set-Jaccard@3 ≥ 0.4). Sparseness fallback: < 20 replay-eligible rows → Layer 2 auto-skips with stderr warning. Stable JSON envelope with `schema_version: 1`. Exit 0/1/2 for pass/fail/usage so CI can gate. Mirrors v0.27.x cross-modal + v0.28.1 longmemeval dispatch shape under `src/commands/eval.ts`. **v0.33.1.3:** `WhoknowsFn` callable abstraction lets the gates be impl-agnostic. `runEvalWhoknows(engine: BrainEngine | null, args)` picks the impl at entry — thin-client mode (`isThinClient(cfg)`) routes per-query through `callRemoteTool(cfg, 'find_experts', {topic, limit})` via the v0.31.1 seam; local mode calls `findExperts(engine, ...)` directly. cli.ts adds a thin-client bypass before `connectEngine` for `gbrain eval whoknows`, matching the longmemeval/cross-modal no-DB pattern. Regression gate auto-skips in thin-client mode (no DB access to `eval_candidates`). Public exports `jaccardAtK`, `topKHit`, `readFixture`, `WhoknowsFn`, threshold constants are pinned by `test/eval-whoknows.test.ts` (25 cases, +2 for the null-engine signature contract). +- `test/fixtures/whoknows-eval.jsonl` (v0.33) — 10-row synthetic placeholder demonstrating the eval-fixture schema (`{query, expected_top_3_slugs, notes?}` JSONL). End users replace with their own real queries before shipping; the placeholder uses obviously-example slugs (`wiki/people/example-alice`) so production data isn't conflated with the test fixture. Drives `test/e2e/whoknows.test.ts` (which seeds a matching synthetic brain and asserts the >=80% gate) and the `whoknows_health` doctor check. - `src/commands/transcripts.ts` (v0.29) — `gbrain transcripts recent [--days N] [--full] [--json]`: recent raw `.txt` transcripts from the dream-cycle corpus dirs. Imports `listRecentTranscripts` from `src/core/transcripts.ts` (the same library the gated `get_recent_transcripts` MCP op uses). Local-only by construction — the CLI always runs with `ctx.remote=false`. - `src/commands/integrity.ts` — `gbrain integrity check|auto|review|extract`: bare-tweet detection, dead-link detection, three-bucket repair (auto-repair / review-queue / skip). `scanIntegrity()` is the shared library function called from `gbrain doctor` (sampled at limit=500) and `cmdCheck` (full scan). v0.22.8: batch-load fast path on Postgres uses a single SQL query to fix the PgBouncer round-trip timeout (60s → ~6s). Gated by `engine.kind === 'postgres'` at the call site so PGLite never enters batch; fallback `catch` logs at `GBRAIN_DEBUG=1` so real Postgres errors are diagnosable. **v0.32.8 (PR #860):** batch projection switched from `SELECT DISTINCT ON (slug)` to `SELECT ... ORDER BY source_id, slug` so multi-source brains scan each `(source, slug)` row independently (pre-fix the DISTINCT collapsed same-slug-different-source pages into one scan, the same bug class this PR fixes). Sequential and auto-repair loops use `listAllPageRefs()` to enumerate `(slug, source_id)` pairs and thread `sourceId` to `getPage`. Batch + sequential paths now report the same page count on multi-source brains. - `src/commands/doctor.ts` — `gbrain doctor [--json] [--fast] [--fix] [--dry-run] [--index-audit]`: health checks. v0.12.3 added `jsonb_integrity` + `markdown_body_completeness` reliability checks. v0.14.1: `--fix` delegates inlined cross-cutting rules to `> **Convention:** see [path](path).` callouts (pipes DRY violations into `src/core/dry-fix.ts`); `--fix --dry-run` previews without writing. v0.14.2: `schema_version` check fails loudly when `version=0` (migrations never ran — the #218 `bun install -g` signature) and routes users to `gbrain apply-migrations --yes`; new opt-in `--index-audit` flag (Postgres-only) reports zero-scan indexes from `pg_stat_user_indexes` (informational only, no auto-drop). v0.15.2: every DB check is wrapped in a progress phase; `markdown_body_completeness` runs under a 1s heartbeat timer so 10+ min scans are observable on 50K-page brains. v0.19.1 added `queue_health` (Postgres-only) with two subchecks: stalled-forever active jobs (started_at > 1h) and waiting-depth-per-name > threshold (default 10, override via `GBRAIN_QUEUE_WAITING_THRESHOLD`). Worker-heartbeat subcheck intentionally deferred to follow-up B7 because it needs a `minion_workers` table to produce ground-truth signal. Fix hints point at `gbrain repair-jsonb`, `gbrain sync --force`, `gbrain apply-migrations`, and `gbrain jobs get/cancel `. v0.22.12 (#500): `sync_failures` check shows `[CODE=N, ...]` breakdown for both unacked entries (warn) and acked-historical entries (ok), surfacing systemic failure modes (`SLUG_MISMATCH=2685`) instead of a bare count. v0.26.7 (#612): `rls_event_trigger` check (post-install drift detector for migration v35's auto-RLS event trigger). Lives outside the `// 5. RLS` slice that the structural doctor.test.ts guards anchor on, so the existing test guards stay intact. Healthy `evtenabled` set is `('O','A')` only — `R` is replica-only and would not fire in normal sessions; `D` is disabled. Fix hint is `gbrain apply-migrations --force-retry 35`. **v0.30.2:** `queue_health` gains a fourth subcheck — surfaces dead-lettered subagent jobs with `last_error` matching the `prompt_too_long` classifier within the last 24h. Fix hint points at `gbrain dream --phase synthesize --dry-run --json` to identify the offending transcript and `gbrain jobs prune --status dead --queue default` to clean up. Postgres-only. **v0.31.7:** `runDoctor` switches to `autoDetectSkillsDirReadOnly` (from `src/core/repo-root.ts`) so `bun install -g github:garrytan/gbrain && cd ~ && gbrain doctor` finds the bundled `skills/` via the install-path fallback instead of warning "Could not find skills directory" + docking the health score. `--fix` carries a D6 safety gate: when `detected.source === 'install_path'`, the command refuses auto-repair with a stderr message pointing at `$GBRAIN_SKILLS_DIR` / `$OPENCLAW_WORKSPACE` / `--skills-dir`, because `autoFixDryViolations` writes to SKILL.md files and would otherwise silently rewrite the install tree. The `graph_coverage` check now short-circuits to `ok: 'No entity pages — graph_coverage not applicable (markdown-only brain)'` when `SELECT COUNT(*) FROM pages WHERE type IN ('entity','person','company','organization')` returns 0 (closes #530); the entity count is woven into the warn message and the WARN hint switches from the long-deprecated `gbrain link-extract && gbrain timeline-extract` (gone since v0.16) to the canonical `gbrain extract all`. Pinned by an IRON-RULE regression assertion in `test/doctor.test.ts` that bans the stale verb names from the source string. **v0.32.4:** new `sync_freshness` check (exported `checkSyncFreshness` at the same file) added to both `runDoctor` (local) and `doctorReportRemote` (thin-client). Pure staleness probe — queries `sources.last_sync_at` only, no filesystem access. Warns at 24h, fails at 72h (or never-synced). Future-`last_sync_at` warns ("clock skew or corrupted timestamp") instead of silently falling through as ok — codex outside-voice caught the negative-ageMs bug pre-merge. Env-var overrides `GBRAIN_SYNC_FRESHNESS_WARN_HOURS` / `GBRAIN_SYNC_FRESHNESS_FAIL_HOURS`; invalid values fall back to defaults with a once-per-process stderr warn (`_resolveSyncFreshnessHours`). Failure messages embed `source.id` (not `source.name`) so the printed fix command `gbrain sync --source ` matches what the user copy-pastes. Filesystem-vs-DB page drift detection was deliberately stripped from the v0.32.4 scope — `doctorReportRemote` runs in the HTTP MCP server (`src/commands/serve-http.ts`), and walking DB-supplied `local_path` from a remote-callable endpoint crosses a trust boundary (OAuth write scope could mutate `sources.local_path`). Drift detection will resurface in a separate PR routed through `multi_source_drift`'s existing guard infrastructure (`GBRAIN_DRIFT_LIMIT` / `GBRAIN_DRIFT_TIMEOUT_MS`) with slug normalization tests and a meta-file allow-list. Pinned by 12 cases in `test/doctor.test.ts` ("v0.32.4 — sync_freshness check" describe block): empty sources, never-synced fail, >72h fail, exact 72h boundary, 24h-72h warn, exact 24h boundary, <24h ok, future-timestamp warn, mixed sources (highest severity wins), `executeRaw` throws → outer-catch warn, `GBRAIN_SYNC_FRESHNESS_FAIL_HOURS=6` override fires at 7h, source.id-in-message regression. diff --git a/VERSION b/VERSION index be386c9ed..a725d6e3d 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -0.33.0 +0.33.1.0 \ No newline at end of file diff --git a/llms-full.txt b/llms-full.txt index 36979515d..88ca309a1 100644 --- a/llms-full.txt +++ b/llms-full.txt @@ -188,7 +188,7 @@ strict behavior when unset. - `docs/eval-bench.md` (v0.25.0) — contributor guide for using captured data to benchmark retrieval changes before merging. Linked from CONTRIBUTING.md under "Running real-world eval benchmarks (touching retrieval code)". - `src/core/eval-capture.ts` (v0.25.0) — op-layer capture wrapper called from `src/core/operations.ts` `query` + `search` handlers. Catches MCP + CLI + subagent tool-bridge from one site. Fire-and-forget; failures route to `engine.logEvalCaptureFailure` so `gbrain doctor` sees drops cross-process. **Capture is off by default** — `isEvalCaptureEnabled` resolution: explicit `config.eval.capture` (true/false) wins, else `process.env.GBRAIN_CONTRIBUTOR_MODE === '1'`, else off. Production users get a quiet brain; contributors set `export GBRAIN_CONTRIBUTOR_MODE=1` in `.zshrc` to enable the dev loop. PII scrubber gate is independent and defaults to true regardless of CONTRIBUTOR_MODE. - `src/core/eval-capture-scrub.ts` (v0.25.0) — zero-deps PII scrubber: emails, phones, SSN, Luhn-verified credit cards, JWT-shaped tokens, bearer tokens. -- `src/core/search/hybrid.ts` — Cathedral II `Promise` return shape unchanged in v0.25.0. Adds `onMeta?: (m: HybridSearchMeta) => void` callback so op-layer capture can record what hybridSearch actually did. Existing callers leave it undefined. +- `src/core/search/hybrid.ts` — Cathedral II `Promise` return shape unchanged in v0.25.0. Adds `onMeta?: (m: HybridSearchMeta) => void` callback so op-layer capture can record what hybridSearch actually did. Existing callers leave it undefined. **v0.33:** `HybridSearchOpts.types?: PageType[]` (defined on `SearchOpts`) threads a multi-type filter through to per-engine `searchKeyword` + `searchVector` + `searchKeywordChunks`, where it lands as `AND p.type = ANY($N::text[])`. Primary consumer is `gbrain whoknows` (filters to `['person','company']`). AND-applies alongside the existing single-value `type` filter; either or both can be used. - `docs/eval-capture.md` (v0.25.0) — stable NDJSON schema reference for gbrain-evals consumers. - `test/public-exports.test.ts` (v0.25.0 / R2) — runtime contract test. Imports each of the 17 public subpaths via package name and pins a canary symbol per module. Paired with `scripts/check-exports-count.sh`. - `src/core/embedding.ts` — OpenAI text-embedding-3-large, batch, retry, backoff. **v0.28.7:** `BATCH_SIZE` reverted 50→100 — the original Voyage safety guard halved OpenAI throughput on every page. Per-recipe pre-split + recursive halving + adaptive shrink-on-miss now live in the gateway, so the outer paginator goes back to its original purpose: progress-callback granularity, not batch protection. @@ -267,6 +267,9 @@ strict behavior when unset. - `src/commands/orphans.ts` — `gbrain orphans [--json] [--count] [--include-pseudo]`: surfaces pages with zero inbound wikilinks, grouped by domain. Auto-generated/raw/pseudo pages filtered by default. Also exposed as `find_orphans` MCP operation. Shipped in v0.12.3 (contributed by @knee5). - `src/commands/salience.ts` (v0.29) — `gbrain salience [--days N] [--limit N] [--kind PREFIX] [--json]`: pages ranked by emotional + activity salience over a recency window. Mirrors orphans.ts shape (pure data fn + JSON formatter + human formatter). Calls `engine.getRecentSalience(opts)`. Score formula: `(emotional_weight × 5) + ln(1 + active_take_count) + 1/(1 + days_since_update)`. - `src/commands/anomalies.ts` (v0.29) — `gbrain anomalies [--since YYYY-MM-DD] [--lookback-days N] [--sigma N] [--json]`: cohort-level activity outliers. Calls `engine.findAnomalies(opts)`. Two cohort kinds in v1: tag, type. Year cohort deferred to v0.30. +- `src/commands/whoknows.ts` (v0.33) — `gbrain whoknows [--explain] [--limit N] [--json]`: expertise + relationship-proximity routing. Mirrors v0.29 salience/anomalies shape (pure `rankCandidates()` + `findExperts()` orchestrator + `runWhoknows()` CLI dispatch + thin-client routing). MCP op = `find_experts` (scope: read, localOnly: false) per ENG-D5. Ranking formula (ENG-D1 locked): `score = log(1 + raw_match) × max(0.1, exp(-days/180)) × (0.5 + 0.5 × salience)` where `raw_match` is hybridSearch's RRF+source-boost score. Filters at SQL via the new `SearchOpts.types: ['person', 'company']` (no post-filter waste). hybridSearch's internal salience+recency boosts are intentionally disabled — the locked formula applies on a clean signal. Floors prevent multiplicative-zero edge cases (cold-start people stay visible); ties break alphabetically by slug for determinism. 16 unit tests in `test/whoknows.test.ts` pin the math. +- `src/commands/eval-whoknows.ts` (v0.33, v0.33.1.3 thin-client wiring) — `gbrain eval whoknows [--json] [--skip-replay]`: two-layer eval gate (ENG-D2). Layer 1 quality (hand-labeled fixture, top-3 hit rate ≥ 0.8). Layer 2 regression (`eval_candidates` replay set-Jaccard@3 ≥ 0.4). Sparseness fallback: < 20 replay-eligible rows → Layer 2 auto-skips with stderr warning. Stable JSON envelope with `schema_version: 1`. Exit 0/1/2 for pass/fail/usage so CI can gate. Mirrors v0.27.x cross-modal + v0.28.1 longmemeval dispatch shape under `src/commands/eval.ts`. **v0.33.1.3:** `WhoknowsFn` callable abstraction lets the gates be impl-agnostic. `runEvalWhoknows(engine: BrainEngine | null, args)` picks the impl at entry — thin-client mode (`isThinClient(cfg)`) routes per-query through `callRemoteTool(cfg, 'find_experts', {topic, limit})` via the v0.31.1 seam; local mode calls `findExperts(engine, ...)` directly. cli.ts adds a thin-client bypass before `connectEngine` for `gbrain eval whoknows`, matching the longmemeval/cross-modal no-DB pattern. Regression gate auto-skips in thin-client mode (no DB access to `eval_candidates`). Public exports `jaccardAtK`, `topKHit`, `readFixture`, `WhoknowsFn`, threshold constants are pinned by `test/eval-whoknows.test.ts` (25 cases, +2 for the null-engine signature contract). +- `test/fixtures/whoknows-eval.jsonl` (v0.33) — 10-row synthetic placeholder demonstrating the eval-fixture schema (`{query, expected_top_3_slugs, notes?}` JSONL). End users replace with their own real queries before shipping; the placeholder uses obviously-example slugs (`wiki/people/example-alice`) so production data isn't conflated with the test fixture. Drives `test/e2e/whoknows.test.ts` (which seeds a matching synthetic brain and asserts the >=80% gate) and the `whoknows_health` doctor check. - `src/commands/transcripts.ts` (v0.29) — `gbrain transcripts recent [--days N] [--full] [--json]`: recent raw `.txt` transcripts from the dream-cycle corpus dirs. Imports `listRecentTranscripts` from `src/core/transcripts.ts` (the same library the gated `get_recent_transcripts` MCP op uses). Local-only by construction — the CLI always runs with `ctx.remote=false`. - `src/commands/integrity.ts` — `gbrain integrity check|auto|review|extract`: bare-tweet detection, dead-link detection, three-bucket repair (auto-repair / review-queue / skip). `scanIntegrity()` is the shared library function called from `gbrain doctor` (sampled at limit=500) and `cmdCheck` (full scan). v0.22.8: batch-load fast path on Postgres uses a single SQL query to fix the PgBouncer round-trip timeout (60s → ~6s). Gated by `engine.kind === 'postgres'` at the call site so PGLite never enters batch; fallback `catch` logs at `GBRAIN_DEBUG=1` so real Postgres errors are diagnosable. **v0.32.8 (PR #860):** batch projection switched from `SELECT DISTINCT ON (slug)` to `SELECT ... ORDER BY source_id, slug` so multi-source brains scan each `(source, slug)` row independently (pre-fix the DISTINCT collapsed same-slug-different-source pages into one scan, the same bug class this PR fixes). Sequential and auto-repair loops use `listAllPageRefs()` to enumerate `(slug, source_id)` pairs and thread `sourceId` to `getPage`. Batch + sequential paths now report the same page count on multi-source brains. - `src/commands/doctor.ts` — `gbrain doctor [--json] [--fast] [--fix] [--dry-run] [--index-audit]`: health checks. v0.12.3 added `jsonb_integrity` + `markdown_body_completeness` reliability checks. v0.14.1: `--fix` delegates inlined cross-cutting rules to `> **Convention:** see [path](path).` callouts (pipes DRY violations into `src/core/dry-fix.ts`); `--fix --dry-run` previews without writing. v0.14.2: `schema_version` check fails loudly when `version=0` (migrations never ran — the #218 `bun install -g` signature) and routes users to `gbrain apply-migrations --yes`; new opt-in `--index-audit` flag (Postgres-only) reports zero-scan indexes from `pg_stat_user_indexes` (informational only, no auto-drop). v0.15.2: every DB check is wrapped in a progress phase; `markdown_body_completeness` runs under a 1s heartbeat timer so 10+ min scans are observable on 50K-page brains. v0.19.1 added `queue_health` (Postgres-only) with two subchecks: stalled-forever active jobs (started_at > 1h) and waiting-depth-per-name > threshold (default 10, override via `GBRAIN_QUEUE_WAITING_THRESHOLD`). Worker-heartbeat subcheck intentionally deferred to follow-up B7 because it needs a `minion_workers` table to produce ground-truth signal. Fix hints point at `gbrain repair-jsonb`, `gbrain sync --force`, `gbrain apply-migrations`, and `gbrain jobs get/cancel `. v0.22.12 (#500): `sync_failures` check shows `[CODE=N, ...]` breakdown for both unacked entries (warn) and acked-historical entries (ok), surfacing systemic failure modes (`SLUG_MISMATCH=2685`) instead of a bare count. v0.26.7 (#612): `rls_event_trigger` check (post-install drift detector for migration v35's auto-RLS event trigger). Lives outside the `// 5. RLS` slice that the structural doctor.test.ts guards anchor on, so the existing test guards stay intact. Healthy `evtenabled` set is `('O','A')` only — `R` is replica-only and would not fire in normal sessions; `D` is disabled. Fix hint is `gbrain apply-migrations --force-retry 35`. **v0.30.2:** `queue_health` gains a fourth subcheck — surfaces dead-lettered subagent jobs with `last_error` matching the `prompt_too_long` classifier within the last 24h. Fix hint points at `gbrain dream --phase synthesize --dry-run --json` to identify the offending transcript and `gbrain jobs prune --status dead --queue default` to clean up. Postgres-only. **v0.31.7:** `runDoctor` switches to `autoDetectSkillsDirReadOnly` (from `src/core/repo-root.ts`) so `bun install -g github:garrytan/gbrain && cd ~ && gbrain doctor` finds the bundled `skills/` via the install-path fallback instead of warning "Could not find skills directory" + docking the health score. `--fix` carries a D6 safety gate: when `detected.source === 'install_path'`, the command refuses auto-repair with a stderr message pointing at `$GBRAIN_SKILLS_DIR` / `$OPENCLAW_WORKSPACE` / `--skills-dir`, because `autoFixDryViolations` writes to SKILL.md files and would otherwise silently rewrite the install tree. The `graph_coverage` check now short-circuits to `ok: 'No entity pages — graph_coverage not applicable (markdown-only brain)'` when `SELECT COUNT(*) FROM pages WHERE type IN ('entity','person','company','organization')` returns 0 (closes #530); the entity count is woven into the warn message and the WARN hint switches from the long-deprecated `gbrain link-extract && gbrain timeline-extract` (gone since v0.16) to the canonical `gbrain extract all`. Pinned by an IRON-RULE regression assertion in `test/doctor.test.ts` that bans the stale verb names from the source string. **v0.32.4:** new `sync_freshness` check (exported `checkSyncFreshness` at the same file) added to both `runDoctor` (local) and `doctorReportRemote` (thin-client). Pure staleness probe — queries `sources.last_sync_at` only, no filesystem access. Warns at 24h, fails at 72h (or never-synced). Future-`last_sync_at` warns ("clock skew or corrupted timestamp") instead of silently falling through as ok — codex outside-voice caught the negative-ageMs bug pre-merge. Env-var overrides `GBRAIN_SYNC_FRESHNESS_WARN_HOURS` / `GBRAIN_SYNC_FRESHNESS_FAIL_HOURS`; invalid values fall back to defaults with a once-per-process stderr warn (`_resolveSyncFreshnessHours`). Failure messages embed `source.id` (not `source.name`) so the printed fix command `gbrain sync --source ` matches what the user copy-pastes. Filesystem-vs-DB page drift detection was deliberately stripped from the v0.32.4 scope — `doctorReportRemote` runs in the HTTP MCP server (`src/commands/serve-http.ts`), and walking DB-supplied `local_path` from a remote-callable endpoint crosses a trust boundary (OAuth write scope could mutate `sources.local_path`). Drift detection will resurface in a separate PR routed through `multi_source_drift`'s existing guard infrastructure (`GBRAIN_DRIFT_LIMIT` / `GBRAIN_DRIFT_TIMEOUT_MS`) with slug normalization tests and a meta-file allow-list. Pinned by 12 cases in `test/doctor.test.ts` ("v0.32.4 — sync_freshness check" describe block): empty sources, never-synced fail, >72h fail, exact 72h boundary, 24h-72h warn, exact 24h boundary, <24h ok, future-timestamp warn, mixed sources (highest severity wins), `executeRaw` throws → outer-catch warn, `GBRAIN_SYNC_FRESHNESS_FAIL_HOURS=6` override fires at 7h, source.id-in-message regression. diff --git a/package.json b/package.json index 45082da79..540d2bef7 100644 --- a/package.json +++ b/package.json @@ -1,6 +1,6 @@ { "name": "gbrain", - "version": "0.33.0", + "version": "0.33.1.0", "description": "Postgres-native personal knowledge brain with hybrid RAG search", "type": "module", "main": "src/core/index.ts", diff --git a/src/cli.ts b/src/cli.ts index ab9afb783..ddaffdcf3 100755 --- a/src/cli.ts +++ b/src/cli.ts @@ -957,6 +957,18 @@ async function handleCliOnly(command: string, args: string[]) { return; } + // v0.33.1.3: `gbrain eval whoknows` on thin-client installs bypasses + // connectEngine entirely — the eval routes per-query through the remote + // `find_experts` MCP op (the v0.31.1 routing seam). Local mode falls + // through to the engine-connected path below. + if (command === 'eval' && args[0] === 'whoknows') { + const cfgPre = loadConfig(); + if (isThinClient(cfgPre)) { + const { runEvalWhoknows } = await import('./commands/eval-whoknows.ts'); + process.exit(await runEvalWhoknows(null, args.slice(1))); + } + } + // All remaining CLI-only commands need a DB connection const engine = await connectEngine(); try { @@ -1088,6 +1100,15 @@ async function handleCliOnly(command: string, args: string[]) { await runAnomalies(engine, args); break; } + case 'whoknows': { + // v0.33 (Issue #?): expertise + relationship-proximity routing. + // MCP op `find_experts` (read-scoped) backs the same code path; CLI + // dispatch here is the user-facing surface. Thin-client routing + // happens inside runWhoknows via isThinClient(cfg) (v0.31.1 pattern). + const { runWhoknows } = await import('./commands/whoknows.ts'); + await runWhoknows(engine, args); + break; + } case 'transcripts': { const { runTranscripts } = await import('./commands/transcripts.ts'); await runTranscripts(engine, args); diff --git a/src/commands/doctor.ts b/src/commands/doctor.ts index e287cda97..4ec4f8b8c 100644 --- a/src/commands/doctor.ts +++ b/src/commands/doctor.ts @@ -86,6 +86,66 @@ export function computeDoctorReport(checks: Check[]): DoctorReport { * Tolerance matches migration v48: any value with abs(weight - on_grid) > 1e-3 * is genuinely off-grid (the 0.05 grid is 5e-2; float32 noise is ~1e-7). */ +/** + * v0.33: whoknows_health — verify the eval fixture is present at the + * documented path. Lightweight; just checks file existence and row count, + * not the eval gate outcome (that runs via `gbrain eval whoknows`). + * + * Surface is intentionally narrow: a missing fixture means the eval + * cannot run at all, which is the highest-leverage signal. Hit-rate + * regression detection lives in `gbrain eval whoknows --json` and is + * the job of the eval command, not the doctor sweep. + */ +export async function whoknowsHealthCheck(_engine: BrainEngine): Promise { + try { + const { existsSync, readFileSync, statSync } = await import('fs'); + const path = await import('path'); + const repoRoot = process.cwd(); + const fixturePath = path.join(repoRoot, 'test/fixtures/whoknows-eval.jsonl'); + if (!existsSync(fixturePath)) { + return { + name: 'whoknows_health', + status: 'warn', + message: `whoknows eval fixture missing at test/fixtures/whoknows-eval.jsonl. Fix: hand-label 10 queries you'd actually run, format {query, expected_top_3_slugs, notes}.`, + }; + } + const stat = statSync(fixturePath); + if (stat.size === 0) { + return { + name: 'whoknows_health', + status: 'warn', + message: 'whoknows eval fixture exists but is empty. The eval cannot pass without queries.', + }; + } + const raw = readFileSync(fixturePath, 'utf-8'); + const rows = raw + .split('\n') + .filter((l) => { + const t = l.trim(); + return t && !t.startsWith('#') && !t.startsWith('//'); + }); + if (rows.length < 5) { + return { + name: 'whoknows_health', + status: 'warn', + message: `whoknows eval fixture has only ${rows.length} row(s); ENG-D2 recommends 10. Fix: add more hand-labeled queries.`, + }; + } + return { + name: 'whoknows_health', + status: 'ok', + message: `whoknows eval fixture present (${rows.length} queries). Run \`gbrain eval whoknows test/fixtures/whoknows-eval.jsonl\` to grade.`, + }; + } catch (e) { + const msg = e instanceof Error ? e.message : String(e); + return { + name: 'whoknows_health', + status: 'warn', + message: `Could not check whoknows fixture: ${msg}`, + }; + } +} + export async function takesWeightGridCheck(engine: BrainEngine): Promise { try { const rows = await engine.executeRaw<{ off_grid: string | number; total: string | number }>( @@ -1511,6 +1571,12 @@ export async function runDoctor(engine: BrainEngine | null, args: string[], dbSo progress.heartbeat('takes_weight_grid'); checks.push(await takesWeightGridCheck(engine)); + // v0.33: whoknows_health — fixture presence + row count. The eval + // gate itself runs via `gbrain eval whoknows`; this check is the + // "did you do the assignment?" signal. + progress.heartbeat('whoknows_health'); + checks.push(await whoknowsHealthCheck(engine)); + // 11. Markdown body completeness (v0.12.3 reliability wave). // v0.12.0's splitBody ate everything after the first `---` horizontal rule, // truncating wiki-style pages. Heuristic: pages whose body is <30% of the diff --git a/src/commands/eval-whoknows.ts b/src/commands/eval-whoknows.ts new file mode 100644 index 000000000..83e62b67e --- /dev/null +++ b/src/commands/eval-whoknows.ts @@ -0,0 +1,451 @@ +/** + * gbrain eval whoknows — v0.33 two-layer eval gate (ENG-D2). + * + * Layer 1 (PRIMARY, ship-blocking): hand-labeled fixture. + * For each {query, expected_top_3_slugs}, run `findExperts` and check + * whether top-3 result slugs intersect with expected_top_3_slugs. + * Pass = HIT_RATE_THRESHOLD (0.8) or higher. + * + * Layer 2 (SECONDARY, ship-blocking when data exists): eval_candidates replay. + * Stream rows from `eval_candidates` where tool_name='query' (the closest + * shape to whoknows queries the capture system has). For each, re-run + * findExperts and compute set-Jaccard@3 between current output and + * captured retrieved_slugs. Pass = REGRESSION_THRESHOLD (0.4) mean Jaccard. + * + * Sparseness fallback: if fewer than MIN_REPLAY_ROWS (20) replay-eligible + * rows exist, regression gate auto-disables with stderr warning and exit + * is decided by Layer 1 alone. + * + * Exit codes: + * 0 — both gates passed (or Layer 1 passed + Layer 2 skipped via sparseness) + * 1 — at least one gate failed + * 2 — config/usage error + * + * Output: + * --json machine-readable JSON envelope + * default human-readable table + verdict + * + * Usage: + * gbrain eval whoknows test/fixtures/whoknows-eval.jsonl + * gbrain eval whoknows test/fixtures/whoknows-eval.jsonl --json + * gbrain eval whoknows test/fixtures/whoknows-eval.jsonl --skip-replay + */ + +import { readFileSync, existsSync } from 'fs'; +import type { BrainEngine } from '../core/engine.ts'; +import { findExperts, type WhoknowsResult } from './whoknows.ts'; +import { loadConfig, isThinClient } from '../core/config.ts'; +import { callRemoteTool, unpackToolResult } from '../core/mcp-client.ts'; + +export const HIT_RATE_THRESHOLD = 0.8; +export const REGRESSION_THRESHOLD = 0.4; +export const MIN_REPLAY_ROWS = 20; + +export interface FixtureRow { + query: string; + expected_top_3_slugs: string[]; + notes?: string; +} + +export interface QualityRowResult { + query: string; + expected: string[]; + actual_top_3: string[]; + hit: boolean; +} + +export interface QualityReport { + total: number; + hits: number; + hit_rate: number; + threshold: number; + passed: boolean; + rows: QualityRowResult[]; +} + +export interface RegressionRowResult { + query: string; + captured: string[]; + current: string[]; + jaccard: number; +} + +export interface RegressionReport { + status: 'passed' | 'failed' | 'skipped'; + reason?: string; // populated when skipped + total: number; + mean_jaccard: number; + threshold: number; + rows: RegressionRowResult[]; +} + +export interface EvalWhoknowsReport { + schema_version: 1; + fixture_path: string; + quality: QualityReport; + regression: RegressionReport; + overall_passed: boolean; + exit_code: 0 | 1; +} + +interface CliOpts { + fixturePath?: string; + json: boolean; + skipReplay: boolean; + limit: number; + help: boolean; +} + +function parseArgs(args: string[]): CliOpts { + const opts: CliOpts = { json: false, skipReplay: false, limit: 5, help: false }; + const positional: string[] = []; + for (let i = 0; i < args.length; i++) { + const a = args[i]; + if (a === '--help' || a === '-h') { + opts.help = true; + continue; + } + if (a === '--json') { + opts.json = true; + continue; + } + if (a === '--skip-replay') { + opts.skipReplay = true; + continue; + } + if (a === '--limit') { + const n = parseInt(args[++i] ?? '', 10); + if (Number.isFinite(n) && n > 0) opts.limit = n; + continue; + } + if (a && !a.startsWith('--')) positional.push(a); + } + if (positional[0]) opts.fixturePath = positional[0]; + return opts; +} + +const HELP = `Usage: gbrain eval whoknows [options] + +Two-layer eval gate (v0.33 ENG-D2) for naive gbrain whoknows: + Layer 1 (PRIMARY): hand-labeled fixture, pass at >= 80% top-3 hit rate + Layer 2 (REGRESSION): eval_candidates replay set-Jaccard@3 >= 0.4 + (auto-skipped if < 20 replay-eligible rows) + +Fixture format (JSONL, one row per line): + {"query": "lab automation", "expected_top_3_slugs": ["wiki/people/alice", "..."], "notes": "..."} + +Options: + --json Emit JSON report instead of human-readable table + --skip-replay Skip Layer 2 entirely (run quality gate only) + --limit N Top-K to grade (default 5; eval uses top-3 by default) + --help, -h Show this help +`; + +export function readFixture(path: string): FixtureRow[] { + if (!existsSync(path)) { + throw new Error(`fixture not found: ${path}`); + } + const raw = readFileSync(path, 'utf-8'); + const rows: FixtureRow[] = []; + for (const line of raw.split('\n')) { + const trimmed = line.trim(); + if (!trimmed || trimmed.startsWith('//') || trimmed.startsWith('#')) continue; + let obj: unknown; + try { + obj = JSON.parse(trimmed); + } catch (e) { + throw new Error(`malformed JSONL line: ${trimmed.slice(0, 80)}`); + } + if ( + obj && + typeof obj === 'object' && + typeof (obj as Record).query === 'string' && + Array.isArray((obj as Record).expected_top_3_slugs) + ) { + const o = obj as Record; + const expected = (o.expected_top_3_slugs as unknown[]).filter( + (s): s is string => typeof s === 'string', + ); + const row: FixtureRow = { + query: o.query as string, + expected_top_3_slugs: expected, + }; + if (typeof o.notes === 'string') row.notes = o.notes; + rows.push(row); + } else { + throw new Error(`fixture row missing required fields (query, expected_top_3_slugs): ${trimmed.slice(0, 80)}`); + } + } + return rows; +} + +/** + * Set-Jaccard@k between two slug lists, treating only the first k items + * of each as the set. Empty intersection over empty union = 1.0 (vacuously + * stable); empty intersection over non-empty union = 0. + */ +export function jaccardAtK(a: string[], b: string[], k = 3): number { + const setA = new Set(a.slice(0, k)); + const setB = new Set(b.slice(0, k)); + if (setA.size === 0 && setB.size === 0) return 1; + let intersect = 0; + for (const x of setA) if (setB.has(x)) intersect++; + const union = setA.size + setB.size - intersect; + return union === 0 ? 1 : intersect / union; +} + +export function topKHit(actual: string[], expected: string[], k = 3): boolean { + const expectedSet = new Set(expected); + for (let i = 0; i < Math.min(k, actual.length); i++) { + if (expectedSet.has(actual[i])) return true; + } + return false; +} + +/** + * v0.33.1.3: per-query whoknows callable. The eval layers are agnostic + * about WHERE findExperts runs — local engine call vs thin-client MCP + * routed call. runEvalWhoknows picks the impl, the gates consume it. + */ +export type WhoknowsFn = (topic: string, limit: number) => Promise; + +async function runQualityGate( + whoknows: WhoknowsFn, + fixture: FixtureRow[], + limit: number, +): Promise { + const rows: QualityRowResult[] = []; + for (const row of fixture) { + const results = await whoknows(row.query, limit); + const actualTop3 = results.slice(0, 3).map((r) => r.slug); + rows.push({ + query: row.query, + expected: row.expected_top_3_slugs, + actual_top_3: actualTop3, + hit: topKHit(actualTop3, row.expected_top_3_slugs, 3), + }); + } + const hits = rows.filter((r) => r.hit).length; + const hit_rate = rows.length === 0 ? 0 : hits / rows.length; + return { + total: rows.length, + hits, + hit_rate, + threshold: HIT_RATE_THRESHOLD, + passed: hit_rate >= HIT_RATE_THRESHOLD, + rows, + }; +} + +interface ReplayRow { + query: string; + retrieved_slugs: string[]; +} + +/** + * Stream captured query-shaped rows from eval_candidates. Limits to the + * last 200 rows for tractable runtime; the regression layer is a + * sanity check, not exhaustive scoring. + */ +async function loadReplayRows(engine: BrainEngine): Promise { + try { + const rows = await engine.executeRaw<{ + query: string; + retrieved_slugs: string[] | string; + }>( + `SELECT query, retrieved_slugs + FROM eval_candidates + WHERE tool_name = 'query' + AND query IS NOT NULL + AND query <> '' + ORDER BY id DESC + LIMIT 200`, + ); + return rows.map((r) => ({ + query: String(r.query), + retrieved_slugs: Array.isArray(r.retrieved_slugs) + ? r.retrieved_slugs + : typeof r.retrieved_slugs === 'string' + ? safeJsonArray(r.retrieved_slugs) + : [], + })); + } catch (e) { + // Table may not exist on installs where CONTRIBUTOR_MODE was never on. + // Treat as "no replay data" for sparseness fallback. + return []; + } +} + +function safeJsonArray(s: string): string[] { + try { + const v = JSON.parse(s); + return Array.isArray(v) ? v.filter((x): x is string => typeof x === 'string') : []; + } catch { + return []; + } +} + +async function runRegressionGate( + engine: BrainEngine, + whoknows: WhoknowsFn, + limit: number, +): Promise { + const captured = await loadReplayRows(engine); + if (captured.length < MIN_REPLAY_ROWS) { + return { + status: 'skipped', + reason: `only ${captured.length} replay-eligible eval_candidates rows (< ${MIN_REPLAY_ROWS} threshold); GBRAIN_CONTRIBUTOR_MODE may have been off`, + total: captured.length, + mean_jaccard: 0, + threshold: REGRESSION_THRESHOLD, + rows: [], + }; + } + const rows: RegressionRowResult[] = []; + for (const r of captured) { + const current = await whoknows(r.query, limit); + const currentSlugs = current.slice(0, 3).map((x) => x.slug); + rows.push({ + query: r.query, + captured: r.retrieved_slugs.slice(0, 3), + current: currentSlugs, + jaccard: jaccardAtK(currentSlugs, r.retrieved_slugs, 3), + }); + } + const mean_jaccard = rows.reduce((s, x) => s + x.jaccard, 0) / Math.max(1, rows.length); + return { + status: mean_jaccard >= REGRESSION_THRESHOLD ? 'passed' : 'failed', + total: rows.length, + mean_jaccard, + threshold: REGRESSION_THRESHOLD, + rows, + }; +} + +export async function runEvalWhoknows( + engine: BrainEngine | null, + args: string[], +): Promise<0 | 1 | 2> { + const opts = parseArgs(args); + if (opts.help) { + console.log(HELP); + return 0; + } + if (!opts.fixturePath) { + console.error('gbrain eval whoknows: fixture path required'); + console.error(HELP); + return 2; + } + + let fixture: FixtureRow[]; + try { + fixture = readFixture(opts.fixturePath); + } catch (e: unknown) { + console.error(`gbrain eval whoknows: ${(e as Error).message}`); + return 2; + } + if (fixture.length === 0) { + console.error('gbrain eval whoknows: fixture file is empty'); + return 2; + } + + // v0.33.1.3: pick the whoknows impl. Thin-client mode routes per-query + // through the remote `find_experts` MCP op via the v0.31.1 routing seam + // (callRemoteTool). Local mode calls findExperts() directly. Either way, + // the gate logic below is impl-agnostic. + const cfg = loadConfig(); + const thinClient = isThinClient(cfg); + if (!thinClient && !engine) { + console.error('gbrain eval whoknows: local engine required (not thin-client and no engine connected)'); + return 2; + } + const whoknows: WhoknowsFn = thinClient + ? async (topic, limit) => { + const raw = await callRemoteTool( + cfg!, + 'find_experts', + { topic, limit }, + { timeoutMs: 30_000 }, + ); + return unpackToolResult(raw); + } + : async (topic, limit) => findExperts(engine!, { topic, limit }); + + const quality = await runQualityGate(whoknows, fixture, opts.limit); + // Regression gate auto-skips on thin-client: eval_candidates lives in + // the remote brain's Postgres and there's no MCP op to stream rows. + // Quality gate alone gates ship in thin-client mode. + let regression: RegressionReport; + if (opts.skipReplay) { + regression = { + status: 'skipped', + reason: '--skip-replay flag', + total: 0, + mean_jaccard: 0, + threshold: REGRESSION_THRESHOLD, + rows: [], + }; + } else if (thinClient || !engine) { + regression = { + status: 'skipped', + reason: 'thin-client mode: no local DB access to eval_candidates table', + total: 0, + mean_jaccard: 0, + threshold: REGRESSION_THRESHOLD, + rows: [], + }; + } else { + regression = await runRegressionGate(engine, whoknows, opts.limit); + } + + const regressionPassed = regression.status !== 'failed'; + const overall = quality.passed && regressionPassed; + + const report: EvalWhoknowsReport = { + schema_version: 1, + fixture_path: opts.fixturePath, + quality, + regression, + overall_passed: overall, + exit_code: overall ? 0 : 1, + }; + + if (opts.json) { + console.log(JSON.stringify(report, null, 2)); + } else { + renderHumanReport(report); + } + return overall ? 0 : 1; +} + +function renderHumanReport(r: EvalWhoknowsReport): void { + console.log(`whoknows eval @ ${r.fixture_path}`); + console.log('─'.repeat(60)); + console.log(''); + console.log('LAYER 1 — quality gate (hand-labeled fixture)'); + console.log(` total: ${r.quality.total}`); + console.log(` hits: ${r.quality.hits}`); + console.log(` rate: ${(r.quality.hit_rate * 100).toFixed(1)}% (threshold ${(r.quality.threshold * 100).toFixed(0)}%)`); + console.log(` ${r.quality.passed ? 'PASS' : 'FAIL'}`); + if (!r.quality.passed) { + console.log(''); + console.log(' Misses:'); + for (const row of r.quality.rows) { + if (row.hit) continue; + console.log(` "${row.query}"`); + console.log(` expected: ${row.expected.join(', ')}`); + console.log(` got: ${row.actual_top_3.join(', ') || '(no results)'}`); + } + } + console.log(''); + console.log('LAYER 2 — regression gate (eval_candidates replay)'); + if (r.regression.status === 'skipped') { + console.log(` SKIPPED — ${r.regression.reason}`); + } else { + console.log(` total: ${r.regression.total}`); + console.log(` Jaccard mean: ${r.regression.mean_jaccard.toFixed(3)} (threshold ${r.regression.threshold.toFixed(2)})`); + console.log(` ${r.regression.status === 'passed' ? 'PASS' : 'FAIL'}`); + } + console.log(''); + console.log(`VERDICT: ${r.overall_passed ? 'PASS' : 'FAIL'}`); +} diff --git a/src/commands/eval.ts b/src/commands/eval.ts index df8a69f16..e0c338a5b 100644 --- a/src/commands/eval.ts +++ b/src/commands/eval.ts @@ -45,6 +45,13 @@ export async function runEvalCommand(engine: BrainEngine, args: string[]): Promi const { runEvalCrossModal } = await import('./eval-cross-modal.ts'); process.exit(await runEvalCrossModal(args.slice(1))); } + if (sub === 'whoknows') { + // v0.33 two-layer eval gate (ENG-D2): hand-labeled fixture = + // quality, eval_candidates replay = regression. Pass criteria + // baked in (>=80% top-3 hit rate; >=0.4 Jaccard with sparseness fallback). + const { runEvalWhoknows } = await import('./eval-whoknows.ts'); + process.exit(await runEvalWhoknows(engine, args.slice(1))); + } if (sub === 'suspected-contradictions') { // v0.32.6 — contradiction probe. Engine connected (calls hybridSearch + // the eval_contradictions_cache + _runs tables). Matches the `replay` diff --git a/src/commands/whoknows.ts b/src/commands/whoknows.ts new file mode 100644 index 000000000..a42256814 --- /dev/null +++ b/src/commands/whoknows.ts @@ -0,0 +1,347 @@ +/** + * gbrain whoknows — "Who should I talk to about X?" + * + * v0.33 wedge: expertise + relationship-proximity routing query. + * Returns ranked person/company candidates from the brain that + * know about the given topic. + * + * Ranking spec (locked by ENG-D1): + * + * score(page) = expertise × max(0.1, recency_decay) × (0.5 + 0.5 × salience) + * + * where: + * expertise = log(1 + chunk_match_count) + * // sub-linear; prevents one-big-page-dominates. + * // v0.33 implementation uses hybrid search's raw + * // score as a proxy for chunk_match_count (search + * // score is already a non-linear relevance signal + * // post-RRF + source-boost). The eval gate will + * // tell us if we need the literal count. + * recency_decay = exp(-days_since_effective_date / 180) + * // ~6 month half-life; floored at 0.1 so cold-start + * // people stay visible (multiplicative-zero defense). + * salience = pages.salience_score (already 0..1) + * // linear; centered at 0.5 so missing-salience = neutral. + * + * The query path is hybrid search (keyword + vector + RRF + source-boost) + * filtered at SQL level to person/company pages via the new SearchOpts.types + * parameter (no post-filter waste). Salience and recency boosts in + * hybridSearch are disabled (we apply our own formula on top of the + * raw relevance score). + * + * Usage: + * gbrain whoknows "lab automation" + * gbrain whoknows "fintech compliance" --explain + * gbrain whoknows "ai agents" --limit 10 --json + */ + +import type { BrainEngine } from '../core/engine.ts'; +import type { PageType, SearchResult } from '../core/types.ts'; +import { hybridSearch } from '../core/search/hybrid.ts'; +import { loadConfig, isThinClient } from '../core/config.ts'; +import { callRemoteTool, unpackToolResult } from '../core/mcp-client.ts'; + +export interface WhoknowsOpts { + topic: string; + limit?: number; + explain?: boolean; + /** + * Override the default person/company filter. Most callers should leave + * this undefined and accept the default; surface is here so future ops + * (find_experts_in_companies, find_advisors, etc.) can reuse the + * ranking function without redefining the type filter. + */ + types?: PageType[]; +} + +export interface WhoknowsResult { + slug: string; + source_id: string; + title: string; + type: PageType; + score: number; + factors: { + expertise: number; + recency_decay: number; + recency_factor: number; + salience: number; + salience_factor: number; + days_since_effective: number | null; + raw_match: number; + }; +} + +const DEFAULT_TYPES: PageType[] = ['person', 'company']; +const DEFAULT_LIMIT = 5; +const RECENCY_HALF_LIFE_DAYS = 180; // 6 months +const RECENCY_FLOOR = 0.1; +const SALIENCE_CENTER = 0.5; // missing salience = neutral + +/** + * Pure ranking function. Exported for tests; the CLI/MCP path calls + * findExperts() which adds the search step. + * + * Inputs are pre-fetched candidates with their raw_match + recency + + * salience signals; output is the same set with computed final scores + * and full factor breakdown for --explain. + */ +export function rankCandidates( + candidates: Array<{ + slug: string; + source_id: string; + title: string; + type: PageType; + raw_match: number; + days_since_effective: number | null; + salience_raw: number | null; + }>, + limit: number = DEFAULT_LIMIT, +): WhoknowsResult[] { + const ranked = candidates.map((c) => { + // expertise: sub-linear via log(1 + raw_match). raw_match comes from + // hybridSearch's score, which is already RRF + source-boost-adjusted. + // Clamp to 0 to defend against negative-score producers; log(1+0)=0. + const safeRaw = Math.max(0, Number.isFinite(c.raw_match) ? c.raw_match : 0); + const expertise = Math.log1p(safeRaw); + + // recency_decay: exp(-days/180). Floor at 0.1 so cold-start (no + // effective_date) people don't multiplicative-zero out. + let recency_decay: number; + if (c.days_since_effective == null || !Number.isFinite(c.days_since_effective)) { + recency_decay = RECENCY_FLOOR; + } else { + const days = Math.max(0, c.days_since_effective); + recency_decay = Math.exp(-days / RECENCY_HALF_LIFE_DAYS); + } + const recency_factor = Math.max(RECENCY_FLOOR, recency_decay); + + // salience: linear, centered at 0.5. NaN / out-of-range → 0.5 neutral. + let salience = c.salience_raw == null ? SALIENCE_CENTER : c.salience_raw; + if (!Number.isFinite(salience)) salience = SALIENCE_CENTER; + salience = Math.min(1, Math.max(0, salience)); + const salience_factor = 0.5 + 0.5 * salience; + + const score = expertise * recency_factor * salience_factor; + + return { + slug: c.slug, + source_id: c.source_id, + title: c.title, + type: c.type, + score: Number.isFinite(score) ? score : 0, + factors: { + expertise, + recency_decay, + recency_factor, + salience, + salience_factor, + days_since_effective: c.days_since_effective, + raw_match: c.raw_match, + }, + }; + }); + + // Sort by score DESC; tie-break by slug alphabetical for determinism. + ranked.sort((a, b) => { + if (b.score !== a.score) return b.score - a.score; + return a.slug.localeCompare(b.slug); + }); + + return ranked.slice(0, Math.max(1, limit)); +} + +/** + * Public entrypoint. Searches, fetches per-candidate signals, + * applies the locked ranking spec, returns top-K. + */ +export async function findExperts( + engine: BrainEngine, + opts: WhoknowsOpts, +): Promise { + const types = opts.types ?? DEFAULT_TYPES; + const limit = opts.limit ?? DEFAULT_LIMIT; + const innerLimit = Math.max(limit * 10, 50); + + // 1. Hybrid search with SQL-level types filter (v0.33 typeFilter parameter). + // Disable salience + recency boosts in hybridSearch — we apply our own + // locked formula on top of the raw relevance score. + const results: SearchResult[] = await hybridSearch(engine, opts.topic, { + types, + limit: innerLimit, + salience: 'off', + recency: 'off', + }); + + if (results.length === 0) return []; + + // 2. Dedup to one row per (slug, source_id) — hybridSearch already does + // chunk-grain dedup, but defend against duplicates from cross-source + // fan-out by taking max raw_match per composite key. + const byKey = new Map(); + for (const r of results) { + const key = `${r.source_id ?? 'default'}::${r.slug}`; + const prev = byKey.get(key); + if (!prev || r.score > prev.score) byKey.set(key, r); + } + const candidates = Array.from(byKey.values()); + + // 3. Batch-fetch salience + effective_date per (slug, source_id) ref. + const refs = candidates.map((c) => ({ + slug: c.slug, + source_id: c.source_id ?? 'default', + })); + const [salienceMap, dateMap] = await Promise.all([ + engine.getSalienceScores(refs).catch(() => new Map()), + engine.getEffectiveDates(refs).catch(() => new Map()), + ]); + + // 4. Build the ranking-function input shape. + const now = Date.now(); + const inputs = candidates.map((c) => { + const sourceId = c.source_id ?? 'default'; + const key = `${sourceId}::${c.slug}`; + const salienceRaw = salienceMap.get(key); + // Salience scores from getSalienceScores are emotional_weight × 5 + + // ln(1+take_count); they're unbounded, not 0..1. Normalize by clamping + // to [0, 1] via a tanh-ish squash: ratio = score / (1 + score). + const salienceNormalized = + salienceRaw == null || !Number.isFinite(salienceRaw) || salienceRaw < 0 + ? null + : salienceRaw / (1 + salienceRaw); + const dateObj = dateMap.get(key); + let daysSinceEffective: number | null = null; + if (dateObj instanceof Date && Number.isFinite(dateObj.getTime())) { + daysSinceEffective = (now - dateObj.getTime()) / 86_400_000; + if (daysSinceEffective < 0) daysSinceEffective = 0; + } + return { + slug: c.slug, + source_id: sourceId, + title: c.title, + type: c.type, + raw_match: c.score, + days_since_effective: daysSinceEffective, + salience_raw: salienceNormalized, + }; + }); + + // 5. Rank. + return rankCandidates(inputs, limit); +} + +// ---------------- CLI dispatch ---------------- + +interface CliOpts { + topic: string; + limit?: number; + explain?: boolean; + json?: boolean; +} + +function parseArgs(args: string[]): CliOpts | { help: true } | { error: string } { + const opts: Partial = {}; + const positional: string[] = []; + for (let i = 0; i < args.length; i++) { + const a = args[i]; + if (a === '--help' || a === '-h') return { help: true }; + if (a === '--json') { opts.json = true; continue; } + if (a === '--explain') { opts.explain = true; continue; } + if (a === '--limit') { + const n = parseInt(args[++i] ?? '', 10); + if (Number.isFinite(n) && n > 0) opts.limit = n; + continue; + } + if (a?.startsWith('--')) continue; // ignore unknown flags + if (typeof a === 'string') positional.push(a); + } + if (positional.length === 0) return { error: 'topic argument required' }; + opts.topic = positional.join(' '); + return opts as CliOpts; +} + +const HELP = `Usage: gbrain whoknows [options] + +Ask your brain who knows about a topic. Returns ranked person/company +pages by expertise depth, relationship recency, and salience. + +Options: + --limit N Max results (default 5) + --explain Show the ranking factor breakdown per result + --json JSON output for agents + --help, -h Show this help + +Examples: + gbrain whoknows "lab automation" + gbrain whoknows fintech compliance --explain + gbrain whoknows "ai agents" --limit 10 --json +`; + +export async function runWhoknows( + engine: BrainEngine, + args: string[], +): Promise { + const parsed = parseArgs(args); + if ('help' in parsed) { + console.log(HELP); + return; + } + if ('error' in parsed) { + console.error(`gbrain whoknows: ${parsed.error}`); + console.error(HELP); + process.exit(2); + return; + } + + // Thin-client routing (v0.31.1): route through the remote `find_experts` + // MCP op when this install has no local brain. + let results: WhoknowsResult[]; + const cfg = loadConfig(); + if (isThinClient(cfg)) { + const raw = await callRemoteTool(cfg!, 'find_experts', { + topic: parsed.topic, + limit: parsed.limit, + explain: parsed.explain, + }, { timeoutMs: 30_000 }); + results = unpackToolResult(raw); + } else { + results = await findExperts(engine, { + topic: parsed.topic, + limit: parsed.limit, + explain: parsed.explain, + }); + } + + if (parsed.json) { + console.log(JSON.stringify(results, null, 2)); + return; + } + + if (results.length === 0) { + console.log(`(no person or company pages match "${parsed.topic}")`); + return; + } + + // Human format: rank | score | type | slug — title + const header = `${pad('#', 3)} ${pad('score', 7)} ${pad('type', 8)} slug — title`; + console.log(header); + console.log('-'.repeat(Math.min(80, header.length))); + results.forEach((r, i) => { + const score = r.score.toFixed(3); + console.log( + `${pad(String(i + 1), 3)} ${pad(score, 7)} ${pad(r.type, 8)} ${r.slug} — ${r.title}`, + ); + if (parsed.explain) { + const f = r.factors; + const days = f.days_since_effective == null ? 'cold' : f.days_since_effective.toFixed(0); + console.log( + ` expertise=${f.expertise.toFixed(3)} (raw=${f.raw_match.toFixed(3)}) ` + + `recency=${f.recency_factor.toFixed(3)} (${days}d) ` + + `salience=${f.salience.toFixed(3)} → factor=${f.salience_factor.toFixed(3)}`, + ); + } + }); +} + +function pad(s: string, n: number): string { + return s.length >= n ? s : s + ' '.repeat(n - s.length); +} diff --git a/src/core/operations-descriptions.ts b/src/core/operations-descriptions.ts index 2ecaca836..51227a795 100644 --- a/src/core/operations-descriptions.ts +++ b/src/core/operations-descriptions.ts @@ -34,6 +34,15 @@ export const FIND_ANOMALIES_DESCRIPTION = "patterns the user wouldn't have searched for. Cohort kinds: tag, type. " + "Year cohort is deferred to a later release."; +export const FIND_EXPERTS_DESCRIPTION = + "Answers 'who in my brain knows about '. Returns ranked person/company " + + "pages by expertise depth (sub-linear match score), relationship recency " + + "(exp decay with 6-month half-life), and salience. Use this for questions " + + "like 'who should I talk to about X', 'who knows about Y', 'find me someone " + + "who's worked on Z', or any expertise-routing intent. Filters at SQL to " + + "person + company pages — does NOT return notes or articles. Pair with " + + "--explain (CLI) to surface the per-result factor breakdown."; + export const GET_RECENT_TRANSCRIPTS_DESCRIPTION = "Returns one-line summaries of recent raw conversation transcripts (NOT polished " + "reflections). Use this FIRST for questions about 'what's going on with me', " + diff --git a/src/core/operations.ts b/src/core/operations.ts index b82c1c3bb..41dd72064 100644 --- a/src/core/operations.ts +++ b/src/core/operations.ts @@ -25,6 +25,7 @@ import { VERSION } from '../version.ts'; import { GET_RECENT_SALIENCE_DESCRIPTION, FIND_ANOMALIES_DESCRIPTION, + FIND_EXPERTS_DESCRIPTION, GET_RECENT_TRANSCRIPTS_DESCRIPTION, LIST_PAGES_DESCRIPTION, QUERY_DESCRIPTION, @@ -2212,6 +2213,41 @@ const find_anomalies: Operation = { cliHints: { name: 'anomalies' }, }; +// v0.33: expertise + relationship-proximity routing. CLI: gbrain whoknows. +const find_experts: Operation = { + name: 'find_experts', + description: FIND_EXPERTS_DESCRIPTION, + scope: 'read', + params: { + topic: { + type: 'string', + description: 'The topic to route. Free-form natural language.', + }, + limit: { + type: 'number', + description: 'Max results (default 5).', + }, + explain: { + type: 'boolean', + description: 'Include factor breakdown per result (expertise, recency, salience).', + }, + }, + handler: async (_ctx, p) => { + const { findExperts } = await import('../commands/whoknows.ts'); + const topic = typeof p.topic === 'string' ? p.topic : ''; + if (!topic.trim()) { + throw new OperationError('invalid_params', '`topic` is required and must be a non-empty string.'); + } + return findExperts(_ctx.engine, { + topic, + limit: typeof p.limit === 'number' ? p.limit : undefined, + explain: p.explain === true, + }); + }, + cliHints: { name: 'whoknows', positional: ['topic'] }, +}; + +// v0.32.6: contradiction probe MCP surface (M3) const find_contradictions: Operation = { name: 'find_contradictions', description: FIND_CONTRADICTIONS_DESCRIPTION, @@ -2794,6 +2830,8 @@ export const operations: Operation[] = [ extract_facts, recall, forget_fact, // v0.32.6: contradiction probe MCP surface (M3) find_contradictions, + // v0.33: expertise + relationship-proximity routing + find_experts, ]; export const operationsByName = Object.fromEntries( diff --git a/src/core/pglite-engine.ts b/src/core/pglite-engine.ts index 257489608..029509414 100644 --- a/src/core/pglite-engine.ts +++ b/src/core/pglite-engine.ts @@ -771,6 +771,11 @@ export class PGLiteEngine implements BrainEngine { params.push(opts.symbolKind); extraFilter += ` AND cc.symbol_type = $${params.length}`; } + // v0.33: multi-type filter for whoknows. + if (opts?.types && opts.types.length > 0) { + params.push(opts.types); + extraFilter += ` AND p.type = ANY($${params.length}::text[])`; + } // v0.29.1 — since/until date filter (Postgres parity, codex pass-1 #10). // Reads against COALESCE(effective_date, updated_at) so date filtering // matches user intent (a meeting was on its event_date, not when it @@ -1060,6 +1065,13 @@ export class PGLiteEngine implements BrainEngine { params.push(opts.symbolKind); extraFilter += ` AND cc.symbol_type = $${params.length}`; } + // v0.33: multi-type filter for whoknows. Applied inside HNSW candidate + // CTE so the candidate pool consists only of typed pages — limit budget + // goes to person/company pages instead of being eaten by other types. + if (opts?.types && opts.types.length > 0) { + params.push(opts.types); + extraFilter += ` AND p.type = ANY($${params.length}::text[])`; + } // v0.29.1 since/until parity (codex pass-1 #10). Filter applied INSIDE // the inner CTE so HNSW's candidate pool already excludes out-of-range // pages — preserves pagination contract. diff --git a/src/core/postgres-engine.ts b/src/core/postgres-engine.ts index 60fc19af4..797bde9a9 100644 --- a/src/core/postgres-engine.ts +++ b/src/core/postgres-engine.ts @@ -795,6 +795,13 @@ export class PostgresEngine implements BrainEngine { params.push(type); typeClause = `AND p.type = $${params.length}`; } + // v0.33: multi-type filter for whoknows. AND-applied alongside the + // single-value `type` filter (callers can use either or both). + let typesClause = ''; + if (opts?.types && opts.types.length > 0) { + params.push(opts.types); + typesClause = `AND p.type = ANY($${params.length}::text[])`; + } let excludeSlugsClause = ''; if (excludeSlugs?.length) { params.push(excludeSlugs); @@ -845,6 +852,7 @@ export class PostgresEngine implements BrainEngine { JOIN sources s ON s.id = p.source_id WHERE cc.search_vector @@ websearch_to_tsquery('english', $1) ${typeClause} + ${typesClause} ${excludeSlugsClause} ${detailLow ? `AND cc.chunk_source = 'compiled_truth'` : ''} ${languageClause} @@ -921,6 +929,13 @@ export class PostgresEngine implements BrainEngine { params.push(type); typeClause = `AND p.type = $${params.length}`; } + // v0.33: multi-type filter for whoknows. AND-applied alongside the + // single-value `type` filter (callers can use either or both). + let typesClause = ''; + if (opts?.types && opts.types.length > 0) { + params.push(opts.types); + typesClause = `AND p.type = ANY($${params.length}::text[])`; + } let excludeSlugsClause = ''; if (excludeSlugs?.length) { params.push(excludeSlugs); @@ -966,6 +981,7 @@ export class PostgresEngine implements BrainEngine { JOIN sources s ON s.id = p.source_id WHERE cc.search_vector @@ websearch_to_tsquery('english', $1) ${typeClause} + ${typesClause} ${excludeSlugsClause} ${detailLow ? `AND cc.chunk_source = 'compiled_truth'` : ''} ${languageClause} @@ -1022,6 +1038,13 @@ export class PostgresEngine implements BrainEngine { params.push(type); typeClause = `AND p.type = $${params.length}`; } + // v0.33: multi-type filter for whoknows. AND-applied alongside the + // single-value `type` filter (callers can use either or both). + let typesClause = ''; + if (opts?.types && opts.types.length > 0) { + params.push(opts.types); + typesClause = `AND p.type = ANY($${params.length}::text[])`; + } let excludeSlugsClause = ''; if (excludeSlugs?.length) { params.push(excludeSlugs); @@ -1077,6 +1100,7 @@ export class PostgresEngine implements BrainEngine { WHERE cc.${col} IS NOT NULL ${modalityFilter} ${detailLow ? `AND cc.chunk_source = 'compiled_truth'` : ''} ${typeClause} + ${typesClause} ${excludeSlugsClause} ${languageClause} ${symbolKindClause} diff --git a/src/core/search/hybrid.ts b/src/core/search/hybrid.ts index 96e29ad7e..a2c0723db 100644 --- a/src/core/search/hybrid.ts +++ b/src/core/search/hybrid.ts @@ -227,6 +227,10 @@ export async function hybridSearch( // per-engine searchKeyword / searchVector apply the filters at SQL level. language: opts?.language, symbolKind: opts?.symbolKind, + // v0.33: multi-type filter for whoknows ('person','company'). Pushes + // type filter to SQL level so the limit budget goes to candidate-typed + // pages instead of being eaten by note/transcript/article pages. + types: opts?.types, // v0.29.1: since/until take precedence over deprecated afterDate/beforeDate. // The engine still consumes the legacy field names; this aliasing keeps // PR #618 callers compiling while the new names are the public surface. diff --git a/src/core/types.ts b/src/core/types.ts index 190459b51..57d1af93b 100644 --- a/src/core/types.ts +++ b/src/core/types.ts @@ -405,6 +405,15 @@ export interface SearchOpts { limit?: number; offset?: number; type?: PageType; + /** + * v0.33: multi-type filter. When set, search results are filtered to + * pages whose `type` is in this list, pushed to SQL via + * `AND p.type = ANY($N::text[])` in both engines. Stacks with the + * single-value `type` filter (both are AND-applied). Primary consumer + * is `gbrain whoknows` (filters to ['person','company']); future + * entity-only search reuses the parameter. + */ + types?: PageType[]; exclude_slugs?: string[]; /** * Slug-prefix excludes — additive over DEFAULT_HARD_EXCLUDES (test/, archive/, diff --git a/test/e2e/whoknows.test.ts b/test/e2e/whoknows.test.ts new file mode 100644 index 000000000..009e50767 --- /dev/null +++ b/test/e2e/whoknows.test.ts @@ -0,0 +1,235 @@ +/** + * v0.33 whoknows E2E — full pipeline against a seeded PGLite brain. + * + * Seeds a synthetic brain matching test/fixtures/whoknows-eval.jsonl, + * runs gbrain eval whoknows --skip-replay over the fixture, asserts + * the quality gate passes >= 80% top-3 hit rate. Also exercises: + * + * - findExperts() directly with --types filter + * - Person/company filtering excludes other types + * - Empty result returns empty array (not crash) + * - --explain output includes factor breakdown + * + * Mock embeddings via basis vectors (no OpenAI key needed). Uses the + * same pattern as test/e2e/search-quality.test.ts. + */ + +import { describe, test, expect, beforeAll, afterAll } from 'bun:test'; +import { PGLiteEngine } from '../../src/core/pglite-engine.ts'; +import type { ChunkInput } from '../../src/core/types.ts'; +import { findExperts } from '../../src/commands/whoknows.ts'; +import { readFixture } from '../../src/commands/eval-whoknows.ts'; + +let engine: PGLiteEngine; + +function basisEmbedding(idx: number, dim = 1536): Float32Array { + const emb = new Float32Array(dim); + emb[idx % dim] = 1.0; + return emb; +} + +async function seedPerson( + slug: string, + title: string, + topic: string, + embeddingIdx: number, +) { + await engine.putPage(slug, { + type: 'person', + title, + compiled_truth: `${title} is an expert in ${topic}. Built career around ${topic}.`, + timeline: `2024-01-01: ${title} on ${topic} project.`, + }); + const chunks: ChunkInput[] = [ + { + chunk_index: 0, + chunk_text: `${title} is an expert in ${topic}. Built career around ${topic}.`, + chunk_source: 'compiled_truth', + embedding: basisEmbedding(embeddingIdx), + token_count: 15, + }, + { + chunk_index: 1, + chunk_text: `2024-01-01: ${title} on ${topic} project.`, + chunk_source: 'timeline', + embedding: basisEmbedding(embeddingIdx + 100), + token_count: 10, + }, + ]; + await engine.upsertChunks(slug, chunks); +} + +async function seedCompany( + slug: string, + title: string, + topic: string, + embeddingIdx: number, +) { + await engine.putPage(slug, { + type: 'company', + title, + compiled_truth: `${title} is a company focused on ${topic}. Leader in ${topic}.`, + timeline: `2024-01-01: ${title} ${topic} milestone.`, + }); + const chunks: ChunkInput[] = [ + { + chunk_index: 0, + chunk_text: `${title} is a company focused on ${topic}. Leader in ${topic}.`, + chunk_source: 'compiled_truth', + embedding: basisEmbedding(embeddingIdx), + token_count: 15, + }, + ]; + await engine.upsertChunks(slug, chunks); +} + +async function seedConcept( + slug: string, + title: string, + topic: string, + embeddingIdx: number, +) { + await engine.putPage(slug, { + type: 'concept', + title, + compiled_truth: `${topic} is an important concept. Many explore ${topic}.`, + timeline: `2024-01-01: notes on ${topic}.`, + }); + const chunks: ChunkInput[] = [ + { + chunk_index: 0, + chunk_text: `${topic} is an important concept. Many explore ${topic}.`, + chunk_source: 'compiled_truth', + embedding: basisEmbedding(embeddingIdx), + token_count: 12, + }, + ]; + await engine.upsertChunks(slug, chunks); +} + +beforeAll(async () => { + engine = new PGLiteEngine(); + await engine.connect({}); + await engine.initSchema(); + + // People matching the synthetic fixture topics. + await seedPerson('wiki/people/example-alice', 'Alice Example', 'fintech payments', 10); + await seedPerson('wiki/people/example-bob', 'Bob Example', 'crypto investing', 12); + await seedPerson('wiki/people/example-carol', 'Carol Example', 'ai agents', 14); + await seedPerson('wiki/people/example-dave', 'Dave Example', 'distributed systems', 16); + await seedPerson('wiki/people/example-eve', 'Eve Example', 'healthcare technology', 18); + await seedPerson('wiki/people/example-frank', 'Frank Example', 'developer tools', 20); + await seedPerson('wiki/people/example-grace', 'Grace Example', 'machine learning research', 22); + await seedPerson('wiki/people/example-hank', 'Hank Example', 'climate tech', 24); + await seedPerson('wiki/people/example-ivy', 'Ivy Example', 'enterprise sales', 26); + await seedPerson('wiki/people/example-jake', 'Jake Example', 'hardware engineering', 28); + + // Companies matching the synthetic fixture topics. + await seedCompany('wiki/companies/example-fintech-co', 'FintechCo', 'fintech payments', 11); + await seedCompany('wiki/companies/example-fund', 'CryptoFund', 'crypto investing', 13); + await seedCompany('wiki/companies/example-health-co', 'HealthCo', 'healthcare technology', 19); + await seedCompany('wiki/companies/example-devtools-co', 'DevtoolsCo', 'developer tools', 21); + await seedCompany('wiki/companies/example-climate-co', 'ClimateCo', 'climate tech', 25); + await seedCompany('wiki/companies/example-hardware-co', 'HardwareCo', 'hardware engineering', 29); + + // Decoy non-person/non-company pages with the same topics (filter should hide). + await seedConcept('concepts/fintech-essay', 'Fintech Essay', 'fintech payments', 30); + await seedConcept('concepts/crypto-thoughts', 'Crypto Thoughts', 'crypto investing', 31); +}, 120_000); + +afterAll(async () => { + if (engine) await engine.disconnect(); +}); + +describe('whoknows E2E — quality gate on synthetic fixture', () => { + test('runs findExperts and the fixture quality gate at >= 80% hit rate', async () => { + // v0.33.1.3: The shipped fixture at test/fixtures/whoknows-eval.jsonl + // is now real-brain data (people/eric-vishria, etc.) — those slugs + // don't exist in this E2E's synthetic seed. We define an inline + // synthetic fixture matching the seed above. Production users replace + // the shipped fixture with their own real queries; this test verifies + // the eval pipeline mechanically, not against shipped data. + const inlineFixture = [ + { query: 'fintech payments', expected: ['wiki/people/example-alice', 'wiki/companies/example-fintech-co'] }, + { query: 'crypto investing', expected: ['wiki/companies/example-fund', 'wiki/people/example-bob'] }, + { query: 'ai agents', expected: ['wiki/people/example-carol'] }, + { query: 'distributed systems', expected: ['wiki/people/example-dave'] }, + { query: 'healthcare technology', expected: ['wiki/companies/example-health-co', 'wiki/people/example-eve'] }, + { query: 'developer tools', expected: ['wiki/people/example-frank', 'wiki/companies/example-devtools-co'] }, + { query: 'machine learning research', expected: ['wiki/people/example-grace'] }, + { query: 'climate tech', expected: ['wiki/companies/example-climate-co', 'wiki/people/example-hank'] }, + { query: 'enterprise sales', expected: ['wiki/people/example-ivy'] }, + { query: 'hardware engineering', expected: ['wiki/people/example-jake', 'wiki/companies/example-hardware-co'] }, + ]; + + let hits = 0; + for (const row of inlineFixture) { + const results = await findExperts(engine, { topic: row.query, limit: 5 }); + const top3 = new Set(results.slice(0, 3).map((r) => r.slug)); + const hit = row.expected.some((s) => top3.has(s)); + if (hit) hits++; + } + const hitRate = hits / inlineFixture.length; + // Synthetic seed designed so every query has a clear best match. + // Assert >= 80% (the locked ENG-D2 threshold). In practice 100% on + // this controlled fixture. + expect(hitRate).toBeGreaterThanOrEqual(0.8); + }, 60_000); + + test('shipped fixture at test/fixtures/whoknows-eval.jsonl loads and parses', () => { + // Sanity check that the shipped (real-brain) fixture exists and parses. + // Doesn't assert hit rate — the seeded brain doesn't have those slugs. + const fixture = readFixture( + `${process.cwd()}/test/fixtures/whoknows-eval.jsonl`, + ); + expect(fixture.length).toBeGreaterThanOrEqual(5); + for (const row of fixture) { + expect(typeof row.query).toBe('string'); + expect(row.expected_top_3_slugs.length).toBeGreaterThanOrEqual(1); + } + }); +}); + +describe('whoknows E2E — typeFilter and shadow paths', () => { + test('type filter excludes concept pages (decoys do not appear in results)', async () => { + const results = await findExperts(engine, { topic: 'fintech payments', limit: 10 }); + expect(results.length).toBeGreaterThan(0); + for (const r of results) { + expect(['person', 'company']).toContain(r.type); + } + // The decoy concept page must NOT appear. + expect(results.find((r) => r.slug === 'concepts/fintech-essay')).toBeUndefined(); + }); + + test('zero matches returns empty array gracefully', async () => { + const results = await findExperts(engine, { + topic: 'this-topic-is-definitely-not-in-the-brain-xyzqwerty', + limit: 5, + }); + expect(Array.isArray(results)).toBe(true); + // searchHybrid may return loosely-matching results based on stemming; + // we just assert it doesn't crash and returns sanely. + expect(results.length).toBeGreaterThanOrEqual(0); + }); + + test('--explain factor breakdown is present on every result', async () => { + const results = await findExperts(engine, { topic: 'crypto investing', limit: 3 }); + expect(results.length).toBeGreaterThan(0); + for (const r of results) { + expect(r.factors).toBeDefined(); + expect(typeof r.factors.expertise).toBe('number'); + expect(typeof r.factors.recency_factor).toBe('number'); + expect(typeof r.factors.salience).toBe('number'); + expect(typeof r.score).toBe('number'); + expect(Number.isFinite(r.score)).toBe(true); + } + }); + + test('top-K honors limit parameter', async () => { + const r5 = await findExperts(engine, { topic: 'developer tools', limit: 5 }); + const r1 = await findExperts(engine, { topic: 'developer tools', limit: 1 }); + expect(r5.length).toBeGreaterThanOrEqual(r1.length); + expect(r1.length).toBeLessThanOrEqual(1); + }); +}); + diff --git a/test/eval-whoknows.test.ts b/test/eval-whoknows.test.ts new file mode 100644 index 000000000..b7eb5cbbe --- /dev/null +++ b/test/eval-whoknows.test.ts @@ -0,0 +1,190 @@ +import { describe, it, expect } from 'bun:test'; +import { writeFileSync, unlinkSync } from 'fs'; +import { tmpdir } from 'os'; +import { join } from 'path'; +import { + jaccardAtK, + topKHit, + readFixture, + HIT_RATE_THRESHOLD, + REGRESSION_THRESHOLD, + MIN_REPLAY_ROWS, + type FixtureRow, +} from '../src/commands/eval-whoknows.ts'; + +/** + * v0.33 eval harness unit tests — pure functions only. + * + * Integration coverage (real engine, fixture grading end-to-end) lives in + * test/e2e/whoknows.test.ts. This file verifies the math and the parser. + */ + +describe('eval-whoknows / jaccardAtK', () => { + it('identical 3-element sets → 1.0', () => { + expect(jaccardAtK(['a', 'b', 'c'], ['a', 'b', 'c'], 3)).toBeCloseTo(1.0, 5); + }); + + it('disjoint sets → 0', () => { + expect(jaccardAtK(['a', 'b', 'c'], ['x', 'y', 'z'], 3)).toBe(0); + }); + + it('partial overlap (2 of 3 match) → 2/4 = 0.5', () => { + expect(jaccardAtK(['a', 'b', 'c'], ['a', 'b', 'z'], 3)).toBeCloseTo(0.5, 5); + }); + + it('respects k cutoff — ignores beyond top-k', () => { + expect(jaccardAtK(['a', 'b', 'x'], ['a', 'b', 'y'], 2)).toBeCloseTo(1.0, 5); + }); + + it('empty both sets → 1.0 (vacuously stable)', () => { + expect(jaccardAtK([], [], 3)).toBe(1); + }); + + it('empty one side, non-empty other → 0', () => { + expect(jaccardAtK([], ['a', 'b', 'c'], 3)).toBe(0); + }); + + it('duplicates in input collapse via Set semantics', () => { + // Set-Jaccard, not multiset — duplicates collapse. + expect(jaccardAtK(['a', 'a', 'a'], ['a'], 3)).toBe(1); + }); +}); + +describe('eval-whoknows / topKHit', () => { + it('expected slug at position 1 → hit', () => { + expect(topKHit(['alice', 'bob', 'carol'], ['alice'], 3)).toBe(true); + }); + + it('expected slug at position 3 → hit (within top-3)', () => { + expect(topKHit(['x', 'y', 'alice'], ['alice'], 3)).toBe(true); + }); + + it('expected slug at position 4 → miss (beyond top-3)', () => { + expect(topKHit(['x', 'y', 'z', 'alice'], ['alice'], 3)).toBe(false); + }); + + it('no expected match anywhere → miss', () => { + expect(topKHit(['x', 'y', 'z'], ['alice'], 3)).toBe(false); + }); + + it('multiple expected slugs — hit if ANY appears in top-3', () => { + expect(topKHit(['x', 'bob', 'z'], ['alice', 'bob', 'carol'], 3)).toBe(true); + }); + + it('empty actual results → miss', () => { + expect(topKHit([], ['alice'], 3)).toBe(false); + }); + + it('empty expected → miss (cannot match anything)', () => { + expect(topKHit(['alice', 'bob'], [], 3)).toBe(false); + }); +}); + +describe('eval-whoknows / readFixture', () => { + function tmpFixture(content: string): string { + const path = join(tmpdir(), `whoknows-eval-test-${Date.now()}-${Math.random()}.jsonl`); + writeFileSync(path, content); + return path; + } + + it('parses well-formed JSONL', () => { + const path = tmpFixture( + '{"query":"lab automation","expected_top_3_slugs":["wiki/people/alice","wiki/people/bob"]}\n' + + '{"query":"fintech","expected_top_3_slugs":["wiki/companies/acme"],"notes":"hot topic"}\n', + ); + try { + const rows = readFixture(path); + expect(rows.length).toBe(2); + expect(rows[0].query).toBe('lab automation'); + expect(rows[0].expected_top_3_slugs.length).toBe(2); + expect(rows[1].notes).toBe('hot topic'); + } finally { + unlinkSync(path); + } + }); + + it('skips blank lines and comments (#, //)', () => { + const path = tmpFixture( + '# this is a comment\n' + + '\n' + + '// another comment\n' + + '{"query":"x","expected_top_3_slugs":["y"]}\n', + ); + try { + const rows = readFixture(path); + expect(rows.length).toBe(1); + } finally { + unlinkSync(path); + } + }); + + it('throws on missing file', () => { + expect(() => readFixture('/nonexistent/path/abc.jsonl')).toThrow(/fixture not found/); + }); + + it('throws on malformed JSON line', () => { + const path = tmpFixture('{not json\n'); + try { + expect(() => readFixture(path)).toThrow(/malformed JSONL line/); + } finally { + unlinkSync(path); + } + }); + + it('throws on row missing required fields', () => { + const path = tmpFixture('{"query":"x"}\n'); // missing expected_top_3_slugs + try { + expect(() => readFixture(path)).toThrow(/missing required fields/); + } finally { + unlinkSync(path); + } + }); + + it('filters non-string entries in expected_top_3_slugs', () => { + const path = tmpFixture( + '{"query":"x","expected_top_3_slugs":["alice", null, 42, "bob"]}\n', + ); + try { + const rows = readFixture(path); + expect(rows[0].expected_top_3_slugs).toEqual(['alice', 'bob']); + } finally { + unlinkSync(path); + } + }); +}); + +describe('eval-whoknows / thresholds', () => { + it('HIT_RATE_THRESHOLD locked at 0.8 per ENG-D2', () => { + expect(HIT_RATE_THRESHOLD).toBe(0.8); + }); + + it('REGRESSION_THRESHOLD locked at 0.4 per ENG-D2', () => { + expect(REGRESSION_THRESHOLD).toBe(0.4); + }); + + it('MIN_REPLAY_ROWS sparseness fallback at 20', () => { + expect(MIN_REPLAY_ROWS).toBe(20); + }); +}); + +// v0.33.1.3: WhoknowsFn is the per-query callable that the gates consume. +// runEvalWhoknows picks the impl (local findExperts vs thin-client MCP-routed). +// These tests pin the type-level contract and the export presence; full +// thin-client routing E2E is in the engine-required integration suite. +describe('eval-whoknows / WhoknowsFn contract', () => { + it('module exports WhoknowsFn type alias', async () => { + // The type is structurally `(topic: string, limit: number) => Promise`. + // Confirm import resolves without throwing. + const mod = await import('../src/commands/eval-whoknows.ts'); + expect(typeof mod.runEvalWhoknows).toBe('function'); + }); + + it('runEvalWhoknows accepts null engine (thin-client signature)', async () => { + // Signature gate: the function must be callable with engine=null. We use + // a missing-fixture path to short-circuit before any engine/MCP use, so + // this test pins ONLY the signature acceptance, not the routing logic. + const { runEvalWhoknows } = await import('../src/commands/eval-whoknows.ts'); + const exitCode = await runEvalWhoknows(null, []); // no fixture path → 2 + expect(exitCode).toBe(2); + }); +}); diff --git a/test/find-experts-op.test.ts b/test/find-experts-op.test.ts new file mode 100644 index 000000000..97272760f --- /dev/null +++ b/test/find-experts-op.test.ts @@ -0,0 +1,137 @@ +/** + * v0.33 — find_experts MCP op coverage. + * + * Verifies the op declaration: registered in operations array, exposed + * with the locked surface (scope: read, localOnly: false), accepts the + * documented params, validates non-empty topic, and the handler invokes + * the same findExperts() pure function the CLI calls (handler-to-core + * wiring parity). + * + * Engine-touching path is covered end-to-end against PGLite in + * test/e2e/whoknows.test.ts; this file is fast-loop coverage for the + * MCP-surface contract. + */ + +import { describe, test, expect, beforeAll, afterAll } from 'bun:test'; +import { operations, operationsByName } from '../src/core/operations.ts'; +import { FIND_EXPERTS_DESCRIPTION } from '../src/core/operations-descriptions.ts'; +import { PGLiteEngine } from '../src/core/pglite-engine.ts'; +import type { OperationContext } from '../src/core/operations.ts'; +import type { ChunkInput } from '../src/core/types.ts'; + +let engine: PGLiteEngine; + +function basisEmbedding(idx: number, dim = 1536): Float32Array { + const emb = new Float32Array(dim); + emb[idx % dim] = 1.0; + return emb; +} + +beforeAll(async () => { + engine = new PGLiteEngine(); + await engine.connect({}); + await engine.initSchema(); + + await engine.putPage('wiki/people/expert', { + type: 'person', + title: 'Expert', + compiled_truth: 'Expert is the authority on widgets.', + }); + await engine.upsertChunks('wiki/people/expert', [ + { + chunk_index: 0, + chunk_text: 'Expert is the authority on widgets.', + chunk_source: 'compiled_truth', + embedding: basisEmbedding(7), + token_count: 10, + } as ChunkInput, + ]); +}, 60_000); + +afterAll(async () => { + await engine.disconnect(); +}); + +describe('find_experts — op declaration', () => { + test('registered in the operations array', () => { + const op = operations.find((o) => o.name === 'find_experts'); + expect(op).toBeDefined(); + }); + + test('findable via operationsByName', () => { + expect(operationsByName['find_experts']).toBeDefined(); + expect(operationsByName['find_experts'].name).toBe('find_experts'); + }); + + test('scope is read; localOnly is false (HTTP-MCP accessible)', () => { + const op = operationsByName['find_experts']; + expect(op.scope).toBe('read'); + // localOnly defaults to undefined/false; explicit truthy would block HTTP MCP. + expect(op.localOnly).not.toBe(true); + }); + + test('declares the documented params (topic / limit / explain)', () => { + const op = operationsByName['find_experts']; + expect(op.params).toBeDefined(); + expect(op.params.topic).toBeDefined(); + expect(op.params.topic.type).toBe('string'); + expect(op.params.limit).toBeDefined(); + expect(op.params.limit.type).toBe('number'); + expect(op.params.explain).toBeDefined(); + expect(op.params.explain.type).toBe('boolean'); + }); + + test('cliHints.name is "whoknows"', () => { + const op = operationsByName['find_experts']; + expect(op.cliHints?.name).toBe('whoknows'); + }); + + test('description text is non-trivial and references the use case', () => { + expect(FIND_EXPERTS_DESCRIPTION.length).toBeGreaterThan(60); + expect(FIND_EXPERTS_DESCRIPTION).toMatch(/expert|knows|topic|routing/i); + }); +}); + +describe('find_experts — handler behavior', () => { + function makeCtx(): OperationContext { + // Minimal local-only context; the handler doesn't consult auth or + // remote on a read-scoped read-only call (handler validates topic + // then dispatches to findExperts). Cast through unknown to keep the + // shape narrow without re-declaring the full OperationContext type. + return { + engine, + remote: false, + config: {}, + logger: console, + dryRun: false, + } as unknown as OperationContext; + } + + test('rejects empty topic with invalid_params', async () => { + const op = operationsByName['find_experts']; + await expect(op.handler(makeCtx(), { topic: '' })).rejects.toThrow(/topic/); + }); + + test('rejects whitespace-only topic with invalid_params', async () => { + const op = operationsByName['find_experts']; + await expect(op.handler(makeCtx(), { topic: ' ' })).rejects.toThrow(/topic/); + }); + + test('rejects missing topic (undefined) with invalid_params', async () => { + const op = operationsByName['find_experts']; + await expect(op.handler(makeCtx(), {})).rejects.toThrow(/topic/); + }); + + test('handler returns an array on valid topic', async () => { + const op = operationsByName['find_experts']; + const result = (await op.handler(makeCtx(), { topic: 'widgets' })) as unknown[]; + expect(Array.isArray(result)).toBe(true); + }); + + test('handler honors limit parameter', async () => { + const op = operationsByName['find_experts']; + const result = (await op.handler(makeCtx(), { topic: 'widgets', limit: 1 })) as unknown[]; + expect(Array.isArray(result)).toBe(true); + expect(result.length).toBeLessThanOrEqual(1); + }); +}); diff --git a/test/fixtures/whoknows-eval.jsonl b/test/fixtures/whoknows-eval.jsonl new file mode 100644 index 000000000..baba2e950 --- /dev/null +++ b/test/fixtures/whoknows-eval.jsonl @@ -0,0 +1,21 @@ +// v0.33 whoknows eval fixture — 10 real routing queries from Garry's +// brain. Each row: {query, expected_top_3_slugs, notes}. +// +// Source: reference/vc-intro-network ("Who Takes Intros from Garry") and +// adjacent routing context Garry maintains in his brain. Slugs verified +// against the markdown source at ~/git/brain/people/.md as of +// 2026-05-11. +// +// Ground truth is "would Garry actually route this intro / consider this +// person an expert here." A hit is achieved when ANY of the expected +// top-3 slugs appears in the eval's top-3 returned slugs. +{"query":"healthtech seed VC who takes intros from Garry","expected_top_3_slugs":["people/eric-vishria","people/kristina-shen","people/rebecca-kaden"],"notes":"Eric Vishria (Benchmark, 1-Best, healthtech), Kristina Shen (Chemistry, healthtech), Rebecca Kaden (USV)"} +{"query":"fintech seed investor","expected_top_3_slugs":["people/nick-shalek","people/parul-singh","people/elad-gil"],"notes":"Nick Shalek (Ribbit Capital, fintech-focused), Parul Singh (645 Ventures), Elad Gil (angel)"} +{"query":"AI angel investor early stage","expected_top_3_slugs":["people/elad-gil","people/lachy-groom","people/gokul-rajaram"],"notes":"Three top-rated angels for AI seed rounds"} +{"query":"Founders Fund partner for defense and deep tech","expected_top_3_slugs":["people/trae-stephens"],"notes":"Trae Stephens, defense partner at FF"} +{"query":"USV partner for consumer marketplaces","expected_top_3_slugs":["people/rebecca-kaden"],"notes":"Rebecca Kaden at USV"} +{"query":"Accel partner who funds YC seed","expected_top_3_slugs":["people/amit-kumar"],"notes":"Amit Kumar at Accel, 102 YC deals"} +{"query":"YC partner who advises on fintech","expected_top_3_slugs":["people/diana-hu","people/jon-xu"],"notes":"Diana Hu and Jon Xu, YC GPs"} +{"query":"Menlo Ventures Series A lead","expected_top_3_slugs":["people/joff-redfern"],"notes":"Joff Redfern at Menlo, ex-CPO Atlassian"} +{"query":"Quiet Capital partner","expected_top_3_slugs":["people/lee-edwards"],"notes":"Lee Edwards at Quiet Capital, 52 YC deals"} +{"query":"Index Ventures partner for SaaS","expected_top_3_slugs":["people/nina-achadian"],"notes":"Nina Achadian at Index, 69 YC deals"} diff --git a/test/search-types-filter.test.ts b/test/search-types-filter.test.ts new file mode 100644 index 000000000..0895fb4f6 --- /dev/null +++ b/test/search-types-filter.test.ts @@ -0,0 +1,177 @@ +/** + * v0.33 — SearchOpts.types filter, engine-level coverage. + * + * Exercises the SQL-level type filter on PGLite for searchKeyword + * and searchVector. The E2E test (test/e2e/whoknows.test.ts) covers + * the full pipeline; this file targets the engine surface specifically + * so a regression in the types-clause SQL emission gets caught here + * with a tight assertion rather than as part of a longer pipeline. + */ + +import { describe, test, expect, beforeAll, afterAll } from 'bun:test'; +import { PGLiteEngine } from '../src/core/pglite-engine.ts'; +import type { ChunkInput } from '../src/core/types.ts'; + +let engine: PGLiteEngine; + +function basisEmbedding(idx: number, dim = 1536): Float32Array { + const emb = new Float32Array(dim); + emb[idx % dim] = 1.0; + return emb; +} + +beforeAll(async () => { + engine = new PGLiteEngine(); + await engine.connect({}); + await engine.initSchema(); + + // Three pages, three types, sharing the keyword "shared-keyword-xyz". + await engine.putPage('wiki/people/p1', { + type: 'person', + title: 'Person One', + compiled_truth: 'Person One has shared-keyword-xyz expertise.', + }); + await engine.upsertChunks('wiki/people/p1', [ + { + chunk_index: 0, + chunk_text: 'Person One has shared-keyword-xyz expertise.', + chunk_source: 'compiled_truth', + embedding: basisEmbedding(10), + token_count: 10, + }, + ]); + + await engine.putPage('wiki/companies/c1', { + type: 'company', + title: 'Company One', + compiled_truth: 'Company One leader in shared-keyword-xyz.', + }); + await engine.upsertChunks('wiki/companies/c1', [ + { + chunk_index: 0, + chunk_text: 'Company One leader in shared-keyword-xyz.', + chunk_source: 'compiled_truth', + embedding: basisEmbedding(11), + token_count: 10, + }, + ]); + + await engine.putPage('concepts/c1', { + type: 'concept', + title: 'Concept One', + compiled_truth: 'Concept One: shared-keyword-xyz is interesting.', + }); + await engine.upsertChunks('concepts/c1', [ + { + chunk_index: 0, + chunk_text: 'Concept One: shared-keyword-xyz is interesting.', + chunk_source: 'compiled_truth', + embedding: basisEmbedding(12), + token_count: 10, + }, + ]); +}, 60_000); + +afterAll(async () => { + await engine.disconnect(); +}); + +describe('searchKeyword — types filter', () => { + test('no types filter: returns all three types', async () => { + const results = await engine.searchKeyword('shared-keyword-xyz', { limit: 10 }); + const types = new Set(results.map((r) => r.type)); + expect(types.has('person')).toBe(true); + expect(types.has('company')).toBe(true); + expect(types.has('concept')).toBe(true); + }); + + test('types: [person, company] excludes concept', async () => { + const results = await engine.searchKeyword('shared-keyword-xyz', { + types: ['person', 'company'], + limit: 10, + }); + expect(results.length).toBe(2); + for (const r of results) { + expect(['person', 'company']).toContain(r.type); + } + expect(results.find((r) => r.type === 'concept')).toBeUndefined(); + }); + + test('types: [concept] excludes person and company', async () => { + const results = await engine.searchKeyword('shared-keyword-xyz', { + types: ['concept'], + limit: 10, + }); + expect(results.length).toBe(1); + expect(results[0].type).toBe('concept'); + }); + + test('types: [] (empty array) is treated as no filter', async () => { + // Empty array hits the `opts.types.length > 0` check and skips the + // clause — same as omitting the field. Documented as part of the + // SearchOpts.types contract. + const all = await engine.searchKeyword('shared-keyword-xyz', { limit: 10 }); + const empty = await engine.searchKeyword('shared-keyword-xyz', { types: [], limit: 10 }); + expect(empty.length).toBe(all.length); + }); + + test('types alone is the multi-type filter (single-value `type` is Postgres-only)', async () => { + // PGLite searchKeyword never honored the single-value `type` field + // (pre-v0.33 parity gap; only postgres-engine.ts has typeClause). The + // new v0.33 `types` field is the multi-type surface that BOTH engines + // honor. AND-stacking with `type` is asserted in test/e2e cross-engine + // coverage; on PGLite, `types` is the only filter that applies. + const results = await engine.searchKeyword('shared-keyword-xyz', { + types: ['person'], + limit: 10, + }); + expect(results.length).toBe(1); + expect(results[0].type).toBe('person'); + }); +}); + +describe('searchVector — types filter', () => { + test('no types filter: returns all matching types', async () => { + const results = await engine.searchVector(basisEmbedding(10), { limit: 10 }); + // Vector search may return all by similarity; the assertion is that + // the result set is non-empty and the filter is opt-in. + expect(results.length).toBeGreaterThan(0); + }); + + test('types: [person, company] excludes concept from vector results', async () => { + const results = await engine.searchVector(basisEmbedding(10), { + types: ['person', 'company'], + limit: 10, + }); + for (const r of results) { + expect(['person', 'company']).toContain(r.type); + } + expect(results.find((r) => r.type === 'concept')).toBeUndefined(); + }); + + test('types: [concept] returns only concept-typed results from vector', async () => { + const results = await engine.searchVector(basisEmbedding(12), { + types: ['concept'], + limit: 10, + }); + for (const r of results) { + expect(r.type).toBe('concept'); + } + }); +}); + +describe('searchKeywordChunks — types filter (Postgres-only path is parity)', () => { + test('chunk-grain search honors types filter', async () => { + // searchKeywordChunks lives in postgres-engine.ts; on PGLite the path + // diverges into searchKeyword. We exercise via searchKeyword above and + // assert the cross-engine contract here for posterity. This test + // primarily documents the public surface; the SQL-level coverage for + // postgres is in test/e2e/postgres-engine.test.ts (which runs only + // with DATABASE_URL set). + const results = await engine.searchKeyword('shared-keyword-xyz', { + types: ['person'], + limit: 10, + }); + expect(results.every((r) => r.type === 'person')).toBe(true); + }); +}); diff --git a/test/whoknows-doctor.test.ts b/test/whoknows-doctor.test.ts new file mode 100644 index 000000000..f6d5229a9 --- /dev/null +++ b/test/whoknows-doctor.test.ts @@ -0,0 +1,117 @@ +import { describe, it, expect, beforeAll, afterAll, beforeEach } from 'bun:test'; +import { mkdtempSync, rmSync, writeFileSync, mkdirSync, existsSync } from 'fs'; +import { tmpdir } from 'os'; +import { join } from 'path'; +import { whoknowsHealthCheck } from '../src/commands/doctor.ts'; + +/** + * v0.33 whoknows_health doctor check — fixture-only assertion. The + * check inspects the working-directory fixture file; it does NOT need + * an engine. We pass a sentinel object cast to BrainEngine for the + * type contract since the check intentionally ignores its argument. + */ + +const stubEngine = {} as Parameters[0]; + +let savedCwd: string; +let workDir: string; + +beforeAll(() => { + savedCwd = process.cwd(); +}); + +afterAll(() => { + process.chdir(savedCwd); +}); + +beforeEach(() => { + workDir = mkdtempSync(join(tmpdir(), 'whoknows-doctor-')); + process.chdir(workDir); +}); + +function cleanup() { + process.chdir(savedCwd); + try { + rmSync(workDir, { recursive: true, force: true }); + } catch { + // best-effort cleanup + } +} + +describe('whoknows_health doctor check', () => { + it('warns when fixture file is missing entirely', async () => { + try { + const check = await whoknowsHealthCheck(stubEngine); + expect(check.name).toBe('whoknows_health'); + expect(check.status).toBe('warn'); + expect(check.message).toContain('fixture missing'); + } finally { + cleanup(); + } + }); + + it('warns when fixture exists but is empty', async () => { + try { + mkdirSync('test/fixtures', { recursive: true }); + writeFileSync('test/fixtures/whoknows-eval.jsonl', ''); + const check = await whoknowsHealthCheck(stubEngine); + expect(check.status).toBe('warn'); + expect(check.message).toContain('empty'); + } finally { + cleanup(); + } + }); + + it('warns when fixture has fewer than 5 rows', async () => { + try { + mkdirSync('test/fixtures', { recursive: true }); + writeFileSync( + 'test/fixtures/whoknows-eval.jsonl', + '{"query":"a","expected_top_3_slugs":["x"]}\n' + + '{"query":"b","expected_top_3_slugs":["y"]}\n', + ); + const check = await whoknowsHealthCheck(stubEngine); + expect(check.status).toBe('warn'); + expect(check.message).toContain('2 row'); + } finally { + cleanup(); + } + }); + + it('passes when fixture has at least 5 rows', async () => { + try { + mkdirSync('test/fixtures', { recursive: true }); + const rows = Array.from({ length: 10 }, (_, i) => + JSON.stringify({ query: `q${i}`, expected_top_3_slugs: [`p${i}`] }), + ).join('\n'); + writeFileSync('test/fixtures/whoknows-eval.jsonl', rows + '\n'); + const check = await whoknowsHealthCheck(stubEngine); + expect(check.status).toBe('ok'); + expect(check.message).toContain('10 queries'); + } finally { + cleanup(); + } + }); + + it('ignores comment lines and blank lines when counting rows', async () => { + try { + mkdirSync('test/fixtures', { recursive: true }); + const content = [ + '# comment', + '// another comment', + '', + '{"query":"a","expected_top_3_slugs":["x"]}', + '{"query":"b","expected_top_3_slugs":["y"]}', + '{"query":"c","expected_top_3_slugs":["z"]}', + '{"query":"d","expected_top_3_slugs":["w"]}', + '{"query":"e","expected_top_3_slugs":["v"]}', + ].join('\n'); + writeFileSync('test/fixtures/whoknows-eval.jsonl', content + '\n'); + const check = await whoknowsHealthCheck(stubEngine); + expect(check.status).toBe('ok'); + expect(check.message).toContain('5 queries'); + } finally { + cleanup(); + } + }); +}); diff --git a/test/whoknows.test.ts b/test/whoknows.test.ts new file mode 100644 index 000000000..b6319d2af --- /dev/null +++ b/test/whoknows.test.ts @@ -0,0 +1,197 @@ +import { describe, it, expect } from 'bun:test'; +import { rankCandidates, runWhoknows, findExperts, type WhoknowsResult } from '../src/commands/whoknows.ts'; +import type { PageType } from '../src/core/types.ts'; + +/** + * v0.33 whoknows — pure-function unit tests covering the 10 locked + * shadow-path cases from ENG-D3 plus a few obvious sanity asserts. + * + * The ranking spec (also documented in src/commands/whoknows.ts): + * + * score = log(1 + raw_match) // expertise (sub-linear) + * × max(0.1, exp(-days/180)) // recency (floored) + * × (0.5 + 0.5 × clamp(salience)) // salience (centered) + * + * These tests exercise rankCandidates (pure) and the CLI registration. + * Integration against a real brain lives in test/e2e/whoknows.test.ts. + */ + +function input( + slug: string, + raw_match: number, + days: number | null, + salience: number | null, + type: PageType = 'person', +) { + return { + slug, + source_id: 'default', + title: slug, + type, + raw_match, + days_since_effective: days, + salience_raw: salience, + }; +} + +describe('whoknows / rankCandidates — locked shadow paths (ENG-D3)', () => { + // Case 1: zero hybrid-search results → empty array + it('returns empty array on empty input', () => { + expect(rankCandidates([])).toEqual([]); + }); + + // Case 2: negative recency input → floor activates, score stays valid + it('negative days_since_effective clamps to 0 (recency_decay = 1.0)', () => { + const ranked = rankCandidates([input('alice', 0.5, -10, 0.5)]); + expect(ranked[0].factors.recency_decay).toBeCloseTo(1.0, 5); + expect(Number.isFinite(ranked[0].score)).toBe(true); + }); + + // Case 3: NaN salience → defaults to neutral (0.5) + it('NaN salience defaults to neutral 0.5', () => { + const ranked = rankCandidates([input('bob', 0.5, 30, NaN)]); + expect(ranked[0].factors.salience).toBeCloseTo(0.5, 5); + expect(ranked[0].factors.salience_factor).toBeCloseTo(0.75, 5); + }); + + // Case 4: undefined / null match score → 0 expertise, score zeros gracefully + it('NaN raw_match → expertise=0; score zeros gracefully without NaN', () => { + const ranked = rankCandidates([input('carol', NaN, 30, 0.5)]); + expect(ranked[0].factors.expertise).toBe(0); + expect(ranked[0].score).toBe(0); + expect(Number.isFinite(ranked[0].score)).toBe(true); + }); + + // Case 5: person-type filter — verified at SQL level by SearchOpts.types. + // Here we assert rankCandidates preserves the type field passed in. + it('preserves page type in the result row (filter happens upstream at SQL)', () => { + const ranked = rankCandidates([ + input('alice', 0.5, 30, 0.5, 'person'), + input('acme', 0.3, 30, 0.5, 'company'), + ]); + expect(ranked.find((r) => r.slug === 'alice')?.type).toBe('person'); + expect(ranked.find((r) => r.slug === 'acme')?.type).toBe('company'); + }); + + // Case 6: --explain output includes all factor values + it('every result includes the full factor breakdown for --explain', () => { + const [row] = rankCandidates([input('alice', 0.5, 60, 0.4)]); + expect(row.factors).toBeDefined(); + expect(typeof row.factors.expertise).toBe('number'); + expect(typeof row.factors.recency_decay).toBe('number'); + expect(typeof row.factors.recency_factor).toBe('number'); + expect(typeof row.factors.salience).toBe('number'); + expect(typeof row.factors.salience_factor).toBe('number'); + expect(typeof row.factors.raw_match).toBe('number'); + // days_since_effective may be null for cold-start; the shape is correct either way. + expect('days_since_effective' in row.factors).toBe(true); + }); + + // Case 7: top-K honors opts.limit; defaults to 5 + it('top-K honors limit; defaults to 5; clamped to >= 1', () => { + const many = Array.from({ length: 12 }, (_, i) => + input(`person-${String(i).padStart(2, '0')}`, 0.5 - i * 0.01, 30, 0.5), + ); + expect(rankCandidates(many).length).toBe(5); // default + expect(rankCandidates(many, 3).length).toBe(3); + expect(rankCandidates(many, 100).length).toBe(12); + expect(rankCandidates(many, 0).length).toBe(1); // clamped to >= 1 + }); + + // Case 8: recency floor (0.1) — extreme days never produces NaN/Infinity + it('extreme days_since_effective is floored, never produces NaN/Infinity', () => { + const ranked = rankCandidates([ + input('ancient', 0.5, 365 * 100, 0.5), // 100 years + input('cold-start', 0.5, null, 0.5), // never updated + ]); + for (const r of ranked) { + expect(Number.isFinite(r.score)).toBe(true); + expect(r.factors.recency_factor).toBeGreaterThanOrEqual(0.1); + } + // cold-start (null days) → recency_factor = floor (0.1) + const cold = ranked.find((r) => r.slug === 'cold-start')!; + expect(cold.factors.recency_factor).toBeCloseTo(0.1, 5); + }); + + // Case 9: stable ordering — same-score ties break by slug alphabetical + it('same-score ties break alphabetically by slug for determinism', () => { + const ranked = rankCandidates([ + input('zoe', 0.5, 30, 0.5), + input('alice', 0.5, 30, 0.5), + input('bob', 0.5, 30, 0.5), + ]); + expect(ranked.map((r) => r.slug)).toEqual(['alice', 'bob', 'zoe']); + }); + + // Case 10: contract shape — public exports exist and have expected types + it('public surface: rankCandidates / findExperts / runWhoknows are functions', () => { + expect(typeof rankCandidates).toBe('function'); + expect(typeof findExperts).toBe('function'); + expect(typeof runWhoknows).toBe('function'); + }); +}); + +describe('whoknows / rankCandidates — ranking sanity', () => { + it('higher raw_match outranks lower (with all else equal)', () => { + const ranked = rankCandidates([ + input('low-match', 0.1, 30, 0.5), + input('high-match', 0.9, 30, 0.5), + ]); + expect(ranked[0].slug).toBe('high-match'); + }); + + it('more recent outranks older (with all else equal)', () => { + const ranked = rankCandidates([ + input('old', 0.5, 365, 0.5), + input('recent', 0.5, 7, 0.5), + ]); + expect(ranked[0].slug).toBe('recent'); + }); + + it('higher salience outranks lower (with all else equal)', () => { + const ranked = rankCandidates([ + input('low-salience', 0.5, 30, 0.1), + input('high-salience', 0.5, 30, 0.9), + ]); + expect(ranked[0].slug).toBe('high-salience'); + }); + + it('all-zero candidate scores 0 but still appears in the result set', () => { + const ranked = rankCandidates([input('flat', 0, 365 * 10, 0)]); + expect(ranked.length).toBe(1); + expect(ranked[0].score).toBe(0); + }); +}); + +describe('whoknows / rankCandidates — composite key safety', () => { + it('preserves source_id on each result row', () => { + const ranked = rankCandidates([ + { slug: 'alice', source_id: 'srcA', title: 'Alice', type: 'person', raw_match: 0.5, days_since_effective: 30, salience_raw: 0.5 }, + { slug: 'alice', source_id: 'srcB', title: 'Alice B', type: 'person', raw_match: 0.6, days_since_effective: 30, salience_raw: 0.5 }, + ]); + // Both rows preserved with their source_ids — composite key intact. + expect(ranked.length).toBe(2); + const sources = new Set(ranked.map((r) => r.source_id)); + expect(sources.has('srcA')).toBe(true); + expect(sources.has('srcB')).toBe(true); + }); +}); + +describe('whoknows / rankCandidates — factor decomposition', () => { + it('returns the exact factor breakdown for a known input', () => { + // expertise = log(1 + 0.5) ≈ 0.405 + // recency_decay = exp(-30/180) ≈ 0.846 + // salience_factor = 0.5 + 0.5*0.5 = 0.75 + // score ≈ 0.405 * 0.846 * 0.75 ≈ 0.257 + const [row] = rankCandidates([input('alice', 0.5, 30, 0.5)]); + expect(row.factors.expertise).toBeCloseTo(Math.log1p(0.5), 5); + expect(row.factors.recency_decay).toBeCloseTo(Math.exp(-30 / 180), 5); + expect(row.factors.recency_factor).toBeCloseTo(Math.exp(-30 / 180), 5); + expect(row.factors.salience_factor).toBeCloseTo(0.75, 5); + expect(row.score).toBeCloseTo(Math.log1p(0.5) * Math.exp(-30 / 180) * 0.75, 5); + }); +}); + +// Case-marker comment: the 10 ENG-D3 cases live above (1-10 in the +// "locked shadow paths" describe block). The additional describes cover +// ranking sanity and source-id safety beyond the locked minimum.