Skip to content

v0.33.2.0 feat(search-lite): token budget + semantic query cache + intent weighting#897

Open
garrytan-agents wants to merge 14 commits into
garrytan:masterfrom
garrytan-agents:feat/search-lite-mode
Open

v0.33.2.0 feat(search-lite): token budget + semantic query cache + intent weighting#897
garrytan-agents wants to merge 14 commits into
garrytan:masterfrom
garrytan-agents:feat/search-lite-mode

Conversation

@garrytan-agents
Copy link
Copy Markdown
Contributor

@garrytan-agents garrytan-agents commented May 12, 2026

What

Three additive search features inspired by brain-kit. All backward-compatible: existing callers see identical behavior unless they opt in to the new options.

1. Token Budget Enforcement (src/core/search/token-budget.ts)

Cap the cumulative token cost of returned results so search payloads fit downstream context windows. Greedy top-down walk; preserves caller ordering; no re-rank. char/4 heuristic for token counting (no tokenizer dependency — keeps the bun --compile bundle small).

  • SearchOpts.tokenBudget — numeric cap. Default undefined = no-op.
  • HybridSearchMeta.token_budget = { budget, used, kept, dropped }
  • HTTP query op: pass token_budget param.

2. Semantic Query Cache (src/core/search/query-cache.ts + migration v52)

Cache search results keyed by query embedding similarity. HNSW lookup: embedding <=> $1 < 0.08 (cosine similarity ≥ 0.92). Per-source isolation. Per-row TTL (default 3600s). Best-effort writes; all errors swallowed so the cache never breaks the search hot path.

  • Migration v52 creates query_cache table. HALFVEC where pgvector ≥ 0.7; VECTOR otherwise. Embedding dim resolved from config.embedding_dimensions so non-OpenAI brains work.
  • New gbrain cache CLI: stats / clear --yes / prune.
  • Config keys: search.cache.enabled / search.cache.similarity_threshold / search.cache.ttl_seconds.
  • HybridSearchMeta.cache = { status, similarity?, age_seconds? }
  • Routed through new hybridSearchCached(engine, query, opts) wrapper. The query op in operations.ts now uses this wrapper so MCP/CLI calls benefit automatically. Skipped for two-pass walks + non-default embedding columns where cache semantics don't hold.

3. Zero-LLM Intent Weighting (src/core/search/intent-weights.ts)

Builds on the existing 4-intent classifier (src/core/search/query-intent.ts). New weight-adjustment layer applies subtle per-intent nudges:

Intent Adjustment
entity boost keyword RRF + exact slug/title match boost
temporal default recency='on' when caller left it unset
event boost keyword RRF (rare named entities) + soft recency
general no-op (1.0 multipliers)

All adjustments are subtle (max 1.25×). Caller-explicit options always win — intent weighting never silently overrides explicit recency / salience. Default ON; opt out via opts.intentWeighting = false. The LLM query expansion path is still available and opt-in via opts.expansion = true — it just isn't the default anymore.

HybridSearchMeta.intent now surfaces classifier output for debugging.

Files changed (13)

 src/cli.ts                                |  +13 -1
 src/commands/cache.ts                     | +106 NEW
 src/core/migrate.ts                       | +104 NEW (migration v52)
 src/core/operations.ts                    |  +11 -2
 src/core/search/hybrid.ts                 | +281 -7
 src/core/search/intent-weights.ts         | +132 NEW
 src/core/search/query-cache.ts            | +321 NEW
 src/core/search/token-budget.ts           | +110 NEW
 src/core/types.ts                         |  +57 NEW
 test/hybrid-search-lite.serial.test.ts    | +163 NEW
 test/intent-weights.test.ts               | +123 NEW
 test/query-cache.test.ts                  | +247 NEW
 test/token-budget.test.ts                 | +125 NEW

Test results

  • 44 new tests pass across the 4 new test files
    • test/token-budget.test.ts — 10 tests (pure module)
    • test/intent-weights.test.ts — 13 tests (pure module)
    • test/query-cache.test.ts — 12 tests (PGLite, including migration v52 schema verification + TTL expiration + source isolation)
    • test/hybrid-search-lite.serial.test.ts — 9 tests (PGLite e2e through hybridSearchCached)
  • 105 existing search tests still pass (test/search.test.ts, test/query-intent*.test.ts, test/hybrid-meta.test.ts, test/search-limit.test.ts, test/search-lang-symbol-kind.test.ts, test/search-image-column.test.ts)
  • 162 tests total pass when run together (no cross-file isolation issues)
  • bun run verify clean (privacy + jsonb + progress + test-isolation + wasm + admin-build + admin-scope-drift + cli-exec + system-of-record + tsc --noEmit)
  • bun run src/cli.ts doctor --fast runs successfully — pre-existing resolver_health / minions_migration failures on master are unrelated to this PR.

Design notes

  • Backward compat is the contract. Token budget defaults to off. Cache defaults on but cache miss = normal search. Intent weighting defaults on but adjusts weights at most 1.25× and never overrides caller-explicit options.
  • No new dependencies. Token counting uses char/4 (js-tiktoken would balloon the bun --compile bundle). Migration uses the same HALFVEC/VECTOR pgvector-version probe pattern as v45 (facts) and v40, so it works on both Postgres and PGLite without extra config.
  • System-of-record clean. The query_cache table is a derived cache, not authoritative state. gbrain cache clear is the documented invalidation surface; the system-of-record check passes.

View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

…ting

Adds three additive features to the hybrid search pipeline. All
backward-compatible: existing callers see identical behavior unless they
opt in to the new options.

## 1. Token Budget Enforcement (src/core/search/token-budget.ts)

Cap the cumulative token cost of returned results so search payloads
fit downstream context windows. Greedy top-down walk; preserves caller
ordering; no re-rank. char/4 heuristic for token counting (no
tokenizer dependency \u2014 keeps the bun --compile bundle small).

  SearchOpts.tokenBudget   \u2014 numeric cap. Default undefined = no-op.
  HybridSearchMeta.token_budget = { budget, used, kept, dropped }

  HTTP query op: pass `token_budget` param.

## 2. Semantic Query Cache (src/core/search/query-cache.ts + migration v52)

Cache search results keyed by query embedding similarity. HNSW lookup:
`embedding <=> $1 < 0.08` (cosine similarity >= 0.92). Per-source
isolation so multi-source brains don\u2019t bleed. Per-row TTL (default 3600s).
Best-effort writes; all errors swallowed so the cache never breaks the
search hot path.

  Migration v52 creates query_cache table with HALFVEC where pgvector >= 0.7;
  falls back to VECTOR with the resolved config.embedding_dimensions dim.

  New `gbrain cache` CLI: stats / clear --yes / prune.
  Config keys: search.cache.enabled / similarity_threshold / ttl_seconds.

  HybridSearchMeta.cache = { status, similarity?, age_seconds? }

  Routed through new `hybridSearchCached(engine, query, opts)` wrapper;
  the operations.ts query op now uses this wrapper so MCP/CLI calls
  benefit automatically. Skipped for two-pass walks + non-default
  embedding columns where cache semantics don\u2019t hold.

## 3. Zero-LLM Intent Weighting (src/core/search/intent-weights.ts)

Builds on the existing query-intent classifier (4 intents: entity /
temporal / event / general). New weight-adjustment layer applies subtle
per-intent nudges:

  entity   \u2192 boost keyword RRF + exact slug/title match
  temporal \u2192 default recency=on when caller left it unset
  event    \u2192 boost keyword RRF (rare named entities) + soft recency
  general  \u2192 no-op (1.0 multipliers everywhere)

All adjustments are SUBTLE (max 1.25x). Caller-explicit options ALWAYS
win \u2014 intent weighting never silently overrides recency / salience.

Default ON; opt out via `opts.intentWeighting = false`. LLM query
expansion (expansion.ts) is still available and opt-in via
`opts.expansion = true` \u2014 it just isn\u2019t the default anymore.

  HybridSearchMeta.intent now surfaces classifier output for debugging.

## Tests

  test/token-budget.test.ts            (10 tests, pure module)
  test/intent-weights.test.ts          (13 tests, pure module)
  test/query-cache.test.ts             (12 tests, PGLite)
  test/hybrid-search-lite.serial.test.ts (9 tests, PGLite e2e)

Plus 105 pre-existing search tests still pass. `bun run verify` clean.

Co-authored-by: Wintermute <[email protected]>
@garrytan garrytan changed the title feat(search-lite): token budget + semantic query cache + intent weighting v0.33.1.0 feat(search-lite): token budget + semantic query cache + intent weighting May 12, 2026
@garrytan garrytan changed the title v0.33.1.0 feat(search-lite): token budget + semantic query cache + intent weighting v0.33.2.0 feat(search-lite): token budget + semantic query cache + intent weighting May 12, 2026
garrytan added 13 commits May 11, 2026 23:25
Resolved single conflict in src/core/migrate.ts: master claimed v52/53/54
(eval_contradictions_cache + eval_contradictions_runs + cjk_wave). PR garrytan#897's
v52 query_cache migration renumbered to v55 to sit after.

Typecheck clean.
…ybridSearch

Three named modes (conservative / balanced / tokenmax) that bundle the
search-lite knobs from PR garrytan#897 into a single config key. Mode resolution
lives in bare hybridSearch (NOT just the cached wrapper) so eval-replay
and eval-longmemeval — which call bare hybridSearch — test the same
mode-affected behavior as production. See [CDX-5+6] in the plan.

The mode bundle supplies DEFAULTS for intentWeighting, tokenBudget,
expansion, and searchLimit when the caller leaves those undefined.
Per-call SearchOpts and per-key config overrides still win (matches the
v0.31.12 model-tier resolution chain at model-config.ts:resolveModel).

knobsHash() exposes a stable SHA-256 of the resolved knob set; the cache
contamination hotfix (next commit) consumes it to prevent a tokenmax
write from being served to a conservative read.

Three new fields on HybridSearchMeta:
  - mode (resolved mode name)
  - existing token_budget meta now fires from bare hybridSearch too

Bare hybridSearch now applies tokenBudget at all three return paths
(no-embedding-provider, keyword-only-fallback, main). Previously only
hybridSearchCached enforced budget; eval commands missed it.

Tests: 37 unit cases pin the 3x7 bundle table cell-by-cell, the
resolution chain semantics, knobs hash determinism + cross-mode
separation, and the config-table parser. All 72 search-lite tests pass.

Bisect-friendly: this commit ONLY adds mode resolution. The cache-key
contamination hotfix [CDX-4] is a separate atomic commit (next).
PR garrytan#897's query_cache keyed rows on sha256(source_id::query_text) only.
A tokenmax search (expansion=on, limit=50) populated a row that a
subsequent conservative call (no expansion, limit=10) read back, serving
the wrong-shape results. This is a real bug in PR garrytan#897 today, regardless
of the v0.32.3 mode picker work — Codex caught it in plan review.

Fix:
- Migration v56 adds query_cache.knobs_hash TEXT column + composite
  (source_id, knobs_hash, created_at) index. Existing rows have NULL
  knobs_hash and are excluded from lookups (silently re-populated with
  the right hash on first hit — no orphan data, no destructive migration).
- cacheRowId(query, source, knobsHash) — knobsHash now part of the PK so
  a tokenmax write and a conservative write for the same (query, source)
  land in distinct rows.
- SemanticQueryCache.lookup({knobsHash}) filters WHERE knobs_hash = $.
- SemanticQueryCache.store({knobsHash}) writes the resolved hash.
- hybridSearchCached threads knobsHash from resolveSearchMode through
  every cache call. Cache config (enabled/threshold/TTL) now reads from
  the resolved mode bundle, not directly from the config table.

Tests (test/query-cache-knobs-hash.test.ts, 11 cases):
- cacheRowId bifurcates by knobsHash
- Tokenmax write does NOT contaminate conservative lookup
- Three modes coexist as distinct rows for same query
- Legacy NULL-knobs_hash rows are excluded from lookup
- Same-mode write updates in place (no duplicate rows)

All 58 cache + mode tests pass. Migration v56 applies cleanly on a fresh
PGLite brain.

Bisect-friendly: this commit is the cache-key hotfix alone. Mode
resolution wiring lives in the previous commit.
…able

Migration v57 creates search_telemetry (date, mode, intent, count,
sum_results, sum_tokens, sum_budget_dropped, cache_hit, cache_miss,
first_seen, last_seen). PK (date, mode, intent) caps growth at ~4380
rows/year. Sums + counts only — averages derive at read time so
concurrent ON CONFLICT writes from multiple gbrain processes accumulate
correctly [CDX-17].

In-memory bucket flushed periodically (60s OR 100 calls) + on process
beforeExit/SIGINT/SIGTERM with a 2-second cap. The search hot path NEVER
waits on this write [D2, CDX-19].

Date-bucketed cache_hit / cache_miss columns make hit rate over --days N
derivable [CDX-18]. query_cache.hit_count is a lifetime counter and
can't be sliced by window.

Wired into bare hybridSearch via emitMeta: every search call sync-bumps
a bucket. flush() drains atomically by swapping the map before SQL writes
so a record() during flush lands in the new map.

readSearchStats(engine, {days}) returns the StatsWindow shape that
gbrain search stats consumes (next commit).

Tests: 16 unit cases pin record/flush/read semantics including
ON-CONFLICT-adds-raw-values, concurrent-flush coalescing, cache hit-rate
math, missing-table graceful degradation, and window clamping.

53 migrations apply on a fresh PGLite brain.
…+8+9]

CDX-8: gbrain config has no unset path today. Required before
`gbrain search modes --reset` can clear search.* overrides.

  - BrainEngine.unsetConfig(key) → returns rows deleted (0|1)
  - BrainEngine.listConfigKeys(prefix) → exact-literal prefix match
    with LIKE-escape on user-supplied % / _ / \ characters
  - PGLiteEngine + PostgresEngine implementations
  - `gbrain config unset <key>` and `gbrain config unset --pattern <prefix>`
    sub-subcommands

CDX-9: readLine has no EOF detection or timeout. Mode-picker plan calls
out "TTY closes mid-prompt → defaults to balanced" but the raw helper
hangs forever. New readLineSafe(prompt, defaultValue, timeoutMs=60s):

  - Returns defaultValue on stdin 'end' event
  - Returns defaultValue on timeout
  - Returns defaultValue on empty Enter
  - Non-TTY stdin returns defaultValue immediately (e2e safe)
  - Returns trimmed user input otherwise

Exported so install picker (next task) can use it.

Tests: 9 cases pin unset semantics + prefix matcher edge cases
(glob-wildcard escape, sort order, idempotent loop, search.* sweep).
All 53 migrations apply on a fresh PGLite brain.
Install picker (src/commands/init-mode-picker.ts):
  - Runs as a phase inside `gbrain init` AFTER engine.initSchema() so DB
    config writes work [CDX-7].
  - Idempotent: skipped on re-init if search.mode is already set.
  - Smart auto-suggestion via recommendModeFor() reads
    models.tier.subagent / models.default / OPENAI_API_KEY:
      * Opus default/subagent → tokenmax (quality ceiling)
      * Haiku subagent → conservative (4K budget keeps cost down)
      * No OpenAI key → conservative (no LLM expansion possible)
      * Sonnet / unknown → balanced (safe default)
  - TTY shows menu via readLineSafe (60s timeout, defaults on EOF/empty).
  - Non-TTY auto-selects + emits operator hint:
      [gbrain] search mode: X (auto-selected — reason)
      [gbrain] To change: gbrain config set search.mode <...>
  - --json mode emits structured `{phase: 'search_mode_picker', ...}` event.
  - Wired into both initPGLite and initPostgres flows.

Upgrade banner (src/commands/upgrade.ts):
  - One-shot stderr banner in runPostUpgrade.
  - State persisted via config key `search.mode_upgrade_notice_shown=true`
    — fires at most once per install.
  - Copy corrected per [CDX-1+2+3]: production query op STILL defaults
    expand=true and limit=20. The banner reframes from "behavior is
    regressing" to "named modes available + here's how to preserve
    exact current shape."

Tests (test/init-mode-picker.test.ts, 16 cases):
  - recommendModeFor heuristic for all 4 input shapes
  - parseModeInput accepts numeric/named/case-insensitive, rejects garbage
  - runModePicker non-TTY auto-selects + writes config
  - Idempotent + --force re-prompt + JSON output
  - Opus → tokenmax, Haiku → conservative real wiring through engine
Three sub-subcommands mirroring the gbrain models (v0.31.12) shape:

  gbrain search modes [--json]
    Read-only routing dashboard. Shows the three mode bundles, the active
    mode, and the source of every resolved knob:
      cache_enabled = true   [override: search.cache.enabled]
      tokenBudget   = 4000   [mode: conservative]
    Plus knob descriptions for legibility.

  gbrain search modes --reset [--source <mode>]
    Clears every search.* override (NOT search.mode itself). Preserves
    the upgrade-notice state key. --source <mode> is a dry-run that
    lists what --reset would change without writing — the paved path
    [CDX-8] flagged as missing.

  gbrain search stats [--days N] [--json]
    Observability. Reads the search_telemetry rollup over the window
    (clamps to [1, 365]). Prints cache hit rate, mode mix, intent mix,
    budget drops, avg results/tokens. JSON output includes
    _meta.metric_glossary block per [CDX-25].

  gbrain search tune [--apply] [--json]
    Recommendation engine. 5 rules cover the bug class:
      - Insufficient data → "no_recommendations" status
      - Conservative + high budget-drop rate → suggest balanced
      - High cache hit rate (>85%) → suggest similarity threshold bump
      - Tokenmax + Haiku subagent → suggest balanced (cost mismatch)
      - Cache disabled but stats show usage → suggest re-enabling
    --apply mutates config via setConfig / unsetConfig with a paste-ready
    revert command printed at the end.

Registered in src/cli.ts dispatch table. 17 unit cases pin:
  - Dashboard report shape + per-knob source attribution
  - --reset preserves search.mode + notice key
  - --source dry-run never writes
  - stats reads telemetry rollup; --days clamps
  - tune recommendation rules fire on real telemetry data
  - --apply mutates config
  - --help + unknown subcommand exit codes
… guard

Single source of truth at src/core/eval/metric-glossary.ts. Every entry
carries 3 fields:
  - industry_term (canonical IR/NLP literature name, preserved verbatim)
  - eli10 (plain-English a 16-year-old can follow)
  - range (numeric range + interpretation)

Covers 4 metric families:
  - Retrieval: P@k, R@k, MRR, nDCG@k
  - Stability: Jaccard@k, top-1 stability
  - Statistical: p-value (paired bootstrap + Bonferroni), 95% CI
  - Operational: cache hit rate, avg results/tokens, cost per query, p99 latency

Public surface:
  - getMetricGloss(metric) → full entry or null
  - eli10For(metric) → plain-English string or null
  - buildMetricGlossaryMeta(metrics[]) → {metric → eli10} record for
    JSON `_meta.metric_glossary` blocks per [CDX-25]. ONE block per
    response, NOT sibling `_gloss` fields on every metric.
  - renderMetricGlossaryMarkdown() → deterministic Markdown for the doc

Auto-generation:
  scripts/generate-metric-glossary.ts emits docs/eval/METRIC_GLOSSARY.md.
  Deterministic (same input → same bytes) so the CI guard can diff.

CI guard:
  scripts/check-eval-glossary-fresh.sh regenerates into a temp file and
  diffs against the committed doc. Out-of-date doc fails the build.
  Wired into `bun run verify` (and therefore `bun run test:full`).

Tests (test/metric-glossary.test.ts, 18 cases):
  - Every documented metric is present
  - Every entry has all 3 required fields
  - Accessors return null on unknown metrics (no throw)
  - buildMetricGlossaryMeta silently drops unknown metrics
  - renderer output is deterministic across calls
  - Renderer groups metrics into 4 sections

docs/eval/METRIC_GLOSSARY.md: 5491 bytes, 124 lines, fresh.
src/core/eval/drift-watch.ts — curated retrieval watch-list [CDX-6].
Five patterns covering the surface that actually affects retrieval quality:
  - src/core/search/      (search pipeline)
  - src/core/embedding.ts (embedding shape)
  - src/core/chunkers/    (chunk granularity)
  - src/core/ai/recipes/anthropic.ts + openai.ts (expansion + embed routing)
  - src/core/operations.ts (the query op definition)

Adding to the list is a deliberate act — requires a CHANGELOG line so
coverage grows on purpose, not by accident. Pure functions:
  - matchesWatchPattern(path) — trailing-slash = prefix, bare = equality
  - filesDriftedSince(repoRoot, sha?) — git diff --name-only wrapper
  - watchedFilesDrifted(repoRoot, sha?) — composite

src/commands/doctor.ts — two new checks.

checkSearchMode [CDX-20]: status stays 'ok' (never warns, never docks
health score). Hint in message field. Three branches:
  - unset → "search.mode is unset (using balanced fallback). Run
    `gbrain search modes` to see what is running and pick a mode."
  - mode + no overrides → "Mode: X (no per-key overrides — mode bundle
    is canonical)."
  - mode + overrides → "Mode: X with N per-key override(s) (k1, k2, …).
    To consolidate to the pure mode bundle: gbrain search modes --reset"
Upgrade-notice state key (search.mode_upgrade_notice_shown) is excluded
from the override roster — it's not a knob.

checkEvalDrift [CDX-6]: surfaces uncommitted changes to retrieval-watched
files. Always 'ok'; operator-facing reminder. Names up to 3 drifted files
in the message + paste-ready re-eval command.

Both helpers exported (was: file-private) so tests can pin behavior
without walking the full runDoctor pipeline.

Tests: 12 drift-watch cases + 7 doctor-check cases. Pin watch-list shape,
prefix-vs-equality matcher semantics, missing-repo graceful failure, and
all three search_mode branches.
Per-mode --mode flag plumbed into:
  - gbrain eval longmemeval --mode <conservative|balanced|tokenmax>
    Sets search.mode in the benchmark brain's config table; config is
    in PRESERVE_TABLES so resetTables doesn't wipe it between questions.
    Mode surfaces in the per-question NDJSON row.
  - gbrain eval replay --mode <m> + --compare-limit N
    --compare-limit forces a constant K across modes [CDX-13]; without
    it, Jaccard@k against the captured baseline measures K-drift, not
    quality. Mode is set once before the replay loop.
  - NOT cross-modal per [CDX-11]: cross-modal scores OUTPUT against
    TASK; it doesn't retrieve. Adding --mode there is theater.

New: gbrain eval run-all orchestrator (src/commands/eval-run-all.ts):
  - Sweeps every requested mode × suite combination
  - Sequential default per D9; --parallel N opt-in (clamped to mode count)
  - Cost guard with split caps [CDX-15+16]:
      --budget-usd-retrieval N (default $5)
      --budget-usd-answer N (default $20)
    Non-TTY refuses with exit 2 unless --yes AND explicit --budget-usd-*
    flags pass. TTY refuses without --yes (defense against agent loops).
  - estimateRunCost computes per-(suite,mode) breakdown including the
    expansion-Haiku surcharge for tokenmax.
  - Audit trail: appends to <repo>/.gbrain-evals/eval-results.jsonl
    [CDX-23]. Personal brain (~/.gbrain) NEVER touched.
  - v0.32.3 ships orchestrator + argv + guard + persist hook.
    In-process per-suite invocation is a v0.32.4 follow-up (operator
    runs the per-suite CLIs with the documented --mode flag for now;
    each completion calls persistRunRecord to log).

New: gbrain eval compare report (src/commands/eval-compare.ts):
  - Reads eval-results.jsonl, groups by (suite, mode), renders MD or JSON
  - Most-recent (suite, mode, commit) wins when duplicates exist
  - JSON output has schema_version=2 + _meta.metric_glossary block per
    [CDX-25] (ONE block per response, not sibling _gloss fields)
  - _meta.methodology field names the paired-bootstrap + Bonferroni
    discipline per [CDX-14] so haters can reproduce
  - Missing file → friendly hint pointing at `gbrain eval run-all`

Wired into eval dispatch table in src/commands/eval.ts.

Metric glossary fuzzy fallback: `recall@10` → `recall@k` lookup
(the glossary documents the family; report rows carry specific K
values). Routes through getMetricGloss for every call site.

Tests (42 cases total — all green):
  - eval-run-all.test.ts (19): argv parser, cost estimate, guard
    semantics for all 4 (over/under × tty/non-tty) shapes, persist hook
    NDJSON shape.
  - eval-compare.test.ts (5): JSON + MD output shapes, glossary
    integration, missing-file graceful, mode filter, most-recent-wins.
  - metric-glossary.test.ts (18): unchanged but updated assertions to
    cover the fuzzy `@N` → `@k` fallback.

Pre-existing eval-replay / eval-longmemeval / eval-export / eval-prune
tests (42 cases) still pass — --mode + --compare-limit are additive.
docs/eval/SEARCH_MODE_METHODOLOGY.md — haters-immune 8-section template.
Documents what the eval measures + does NOT measure, datasets + sizes
(LongMemEval n=500, Replay n=200, BrainBench n=1240 docs / 350 qrels),
random seed 42, run procedure verbatim, threats to validity (LongMemEval
English+technical skew, char/4 heuristic ~5-10% off, expansion ~97.6%
relative lift on this corpus), per-question raw outputs, pre-registered
expectations (tokenmax wins R@10 by 5-15pp, conservative wins cost by
5-15x, balanced lands within 3pp), re-run cadence anchored to the
src/core/eval/drift-watch.ts watch-list.

Statistical-significance section pins paired bootstrap with 10,000
resamples + Bonferroni correction across 3 modes × 4 metrics [CDX-14].

CLAUDE.md gets two new sections: ## Search Mode (3-mode table + resolution
chain + [CDX-4] cache contamination fix note + CLI commands) and ## Eval
discipline (single-source-of-truth glossary, methodology doc, eval_results
in repo NOT personal brain per [CDX-23]).

README.md Quick Start gets a paragraph naming the install picker, mode
heuristic, and the methodology link.

skills/conventions/search-modes.md NEW — convention file consumed by
brain-ops + query + signal-detector skills via the existing
`> **Convention:**` callout pattern. Routes "what mode" / "tune
retrieval" / "compare modes" queries to the right CLI surface.

skills/RESOLVER.md gets two new trigger rows pointing at
gbrain search * and gbrain eval compare.
bun run build:llms — picks up the new CLAUDE.md sections (Search Mode +
Eval discipline) and the docs/eval/SEARCH_MODE_METHODOLOGY.md addition.
build-llms.test.ts gate now passes.
… flow

The v0.32.3 search_mode + eval_drift helpers were inserted into the
DB-checks sub-helper at runDbChecks (line 345-355), but runDoctor itself
maintains its own check list and only calls the helpers' subset. Push
the two checks into the main runDoctor path (after the existing
sync_freshness check at line 2347) so they actually appear in
`gbrain doctor --json` output.

Both checks gated on engine !== null. Progress reporter heartbeat fires
for each. Both still return status 'ok' per [CDX-20] so health score is
preserved.

Verified end-to-end on a real Postgres brain: gbrain doctor --json now
includes 'search_mode' and 'eval_drift' in the checks array.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants