Audience: gbrain maintainers and contributors. If you're touching retrieval (search, ranking, embeddings, intent classification, query expansion, source boost, hybrid fusion), this is the doc.
For the NDJSON wire format consumed by gbrain-evals, see
eval-capture.md. This doc is the human dev loop
that lives on top of that format.
Capture is off by default for production users (privacy-positive — no surprise data accumulation). Contributors flip it on with one line:
# In ~/.zshrc or ~/.bashrc:
export GBRAIN_CONTRIBUTOR_MODE=1Verify:
gbrain query "anything" >/dev/null
psql $DATABASE_URL -c 'SELECT count(*) FROM eval_candidates' # should be > 0To override (force on/off regardless of env var), edit ~/.gbrain/config.json:
{"eval": {"capture": true}} // force on
{"eval": {"capture": false}} // force offExplicit config beats the env var both directions.
# ① Capture: writes to eval_candidates whenever CONTRIBUTOR_MODE is set.
# Inspect what's been collected:
gbrain doctor # surfaces capture failures
psql $DATABASE_URL -c 'SELECT count(*) FROM eval_candidates'
# ② Snapshot: freeze a baseline before your code change.
gbrain eval export --since 7d > baseline.ndjson
# ③ Code change: do whatever you want — tune RRF_K, swap embed model, edit
# hybrid.ts, add a new boost source, change the intent classifier.
# ④ Replay: re-run every captured query against the current build.
gbrain eval replay --against baseline.ndjsonOutput:
Replaying 247 captured queries…
...25/247
...50/247
...
Replayed 247 of 247 captured queries (0 skipped, 0 errored)
Mean Jaccard@k: 0.927
Top-1 stability: 91.5%
Mean latency Δ: +14ms (current vs captured)
Top 5 regression(s):
jaccard=0.20 captured=12 current=3 "find every reference to widget-co"
jaccard=0.43 captured=14 current=8 "show me everything tagged for review"
jaccard=0.50 captured=8 current=4 "what did alice say about the spec"
...
Three numbers tell you whether the change is safe to land:
| Metric | What it means | Healthy range |
|---|---|---|
| Mean Jaccard@k | Average overlap between captured retrieved slugs and current run's slugs. 1.0 = identical sets. | ≥0.85 for "neutral" changes. <0.7 means major retrieval shift. |
| Top-1 stability | Fraction of queries whose #1 result didn't change. | ≥85% for tuning passes. <70% means top-of-funnel broke. |
| Mean latency Δ | Current minus captured. Positive = slower now. | Within ±50ms of captured. >2× anywhere = regression alarm. |
gbrain eval replay reads your NDJSON snapshot and, for each row:
- Re-executes the same op (
searchKeywordfortool_name='search',hybridSearchfortool_name='query') with the captureddetailandexpand_enabledvalues threaded back in. - Captures the current
retrieved_slugs(deduped, in result order). - Computes set-Jaccard between captured and current slug sets.
- Records top-1 match (was the #1 result the same slug?).
- Records latency delta vs captured
latency_ms.
It does NOT compute MRR or nDCG — those need ground-truth relevance labels,
not a baseline comparison. For metric-against-truth eval, use
gbrain eval --qrels <path> (the legacy IR-eval path, still supported). The
replay tool answers a different question: "did my code change move
retrieval, and which queries did it move most?"
For a third evaluation axis — public benchmark, ground-truth labels, full
question-answer pipeline (not just retrieval) — gbrain eval longmemeval <dataset.jsonl> (v0.28.8) runs the LongMemEval benchmark against gbrain's
hybrid retrieval. Each question gets a clean in-memory PGLite, its haystack
imported, the question asked, the hypothesis emitted as JSONL — exactly the
shape LongMemEval's evaluate_qa.py consumes. Your ~/.gbrain brain is
never opened. See ## Public benchmarks: LongMemEval below.
Replay is not pure. Three things can drift between capture and replay:
- Brain state — your brain probably has more pages now than when the snapshot was taken. Unless you explicitly seed a fixed corpus, mean Jaccard will drop simply because new pages are eligible.
- Embedding source — if you changed
OPENAI_API_KEYbetween capture and replay (or the embedding model rotated), vector-path results drift even with identical code. - Capture cap — captured
retrieved_slugsis a deduped set; it doesn't preserve internal ranking metadata. Two tools can return the same slug set with different scores — Jaccard will say 1.0, but a downstream consumer that orders by score may behave differently.
The metrics are regression alarms on real queries, not a hash check. Pair them with manual inspection of the top regressions.
Every query row in the snapshot embeds the query string via OpenAI to run
the vector half of hybridSearch. Cost is identical to a normal gbrain query invocation — text-embedding-3-large at OpenAI list price, batched
inside a single replay row.
If you're iterating locally and don't want to pay per change, use
--limit 50 to cap rows replayed. The 50 most recent rows are usually
enough to catch direction; expand for the final pre-merge run.
# Iteration mode — 50 most recent queries
gbrain eval replay --against baseline.ndjson --limit 50
# Pre-merge — full snapshot
gbrain eval replay --against baseline.ndjson --top-regressions 20gbrain eval replay --against baseline.ndjson --json > replay.json
jq -e '.summary.mean_jaccard >= 0.85' replay.json || exit 1
jq -e '.summary.top1_stability_rate >= 0.85' replay.json || exit 1Stable JSON shape (schema_version: 1):
{
"schema_version": 1,
"summary": {
"rows_total": 247,
"rows_replayed": 247,
"rows_skipped": 0,
"rows_errored": 0,
"mean_jaccard": 0.927,
"top1_stability_rate": 0.915,
"mean_latency_delta_ms": 14,
"rows_over_2x_latency": 0
}
}--verbose adds a results: [...] array with one entry per replayed row
(useful for piping into jq or a notebook for deeper analysis).
Before merging anything that touches:
src/core/search/hybrid.ts(RRF, fusion, dedup, two-pass retrieval)src/core/search/source-boost.ts/sql-ranking.ts(per-source ranking)src/core/search/intent.ts(auto-detail classification)src/core/search/expansion.ts(Haiku query expansion)src/core/search/dedup.ts(cross-page result collapse)src/core/embedding.tsor any embedding model swapsrc/core/operations.tsqueryorsearchop handlers (capture surface)src/core/postgres-engine.ts/pglite-engine.tssearchKeyword/searchVectorSQL
Skip for: schema-only migrations, doc changes, tests-only PRs, CLI ergonomics that don't touch retrieval.
If you don't have captured traffic yet (fresh install, can't dogfood for a week before merging), you can hand-author an NDJSON file:
{"schema_version":1,"id":1,"tool_name":"query","query":"who is alice","retrieved_slugs":["people/alice","people/alice-bio"],"expand_enabled":false,"detail":null,"latency_ms":0,"remote":false}
{"schema_version":1,"id":2,"tool_name":"search","query":"acme deal","retrieved_slugs":["deals/acme-seed","companies/acme"],"latency_ms":0,"remote":false}Then run gbrain eval replay --against handcrafted.ndjson to confirm the
authoritative slugs come back. This is the seam between the BrainBench-Real
pipeline (replay against live captures) and the BrainBench fixed-fixture
pipeline (gbrain eval --qrels with the sibling
gbrain-evals corpus).
Two ways to disable capture:
unset GBRAIN_CONTRIBUTOR_MODE # easy: just unset the env varOr force off regardless of the env var via ~/.gbrain/config.json:
{"eval": {"capture": false}}Existing eval_candidates rows stay until you gbrain eval prune --older-than 0d (or just drop the table).
| What you see | What it means |
|---|---|
Mean Jaccard@k: 0.4, top regressions all in one source dir |
Source boost or hard-exclude regression on that prefix |
Top-1 stability: 30%, mean Jaccard still high |
RRF tuning shifted the rank order without changing the set — re-tune rrfK |
Mean latency Δ: +500ms, jaccard high |
Vector path got slower; check embedding API or HNSW probes |
rows_errored > 0 |
One or more queries threw. Inspect first 3 in human output, or --json to see all error_message fields |
Many skipped: empty query |
Capture ran on rows where someone passed empty query — check why those were captured |
gbrain eval longmemeval runs the public LongMemEval
benchmark directly against gbrain's hybrid retrieval. Different evaluation
axis from eval replay: public dataset with ground-truth labels, end-to-end
question-answer pipeline, hermetic per-question brains.
# Download the dataset (visit the HF page in a browser; gated/manual download).
# Place longmemeval_oracle.json (or _s.json) somewhere local.
# Retrieval-only (no LLM answer-gen, fastest path, no Anthropic key needed):
gbrain eval longmemeval ./longmemeval_oracle.json --limit 50 --retrieval-only \
> /tmp/hypothesis.jsonl
# Full pipeline (Anthropic key required for answer-gen):
gbrain eval longmemeval ./longmemeval_oracle.json --limit 50 \
> /tmp/hypothesis.jsonl
# Score with LongMemEval's published evaluate_qa.py (not bundled — needs
# OpenAI gpt-4o per their spec):
python evaluate_qa.py /tmp/hypothesis.jsonl- One in-memory PGLite per benchmark run via
createBenchmarkBrain+withBenchmarkBrain. Your~/.gbrainis never opened. - Between questions:
TRUNCATEover runtime-enumeratedpg_tables, NOT a hardcoded list — schema migrations don't silently leak data across questions. Infrastructure tables (sources,config,gbrain_cycle_locks,subagent_rate_leases) are preserved across resets. - Sanitization parity: re-uses
INJECTION_PATTERNSfromsrc/core/think/sanitize.tsso adding a new injection pattern automatically covers takes AND benchmarks. One source of truth. - Retrieved chat content is wrapped in
<chat_session id="..." date="...">framing; the answer-gen system prompt declares the content UNTRUSTED. Same posture as<take>framing. - LLM injection seam:
runEvalLongMemEval(args, {client?: ThinkLLMClient}). Tests stub the client so the full pipeline runs hermetically without any API key.
| Flag | Default | Purpose |
|---|---|---|
--limit N |
run all | Cap question count (iterate fast) |
--retrieval-only |
off | Emit retrieved chunks; no LLM answer-gen |
--keyword-only |
off | Disable vector path (debug retrieval issues) |
--expansion |
off | Multi-query expansion. Off by default for determinism (no per-query Haiku call). Pass to opt in. |
--top-k K |
10 | Retrieval depth |
--model M |
resolved | Default resolves through resolveModel() 6-tier chain (models.eval.longmemeval config key) |
--output FILE |
stdout | Write hypothesis JSONL to file instead of stdout |
p50 25.9ms / p99 30.3ms warm reset+import+search on Apple Silicon (per the
test/eval-longmemeval.test.ts perf gate). Per-question cost well under the
500ms speed gate. 500 questions = ~13s of overhead plus your retrieval and
LLM latency.