Running real-world eval benchmarks against your gbrain changes

Audience: gbrain maintainers and contributors. If you're touching retrieval (search, ranking, embeddings, intent classification, query expansion, source boost, hybrid fusion), this is the doc.

For the NDJSON wire format consumed by gbrain-evals, see eval-capture.md. This doc is the human dev loop that lives on top of that format.

Prerequisite: turn on contributor mode

Capture is off by default for production users (privacy-positive — no surprise data accumulation). Contributors flip it on with one line:

# In ~/.zshrc or ~/.bashrc:
export GBRAIN_CONTRIBUTOR_MODE=1

Verify:

gbrain query "anything" >/dev/null
psql $DATABASE_URL -c 'SELECT count(*) FROM eval_candidates'   # should be > 0

To override (force on/off regardless of env var), edit ~/.gbrain/config.json:

{"eval": {"capture": true}}    // force on
{"eval": {"capture": false}}   // force off

Explicit config beats the env var both directions.

The 4-command loop

# ① Capture: writes to eval_candidates whenever CONTRIBUTOR_MODE is set.
#   Inspect what's been collected:
gbrain doctor                                     # surfaces capture failures
psql $DATABASE_URL -c 'SELECT count(*) FROM eval_candidates'

# ② Snapshot: freeze a baseline before your code change.
gbrain eval export --since 7d > baseline.ndjson

# ③ Code change: do whatever you want — tune RRF_K, swap embed model, edit
#    hybrid.ts, add a new boost source, change the intent classifier.

# ④ Replay: re-run every captured query against the current build.
gbrain eval replay --against baseline.ndjson

Output:

Replaying 247 captured queries…
  ...25/247
  ...50/247
  ...
Replayed 247 of 247 captured queries (0 skipped, 0 errored)
Mean Jaccard@k:    0.927
Top-1 stability:   91.5%
Mean latency Δ:    +14ms (current vs captured)

Top 5 regression(s):
  jaccard=0.20  captured=12  current=3   "find every reference to widget-co"
  jaccard=0.43  captured=14  current=8   "show me everything tagged for review"
  jaccard=0.50  captured=8   current=4   "what did alice say about the spec"
  ...

Three numbers tell you whether the change is safe to land:

Metric	What it means	Healthy range
Mean Jaccard@k	Average overlap between captured retrieved slugs and current run's slugs. 1.0 = identical sets.	≥0.85 for "neutral" changes. <0.7 means major retrieval shift.
Top-1 stability	Fraction of queries whose #1 result didn't change.	≥85% for tuning passes. <70% means top-of-funnel broke.
Mean latency Δ	Current minus captured. Positive = slower now.	Within ±50ms of captured. >2× anywhere = regression alarm.

What it actually does

gbrain eval replay reads your NDJSON snapshot and, for each row:

Re-executes the same op (searchKeyword for tool_name='search', hybridSearch for tool_name='query') with the captured detail and expand_enabled values threaded back in.
Captures the current retrieved_slugs (deduped, in result order).
Computes set-Jaccard between captured and current slug sets.
Records top-1 match (was the #1 result the same slug?).
Records latency delta vs captured latency_ms.

It does NOT compute MRR or nDCG — those need ground-truth relevance labels, not a baseline comparison. For metric-against-truth eval, use gbrain eval --qrels <path> (the legacy IR-eval path, still supported). The replay tool answers a different question: "did my code change move retrieval, and which queries did it move most?"

For a third evaluation axis — public benchmark, ground-truth labels, full question-answer pipeline (not just retrieval) — gbrain eval longmemeval <dataset.jsonl> (v0.28.8) runs the LongMemEval benchmark against gbrain's hybrid retrieval. Each question gets a clean in-memory PGLite, its haystack imported, the question asked, the hypothesis emitted as JSONL — exactly the shape LongMemEval's evaluate_qa.py consumes. Your ~/.gbrain brain is never opened. See ## Public benchmarks: LongMemEval below.

Best-effort by design

Replay is not pure. Three things can drift between capture and replay:

Brain state — your brain probably has more pages now than when the snapshot was taken. Unless you explicitly seed a fixed corpus, mean Jaccard will drop simply because new pages are eligible.
Embedding source — if you changed OPENAI_API_KEY between capture and replay (or the embedding model rotated), vector-path results drift even with identical code.
Capture cap — captured retrieved_slugs is a deduped set; it doesn't preserve internal ranking metadata. Two tools can return the same slug set with different scores — Jaccard will say 1.0, but a downstream consumer that orders by score may behave differently.

The metrics are regression alarms on real queries, not a hash check. Pair them with manual inspection of the top regressions.

Cost

Every query row in the snapshot embeds the query string via OpenAI to run the vector half of hybridSearch. Cost is identical to a normal gbrain query invocation — text-embedding-3-large at OpenAI list price, batched inside a single replay row.

If you're iterating locally and don't want to pay per change, use --limit 50 to cap rows replayed. The 50 most recent rows are usually enough to catch direction; expand for the final pre-merge run.

# Iteration mode — 50 most recent queries
gbrain eval replay --against baseline.ndjson --limit 50

# Pre-merge — full snapshot
gbrain eval replay --against baseline.ndjson --top-regressions 20

CI integration

gbrain eval replay --against baseline.ndjson --json > replay.json
jq -e '.summary.mean_jaccard >= 0.85' replay.json || exit 1
jq -e '.summary.top1_stability_rate >= 0.85' replay.json || exit 1

Stable JSON shape (schema_version: 1):

{
  "schema_version": 1,
  "summary": {
    "rows_total": 247,
    "rows_replayed": 247,
    "rows_skipped": 0,
    "rows_errored": 0,
    "mean_jaccard": 0.927,
    "top1_stability_rate": 0.915,
    "mean_latency_delta_ms": 14,
    "rows_over_2x_latency": 0
  }
}

--verbose adds a results: [...] array with one entry per replayed row (useful for piping into jq or a notebook for deeper analysis).

When to run this

Before merging anything that touches:

src/core/search/hybrid.ts (RRF, fusion, dedup, two-pass retrieval)
src/core/search/source-boost.ts / sql-ranking.ts (per-source ranking)
src/core/search/intent.ts (auto-detail classification)
src/core/search/expansion.ts (Haiku query expansion)
src/core/search/dedup.ts (cross-page result collapse)
src/core/embedding.ts or any embedding model swap
src/core/operations.ts query or search op handlers (capture surface)
src/core/postgres-engine.ts / pglite-engine.ts searchKeyword / searchVector SQL

Skip for: schema-only migrations, doc changes, tests-only PRs, CLI ergonomics that don't touch retrieval.

Building your own corpus

If you don't have captured traffic yet (fresh install, can't dogfood for a week before merging), you can hand-author an NDJSON file:

{"schema_version":1,"id":1,"tool_name":"query","query":"who is alice","retrieved_slugs":["people/alice","people/alice-bio"],"expand_enabled":false,"detail":null,"latency_ms":0,"remote":false}
{"schema_version":1,"id":2,"tool_name":"search","query":"acme deal","retrieved_slugs":["deals/acme-seed","companies/acme"],"latency_ms":0,"remote":false}

Then run gbrain eval replay --against handcrafted.ndjson to confirm the authoritative slugs come back. This is the seam between the BrainBench-Real pipeline (replay against live captures) and the BrainBench fixed-fixture pipeline (gbrain eval --qrels with the sibling gbrain-evals corpus).

Off-switch

Two ways to disable capture:

unset GBRAIN_CONTRIBUTOR_MODE             # easy: just unset the env var

Or force off regardless of the env var via ~/.gbrain/config.json:

{"eval": {"capture": false}}

Existing eval_candidates rows stay until you gbrain eval prune --older-than 0d (or just drop the table).

Failure modes

What you see	What it means
`Mean Jaccard@k: 0.4`, top regressions all in one source dir	Source boost or hard-exclude regression on that prefix
`Top-1 stability: 30%`, mean Jaccard still high	RRF tuning shifted the rank order without changing the set — re-tune `rrfK`
`Mean latency Δ: +500ms`, jaccard high	Vector path got slower; check embedding API or HNSW probes
`rows_errored > 0`	One or more queries threw. Inspect first 3 in human output, or `--json` to see all `error_message` fields
Many `skipped: empty query`	Capture ran on rows where someone passed empty `query` — check why those were captured

Public benchmarks: LongMemEval (v0.28.8)

gbrain eval longmemeval runs the public LongMemEval benchmark directly against gbrain's hybrid retrieval. Different evaluation axis from eval replay: public dataset with ground-truth labels, end-to-end question-answer pipeline, hermetic per-question brains.

# Download the dataset (visit the HF page in a browser; gated/manual download).
# Place longmemeval_oracle.json (or _s.json) somewhere local.

# Retrieval-only (no LLM answer-gen, fastest path, no Anthropic key needed):
gbrain eval longmemeval ./longmemeval_oracle.json --limit 50 --retrieval-only \
  > /tmp/hypothesis.jsonl

# Full pipeline (Anthropic key required for answer-gen):
gbrain eval longmemeval ./longmemeval_oracle.json --limit 50 \
  > /tmp/hypothesis.jsonl

# Score with LongMemEval's published evaluate_qa.py (not bundled — needs
# OpenAI gpt-4o per their spec):
python evaluate_qa.py /tmp/hypothesis.jsonl

Architecture (read this if you're touching the harness)

One in-memory PGLite per benchmark run via createBenchmarkBrain + withBenchmarkBrain. Your ~/.gbrain is never opened.
Between questions: TRUNCATE over runtime-enumerated pg_tables, NOT a hardcoded list — schema migrations don't silently leak data across questions. Infrastructure tables (sources, config, gbrain_cycle_locks, subagent_rate_leases) are preserved across resets.
Sanitization parity: re-uses INJECTION_PATTERNS from src/core/think/sanitize.ts so adding a new injection pattern automatically covers takes AND benchmarks. One source of truth.
Retrieved chat content is wrapped in <chat_session id="..." date="..."> framing; the answer-gen system prompt declares the content UNTRUSTED. Same posture as <take> framing.
LLM injection seam: runEvalLongMemEval(args, {client?: ThinkLLMClient}). Tests stub the client so the full pipeline runs hermetically without any API key.

Flags

Flag	Default	Purpose
`--limit N`	run all	Cap question count (iterate fast)
`--retrieval-only`	off	Emit retrieved chunks; no LLM answer-gen
`--keyword-only`	off	Disable vector path (debug retrieval issues)
`--expansion`	off	Multi-query expansion. Off by default for determinism (no per-query Haiku call). Pass to opt in.
`--top-k K`	10	Retrieval depth
`--model M`	resolved	Default resolves through `resolveModel()` 6-tier chain (`models.eval.longmemeval` config key)
`--output FILE`	stdout	Write hypothesis JSONL to file instead of stdout

Numbers

p50 25.9ms / p99 30.3ms warm reset+import+search on Apple Silicon (per the test/eval-longmemeval.test.ts perf gate). Per-question cost well under the 500ms speed gate. 500 questions = ~13s of overhead plus your retrieval and LLM latency.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running real-world eval benchmarks against your gbrain changes

Prerequisite: turn on contributor mode

The 4-command loop

What it actually does

Best-effort by design

Cost

CI integration

When to run this

Building your own corpus

Off-switch

Failure modes

Public benchmarks: LongMemEval (v0.28.8)

Architecture (read this if you're touching the harness)

Flags

Numbers

FilesExpand file tree

eval-bench.md

Latest commit

History

eval-bench.md

File metadata and controls

Running real-world eval benchmarks against your gbrain changes

Prerequisite: turn on contributor mode

The 4-command loop

What it actually does

Best-effort by design

Cost

CI integration

When to run this

Building your own corpus

Off-switch

Failure modes

Public benchmarks: LongMemEval (v0.28.8)

Architecture (read this if you're touching the harness)

Flags

Numbers