fix(bench): stop silent-failure patterns from biasing LoCoMo scores by jaylfc · Pull Request #33 · jaylfc/taosmd

jaylfc · 2026-04-19T18:21:40Z

Addresses four CodeRabbit MAJOR findings on #30 (posted 2026-04-19 18:15 UTC after force-push). All are real benchmark-integrity issues — they let infra flakiness and adapter artefacts masquerade as legitimate negative results, biasing the published numbers and hiding problems.

Findings addressed

Finding	File:Line	Fix
Judge transport failures scored as NO	`locomo_runner.py:149`	`_judge` returns `None` on exception; `_summary` filters None from Judge average
Generation errors folded into scores	`locomo_runner.py:245`	On gen failure: `predicted=""`, metrics=None, row carries `error` field, `failed_qa` incremented
Same gen-error pattern in mem0 adapter	`mem0_locomo_runner.py:247`	Same fix — gen failure → None metrics + error field
mem0 R@K always 0.0	`mem0_locomo_runner.py:325`	Set `evidence_hits=None, evidence_total=None` (metric unavailable, not zero); `_summary` skips None rows from recall aggregation

Bonus improvement

_summary now emits judge_scored and recall_scored alongside count so the JSON shows the denominator honestly. E.g. a run with 5 judge timeouts would show "Judge 0.41 over 1535 scored of 1540 total" — infra flakiness becomes inspectable rather than invisible.

Base = feat/locomo-prompt-opt (not master)

Targets the PR #30 branch so these fixes land together with the runner. Once both merge, master has a clean runner + clean metric aggregation.

Test plan

Python syntax check on both patched files
No /home/jay/... paths present (sanity on top of earlier fix)
Post-merge: the currently-running mem0 benchmark on Fedora (1000/1540 QAs as of commit) will complete against the OLD adapter — we'll rescore its output with the fixed _summary logic via the existing rescore tool, so the headline number reflects the corrected methodology

CodeRabbit flagged four benchmark-integrity issues on #30. All real — they let infra failures and adapter artefacts masquerade as real negative results, biasing the published numbers and hiding problems. 1. _judge() returned 0.0 on Ollama timeout / network error — identical to a genuine "NO" grade. Now returns None; _summary filters None out of the Judge average so transport flakiness doesn't depress the score. The row still records 0.0 vs None distinctly. 2. Generation errors stored a synthetic "[generation_error: ...]" prediction and then computed F1/BLEU/Judge on it. That folded infra failures into the benchmark averages, and failed_qa stayed zero so the run reported "complete" despite missing answers. Now: on gen failure, predicted=""; f1/bleu/judge=None; row carries an `error` field; _guarded increments failed_qa for those rows; _summary excludes None metrics. 3. mem0_locomo_runner.py hardcoded evidence_hits=0 in every row because mem0 2.x doesn't round-trip per-turn dia_ids. _summary then published retrieval_recall=0.0 for every mem0 run — a fake miss. Now sets evidence_hits and evidence_total to None (metric unavailable, not zero), and _summary skips None rows from recall. 4. mem0 runner inherited both the 0.0-on-failure judge and the error- folding pattern. Both paths fixed to the same None-on-failure convention. _summary now also emits judge_scored + recall_scored alongside count so the JSON shows the denominator honestly (e.g. "Judge 0.41 over 1487 scored of 1540 total"), making infra flakiness inspectable rather than invisible. No schema-breaking changes: existing .rescored.json outputs that have evidence_hits=0 or judge=0.0 remain readable — the new _summary treats them as real zeros, which is how they were when written. Only forward runs produce None for "metric unavailable".

coderabbitai · 2026-04-19T18:21:46Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: ccd1c336-b00d-4579-a2d3-e8f1ee0a1779

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/locomo-silent-failures

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

kilo-code-bot · 2026-04-19T18:22:19Z

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Files Reviewed (2 files)

benchmarks/locomo_runner.py
benchmarks/mem0_locomo_runner.py

_{Reviewed by seed-2-0-pro-260328 · 119,581 tokens}

Single source of truth for every LoCoMo number we've produced so they don't live only in chat transcripts. Captures: - Self-judge scorecards for taosmd-e2b, taosmd-e4b, taosmd-e2b+prompt-opt, mem0-e2b (all runs 2026-04-17 to 2026-04-19) - External qwen3:4b rescore numbers for the three taosmd variants (100% coverage, 0 errors). mem0 rescore queued. - Per-category tables, not just headlines — Temporal 0.29 vs 0.02 (14.5x) is the most dramatic architecture signal - Known artefacts: mem0 R@K=0.0 is an adapter limitation (no dia_id pass-through), patched in PR #33 - Methodology disclosures: same generator (gemma4:e2b), same prompt, same dataset, same top-K=10, same judge (qwen3:4b), commit SHAs for every input - Follow-up: mem0 external rescore in flight, MemPalace adapter queued — will add scorecards to this doc as they complete

Third memory architecture in the comparison harness. Same generator (gemma4:e2b), same ANSWER_PROMPT, same JUDGE_PROMPT, same top-K, same 1540 QAs as the taosmd and mem0 runners. Only the retrieval layer changes — routes search through mempalace.searcher.search_memories(). MemPalace has never published end-to-end Judge on LoCoMo — their own benchmarks/BENCHMARKS.md reports R@10 only (60.3% raw, 88.9% hybrid v5 per their 2026-03 results). Our run adds the novel measurement so the three-way comparison is done under identical conditions. Inherits the silent-failure conventions from PR #33: judge timeouts and generation errors return None (not 0.0) so _summary excludes them from averages. evidence_hits/evidence_total reported as None since MemPalace doesn't round-trip LoCoMo dia_id. Requires: pip install mempalace

* feat(bench): LoCoMo runner + prompt-opt + mem0 adapter (validated) Comprehensive LoCoMo benchmark bundle rebased onto current master. Supersedes the closed #25. ## What lands **Benchmark infrastructure** (~2000 LOC): - benchmarks/locomo_runner.py — full LoCoMo runner (ONNX embeds + Ollama answer/judge, F1/BLEU/Judge/R@K per category, flags for --concurrency, --per-conv-limit, --timeout, --model) - benchmarks/longmemeval_runner.py — ported off dead tinyagentos imports - benchmarks/mem0_locomo_runner.py — apples-to-apples mem0 adapter routing retrieval through mem0.search() with same generator/prompt/ judge as the taosmd runner - pyproject.toml — adds mem0ai + chroma optional deps for the adapter **Prompt tweaks** (validated +0.03 Overall Judge under external qwen3:4b): - Absolute-date instruction + softened IDK in ANSWER_PROMPT - Per-category: Temporal 0.36→0.41 (+0.05), Multi-hop 0.21→0.24 (+0.03) - Targeted improvements landed on targeted categories; Single-hop + Open-dom unchanged (clean intervention signal) **Methodology docs** (4 specs under docs/specs/) ## Review-feedback fixes applied - Kilo CRITICAL: all hardcoded /home/jay/... paths in locomo_runner.py (--dataset, --onnx-path), longmemeval_runner.py, mem0_locomo_runner.py are now repo-relative defaults derived from __file__, with LOCOMO_DATASET + TAOSMD_ONNX_PATH env-var overrides - CodeRabbit major: eval/librarian_eval.py was reading axis_c.get("n") but eval_axis_c emits "n_sessions" — fixed with get("n_sessions", get("n", 0)) compat shim, undercounted n_queries no longer inflates tokens_per_query ## Test plan - [x] All three models benchmarked against 1540-QA LoCoMo set with 100% external-judge (qwen3:4b) coverage, 0 errors - [x] Prompt-opt delta localized to targeted categories as designed - [x] Python syntax check on all four modified benchmark files - [x] No /home/jay/... paths remain in benchmarks/ or eval/ * feat(bench): mempalace_locomo_runner — apples-to-apples vs taosmd & mem0 Third memory architecture in the comparison harness. Same generator (gemma4:e2b), same ANSWER_PROMPT, same JUDGE_PROMPT, same top-K, same 1540 QAs as the taosmd and mem0 runners. Only the retrieval layer changes — routes search through mempalace.searcher.search_memories(). MemPalace has never published end-to-end Judge on LoCoMo — their own benchmarks/BENCHMARKS.md reports R@10 only (60.3% raw, 88.9% hybrid v5 per their 2026-03 results). Our run adds the novel measurement so the three-way comparison is done under identical conditions. Inherits the silent-failure conventions from PR #33: judge timeouts and generation errors return None (not 0.0) so _summary excludes them from averages. evidence_hits/evidence_total reported as None since MemPalace doesn't round-trip LoCoMo dia_id. Requires: pip install mempalace

jaylfc deleted the branch feat/locomo-prompt-opt April 19, 2026 20:48

jaylfc closed this Apr 19, 2026

jaylfc mentioned this pull request Apr 19, 2026

fix(bench): stop silent-failure patterns from biasing LoCoMo scores #35

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bench): stop silent-failure patterns from biasing LoCoMo scores#33

fix(bench): stop silent-failure patterns from biasing LoCoMo scores#33
jaylfc wants to merge 1 commit intofeat/locomo-prompt-optfrom
fix/locomo-silent-failures

jaylfc commented Apr 19, 2026

Uh oh!

coderabbitai bot commented Apr 19, 2026

Review skipped

Uh oh!

kilo-code-bot bot commented Apr 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jaylfc commented Apr 19, 2026

Findings addressed

Bonus improvement

Base = feat/locomo-prompt-opt (not master)

Test plan

Uh oh!

coderabbitai bot commented Apr 19, 2026

Review skipped

Uh oh!

kilo-code-bot bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kilo-code-bot bot commented Apr 19, 2026 •

edited

Loading