fix(bench): stop silent-failure patterns from biasing LoCoMo scores#33
Closed
jaylfc wants to merge 1 commit intofeat/locomo-prompt-optfrom
Closed
fix(bench): stop silent-failure patterns from biasing LoCoMo scores#33jaylfc wants to merge 1 commit intofeat/locomo-prompt-optfrom
jaylfc wants to merge 1 commit intofeat/locomo-prompt-optfrom
Conversation
CodeRabbit flagged four benchmark-integrity issues on #30. All real — they let infra failures and adapter artefacts masquerade as real negative results, biasing the published numbers and hiding problems. 1. _judge() returned 0.0 on Ollama timeout / network error — identical to a genuine "NO" grade. Now returns None; _summary filters None out of the Judge average so transport flakiness doesn't depress the score. The row still records 0.0 vs None distinctly. 2. Generation errors stored a synthetic "[generation_error: ...]" prediction and then computed F1/BLEU/Judge on it. That folded infra failures into the benchmark averages, and failed_qa stayed zero so the run reported "complete" despite missing answers. Now: on gen failure, predicted=""; f1/bleu/judge=None; row carries an `error` field; _guarded increments failed_qa for those rows; _summary excludes None metrics. 3. mem0_locomo_runner.py hardcoded evidence_hits=0 in every row because mem0 2.x doesn't round-trip per-turn dia_ids. _summary then published retrieval_recall=0.0 for every mem0 run — a fake miss. Now sets evidence_hits and evidence_total to None (metric unavailable, not zero), and _summary skips None rows from recall. 4. mem0 runner inherited both the 0.0-on-failure judge and the error- folding pattern. Both paths fixed to the same None-on-failure convention. _summary now also emits judge_scored + recall_scored alongside count so the JSON shows the denominator honestly (e.g. "Judge 0.41 over 1487 scored of 1540 total"), making infra flakiness inspectable rather than invisible. No schema-breaking changes: existing .rescored.json outputs that have evidence_hits=0 or judge=0.0 remain readable — the new _summary treats them as real zeros, which is how they were when written. Only forward runs produce None for "metric unavailable".
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Files Reviewed (2 files)
Reviewed by seed-2-0-pro-260328 · 119,581 tokens |
jaylfc
added a commit
that referenced
this pull request
Apr 19, 2026
Single source of truth for every LoCoMo number we've produced so they don't live only in chat transcripts. Captures: - Self-judge scorecards for taosmd-e2b, taosmd-e4b, taosmd-e2b+prompt-opt, mem0-e2b (all runs 2026-04-17 to 2026-04-19) - External qwen3:4b rescore numbers for the three taosmd variants (100% coverage, 0 errors). mem0 rescore queued. - Per-category tables, not just headlines — Temporal 0.29 vs 0.02 (14.5x) is the most dramatic architecture signal - Known artefacts: mem0 R@K=0.0 is an adapter limitation (no dia_id pass-through), patched in PR #33 - Methodology disclosures: same generator (gemma4:e2b), same prompt, same dataset, same top-K=10, same judge (qwen3:4b), commit SHAs for every input - Follow-up: mem0 external rescore in flight, MemPalace adapter queued — will add scorecards to this doc as they complete
jaylfc
added a commit
that referenced
this pull request
Apr 19, 2026
Third memory architecture in the comparison harness. Same generator (gemma4:e2b), same ANSWER_PROMPT, same JUDGE_PROMPT, same top-K, same 1540 QAs as the taosmd and mem0 runners. Only the retrieval layer changes — routes search through mempalace.searcher.search_memories(). MemPalace has never published end-to-end Judge on LoCoMo — their own benchmarks/BENCHMARKS.md reports R@10 only (60.3% raw, 88.9% hybrid v5 per their 2026-03 results). Our run adds the novel measurement so the three-way comparison is done under identical conditions. Inherits the silent-failure conventions from PR #33: judge timeouts and generation errors return None (not 0.0) so _summary excludes them from averages. evidence_hits/evidence_total reported as None since MemPalace doesn't round-trip LoCoMo dia_id. Requires: pip install mempalace
jaylfc
added a commit
that referenced
this pull request
Apr 19, 2026
* feat(bench): LoCoMo runner + prompt-opt + mem0 adapter (validated) Comprehensive LoCoMo benchmark bundle rebased onto current master. Supersedes the closed #25. ## What lands **Benchmark infrastructure** (~2000 LOC): - benchmarks/locomo_runner.py — full LoCoMo runner (ONNX embeds + Ollama answer/judge, F1/BLEU/Judge/R@K per category, flags for --concurrency, --per-conv-limit, --timeout, --model) - benchmarks/longmemeval_runner.py — ported off dead tinyagentos imports - benchmarks/mem0_locomo_runner.py — apples-to-apples mem0 adapter routing retrieval through mem0.search() with same generator/prompt/ judge as the taosmd runner - pyproject.toml — adds mem0ai + chroma optional deps for the adapter **Prompt tweaks** (validated +0.03 Overall Judge under external qwen3:4b): - Absolute-date instruction + softened IDK in ANSWER_PROMPT - Per-category: Temporal 0.36→0.41 (+0.05), Multi-hop 0.21→0.24 (+0.03) - Targeted improvements landed on targeted categories; Single-hop + Open-dom unchanged (clean intervention signal) **Methodology docs** (4 specs under docs/specs/) ## Review-feedback fixes applied - Kilo CRITICAL: all hardcoded /home/jay/... paths in locomo_runner.py (--dataset, --onnx-path), longmemeval_runner.py, mem0_locomo_runner.py are now repo-relative defaults derived from __file__, with LOCOMO_DATASET + TAOSMD_ONNX_PATH env-var overrides - CodeRabbit major: eval/librarian_eval.py was reading axis_c.get("n") but eval_axis_c emits "n_sessions" — fixed with get("n_sessions", get("n", 0)) compat shim, undercounted n_queries no longer inflates tokens_per_query ## Test plan - [x] All three models benchmarked against 1540-QA LoCoMo set with 100% external-judge (qwen3:4b) coverage, 0 errors - [x] Prompt-opt delta localized to targeted categories as designed - [x] Python syntax check on all four modified benchmark files - [x] No /home/jay/... paths remain in benchmarks/ or eval/ * feat(bench): mempalace_locomo_runner — apples-to-apples vs taosmd & mem0 Third memory architecture in the comparison harness. Same generator (gemma4:e2b), same ANSWER_PROMPT, same JUDGE_PROMPT, same top-K, same 1540 QAs as the taosmd and mem0 runners. Only the retrieval layer changes — routes search through mempalace.searcher.search_memories(). MemPalace has never published end-to-end Judge on LoCoMo — their own benchmarks/BENCHMARKS.md reports R@10 only (60.3% raw, 88.9% hybrid v5 per their 2026-03 results). Our run adds the novel measurement so the three-way comparison is done under identical conditions. Inherits the silent-failure conventions from PR #33: judge timeouts and generation errors return None (not 0.0) so _summary excludes them from averages. evidence_hits/evidence_total reported as None since MemPalace doesn't round-trip LoCoMo dia_id. Requires: pip install mempalace
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Addresses four CodeRabbit MAJOR findings on #30 (posted 2026-04-19 18:15 UTC after force-push). All are real benchmark-integrity issues — they let infra flakiness and adapter artefacts masquerade as legitimate negative results, biasing the published numbers and hiding problems.
Findings addressed
locomo_runner.py:149_judgereturnsNoneon exception;_summaryfilters None from Judge averagelocomo_runner.py:245predicted="", metrics=None, row carrieserrorfield,failed_qaincrementedmem0_locomo_runner.py:247mem0_locomo_runner.py:325evidence_hits=None, evidence_total=None(metric unavailable, not zero);_summaryskips None rows from recall aggregationBonus improvement
_summarynow emitsjudge_scoredandrecall_scoredalongsidecountso the JSON shows the denominator honestly. E.g. a run with 5 judge timeouts would show "Judge 0.41 over 1535 scored of 1540 total" — infra flakiness becomes inspectable rather than invisible.Base = feat/locomo-prompt-opt (not master)
Targets the PR #30 branch so these fixes land together with the runner. Once both merge, master has a clean runner + clean metric aggregation.
Test plan
/home/jay/...paths present (sanity on top of earlier fix)_summarylogic via the existing rescore tool, so the headline number reflects the corrected methodology