Skip to content

fix(bench): stop silent-failure patterns from biasing LoCoMo scores#33

Closed
jaylfc wants to merge 1 commit intofeat/locomo-prompt-optfrom
fix/locomo-silent-failures
Closed

fix(bench): stop silent-failure patterns from biasing LoCoMo scores#33
jaylfc wants to merge 1 commit intofeat/locomo-prompt-optfrom
fix/locomo-silent-failures

Conversation

@jaylfc
Copy link
Copy Markdown
Owner

@jaylfc jaylfc commented Apr 19, 2026

Addresses four CodeRabbit MAJOR findings on #30 (posted 2026-04-19 18:15 UTC after force-push). All are real benchmark-integrity issues — they let infra flakiness and adapter artefacts masquerade as legitimate negative results, biasing the published numbers and hiding problems.

Findings addressed

Finding File:Line Fix
Judge transport failures scored as NO locomo_runner.py:149 _judge returns None on exception; _summary filters None from Judge average
Generation errors folded into scores locomo_runner.py:245 On gen failure: predicted="", metrics=None, row carries error field, failed_qa incremented
Same gen-error pattern in mem0 adapter mem0_locomo_runner.py:247 Same fix — gen failure → None metrics + error field
mem0 R@K always 0.0 mem0_locomo_runner.py:325 Set evidence_hits=None, evidence_total=None (metric unavailable, not zero); _summary skips None rows from recall aggregation

Bonus improvement

_summary now emits judge_scored and recall_scored alongside count so the JSON shows the denominator honestly. E.g. a run with 5 judge timeouts would show "Judge 0.41 over 1535 scored of 1540 total" — infra flakiness becomes inspectable rather than invisible.

Base = feat/locomo-prompt-opt (not master)

Targets the PR #30 branch so these fixes land together with the runner. Once both merge, master has a clean runner + clean metric aggregation.

Test plan

  • Python syntax check on both patched files
  • No /home/jay/... paths present (sanity on top of earlier fix)
  • Post-merge: the currently-running mem0 benchmark on Fedora (1000/1540 QAs as of commit) will complete against the OLD adapter — we'll rescore its output with the fixed _summary logic via the existing rescore tool, so the headline number reflects the corrected methodology

CodeRabbit flagged four benchmark-integrity issues on #30. All real —
they let infra failures and adapter artefacts masquerade as real
negative results, biasing the published numbers and hiding problems.

1. _judge() returned 0.0 on Ollama timeout / network error — identical
   to a genuine "NO" grade. Now returns None; _summary filters None
   out of the Judge average so transport flakiness doesn't depress
   the score. The row still records 0.0 vs None distinctly.

2. Generation errors stored a synthetic "[generation_error: ...]"
   prediction and then computed F1/BLEU/Judge on it. That folded
   infra failures into the benchmark averages, and failed_qa stayed
   zero so the run reported "complete" despite missing answers.
   Now: on gen failure, predicted=""; f1/bleu/judge=None; row carries
   an `error` field; _guarded increments failed_qa for those rows;
   _summary excludes None metrics.

3. mem0_locomo_runner.py hardcoded evidence_hits=0 in every row
   because mem0 2.x doesn't round-trip per-turn dia_ids. _summary
   then published retrieval_recall=0.0 for every mem0 run — a fake
   miss. Now sets evidence_hits and evidence_total to None (metric
   unavailable, not zero), and _summary skips None rows from recall.

4. mem0 runner inherited both the 0.0-on-failure judge and the error-
   folding pattern. Both paths fixed to the same None-on-failure
   convention.

_summary now also emits judge_scored + recall_scored alongside count
so the JSON shows the denominator honestly (e.g. "Judge 0.41 over 1487
scored of 1540 total"), making infra flakiness inspectable rather
than invisible.

No schema-breaking changes: existing .rescored.json outputs that have
evidence_hits=0 or judge=0.0 remain readable — the new _summary treats
them as real zeros, which is how they were when written. Only forward
runs produce None for "metric unavailable".
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 19, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: ccd1c336-b00d-4579-a2d3-e8f1ee0a1779

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/locomo-silent-failures

Comment @coderabbitai help to get the list of available commands and usage tips.

@kilo-code-bot
Copy link
Copy Markdown

kilo-code-bot bot commented Apr 19, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Files Reviewed (2 files)
  • benchmarks/locomo_runner.py
  • benchmarks/mem0_locomo_runner.py

Reviewed by seed-2-0-pro-260328 · 119,581 tokens

jaylfc added a commit that referenced this pull request Apr 19, 2026
Single source of truth for every LoCoMo number we've produced so they
don't live only in chat transcripts. Captures:

- Self-judge scorecards for taosmd-e2b, taosmd-e4b, taosmd-e2b+prompt-opt,
  mem0-e2b (all runs 2026-04-17 to 2026-04-19)
- External qwen3:4b rescore numbers for the three taosmd variants
  (100% coverage, 0 errors). mem0 rescore queued.
- Per-category tables, not just headlines — Temporal 0.29 vs 0.02
  (14.5x) is the most dramatic architecture signal
- Known artefacts: mem0 R@K=0.0 is an adapter limitation (no dia_id
  pass-through), patched in PR #33
- Methodology disclosures: same generator (gemma4:e2b), same prompt,
  same dataset, same top-K=10, same judge (qwen3:4b), commit SHAs for
  every input
- Follow-up: mem0 external rescore in flight, MemPalace adapter queued
  — will add scorecards to this doc as they complete
jaylfc added a commit that referenced this pull request Apr 19, 2026
Third memory architecture in the comparison harness. Same generator
(gemma4:e2b), same ANSWER_PROMPT, same JUDGE_PROMPT, same top-K, same
1540 QAs as the taosmd and mem0 runners. Only the retrieval layer
changes — routes search through mempalace.searcher.search_memories().

MemPalace has never published end-to-end Judge on LoCoMo — their own
benchmarks/BENCHMARKS.md reports R@10 only (60.3% raw, 88.9% hybrid v5
per their 2026-03 results). Our run adds the novel measurement so the
three-way comparison is done under identical conditions.

Inherits the silent-failure conventions from PR #33: judge timeouts
and generation errors return None (not 0.0) so _summary excludes them
from averages. evidence_hits/evidence_total reported as None since
MemPalace doesn't round-trip LoCoMo dia_id.

Requires: pip install mempalace
jaylfc added a commit that referenced this pull request Apr 19, 2026
* feat(bench): LoCoMo runner + prompt-opt + mem0 adapter (validated)

Comprehensive LoCoMo benchmark bundle rebased onto current master.
Supersedes the closed #25.

## What lands

**Benchmark infrastructure** (~2000 LOC):
- benchmarks/locomo_runner.py — full LoCoMo runner (ONNX embeds + Ollama
  answer/judge, F1/BLEU/Judge/R@K per category, flags for --concurrency,
  --per-conv-limit, --timeout, --model)
- benchmarks/longmemeval_runner.py — ported off dead tinyagentos imports
- benchmarks/mem0_locomo_runner.py — apples-to-apples mem0 adapter
  routing retrieval through mem0.search() with same generator/prompt/
  judge as the taosmd runner
- pyproject.toml — adds mem0ai + chroma optional deps for the adapter

**Prompt tweaks** (validated +0.03 Overall Judge under external qwen3:4b):
- Absolute-date instruction + softened IDK in ANSWER_PROMPT
- Per-category: Temporal 0.36→0.41 (+0.05), Multi-hop 0.21→0.24 (+0.03)
- Targeted improvements landed on targeted categories; Single-hop +
  Open-dom unchanged (clean intervention signal)

**Methodology docs** (4 specs under docs/specs/)

## Review-feedback fixes applied

- Kilo CRITICAL: all hardcoded /home/jay/... paths in
  locomo_runner.py (--dataset, --onnx-path), longmemeval_runner.py,
  mem0_locomo_runner.py are now repo-relative defaults derived from
  __file__, with LOCOMO_DATASET + TAOSMD_ONNX_PATH env-var overrides
- CodeRabbit major: eval/librarian_eval.py was reading axis_c.get("n")
  but eval_axis_c emits "n_sessions" — fixed with get("n_sessions",
  get("n", 0)) compat shim, undercounted n_queries no longer
  inflates tokens_per_query

## Test plan
- [x] All three models benchmarked against 1540-QA LoCoMo set with
  100% external-judge (qwen3:4b) coverage, 0 errors
- [x] Prompt-opt delta localized to targeted categories as designed
- [x] Python syntax check on all four modified benchmark files
- [x] No /home/jay/... paths remain in benchmarks/ or eval/

* feat(bench): mempalace_locomo_runner — apples-to-apples vs taosmd & mem0

Third memory architecture in the comparison harness. Same generator
(gemma4:e2b), same ANSWER_PROMPT, same JUDGE_PROMPT, same top-K, same
1540 QAs as the taosmd and mem0 runners. Only the retrieval layer
changes — routes search through mempalace.searcher.search_memories().

MemPalace has never published end-to-end Judge on LoCoMo — their own
benchmarks/BENCHMARKS.md reports R@10 only (60.3% raw, 88.9% hybrid v5
per their 2026-03 results). Our run adds the novel measurement so the
three-way comparison is done under identical conditions.

Inherits the silent-failure conventions from PR #33: judge timeouts
and generation errors return None (not 0.0) so _summary excludes them
from averages. evidence_hits/evidence_total reported as None since
MemPalace doesn't round-trip LoCoMo dia_id.

Requires: pip install mempalace
@jaylfc jaylfc deleted the branch feat/locomo-prompt-opt April 19, 2026 20:48
@jaylfc jaylfc closed this Apr 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant