Skip to content

fix(bench): stop silent-failure patterns from biasing LoCoMo scores#35

Open
jaylfc wants to merge 1 commit intomasterfrom
fix/locomo-silent-failures-v2
Open

fix(bench): stop silent-failure patterns from biasing LoCoMo scores#35
jaylfc wants to merge 1 commit intomasterfrom
fix/locomo-silent-failures-v2

Conversation

@jaylfc
Copy link
Copy Markdown
Owner

@jaylfc jaylfc commented Apr 19, 2026

Re-open of #33 after its original base branch (feat/locomo-prompt-opt) was deleted by PR #30's merge. Same content, retargeted at master.

Addresses four CodeRabbit MAJOR findings that were posted on #30 (now merged) — they flagged silent-failure patterns in the runner + mem0 adapter that let infra flakiness and adapter artefacts masquerade as legitimate negative results.

Findings addressed

File:Line Issue Fix
locomo_runner.py:_judge Ollama timeout scored as "NO" Returns None; _summary excludes from Judge average
locomo_runner.py gen error Synthetic [generation_error: ...] folded into F1/BLEU/Judge; failed_qa stayed 0 On failure: predicted="", metrics=None, row carries error field, failed_qa incremented
mem0_locomo_runner.py Same gen-error folding pattern Same fix
mem0_locomo_runner.py R@K=0.0 evidence_hits=0, evidence_total=len(evidence) hardcoded because mem0 doesn't round-trip dia_id → published fake 0.0 Set both to None (metric unavailable); _summary skips from recall aggregation

Bonus

_summary now emits judge_scored + recall_scored alongside count so the denominator is visible (e.g. "Judge 0.41 over 1535 scored of 1540 total").

Test plan

  • Python syntax check on both patched files
  • Existing tests still pass (pytest tests/ -q --ignore=tests/integration)
  • Current mem0 rescore output can be re-aggregated with fixed _summary — no need to re-run end-to-end

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Improved error handling to distinguish between answer-generation failures and evaluation failures.
    • Transport/HTTP failures now properly tracked and reported.
  • Refactor

    • Updated metric aggregation to exclude failed evaluations from calculations for more accurate reporting.
    • Enhanced error reporting with exception type and message details in benchmark results.
    • Added tracking fields to indicate count of successfully scored metrics.

CodeRabbit flagged four benchmark-integrity issues on #30. All real —
they let infra failures and adapter artefacts masquerade as real
negative results, biasing the published numbers and hiding problems.

1. _judge() returned 0.0 on Ollama timeout / network error — identical
   to a genuine "NO" grade. Now returns None; _summary filters None
   out of the Judge average so transport flakiness doesn't depress
   the score. The row still records 0.0 vs None distinctly.

2. Generation errors stored a synthetic "[generation_error: ...]"
   prediction and then computed F1/BLEU/Judge on it. That folded
   infra failures into the benchmark averages, and failed_qa stayed
   zero so the run reported "complete" despite missing answers.
   Now: on gen failure, predicted=""; f1/bleu/judge=None; row carries
   an `error` field; _guarded increments failed_qa for those rows;
   _summary excludes None metrics.

3. mem0_locomo_runner.py hardcoded evidence_hits=0 in every row
   because mem0 2.x doesn't round-trip per-turn dia_ids. _summary
   then published retrieval_recall=0.0 for every mem0 run — a fake
   miss. Now sets evidence_hits and evidence_total to None (metric
   unavailable, not zero), and _summary skips None rows from recall.

4. mem0 runner inherited both the 0.0-on-failure judge and the error-
   folding pattern. Both paths fixed to the same None-on-failure
   convention.

_summary now also emits judge_scored + recall_scored alongside count
so the JSON shows the denominator honestly (e.g. "Judge 0.41 over 1487
scored of 1540 total"), making infra flakiness inspectable rather
than invisible.

No schema-breaking changes: existing .rescored.json outputs that have
evidence_hits=0 or judge=0.0 remain readable — the new _summary treats
them as real zeros, which is how they were when written. Only forward
runs produce None for "metric unavailable".
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 19, 2026

📝 Walkthrough

Walkthrough

Both benchmark runners updated to handle errors more explicitly: _judge now returns None for failures instead of 0.0, generation failures record None metrics with error details, and summary aggregation excludes None values while tracking non-None result counts.

Changes

Cohort / File(s) Summary
Locomo Runner Error Handling
benchmarks/locomo_runner.py
_judge return type changed to float | None to propagate failures; _process_qa now sets metrics to None on generation errors and adds optional "error" field; _summary excludes None values from aggregation and adds judge_scored and recall_scored counters; _guarded increments failed_qa when errors occur.
Mem0 Locomo Runner Error Handling
benchmarks/mem0_locomo_runner.py
Added EVIDENCE_UNAVAILABLE constant; changed failed retrieval/generation metrics from 0.0 to None; refactored control flow to defer row construction on generation errors; returned rows now include optional "error" field and always apply evidence unavailability.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 When errors crept in wearing zeros' disguise,
We whispered None to catch their true cries—
No more masked failures in metrics' dark night,
Each benchmark now glows with transparent light! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically identifies the main change: fixing silent-failure patterns in LoCoMo benchmark scoring that were causing bias in results.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/locomo-silent-failures-v2

Comment @coderabbitai help to get the list of available commands and usage tips.

@kilo-code-bot
Copy link
Copy Markdown

kilo-code-bot bot commented Apr 19, 2026

Code Review Summary

Status: 3 Issues Found | Recommendation: Address before merge

Overview

Severity Count
WARNING 3
Issue Details (click to expand)

WARNING

File Line Issue
benchmarks/mem0_locomo_runner.py 137 _judge function return type still returns float not `float
benchmarks/mem0_locomo_runner.py 145 _judge exception handler still returns 0.0 instead of None - will bias judge scores downward on infra failures
benchmarks/mem0_locomo_runner.py 424 mem0 runner does not increment failed_qa counter for rows with error field - failed count will be incorrect in output JSON
Files Reviewed (2 files)
  • benchmarks/locomo_runner.py - no issues
  • benchmarks/mem0_locomo_runner.py - 3 issues

Fix these issues in Kilo Cloud


Reviewed by seed-2-0-pro-260328 · 207,083 tokens

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
benchmarks/mem0_locomo_runner.py (1)

318-330: ⚠️ Potential issue | 🟠 Major

failed_qa is still undercounted for returned error rows.

These paths now annotate the row with error, but _guarded() only increments failed_qa when _process_qa_mem0() throws. Retrieval/generation failures that return a row will still show up as successes in meta.failed_qa.

Also applies to: 371-372

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/mem0_locomo_runner.py` around lines 318 - 330, The returned error
rows (e.g., the dict built in _process_qa_mem0() that contains an "error" key)
are not causing meta.failed_qa to increment because _guarded() only counts
failures when _process_qa_mem0() raises; update _guarded() to treat a non-empty
"error" field in the returned result as a failure: after calling result =
_process_qa_mem0(...) check if isinstance(result, dict) and result.get("error")
is truthy, and if so increment meta.failed_qa and append the result to
meta.failed_examples (same behavior as the exception path); apply the same
change to the other analogous call sites noted around lines 371-372 so
returned-error rows are counted consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmarks/mem0_locomo_runner.py`:
- Around line 348-352: The code treats judge outages as valid zeros because
_judge currently returns 0.0 on exceptions; update the implementation so
transport/remote failures return None (not 0.0), and keep this call site logic
(the variable judge and the rounding into judge_val) intact; specifically modify
the _judge(...) function to catch transport/HTTP errors and return None on those
failure paths (rather than 0.0) so judge_val becomes None and outages are
excluded from averages.
- Around line 303-309: The _summary aggregation in this file is failing because
rows now contain None sentinels (EVIDENCE_UNAVAILABLE) for evidence fields and
None for f1/bleu1/judge, so change the _summary implementation to mirror the
safe filtering/counting used in benchmarks/locomo_runner.py: skip rows where
evidence_total is None or where metric values are None before summing/averaging,
treat None as "unavailable" not zero, and compute counts using explicit filters
(e.g., only include rows with r.get("evidence_total") and r.get("f1") is not
None) so sums/averages use numeric values only; update all places referencing
evidence_total/evidence_hits and f1/bleu1/judge in _summary to use this guarded
logic and preserve EVIDENCE_UNAVAILABLE behavior.

---

Outside diff comments:
In `@benchmarks/mem0_locomo_runner.py`:
- Around line 318-330: The returned error rows (e.g., the dict built in
_process_qa_mem0() that contains an "error" key) are not causing meta.failed_qa
to increment because _guarded() only counts failures when _process_qa_mem0()
raises; update _guarded() to treat a non-empty "error" field in the returned
result as a failure: after calling result = _process_qa_mem0(...) check if
isinstance(result, dict) and result.get("error") is truthy, and if so increment
meta.failed_qa and append the result to meta.failed_examples (same behavior as
the exception path); apply the same change to the other analogous call sites
noted around lines 371-372 so returned-error rows are counted consistently.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 3096a247-78cf-4301-8f70-e2f20efe1a2a

📥 Commits

Reviewing files that changed from the base of the PR and between 7d2c780 and bc0a773.

📒 Files selected for processing (2)
  • benchmarks/locomo_runner.py
  • benchmarks/mem0_locomo_runner.py

Comment on lines +303 to +309
# mem0 stored facts don't carry original LoCoMo dia_id metadata, so we
# cannot match against the gold evidence list. Report both hits and total
# as None — the metric is _unavailable_, not zero. _summary in the taosmd
# runner (and any downstream scorecard builder) skips None-valued rows
# from retrieval_recall so this artefact doesn't publish a fake 0.0.
# TODO: wire dia_id pass-through if mem0 adds metadata preservation.
EVIDENCE_UNAVAILABLE = {"evidence_hits": None, "evidence_total": None}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

These None sentinels currently break _summary() in this file.

Every returned row now carries evidence_total=None, but the local _summary() still does r.get("evidence_total", 0) > 0, so aggregation will raise TypeError on the first mem0 result. The new f1/bleu1/judge=None paths also conflict with its raw sum(...) / n averages. Please port the benchmarks/locomo_runner.py::_summary filtering/counting logic here before merging.

Also applies to: 318-330, 348-369

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/mem0_locomo_runner.py` around lines 303 - 309, The _summary
aggregation in this file is failing because rows now contain None sentinels
(EVIDENCE_UNAVAILABLE) for evidence fields and None for f1/bleu1/judge, so
change the _summary implementation to mirror the safe filtering/counting used in
benchmarks/locomo_runner.py: skip rows where evidence_total is None or where
metric values are None before summing/averaging, treat None as "unavailable" not
zero, and compute counts using explicit filters (e.g., only include rows with
r.get("evidence_total") and r.get("f1") is not None) so sums/averages use
numeric values only; update all places referencing evidence_total/evidence_hits
and f1/bleu1/judge in _summary to use this guarded logic and preserve
EVIDENCE_UNAVAILABLE behavior.

Comment on lines +348 to +352
if generation_error is None:
judge = await _judge(client, ollama_url, model, question, reference, predicted)
f1_val = round(_f1(predicted, reference), 4)
bleu_val = round(_bleu1(predicted, reference), 4)
judge_val = round(judge, 4) if judge is not None else None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Judge transport failures are still treated as wrong answers here.

This branch expects _judge() to return None, but the local implementation still returns 0.0 on exceptions. As written, mem0 judge outages will continue to depress the Judge average instead of being excluded.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/mem0_locomo_runner.py` around lines 348 - 352, The code treats
judge outages as valid zeros because _judge currently returns 0.0 on exceptions;
update the implementation so transport/remote failures return None (not 0.0),
and keep this call site logic (the variable judge and the rounding into
judge_val) intact; specifically modify the _judge(...) function to catch
transport/HTTP errors and return None on those failure paths (rather than 0.0)
so judge_val becomes None and outages are excluded from averages.

jaylfc added a commit that referenced this pull request Apr 19, 2026
Captures every model actually used during the benchmark (generator
variants, external judge, embedders, cross-encoder, fact extractor) with
params, quant, VRAM footprint, and backend. Adds the runtime/host row so
anyone reproducing knows the Ollama parallel limit and rescore timeout.

Derives hardware-tier recommendations from what we measured:
- Orange Pi (RK3588 NPU, 16 GB): qwen3:4b gen on rkllama, external judge,
  MiniLM ONNX embed, taosmd arch
- Fedora 3060 (12 GB VRAM): gemma4:e2b gen, qwen3:4b judge co-resident,
  prompt-opt on by default
- Laptop / Mac Mini: qwen3:4b gen via Ollama, external judge
- High-end (≥24 GB): qwen3.5:9b gen viable; e2b still competitive

Documents the seven lessons that drive the defaults: bigger-gen-≠-better
at small scale, qwen for structured output, NUM_PARALLEL is the real
ceiling, nomic context forces batching, architecture dominates
generator choice, self-judge inflates, R@K needs dia_id round-trip.

Also corrects the Commits row: superseded SHAs (ca0ccb7571d8af for
mempalace) and references the right open PRs (#34, #35, #36).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant