fix(bench): stop silent-failure patterns from biasing LoCoMo scores by jaylfc · Pull Request #35 · jaylfc/taosmd

jaylfc · 2026-04-19T20:49:35Z

Re-open of #33 after its original base branch (feat/locomo-prompt-opt) was deleted by PR #30's merge. Same content, retargeted at master.

Addresses four CodeRabbit MAJOR findings that were posted on #30 (now merged) — they flagged silent-failure patterns in the runner + mem0 adapter that let infra flakiness and adapter artefacts masquerade as legitimate negative results.

Findings addressed

File:Line	Issue	Fix
`locomo_runner.py:_judge`	Ollama timeout scored as "NO"	Returns `None`; `_summary` excludes from Judge average
`locomo_runner.py` gen error	Synthetic `[generation_error: ...]` folded into F1/BLEU/Judge; `failed_qa` stayed 0	On failure: `predicted=""`, metrics=None, row carries `error` field, `failed_qa` incremented
`mem0_locomo_runner.py`	Same gen-error folding pattern	Same fix
`mem0_locomo_runner.py` R@K=0.0	`evidence_hits=0, evidence_total=len(evidence)` hardcoded because mem0 doesn't round-trip dia_id → published fake 0.0	Set both to `None` (metric unavailable); `_summary` skips from recall aggregation

Bonus

_summary now emits judge_scored + recall_scored alongside count so the denominator is visible (e.g. "Judge 0.41 over 1535 scored of 1540 total").

Test plan

Python syntax check on both patched files
Existing tests still pass (pytest tests/ -q --ignore=tests/integration)
Current mem0 rescore output can be re-aggregated with fixed _summary — no need to re-run end-to-end

Summary by CodeRabbit

Release Notes

Bug Fixes
- Improved error handling to distinguish between answer-generation failures and evaluation failures.
- Transport/HTTP failures now properly tracked and reported.
Refactor
- Updated metric aggregation to exclude failed evaluations from calculations for more accurate reporting.
- Enhanced error reporting with exception type and message details in benchmark results.
- Added tracking fields to indicate count of successfully scored metrics.

CodeRabbit flagged four benchmark-integrity issues on #30. All real — they let infra failures and adapter artefacts masquerade as real negative results, biasing the published numbers and hiding problems. 1. _judge() returned 0.0 on Ollama timeout / network error — identical to a genuine "NO" grade. Now returns None; _summary filters None out of the Judge average so transport flakiness doesn't depress the score. The row still records 0.0 vs None distinctly. 2. Generation errors stored a synthetic "[generation_error: ...]" prediction and then computed F1/BLEU/Judge on it. That folded infra failures into the benchmark averages, and failed_qa stayed zero so the run reported "complete" despite missing answers. Now: on gen failure, predicted=""; f1/bleu/judge=None; row carries an `error` field; _guarded increments failed_qa for those rows; _summary excludes None metrics. 3. mem0_locomo_runner.py hardcoded evidence_hits=0 in every row because mem0 2.x doesn't round-trip per-turn dia_ids. _summary then published retrieval_recall=0.0 for every mem0 run — a fake miss. Now sets evidence_hits and evidence_total to None (metric unavailable, not zero), and _summary skips None rows from recall. 4. mem0 runner inherited both the 0.0-on-failure judge and the error- folding pattern. Both paths fixed to the same None-on-failure convention. _summary now also emits judge_scored + recall_scored alongside count so the JSON shows the denominator honestly (e.g. "Judge 0.41 over 1487 scored of 1540 total"), making infra flakiness inspectable rather than invisible. No schema-breaking changes: existing .rescored.json outputs that have evidence_hits=0 or judge=0.0 remain readable — the new _summary treats them as real zeros, which is how they were when written. Only forward runs produce None for "metric unavailable".

coderabbitai · 2026-04-19T20:49:45Z

📝 Walkthrough

Walkthrough

Both benchmark runners updated to handle errors more explicitly: _judge now returns None for failures instead of 0.0, generation failures record None metrics with error details, and summary aggregation excludes None values while tracking non-None result counts.

Changes

Cohort / File(s)	Summary
Locomo Runner Error Handling `benchmarks/locomo_runner.py`	`_judge` return type changed to `float \| None` to propagate failures; `_process_qa` now sets metrics to `None` on generation errors and adds optional `"error"` field; `_summary` excludes `None` values from aggregation and adds `judge_scored` and `recall_scored` counters; `_guarded` increments `failed_qa` when errors occur.
Mem0 Locomo Runner Error Handling `benchmarks/mem0_locomo_runner.py`	Added `EVIDENCE_UNAVAILABLE` constant; changed failed retrieval/generation metrics from `0.0` to `None`; refactored control flow to defer row construction on generation errors; returned rows now include optional `"error"` field and always apply evidence unavailability.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 When errors crept in wearing zeros' disguise,
We whispered None to catch their true cries—
No more masked failures in metrics' dark night,
Each benchmark now glows with transparent light! ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically identifies the main change: fixing silent-failure patterns in LoCoMo benchmark scoring that were causing bias in results.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/locomo-silent-failures-v2

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

kilo-code-bot · 2026-04-19T20:50:34Z

Code Review Summary

Status: 3 Issues Found | Recommendation: Address before merge

Overview

Severity	Count
WARNING	3

Issue Details (click to expand)

WARNING

File	Line	Issue
`benchmarks/mem0_locomo_runner.py`	137	`_judge` function return type still returns `float` not `float
`benchmarks/mem0_locomo_runner.py`	145	`_judge` exception handler still returns 0.0 instead of None - will bias judge scores downward on infra failures
`benchmarks/mem0_locomo_runner.py`	424	mem0 runner does not increment `failed_qa` counter for rows with `error` field - failed count will be incorrect in output JSON

Files Reviewed (2 files)

benchmarks/locomo_runner.py - no issues
benchmarks/mem0_locomo_runner.py - 3 issues

Fix these issues in Kilo Cloud

_{Reviewed by seed-2-0-pro-260328 · 207,083 tokens}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

benchmarks/mem0_locomo_runner.py (1)
318-330: ⚠️ Potential issue | 🟠 Major

failed_qa is still undercounted for returned error rows.

These paths now annotate the row with error, but _guarded() only increments failed_qa when _process_qa_mem0() throws. Retrieval/generation failures that return a row will still show up as successes in meta.failed_qa.

Also applies to: 371-372
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/mem0_locomo_runner.py` around lines 318 - 330, The returned error
rows (e.g., the dict built in _process_qa_mem0() that contains an "error" key)
are not causing meta.failed_qa to increment because _guarded() only counts
failures when _process_qa_mem0() raises; update _guarded() to treat a non-empty
"error" field in the returned result as a failure: after calling result =
_process_qa_mem0(...) check if isinstance(result, dict) and result.get("error")
is truthy, and if so increment meta.failed_qa and append the result to
meta.failed_examples (same behavior as the exception path); apply the same
change to the other analogous call sites noted around lines 371-372 so
returned-error rows are counted consistently.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmarks/mem0_locomo_runner.py`:
- Around line 348-352: The code treats judge outages as valid zeros because
_judge currently returns 0.0 on exceptions; update the implementation so
transport/remote failures return None (not 0.0), and keep this call site logic
(the variable judge and the rounding into judge_val) intact; specifically modify
the _judge(...) function to catch transport/HTTP errors and return None on those
failure paths (rather than 0.0) so judge_val becomes None and outages are
excluded from averages.
- Around line 303-309: The _summary aggregation in this file is failing because
rows now contain None sentinels (EVIDENCE_UNAVAILABLE) for evidence fields and
None for f1/bleu1/judge, so change the _summary implementation to mirror the
safe filtering/counting used in benchmarks/locomo_runner.py: skip rows where
evidence_total is None or where metric values are None before summing/averaging,
treat None as "unavailable" not zero, and compute counts using explicit filters
(e.g., only include rows with r.get("evidence_total") and r.get("f1") is not
None) so sums/averages use numeric values only; update all places referencing
evidence_total/evidence_hits and f1/bleu1/judge in _summary to use this guarded
logic and preserve EVIDENCE_UNAVAILABLE behavior.

---

Outside diff comments:
In `@benchmarks/mem0_locomo_runner.py`:
- Around line 318-330: The returned error rows (e.g., the dict built in
_process_qa_mem0() that contains an "error" key) are not causing meta.failed_qa
to increment because _guarded() only counts failures when _process_qa_mem0()
raises; update _guarded() to treat a non-empty "error" field in the returned
result as a failure: after calling result = _process_qa_mem0(...) check if
isinstance(result, dict) and result.get("error") is truthy, and if so increment
meta.failed_qa and append the result to meta.failed_examples (same behavior as
the exception path); apply the same change to the other analogous call sites
noted around lines 371-372 so returned-error rows are counted consistently.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 3096a247-78cf-4301-8f70-e2f20efe1a2a

📥 Commits

Reviewing files that changed from the base of the PR and between 7d2c780 and bc0a773.

📒 Files selected for processing (2)

benchmarks/locomo_runner.py
benchmarks/mem0_locomo_runner.py

coderabbitai · 2026-04-19T20:54:29Z

+    # mem0 stored facts don't carry original LoCoMo dia_id metadata, so we
+    # cannot match against the gold evidence list. Report both hits and total
+    # as None — the metric is _unavailable_, not zero. _summary in the taosmd
+    # runner (and any downstream scorecard builder) skips None-valued rows
+    # from retrieval_recall so this artefact doesn't publish a fake 0.0.
+    # TODO: wire dia_id pass-through if mem0 adds metadata preservation.
+    EVIDENCE_UNAVAILABLE = {"evidence_hits": None, "evidence_total": None}


⚠️ Potential issue | 🔴 Critical

These None sentinels currently break _summary() in this file.

Every returned row now carries evidence_total=None, but the local _summary() still does r.get("evidence_total", 0) > 0, so aggregation will raise TypeError on the first mem0 result. The new f1/bleu1/judge=None paths also conflict with its raw sum(...) / n averages. Please port the benchmarks/locomo_runner.py::_summary filtering/counting logic here before merging.

Also applies to: 318-330, 348-369

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@benchmarks/mem0_locomo_runner.py` around lines 303 - 309, The _summary aggregation in this file is failing because rows now contain None sentinels (EVIDENCE_UNAVAILABLE) for evidence fields and None for f1/bleu1/judge, so change the _summary implementation to mirror the safe filtering/counting used in benchmarks/locomo_runner.py: skip rows where evidence_total is None or where metric values are None before summing/averaging, treat None as "unavailable" not zero, and compute counts using explicit filters (e.g., only include rows with r.get("evidence_total") and r.get("f1") is not None) so sums/averages use numeric values only; update all places referencing evidence_total/evidence_hits and f1/bleu1/judge in _summary to use this guarded logic and preserve EVIDENCE_UNAVAILABLE behavior.

coderabbitai · 2026-04-19T20:54:29Z

+    if generation_error is None:
+        judge = await _judge(client, ollama_url, model, question, reference, predicted)
+        f1_val = round(_f1(predicted, reference), 4)
+        bleu_val = round(_bleu1(predicted, reference), 4)
+        judge_val = round(judge, 4) if judge is not None else None


⚠️ Potential issue | 🟠 Major

Judge transport failures are still treated as wrong answers here.

This branch expects _judge() to return None, but the local implementation still returns 0.0 on exceptions. As written, mem0 judge outages will continue to depress the Judge average instead of being excluded.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@benchmarks/mem0_locomo_runner.py` around lines 348 - 352, The code treats judge outages as valid zeros because _judge currently returns 0.0 on exceptions; update the implementation so transport/remote failures return None (not 0.0), and keep this call site logic (the variable judge and the rounding into judge_val) intact; specifically modify the _judge(...) function to catch transport/HTTP errors and return None on those failure paths (rather than 0.0) so judge_val becomes None and outages are excluded from averages.

Captures every model actually used during the benchmark (generator variants, external judge, embedders, cross-encoder, fact extractor) with params, quant, VRAM footprint, and backend. Adds the runtime/host row so anyone reproducing knows the Ollama parallel limit and rescore timeout. Derives hardware-tier recommendations from what we measured: - Orange Pi (RK3588 NPU, 16 GB): qwen3:4b gen on rkllama, external judge, MiniLM ONNX embed, taosmd arch - Fedora 3060 (12 GB VRAM): gemma4:e2b gen, qwen3:4b judge co-resident, prompt-opt on by default - Laptop / Mac Mini: qwen3:4b gen via Ollama, external judge - High-end (≥24 GB): qwen3.5:9b gen viable; e2b still competitive Documents the seven lessons that drive the defaults: bigger-gen-≠-better at small scale, qwen for structured output, NUM_PARALLEL is the real ceiling, nomic context forces batching, architecture dominates generator choice, self-judge inflates, R@K needs dia_id round-trip. Also corrects the Commits row: superseded SHAs (ca0ccb7 → 571d8af for mempalace) and references the right open PRs (#34, #35, #36).

coderabbitai bot reviewed Apr 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bench): stop silent-failure patterns from biasing LoCoMo scores#35

fix(bench): stop silent-failure patterns from biasing LoCoMo scores#35
jaylfc wants to merge 1 commit intomasterfrom
fix/locomo-silent-failures-v2

jaylfc commented Apr 19, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 19, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

kilo-code-bot bot commented Apr 19, 2026 •

edited

Loading

WARNING

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 19, 2026

Uh oh!

coderabbitai bot Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jaylfc commented Apr 19, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Findings addressed

Bonus

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

kilo-code-bot bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Overview

WARNING

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jaylfc commented Apr 19, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 19, 2026 •

edited

Loading

kilo-code-bot bot commented Apr 19, 2026 •

edited

Loading