fix(bench): stop silent-failure patterns from biasing LoCoMo scores#35
fix(bench): stop silent-failure patterns from biasing LoCoMo scores#35
Conversation
CodeRabbit flagged four benchmark-integrity issues on #30. All real — they let infra failures and adapter artefacts masquerade as real negative results, biasing the published numbers and hiding problems. 1. _judge() returned 0.0 on Ollama timeout / network error — identical to a genuine "NO" grade. Now returns None; _summary filters None out of the Judge average so transport flakiness doesn't depress the score. The row still records 0.0 vs None distinctly. 2. Generation errors stored a synthetic "[generation_error: ...]" prediction and then computed F1/BLEU/Judge on it. That folded infra failures into the benchmark averages, and failed_qa stayed zero so the run reported "complete" despite missing answers. Now: on gen failure, predicted=""; f1/bleu/judge=None; row carries an `error` field; _guarded increments failed_qa for those rows; _summary excludes None metrics. 3. mem0_locomo_runner.py hardcoded evidence_hits=0 in every row because mem0 2.x doesn't round-trip per-turn dia_ids. _summary then published retrieval_recall=0.0 for every mem0 run — a fake miss. Now sets evidence_hits and evidence_total to None (metric unavailable, not zero), and _summary skips None rows from recall. 4. mem0 runner inherited both the 0.0-on-failure judge and the error- folding pattern. Both paths fixed to the same None-on-failure convention. _summary now also emits judge_scored + recall_scored alongside count so the JSON shows the denominator honestly (e.g. "Judge 0.41 over 1487 scored of 1540 total"), making infra flakiness inspectable rather than invisible. No schema-breaking changes: existing .rescored.json outputs that have evidence_hits=0 or judge=0.0 remain readable — the new _summary treats them as real zeros, which is how they were when written. Only forward runs produce None for "metric unavailable".
📝 WalkthroughWalkthroughBoth benchmark runners updated to handle errors more explicitly: Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Code Review SummaryStatus: 3 Issues Found | Recommendation: Address before merge Overview
Issue Details (click to expand)WARNING
Files Reviewed (2 files)
Fix these issues in Kilo Cloud Reviewed by seed-2-0-pro-260328 · 207,083 tokens |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
benchmarks/mem0_locomo_runner.py (1)
318-330:⚠️ Potential issue | 🟠 Major
failed_qais still undercounted for returned error rows.These paths now annotate the row with
error, but_guarded()only incrementsfailed_qawhen_process_qa_mem0()throws. Retrieval/generation failures that return a row will still show up as successes inmeta.failed_qa.Also applies to: 371-372
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/mem0_locomo_runner.py` around lines 318 - 330, The returned error rows (e.g., the dict built in _process_qa_mem0() that contains an "error" key) are not causing meta.failed_qa to increment because _guarded() only counts failures when _process_qa_mem0() raises; update _guarded() to treat a non-empty "error" field in the returned result as a failure: after calling result = _process_qa_mem0(...) check if isinstance(result, dict) and result.get("error") is truthy, and if so increment meta.failed_qa and append the result to meta.failed_examples (same behavior as the exception path); apply the same change to the other analogous call sites noted around lines 371-372 so returned-error rows are counted consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@benchmarks/mem0_locomo_runner.py`:
- Around line 348-352: The code treats judge outages as valid zeros because
_judge currently returns 0.0 on exceptions; update the implementation so
transport/remote failures return None (not 0.0), and keep this call site logic
(the variable judge and the rounding into judge_val) intact; specifically modify
the _judge(...) function to catch transport/HTTP errors and return None on those
failure paths (rather than 0.0) so judge_val becomes None and outages are
excluded from averages.
- Around line 303-309: The _summary aggregation in this file is failing because
rows now contain None sentinels (EVIDENCE_UNAVAILABLE) for evidence fields and
None for f1/bleu1/judge, so change the _summary implementation to mirror the
safe filtering/counting used in benchmarks/locomo_runner.py: skip rows where
evidence_total is None or where metric values are None before summing/averaging,
treat None as "unavailable" not zero, and compute counts using explicit filters
(e.g., only include rows with r.get("evidence_total") and r.get("f1") is not
None) so sums/averages use numeric values only; update all places referencing
evidence_total/evidence_hits and f1/bleu1/judge in _summary to use this guarded
logic and preserve EVIDENCE_UNAVAILABLE behavior.
---
Outside diff comments:
In `@benchmarks/mem0_locomo_runner.py`:
- Around line 318-330: The returned error rows (e.g., the dict built in
_process_qa_mem0() that contains an "error" key) are not causing meta.failed_qa
to increment because _guarded() only counts failures when _process_qa_mem0()
raises; update _guarded() to treat a non-empty "error" field in the returned
result as a failure: after calling result = _process_qa_mem0(...) check if
isinstance(result, dict) and result.get("error") is truthy, and if so increment
meta.failed_qa and append the result to meta.failed_examples (same behavior as
the exception path); apply the same change to the other analogous call sites
noted around lines 371-372 so returned-error rows are counted consistently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 3096a247-78cf-4301-8f70-e2f20efe1a2a
📒 Files selected for processing (2)
benchmarks/locomo_runner.pybenchmarks/mem0_locomo_runner.py
| # mem0 stored facts don't carry original LoCoMo dia_id metadata, so we | ||
| # cannot match against the gold evidence list. Report both hits and total | ||
| # as None — the metric is _unavailable_, not zero. _summary in the taosmd | ||
| # runner (and any downstream scorecard builder) skips None-valued rows | ||
| # from retrieval_recall so this artefact doesn't publish a fake 0.0. | ||
| # TODO: wire dia_id pass-through if mem0 adds metadata preservation. | ||
| EVIDENCE_UNAVAILABLE = {"evidence_hits": None, "evidence_total": None} |
There was a problem hiding this comment.
These None sentinels currently break _summary() in this file.
Every returned row now carries evidence_total=None, but the local _summary() still does r.get("evidence_total", 0) > 0, so aggregation will raise TypeError on the first mem0 result. The new f1/bleu1/judge=None paths also conflict with its raw sum(...) / n averages. Please port the benchmarks/locomo_runner.py::_summary filtering/counting logic here before merging.
Also applies to: 318-330, 348-369
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@benchmarks/mem0_locomo_runner.py` around lines 303 - 309, The _summary
aggregation in this file is failing because rows now contain None sentinels
(EVIDENCE_UNAVAILABLE) for evidence fields and None for f1/bleu1/judge, so
change the _summary implementation to mirror the safe filtering/counting used in
benchmarks/locomo_runner.py: skip rows where evidence_total is None or where
metric values are None before summing/averaging, treat None as "unavailable" not
zero, and compute counts using explicit filters (e.g., only include rows with
r.get("evidence_total") and r.get("f1") is not None) so sums/averages use
numeric values only; update all places referencing evidence_total/evidence_hits
and f1/bleu1/judge in _summary to use this guarded logic and preserve
EVIDENCE_UNAVAILABLE behavior.
| if generation_error is None: | ||
| judge = await _judge(client, ollama_url, model, question, reference, predicted) | ||
| f1_val = round(_f1(predicted, reference), 4) | ||
| bleu_val = round(_bleu1(predicted, reference), 4) | ||
| judge_val = round(judge, 4) if judge is not None else None |
There was a problem hiding this comment.
Judge transport failures are still treated as wrong answers here.
This branch expects _judge() to return None, but the local implementation still returns 0.0 on exceptions. As written, mem0 judge outages will continue to depress the Judge average instead of being excluded.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@benchmarks/mem0_locomo_runner.py` around lines 348 - 352, The code treats
judge outages as valid zeros because _judge currently returns 0.0 on exceptions;
update the implementation so transport/remote failures return None (not 0.0),
and keep this call site logic (the variable judge and the rounding into
judge_val) intact; specifically modify the _judge(...) function to catch
transport/HTTP errors and return None on those failure paths (rather than 0.0)
so judge_val becomes None and outages are excluded from averages.
Captures every model actually used during the benchmark (generator variants, external judge, embedders, cross-encoder, fact extractor) with params, quant, VRAM footprint, and backend. Adds the runtime/host row so anyone reproducing knows the Ollama parallel limit and rescore timeout. Derives hardware-tier recommendations from what we measured: - Orange Pi (RK3588 NPU, 16 GB): qwen3:4b gen on rkllama, external judge, MiniLM ONNX embed, taosmd arch - Fedora 3060 (12 GB VRAM): gemma4:e2b gen, qwen3:4b judge co-resident, prompt-opt on by default - Laptop / Mac Mini: qwen3:4b gen via Ollama, external judge - High-end (≥24 GB): qwen3.5:9b gen viable; e2b still competitive Documents the seven lessons that drive the defaults: bigger-gen-≠-better at small scale, qwen for structured output, NUM_PARALLEL is the real ceiling, nomic context forces batching, architecture dominates generator choice, self-judge inflates, R@K needs dia_id round-trip. Also corrects the Commits row: superseded SHAs (ca0ccb7 → 571d8af for mempalace) and references the right open PRs (#34, #35, #36).
Re-open of #33 after its original base branch (
feat/locomo-prompt-opt) was deleted by PR #30's merge. Same content, retargeted at master.Addresses four CodeRabbit MAJOR findings that were posted on #30 (now merged) — they flagged silent-failure patterns in the runner + mem0 adapter that let infra flakiness and adapter artefacts masquerade as legitimate negative results.
Findings addressed
locomo_runner.py:_judgeNone;_summaryexcludes from Judge averagelocomo_runner.pygen error[generation_error: ...]folded into F1/BLEU/Judge;failed_qastayed 0predicted="", metrics=None, row carrieserrorfield,failed_qaincrementedmem0_locomo_runner.pymem0_locomo_runner.pyR@K=0.0evidence_hits=0, evidence_total=len(evidence)hardcoded because mem0 doesn't round-trip dia_id → published fake 0.0None(metric unavailable);_summaryskips from recall aggregationBonus
_summarynow emitsjudge_scored+recall_scoredalongsidecountso the denominator is visible (e.g. "Judge 0.41 over 1535 scored of 1540 total").Test plan
pytest tests/ -q --ignore=tests/integration)_summary— no need to re-run end-to-endSummary by CodeRabbit
Release Notes
Bug Fixes
Refactor