feat(bench): LoCoMo runner + prompt-opt + mem0 adapter (validated) by jaylfc · Pull Request #30 · jaylfc/taosmd

jaylfc · 2026-04-19T14:30:18Z

Summary

Comprehensive LoCoMo benchmark PR — lands all the work that produced our published LongMemEval 97% and the new LoCoMo numbers. Supersedes the closed #25 (content subsumed into #26's unintentional bundle).

What lands

Benchmark infrastructure (~2000 LOC):

benchmarks/locomo_runner.py — full LoCoMo runner (ONNX embeds + Ollama answer/judge, F1/BLEU/Judge/R@K per category, --concurrency, --per-conv-limit, --timeout, --model flags, live progress markers)
benchmarks/longmemeval_runner.py — ported off dead tinyagentos imports, now uses TAOSMD_OLLAMA_URL/TAOSMD_OLLAMA_MODEL env vars (no hardcoded IPs)
benchmarks/mem0_locomo_runner.py — apples-to-apples adapter routing retrieval through mem0.search() with the same generator/prompt/judge (for the upcoming memory-architecture comparison)
benchmarks/locomo_rescore.py — already merged via feat(bench): locomo_rescore + resume tools (external judge with better diagnostics) #29, included here only by branch history
eval/librarian_eval.py — fixed divide-by-zero token ratio (replaced with absolute per-query budget)

Prompt tweaks (the +0.03 Overall Judge improvement, validated Apr 18–19):

benchmarks/locomo_runner.py ANSWER_PROMPT — adds absolute-date instruction + softened IDK instruction
Measured impact under external qwen3:4b judge:
- Temporal: 0.36 → 0.41 (+0.05)
- Multi-hop: 0.21 → 0.24 (+0.03)
- Single-hop + Open-dom: unchanged (prompt targeted temporal/multi-hop specifically — clean signal)

Methodology docs:

docs/specs/2026-04-16-taosmd-next-steps.md — four-workstream roadmap
docs/specs/2026-04-16-supersede-benchmark-design.md
docs/specs/2026-04-16-routing-benchmark-design.md
docs/specs/2026-04-24-ama-question-draft.md — Imran Ahmad AMA on r/LLMeng
docs/specs/2026-04-methodology-post-draft.md — publishable protocol-transparency post

README framing (updated on prior commits on this branch):

Reframed Librarian section around "query-time RAG breaks on implicit context" critique
Added Security & Integrity section on memory poisoning

Validated results

LoCoMo full scorecard under qwen3:4b external judge (100% coverage, 0 errors):

Category	e2b	e4b	e2b+prompt-opt
Single-hop	0.17	0.15	0.16
Temporal	0.36	0.31	0.41
Multi-hop	0.21	0.14	0.24
Open-dom	0.51	0.51	0.51
Overall	0.40	0.38	0.41
F1	0.162	0.152	0.175

Test plan

All three models benchmarked against the full 1540-QA LoCoMo set
External judge (qwen3:4b) re-judged all three runs — 100% coverage, 0 errors
Prompt-opt improvement localized to targeted categories as designed
Existing test suite green (pytest tests/ -q --ignore=tests/integration)
Post-merge: delete feat/locomo-benchmark and feat/locomo-prompt-opt
Post-merge: Fedora git fetch --prune && git checkout master before next benchmark run

Summary by CodeRabbit

Release Notes

New Features
- Added LoCoMo benchmark runners for evaluating retrieval and QA generation performance
- Made benchmark LLM and embedding configuration configurable via environment variables
Documentation
- Added benchmark design specifications for routing and temporal reasoning evaluation scenarios
Chores
- Updated evaluation metrics to use absolute token-per-query budget instead of relative comparison

coderabbitai · 2026-04-19T14:30:31Z

Warning

Rate limit exceeded

@jaylfc has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 12 minutes and 45 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 12 minutes and 45 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: bf2c2afe-47ef-430e-a3e7-d25e6053cf42

📥 Commits

Reviewing files that changed from the base of the PR and between fd27d2c and ca0ccb7.

📒 Files selected for processing (2)

benchmarks/mempalace_locomo_runner.py
pyproject.toml

📝 Walkthrough

Walkthrough

This PR introduces two new LoCoMo benchmark runners (taosmd and mem0 variants), updates the existing LongMemEval runner to use taosmd modules with environment-configurable LLM endpoints, modifies the Librarian evaluation token-budget logic from multiplier-based to absolute tokens-per-query, and adds multiple specification documents for upcoming benchmarking initiatives and roadmap planning.

Changes

Cohort / File(s)	Summary
LoCoMo Benchmark Runners `benchmarks/locomo_runner.py`, `benchmarks/mem0_locomo_runner.py`	Two new benchmark executables implementing full LoCoMo workflows: `locomo_runner.py` (465 lines) ingests conversations into `VectorMemory`, runs QA retrieval/generation/judging with Ollama, computes F1/BLEU-1/judge metrics per category; `mem0_locomo_runner.py` (549 lines) performs equivalent operations using mem0 as the vector store backend. Both handle per-conversation temp storage, concurrent QA processing, result aggregation, and JSON/table output.
Benchmark Integration `benchmarks/longmemeval_runner.py`	Updated imports from deprecated `tinyagentos.` to `taosmd.` modules; made Ollama endpoint and model configurable via environment variables (`TAOSMD_OLLAMA_URL`, `TAOSMD_OLLAMA_MODEL`) with sensible defaults; extended `VectorMemory` construction to accept embedding mode and ONNX path from environment variables (`TAOSMD_EMBED_MODE`, `TAOSMD_ONNX_PATH`).
Evaluation Logic `eval/librarian_eval.py`	Changed token-budget criterion in `_print_summary` from a relative multiplier (`token_mult <= 3.0`) to an absolute per-query budget: derives query count from librarian report axes, computes `tokens_per_query`, applies fixed `TOKEN_BUDGET = 2000` threshold, and updates output messaging.
Dependencies `pyproject.toml`	Added optional `benchmarks` dependency group with `mem0ai>=0.1.0`; removed unused `[tool.setuptools.package-data]` section for docs.
Specification Documents `docs/specs/2026-04-16-routing-benchmark-design.md`, `docs/specs/2026-04-16-supersede-benchmark-design.md`, `docs/specs/2026-04-16-taosmd-next-steps.md`, `docs/specs/2026-04-24-ama-question-draft.md`, `docs/specs/2026-04-methodology-post-draft.md`	Five new spec/plan documents: routing and supersede benchmark designs (measurement questions, configurations, prerequisites, metrics), taosmd execution roadmap (LoCoMo publication, Librarian validations, README reframing, ecosystem visibility), AMA question draft for community engagement, and methodology transparency post addressing benchmark reproducibility and reporting standards.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Runner
    participant Dataset
    participant VectorMemory
    participant Ollama
    participant Judge

    User->>Runner: run(dataset, model, concurrency, top_k)
    Runner->>Dataset: load LoCoMo JSON
    Dataset-->>Runner: conversations & QA items

    loop For each conversation
        Runner->>VectorMemory: create temp instance
        Runner->>Runner: ingest session turns
        VectorMemory-->>Runner: indexed (dialog_id, speaker, datetime)

        loop For each QA (concurrent, semaphore-limited)
            Runner->>VectorMemory: search(query, top_k)
            VectorMemory-->>Runner: ranked hits
            Runner->>Runner: build context (hits + datetime)
            Runner->>Ollama: /api/generate(context + question)
            Ollama-->>Runner: predicted_answer
            Runner->>Judge: /api/generate(judge_prompt)
            Judge-->>Runner: YES/NO score (0.0 or 1.0)
            Runner->>Runner: compute F1, BLEU-1 vs reference
            Runner->>Runner: store metrics & latencies
        end
    end

    Runner->>Runner: aggregate by category
    Runner->>Runner: compute retrieval_recall
    Runner-->>User: JSON results + formatted summary table

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

The two new benchmark runners introduce substantial, multi-component async workflows (VectorMemory/mem0 ingestion, Ollama integration, concurrent QA processing, metric computation) across 1,000+ lines. The logic is moderately dense with helper functions, error handling, and aggregation logic. The variety of changes (two new benchmarks, existing benchmark refactor, eval logic modification, dependency updates, and documentation) adds heterogeneity. The evaluation logic change requires careful reasoning about the token-budget semantics. Documentation specs are straightforward but numerous.

Possibly related PRs

PR #29: Both implement Ollama-based judge logic and parse YES/NO judge replies; locomo_runner.py introduces the foundational _judge() and _ollama_generate() functions that may be reused or compared with similar patterns in #29.

Poem

🐰 Benchmark runners hop and bound,
LoCoMo logic newly found,
Mem0 searches, Ollama thinks,
Metrics flow through careful links,
Roadmap charted, specs align—
More transparency by design! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (3 warnings)

Check name	Status	Explanation	Resolution
Linked Issues check	⚠️ Warning	The linked issue `#25` requires prompt migration and JSON parsing changes in session_catalog.py, but no changes to that file appear in this PR's changeset.	Review whether `#25`'s refactoring of session_catalog.py and prompts module integration was intended to be included or if it belongs in a separate PR.
Out of Scope Changes check	⚠️ Warning	The PR adds extensive LoCoMo benchmark code, mem0 adapter, evaluation tweaks, and methodology docs that align with the PR objectives but diverge from the linked issue `#25` requirements.	Clarify the relationship between `#25` (session catalog refactoring) and the current PR scope; consider separating them if they represent different feature tracks.
Docstring Coverage	⚠️ Warning	Docstring coverage is 21.57% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly references the three main deliverables: LoCoMo runner, prompt optimization, and mem0 adapter, matching the primary changes in the PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/locomo-prompt-opt

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

kilo-code-bot · 2026-04-19T14:31:07Z

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Overview

Severity	Count
CRITICAL	0
WARNING	0
SUGGESTION	0

Issue Details (click to expand)

CRITICAL

Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

Files Reviewed (2 files)

benchmarks/mempalace_locomo_runner.py - New runner for MemPalace benchmark, properly implements consistent benchmark interface, follows same patterns as other runners, handles None values correctly, uses relative paths for dataset
pyproject.toml - Added mempalace to benchmarks optional dependencies

_{Reviewed by seed-2-0-pro-260328 · 262,431 tokens}

coderabbitai

Actionable comments posted: 8

🧹 Nitpick comments (8)

.gitignore (1)
13-13: Keep a versioned CLAUDE.md template if local CLAUDE.md is ignored.

Ignoring CLAUDE.md is fine for local machine-specific config, but it can hide required shared agent/memory instructions from the repo. Consider committing CLAUDE.md.example (or equivalent) and documenting copy/setup steps.
Suggested `.gitignore` adjustment
 CLAUDE.md
+!CLAUDE.md.example
Based on learnings: “Applies to {AGENTS,CLAUDE}.md: Do not overwrite existing AGENTS.md or CLAUDE.md files without first reading them and checking for existing memory system integrations (Mem0, Zep, MemGPT/Letta, OpenClaw memory, LangChain modules)”.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.gitignore at line 13, Make sure a versioned template for CLAUDE.md is
committed and .gitignore only excludes the local file: add a committed
CLAUDE.md.example (or CLAUDE.md.template) to the repo containing the shared
agent/memory instructions, update .gitignore to ignore CLAUDE.md but not
CLAUDE.md.example, and add a brief note in the README or a new CONTRIBUTING
section describing the copy/setup steps (copy CLAUDE.md.example -> CLAUDE.md and
merge any existing local CLAUDE.md to avoid overwriting memory integrations).
benchmarks/locomo_runner.py (2)
387-387: datetime.utcnow() is deprecated in Python 3.12+.

Consider using datetime.now(timezone.utc) for forward compatibility.
♻️ Proposed fix
+from datetime import datetime, timezone
 ...
-    timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S")
+    timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/locomo_runner.py` at line 387, Replace the deprecated
datetime.utcnow() usage when building the timestamp: change the assignment
timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S") to use timezone-aware
time, e.g. timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S"), and
add/update the import to include timezone (from datetime import datetime,
timezone) so the timestamp is produced in a forward-compatible, UTC-aware way.
198-200: Clarify the double-metadata unwrapping logic.

The expression h.get("metadata", {}).get("metadata", h.get("metadata", {})) suggests the retrieve API sometimes returns metadata nested inside another metadata key. This defensive handling works, but may indicate an API inconsistency worth investigating.

Consider adding a comment explaining when the double-nesting occurs, or normalizing the API response in retrieve().
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/locomo_runner.py` around lines 198 - 200, The list comprehension
unpacks nested metadata defensively; update the code to normalize the API
response in the retrieve() function so hits always have a flat dict at
h["metadata"] (or explicitly document when nested metadata occurs) and then
simplify the comprehension to use h.get("metadata", {}) directly; specifically,
modify retrieve() to detect and unwrap cases where response returns {"metadata":
{"metadata": ...}} into {"metadata": ...}, and add a short inline comment near
the return (or in retrieve()) explaining the normalization so the expression
h.get("metadata", {}) is sufficient and the double-metadata defensive logic is
no longer needed.
benchmarks/locomo_rescore.py (1)
257-268: Closure captures client before it's defined.

The nested _judge_one function references client at line 260, but client is only assigned at line 267. This works due to Python's late binding of closures (the name is looked up when _judge_one executes, not when it's defined), but it's fragile and potentially confusing.

Consider passing client as a parameter to make the dependency explicit:
♻️ Proposed fix
-    async def _judge_one(idx: int, rec: dict):
+    async def _judge_one(idx: int, rec: dict, client: httpx.AsyncClient):
         nonlocal unparseable
         async with sem:
             val = await _call_ollama_judge(client, ollama_url, model, rec)
         ...

     limits = httpx.Limits(max_connections=concurrency + 2, max_keepalive_connections=concurrency)
     async with httpx.AsyncClient(limits=limits) as client:
-        tasks = [asyncio.create_task(_judge_one(i, r)) for i, r in enumerate(records)]
+        tasks = [asyncio.create_task(_judge_one(i, r, client)) for i, r in enumerate(records)]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/locomo_rescore.py` around lines 257 - 268, The nested function
_judge_one currently captures client from the outer scope after it is created,
which is fragile; make the dependency explicit by adding a client parameter to
_judge_one (e.g., async def _judge_one(idx: int, rec: dict, client):) and update
its internal call to await _call_ollama_judge(client, ollama_url, model, rec).
Then, when creating tasks inside the async with httpx.AsyncClient(...) as client
block, pass the client into each task (e.g., asyncio.create_task(_judge_one(i,
r, client))) while leaving uses of sem, results, and records unchanged.
benchmarks/mem0_locomo_runner.py (3)
467-467: datetime.utcnow() is deprecated in Python 3.12+.

Consider using datetime.now(timezone.utc) for forward compatibility.
♻️ Proposed fix
+from datetime import datetime, timezone
 ...
-    timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S")
+    timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/mem0_locomo_runner.py` at line 467, Replace the deprecated
datetime.utcnow() call used when constructing timestamp (the line assigning
timestamp) with a timezone-aware call: use
datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S") and ensure timezone is
imported (from datetime import datetime, timezone) so the timestamp is
forward-compatible with Python 3.12+.
27-65: Consider extracting shared constants to a common module.

The prompts and utilities are intentionally duplicated for self-containment, but this creates a maintenance burden. If the prompts change in locomo_runner.py, they must be manually updated here. Consider extracting ANSWER_PROMPT, JUDGE_PROMPT, CATEGORY_NAMES, and shared utilities (_tokenize, _f1, _bleu1, _judge) to a benchmarks/_common.py module.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/mem0_locomo_runner.py` around lines 27 - 65, The duplicated
prompts and utilities (ANSWER_PROMPT, JUDGE_PROMPT, CATEGORY_NAMES and helper
functions like _tokenize, _f1, _bleu1, _judge) should be extracted into a single
shared module (e.g., _common) and imported from both runners so they stay in
sync; create the new module containing those constants and utility
implementations, update mem0_locomo_runner.py and locomo_runner.py to import
ANSWER_PROMPT, JUDGE_PROMPT, CATEGORY_NAMES and the utility functions instead of
redefining them, and ensure any references to those symbols remain unchanged so
behavior is preserved.
240-246: Consider using tempfile.mkdtemp for safer temporary directory creation.

The hardcoded /tmp/mem0_{run_id}_chroma path is predictable. While acceptable for benchmark tooling, using tempfile.mkdtemp (as done in locomo_runner.py line 351) would be more robust against symlink attacks on shared systems.
♻️ Proposed fix
+import tempfile
+
 def _build_mem0_config(args: argparse.Namespace) -> dict:
+    tmp_dir = tempfile.mkdtemp(prefix=f"mem0_{args.run_id}_")
     return {
         ...
         "vector_store": {
             "provider": "chroma",
             "config": {
                 "collection_name": f"locomo_{args.run_id}",
-                "path": f"/tmp/mem0_{args.run_id}_chroma",
+                "path": tmp_dir,
             },
         },
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/mem0_locomo_runner.py` around lines 240 - 246, The vector_store
config currently uses a predictable hardcoded path
f"/tmp/mem0_{args.run_id}_chroma"; replace this with a securely-created
temporary directory using tempfile.mkdtemp to avoid predictable symlink attacks:
import tempfile, call tempfile.mkdtemp(prefix=f"mem0_{args.run_id}_chroma_") to
get a tempdir, and assign that tempdir to the "path" value while keeping
"collection_name" as f"locomo_{args.run_id}"; update any teardown/cleanup logic
if needed to remove the tempdir after the benchmark.
docs/specs/2026-04-16-taosmd-next-steps.md (1)
33-33: Task may already be completed.

The relevant code snippet from eval/librarian_eval.py:630-642 shows that the token cost metric already uses tokens_per_query <= TOKEN_BUDGET (absolute threshold) rather than an infx ratio. Consider updating this task to reflect its completion status or verifying if additional work remains.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/specs/2026-04-16-taosmd-next-steps.md` at line 33, The spec task refers
to changing the token cost metric but the code in eval/librarian_eval.py already
enforces the absolute threshold via the tokens_per_query <= TOKEN_BUDGET check;
update the docs/specs/2026-04-16-taosmd-next-steps.md entry 5 to mark this task
as completed (or note any remaining subtasks), and if you want extra
verification, open eval/librarian_eval.py and confirm the evaluation uses
tokens_per_query and TOKEN_BUDGET rather than an infx ratio in the token-cost
calculation logic.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmarks/locomo_runner.py`:
- Around line 421-432: The defaults for the argparse options are using hardcoded
user-specific paths; update the p.add_argument calls for "--dataset" and
"--onnx-path" to derive sensible defaults from environment variables or relative
locations instead of "/home/jay/...". For example, read env vars like
DATASET_PATH and ONNX_PATH (os.environ.get("DATASET_PATH")) or build a
project-relative default via pathlib.Path(__file__).resolve().parents[...] and
fall back to None or a clearly documented relative path; update the "--dataset"
and "--onnx-path" argument default values accordingly and update help text to
mention the env vars/relative default and that users can override via CLI.
- Around line 351-382: The tmp directory created with tempfile.mkdtemp(...) is
never removed; change the code to ensure cleanup by using a
tempfile.TemporaryDirectory() context manager around the lifecycle of tmp and
the VectorMemory instance (replace tempfile.mkdtemp(...) with
TemporaryDirectory() as tmp_dir and pass os.path.join(tmp_dir, "vmem.db")), or
if you must keep mkdtemp, call shutil.rmtree(tmp) after await vmem.close();
update references to tmp (used when constructing VectorMemory and when closing
vmem) so the directory is always removed even if exceptions occur (use
try/finally or context manager around vmem.init()/await
_ingest_conversation/.../await vmem.close()).

In `@benchmarks/longmemeval_runner.py`:
- Around line 152-156: The default onnx_path is currently hardcoded to a
machine-specific path; update the VectorMemory construction to avoid a hardcoded
default by using a neutral fallback (e.g., None or a package/resource-relative
path) and rely on the TAOSMD_ONNX_PATH env var or a config to provide the real
model location; modify the call that creates VectorMemory (parameters: db_path,
embed_mode, onnx_path) so onnx_path uses os.environ.get("TAOSMD_ONNX_PATH") with
no hardcoded user path and add handling in VectorMemory (or the caller) to raise
a clear error or load a bundled model when onnx_path is None.

In `@benchmarks/mem0_locomo_runner.py`:
- Line 412: Replace the deprecated call to asyncio.get_event_loop() inside the
async function run() with asyncio.get_running_loop(); locate the assignment
"loop = asyncio.get_event_loop()" and change it to use
asyncio.get_running_loop() so the coroutine retrieves the currently running
event loop instead of the deprecated global accessor.
- Line 511: The default dataset argument in p.add_argument("--dataset", ...)
uses a hardcoded user-specific path; change it to use a portable default or an
environment variable: remove the absolute /home/jay/... path and either set a
relative default like "data/locomo10.json" or set
default=os.environ.get("LOCOMO_DATASET", "data/locomo10.json") and import os;
alternatively make the argument required (no default) so callers must supply a
path. Update the p.add_argument("--dataset", ...) call accordingly and ensure
any README or run scripts reference the chosen default or env var.

In `@docs/specs/2026-04-16-routing-benchmark-design.md`:
- Around line 54-69: The fenced code block showing the scorecard currently uses
an untyped fence (```) which triggers MD040; change the opening fence to a typed
fence by replacing the leading ``` with ```text so the block becomes ```text ...
``` and leave the closing backticks unchanged; this affects the fenced block
that starts with the line "Category       vector-only ..." and ends with the
final "open-domain → crystals       [N/M correct]" closing fence.

In `@eval/librarian_eval.py`:
- Around line 628-633: The n_queries calculation in eval/librarian_eval.py uses
a non-existent key "n" for axis_c (lib.axis_c.get("n", 0)) while eval_axis_c
emits "n_sessions", causing undercounting; update the expression that computes
n_queries (the max(...) block) to read lib.axis_c.get("n_sessions", 0) (or sum n
and n_sessions as appropriate) so axis_c's sessions are included consistently
with eval_axis_c and tokens/query remains correct.

In `@README.md`:
- Around line 7-9: The README currently uses "97.0%" ambiguously; the top claims
"97.0% end-to-end Judge accuracy on LongMemEval-S" while later sections refer to
"97.0%" as Recall@5 — fix by unifying the framing: decide whether 97.0% refers
to end-to-end Judge accuracy or Recall@5, then update every occurrence of the
phrase "97.0%" and its surrounding label (search for the strings "97.0%",
"LongMemEval-S", and "Recall@5") so each mention explicitly states the metric
(e.g., "97.0% end-to-end Judge accuracy" or "97.0% Recall@5"); alternatively,
present both numbers distinctly (e.g., "97.0% end-to-end Judge accuracy;
Recall@5 = X") and update the sentences around those symbols to remove
ambiguity.

---

Nitpick comments:
In @.gitignore:
- Line 13: Make sure a versioned template for CLAUDE.md is committed and
.gitignore only excludes the local file: add a committed CLAUDE.md.example (or
CLAUDE.md.template) to the repo containing the shared agent/memory instructions,
update .gitignore to ignore CLAUDE.md but not CLAUDE.md.example, and add a brief
note in the README or a new CONTRIBUTING section describing the copy/setup steps
(copy CLAUDE.md.example -> CLAUDE.md and merge any existing local CLAUDE.md to
avoid overwriting memory integrations).

In `@benchmarks/locomo_rescore.py`:
- Around line 257-268: The nested function _judge_one currently captures client
from the outer scope after it is created, which is fragile; make the dependency
explicit by adding a client parameter to _judge_one (e.g., async def
_judge_one(idx: int, rec: dict, client):) and update its internal call to await
_call_ollama_judge(client, ollama_url, model, rec). Then, when creating tasks
inside the async with httpx.AsyncClient(...) as client block, pass the client
into each task (e.g., asyncio.create_task(_judge_one(i, r, client))) while
leaving uses of sem, results, and records unchanged.

In `@benchmarks/locomo_runner.py`:
- Line 387: Replace the deprecated datetime.utcnow() usage when building the
timestamp: change the assignment timestamp =
datetime.utcnow().strftime("%Y%m%d_%H%M%S") to use timezone-aware time, e.g.
timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S"), and add/update
the import to include timezone (from datetime import datetime, timezone) so the
timestamp is produced in a forward-compatible, UTC-aware way.
- Around line 198-200: The list comprehension unpacks nested metadata
defensively; update the code to normalize the API response in the retrieve()
function so hits always have a flat dict at h["metadata"] (or explicitly
document when nested metadata occurs) and then simplify the comprehension to use
h.get("metadata", {}) directly; specifically, modify retrieve() to detect and
unwrap cases where response returns {"metadata": {"metadata": ...}} into
{"metadata": ...}, and add a short inline comment near the return (or in
retrieve()) explaining the normalization so the expression h.get("metadata", {})
is sufficient and the double-metadata defensive logic is no longer needed.

In `@benchmarks/mem0_locomo_runner.py`:
- Line 467: Replace the deprecated datetime.utcnow() call used when constructing
timestamp (the line assigning timestamp) with a timezone-aware call: use
datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S") and ensure timezone is
imported (from datetime import datetime, timezone) so the timestamp is
forward-compatible with Python 3.12+.
- Around line 27-65: The duplicated prompts and utilities (ANSWER_PROMPT,
JUDGE_PROMPT, CATEGORY_NAMES and helper functions like _tokenize, _f1, _bleu1,
_judge) should be extracted into a single shared module (e.g., _common) and
imported from both runners so they stay in sync; create the new module
containing those constants and utility implementations, update
mem0_locomo_runner.py and locomo_runner.py to import ANSWER_PROMPT,
JUDGE_PROMPT, CATEGORY_NAMES and the utility functions instead of redefining
them, and ensure any references to those symbols remain unchanged so behavior is
preserved.
- Around line 240-246: The vector_store config currently uses a predictable
hardcoded path f"/tmp/mem0_{args.run_id}_chroma"; replace this with a
securely-created temporary directory using tempfile.mkdtemp to avoid predictable
symlink attacks: import tempfile, call
tempfile.mkdtemp(prefix=f"mem0_{args.run_id}_chroma_") to get a tempdir, and
assign that tempdir to the "path" value while keeping "collection_name" as
f"locomo_{args.run_id}"; update any teardown/cleanup logic if needed to remove
the tempdir after the benchmark.

In `@docs/specs/2026-04-16-taosmd-next-steps.md`:
- Line 33: The spec task refers to changing the token cost metric but the code
in eval/librarian_eval.py already enforces the absolute threshold via the
tokens_per_query <= TOKEN_BUDGET check; update the
docs/specs/2026-04-16-taosmd-next-steps.md entry 5 to mark this task as
completed (or note any remaining subtasks), and if you want extra verification,
open eval/librarian_eval.py and confirm the evaluation uses tokens_per_query and
TOKEN_BUDGET rather than an infx ratio in the token-cost calculation logic.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 607d827f-0e5a-4d00-8eb3-51299f9cdc95

📥 Commits

Reviewing files that changed from the base of the PR and between c360faa and d2f1a1e.

📒 Files selected for processing (13)

.gitignore
README.md
benchmarks/locomo_rescore.py
benchmarks/locomo_runner.py
benchmarks/longmemeval_runner.py
benchmarks/mem0_locomo_runner.py
docs/specs/2026-04-16-routing-benchmark-design.md
docs/specs/2026-04-16-supersede-benchmark-design.md
docs/specs/2026-04-16-taosmd-next-steps.md
docs/specs/2026-04-24-ama-question-draft.md
docs/specs/2026-04-methodology-post-draft.md
eval/librarian_eval.py
pyproject.toml

coderabbitai · 2026-04-19T14:38:51Z

+            tmp = tempfile.mkdtemp(prefix=f"locomo_{conv_id}_")
+            vmem = VectorMemory(
+                db_path=os.path.join(tmp, "vmem.db"),
+                qmd_url=args.qmd_url,
+                embed_mode=args.embed_mode,
+                onnx_path=args.onnx_path,
+            )
+            await vmem.init(http_client=client)
+            added, ingest_s = await _ingest_conversation(vmem, conv)
+            print(f"[{conv_id}] ingested {added} turns in {ingest_s:.1f}s", flush=True)
+
+            # Pick eligible QAs up-front. --per-conv-limit caps QAs per
+            # conversation (for balanced sampling); --limit is the global cap.
+            eligible: list[dict] = []
+            for qa in conv.get("qa", []) or []:
+                cat = int(qa.get("category", 0))
+                if cat not in include_cats:
+                    continue
+                if args.per_conv_limit and len(eligible) >= args.per_conv_limit:
+                    break
+                if args.limit and (total_seen + len(eligible)) >= args.limit:
+                    break
+                eligible.append(qa)
+
+            if eligible:
+                tasks = [_guarded(qa, conv_id, vmem, client) for qa in eligible]
+                gathered = await asyncio.gather(*tasks, return_exceptions=True)
+                for outcome in gathered:
+                    if isinstance(outcome, Exception) or outcome is None:
+                        continue
+                    results.append(outcome)
+            await vmem.close()


⚠️ Potential issue | 🟡 Minor

Temporary directories are not cleaned up after use.

tempfile.mkdtemp() creates directories that persist after the script exits. For benchmark runs processing multiple conversations, this could accumulate significant disk usage in /tmp.

🐛 Proposed fix using context manager

+import shutil ... - tmp = tempfile.mkdtemp(prefix=f"locomo_{conv_id}_") - vmem = VectorMemory( - db_path=os.path.join(tmp, "vmem.db"), - ... - ) - ... - await vmem.close() + tmp = tempfile.mkdtemp(prefix=f"locomo_{conv_id}_") + try: + vmem = VectorMemory( + db_path=os.path.join(tmp, "vmem.db"), + ... + ) + ... + await vmem.close() + finally: + shutil.rmtree(tmp, ignore_errors=True)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@benchmarks/locomo_runner.py` around lines 351 - 382, The tmp directory created with tempfile.mkdtemp(...) is never removed; change the code to ensure cleanup by using a tempfile.TemporaryDirectory() context manager around the lifecycle of tmp and the VectorMemory instance (replace tempfile.mkdtemp(...) with TemporaryDirectory() as tmp_dir and pass os.path.join(tmp_dir, "vmem.db")), or if you must keep mkdtemp, call shutil.rmtree(tmp) after await vmem.close(); update references to tmp (used when constructing VectorMemory and when closing vmem) so the directory is always removed even if exceptions occur (use try/finally or context manager around vmem.init()/await _ingest_conversation/.../await vmem.close()).

coderabbitai · 2026-04-19T14:38:51Z

+    concurrency = max(1, int(args.concurrency))
+    sem = asyncio.Semaphore(concurrency)
+    progress_lock = asyncio.Lock()
+    loop = asyncio.get_event_loop()


⚠️ Potential issue | 🟡 Minor

Use asyncio.get_running_loop() instead of asyncio.get_event_loop().

asyncio.get_event_loop() is deprecated in Python 3.10+ when called from a coroutine. Since run() is an async function, use asyncio.get_running_loop() instead.

🐛 Proposed fix

- loop = asyncio.get_event_loop() + loop = asyncio.get_running_loop()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

loop = asyncio.get_event_loop()

loop = asyncio.get_running_loop()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@benchmarks/mem0_locomo_runner.py` at line 412, Replace the deprecated call to asyncio.get_event_loop() inside the async function run() with asyncio.get_running_loop(); locate the assignment "loop = asyncio.get_event_loop()" and change it to use asyncio.get_running_loop() so the coroutine retrieves the currently running event loop instead of the deprecated global accessor.

coderabbitai · 2026-04-19T14:38:51Z

+```
+Category       vector-only              flat-thorough            router-directed
+               F1    Judge  tokens      F1    Judge  tokens      F1    Judge  tokens
+single-hop     0.xx  0.xx   N           0.xx  0.xx   N           0.xx  0.xx   N
+temporal       0.xx  0.xx   N           0.xx  0.xx   N           0.xx  0.xx   N
+multi-hop      0.xx  0.xx   N           0.xx  0.xx   N           0.xx  0.xx   N
+open-domain    0.xx  0.xx   N           0.xx  0.xx   N           0.xx  0.xx   N
+------
+overall        0.xx  0.xx   N           0.xx  0.xx   N           0.xx  0.xx   N
+
+Router accuracy (vs expected layer for each category):
+  single-hop → vector          [N/M correct]
+  temporal  → catalog+kg       [N/M correct]
+  multi-hop → kg               [N/M correct]
+  open-domain → crystals       [N/M correct]
+```


⚠️ Potential issue | 🟡 Minor

Add a language tag to the fenced block to satisfy markdown lint.

The scorecard block should use a typed fence (e.g., text) to avoid MD040 failures.

Suggested fix

-``` +```text Category vector-only flat-thorough router-directed F1 Judge tokens F1 Judge tokens F1 Judge tokens single-hop 0.xx 0.xx N 0.xx 0.xx N 0.xx 0.xx N @@ open-domain → crystals [N/M correct] -``` +```

🧰 Tools

🪛 markdownlint-cli2 (0.22.0)

[warning] 54-54: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/specs/2026-04-16-routing-benchmark-design.md` around lines 54 - 69, The fenced code block showing the scorecard currently uses an untyped fence (```) which triggers MD040; change the opening fence to a typed fence by replacing the leading ``` with ```text so the block becomes ```text ... ``` and leave the closing backticks unchanged; this affects the fenced block that starts with the line "Category vector-only ..." and ends with the final "open-domain → crystals [N/M correct]" closing fence.

Comprehensive LoCoMo benchmark bundle rebased onto current master. Supersedes the closed #25. ## What lands **Benchmark infrastructure** (~2000 LOC): - benchmarks/locomo_runner.py — full LoCoMo runner (ONNX embeds + Ollama answer/judge, F1/BLEU/Judge/R@K per category, flags for --concurrency, --per-conv-limit, --timeout, --model) - benchmarks/longmemeval_runner.py — ported off dead tinyagentos imports - benchmarks/mem0_locomo_runner.py — apples-to-apples mem0 adapter routing retrieval through mem0.search() with same generator/prompt/ judge as the taosmd runner - pyproject.toml — adds mem0ai + chroma optional deps for the adapter **Prompt tweaks** (validated +0.03 Overall Judge under external qwen3:4b): - Absolute-date instruction + softened IDK in ANSWER_PROMPT - Per-category: Temporal 0.36→0.41 (+0.05), Multi-hop 0.21→0.24 (+0.03) - Targeted improvements landed on targeted categories; Single-hop + Open-dom unchanged (clean intervention signal) **Methodology docs** (4 specs under docs/specs/) ## Review-feedback fixes applied - Kilo CRITICAL: all hardcoded /home/jay/... paths in locomo_runner.py (--dataset, --onnx-path), longmemeval_runner.py, mem0_locomo_runner.py are now repo-relative defaults derived from __file__, with LOCOMO_DATASET + TAOSMD_ONNX_PATH env-var overrides - CodeRabbit major: eval/librarian_eval.py was reading axis_c.get("n") but eval_axis_c emits "n_sessions" — fixed with get("n_sessions", get("n", 0)) compat shim, undercounted n_queries no longer inflates tokens_per_query ## Test plan - [x] All three models benchmarked against 1540-QA LoCoMo set with 100% external-judge (qwen3:4b) coverage, 0 errors - [x] Prompt-opt delta localized to targeted categories as designed - [x] Python syntax check on all four modified benchmark files - [x] No /home/jay/... paths remain in benchmarks/ or eval/

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (3)

docs/specs/2026-04-16-routing-benchmark-design.md (1)
54-69: ⚠️ Potential issue | 🟡 Minor

Add a language tag to the scorecard fenced block.

The opening fence at Line 54 is untyped and triggers MD040; use a typed fence (e.g., text).
Suggested fix
-```
+```text
 Category       vector-only              flat-thorough            router-directed
                F1    Judge  tokens      F1    Judge  tokens      F1    Judge  tokens
 single-hop     0.xx  0.xx   N           0.xx  0.xx   N           0.xx  0.xx   N
@@
   open-domain → crystals       [N/M correct]
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.

In @docs/specs/2026-04-16-routing-benchmark-design.md around lines 54 - 69, The
fenced code block that contains the scorecard currently uses an untyped opening
fence (); update the opening fence to include a language tag (for example change to text) so the block starts with text and keep the closing ```
unchanged; this addresses the MD040 lint warning for the scorecard fenced block.
</details>

</blockquote></details>
<details>
<summary>benchmarks/locomo_runner.py (1)</summary><blockquote>

`351-382`: _⚠️ Potential issue_ | _🟡 Minor_

**Per-conversation temp directories still are not cleaned up.**

`mkdtemp()` is called for every conversation, but the directory is never removed, and `vmem.close()` is not protected by `finally`. Long benchmark sweeps will leak SQLite files under `/tmp`, especially on ingest or retrieval failures.  
 

<details>
<summary>🐛 Proposed fix</summary>

```diff
+import shutil
...
-            tmp = tempfile.mkdtemp(prefix=f"locomo_{conv_id}_")
-            vmem = VectorMemory(
-                db_path=os.path.join(tmp, "vmem.db"),
-                qmd_url=args.qmd_url,
-                embed_mode=args.embed_mode,
-                onnx_path=args.onnx_path,
-            )
-            await vmem.init(http_client=client)
-            added, ingest_s = await _ingest_conversation(vmem, conv)
-            print(f"[{conv_id}] ingested {added} turns in {ingest_s:.1f}s", flush=True)
+            tmp = tempfile.mkdtemp(prefix=f"locomo_{conv_id}_")
+            vmem = None
+            try:
+                vmem = VectorMemory(
+                    db_path=os.path.join(tmp, "vmem.db"),
+                    qmd_url=args.qmd_url,
+                    embed_mode=args.embed_mode,
+                    onnx_path=args.onnx_path,
+                )
+                await vmem.init(http_client=client)
+                added, ingest_s = await _ingest_conversation(vmem, conv)
+                print(f"[{conv_id}] ingested {added} turns in {ingest_s:.1f}s", flush=True)
+                ...
+            finally:
+                if vmem is not None:
+                    await vmem.close()
+                shutil.rmtree(tmp, ignore_errors=True)
...
-            await vmem.close()
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/locomo_runner.py` around lines 351 - 382, The per-conversation
temporary directory created by tempfile.mkdtemp (used to construct VectorMemory
with db_path) is never removed and vmem.close() is not executed in a finally
block; wrap the lifecycle of VectorMemory and its temp dir in a try/finally so
that you always await vmem.close() (or call sync close if appropriate) and then
remove the tmp directory with shutil.rmtree(tmp, ignore_errors=True) after
closing the DB; keep the existing calls to await vmem.init(...) and
_ingest_conversation(...) inside the try and ensure exception paths (e.g.,
around the asyncio.gather of _guarded tasks) still flow to the finally that
closes vmem and deletes the temp dir.
```

</details>

</blockquote></details>
<details>
<summary>benchmarks/mem0_locomo_runner.py (1)</summary><blockquote>

`409-412`: _⚠️ Potential issue_ | _🟡 Minor_

**Use `get_running_loop()` inside `run()`.**

`run()` is already executing inside an event loop, so `asyncio.get_event_loop()` is the deprecated accessor here. `get_running_loop()` avoids version-dependent behavior.
  

<details>
<summary>🐛 Proposed fix</summary>

```diff
-    loop = asyncio.get_event_loop()
+    loop = asyncio.get_running_loop()
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/mem0_locomo_runner.py` around lines 409 - 412, The code is calling
asyncio.get_event_loop() (assigning to loop) while already running inside run();
replace that call with asyncio.get_running_loop() to avoid deprecated/ambiguous
behavior—update the assignment where loop is created (near variables
concurrency, sem, progress_lock) so loop = asyncio.get_running_loop() is used
inside the run() function or coroutine.
```

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @benchmarks/locomo_runner.py:

Around line 236-245: The try/except around _ollama_generate currently converts
generation exceptions into a synthetic predicted string and still runs
metric/judge computation, which folds generation failures into scores and
doesn't increment failed_qa; change the error path so that when _ollama_generate
raises you (a) record the failure by incrementing failed_qa (the counter used
for failed generations), (b) do not set a synthetic prediction for scoring (use
None or a flag), and (c) skip calling _judge and any F1/BLEU metric computation
for that item (only compute gen_ms as now). Update the flow around
_ollama_generate, predicted, gen_ms, and the call to _judge so that _judge is
only invoked when prediction succeeded and failed_qa is incremented on
exceptions.

Around line 138-149: The judge currently masks transport/timeouts by catching
all exceptions in _judge and returning 0.0; instead, detect and surface
transport failures so they aren't counted as a "NO". Modify _judge (which calls
_ollama_generate with JUDGE_PROMPT) to not convert exceptions into a 0.0 score —
either re-raise the exception or return a sentinel (e.g., None or math.nan) and
log the error with details — and then update the caller(s) that aggregate scores
to skip/handle the sentinel (so only real negative verdicts produce 0.0 and
infra failures are excluded from the average).

In @benchmarks/mem0_locomo_runner.py:

Around line 223-247: The mem0 config currently uses args.run_id directly so
when omitted it becomes "None" (collection_name f"locomo_{args.run_id}" and path
f"/tmp/mem0_{args.run_id}_chroma") causing persistent reuse across runs; fix by
normalizing/assigning a unique default run id inside _build_mem0_config (or
before calling it) when args.run_id is falsy — e.g., set run_id = args.run_id or
generate a stable unique fallback (timestamp or uuid4) and use that run_id for
collection_name and path so each run without an explicit --run-id gets an
isolated backing store.

Duplicate comments:
In @benchmarks/locomo_runner.py:

Around line 351-382: The per-conversation temporary directory created by
tempfile.mkdtemp (used to construct VectorMemory with db_path) is never removed
and vmem.close() is not executed in a finally block; wrap the lifecycle of
VectorMemory and its temp dir in a try/finally so that you always await
vmem.close() (or call sync close if appropriate) and then remove the tmp
directory with shutil.rmtree(tmp, ignore_errors=True) after closing the DB; keep
the existing calls to await vmem.init(...) and _ingest_conversation(...) inside
the try and ensure exception paths (e.g., around the asyncio.gather of _guarded
tasks) still flow to the finally that closes vmem and deletes the temp dir.

In @benchmarks/mem0_locomo_runner.py:

Around line 409-412: The code is calling asyncio.get_event_loop() (assigning
to loop) while already running inside run(); replace that call with
asyncio.get_running_loop() to avoid deprecated/ambiguous behavior—update the
assignment where loop is created (near variables concurrency, sem,
progress_lock) so loop = asyncio.get_running_loop() is used inside the run()
function or coroutine.

In @docs/specs/2026-04-16-routing-benchmark-design.md:

Around line 54-69: The fenced code block that contains the scorecard currently
uses an untyped opening fence (); update the opening fence to include a language tag (for example change to text) so the block starts with text and keep the closing ``` unchanged; this addresses the MD040 lint
warning for the scorecard fenced block.
</details>

<details>
<summary>🪄 Autofix (Beta)</summary>

Fix all unresolved CodeRabbit comments on this PR:

- [ ]  Push a commit to this branch (recommended)
- [ ]  Create a new PR with the fixes

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: defaults

**Review profile**: CHILL

**Plan**: Pro Plus

**Run ID**: `edb764c1-6d97-40e6-9184-8e4def7dbfd7`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between d2f1a1e632e54b06271a0cddd8da9811404ec99b and fd27d2ce574fa3f646e7e64f5ba7fe9e1d8d4de7.

</details>

<details>
<summary>📒 Files selected for processing (10)</summary>

* `benchmarks/locomo_runner.py`
* `benchmarks/longmemeval_runner.py`
* `benchmarks/mem0_locomo_runner.py`
* `docs/specs/2026-04-16-routing-benchmark-design.md`
* `docs/specs/2026-04-16-supersede-benchmark-design.md`
* `docs/specs/2026-04-16-taosmd-next-steps.md`
* `docs/specs/2026-04-24-ama-question-draft.md`
* `docs/specs/2026-04-methodology-post-draft.md`
* `eval/librarian_eval.py`
* `pyproject.toml`

</details>

<details>
<summary>✅ Files skipped from review due to trivial changes (3)</summary>

* benchmarks/longmemeval_runner.py
* docs/specs/2026-04-methodology-post-draft.md
* docs/specs/2026-04-16-taosmd-next-steps.md

</details>

<details>
<summary>🚧 Files skipped from review as they are similar to previous changes (2)</summary>

* pyproject.toml
* eval/librarian_eval.py

</details>

</details>

coderabbitai · 2026-04-19T18:15:00Z

+async def _judge(client: httpx.AsyncClient, url: str, model: str,
+                 question: str, reference: str, predicted: str) -> float:
+    try:
+        reply = await _ollama_generate(
+            client, url, model,
+            JUDGE_PROMPT.format(question=question, reference=reference, predicted=predicted),
+            temperature=0.0,
+        )
+    except Exception:
+        return 0.0
+    head = reply.strip().upper().split()
+    return 1.0 if head and head[0].startswith("YES") else 0.0


⚠️ Potential issue | 🟠 Major

Judge transport failures are being scored as “NO”.

_judge() returns 0.0 on any exception, so an Ollama timeout/network failure becomes indistinguishable from a real negative grade. That biases the judge average downward and hides infra issues in the report.

🧰 Tools

🪛 Ruff (0.15.10)

[warning] 146-146: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@benchmarks/locomo_runner.py` around lines 138 - 149, The judge currently masks transport/timeouts by catching all exceptions in _judge and returning 0.0; instead, detect and surface transport failures so they aren't counted as a "NO". Modify _judge (which calls _ollama_generate with JUDGE_PROMPT) to not convert exceptions into a 0.0 score — either re-raise the exception or return a sentinel (e.g., None or math.nan) and log the error with details — and then update the caller(s) that aggregate scores to skip/handle the sentinel (so only real negative verdicts produce 0.0 and infra failures are excluded from the average).

coderabbitai · 2026-04-19T18:15:00Z

+    try:
+        predicted = await _ollama_generate(
+            client, ollama_url, model,
+            ANSWER_PROMPT.format(context=context, question=question),
+        )
+    except Exception as exc:
+        predicted = f"[generation_error: {exc}]"
+    gen_ms = (time.time() - t1) * 1000.0
+
+    judge = await _judge(client, ollama_url, model, question, reference, predicted)


⚠️ Potential issue | 🟠 Major

Generation errors are folded into the benchmark scores.

When answer generation fails, the code stores a synthetic [generation_error: ...] prediction and still computes F1/BLEU/Judge for that row. Those rows count toward total_qa, but failed_qa stays unchanged, so the JSON can report a “complete” run even though some QAs never produced an answer.

Also applies to: 247-260

🧰 Tools

🪛 Ruff (0.15.10)

[warning] 241-241: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@benchmarks/locomo_runner.py` around lines 236 - 245, The try/except around _ollama_generate currently converts generation exceptions into a synthetic predicted string and still runs metric/judge computation, which folds generation failures into scores and doesn't increment failed_qa; change the error path so that when _ollama_generate raises you (a) record the failure by incrementing failed_qa (the counter used for failed generations), (b) do not set a synthetic prediction for scoring (use None or a flag), and (c) skip calling _judge and any F1/BLEU metric computation for that item (only compute gen_ms as now). Update the flow around _ollama_generate, predicted, gen_ms, and the call to _judge so that _judge is only invoked when prediction succeeded and failed_qa is incremented on exceptions.

coderabbitai · 2026-04-19T18:15:00Z

+def _build_mem0_config(args: argparse.Namespace) -> dict:
+    return {
+        "llm": {
+            "provider": "ollama",
+            "config": {
+                "model": args.model,
+                "ollama_base_url": args.ollama_url,
+                "temperature": 0.2,
+            },
+        },
+        "embedder": {
+            "provider": "ollama",
+            "config": {
+                "model": "nomic-embed-text",
+                "ollama_base_url": args.ollama_url,
+            },
+        },
+        "vector_store": {
+            "provider": "chroma",
+            "config": {
+                "collection_name": f"locomo_{args.run_id}",
+                "path": f"/tmp/mem0_{args.run_id}_chroma",
+            },
+        },
+    }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find and read the file to understand the context find . -name "mem0_locomo_runner.py" -type f

Repository: jaylfc/taosmd

Length of output: 91

🏁 Script executed:

# Search for the file and examine its content fd "mem0_locomo_runner.py" --type f

Repository: jaylfc/taosmd

Length of output: 89

🏁 Script executed:

# Read the entire file to understand the context cat -n benchmarks/mem0_locomo_runner.py

Repository: jaylfc/taosmd

Length of output: 23880

The default mem0 backing store is reused across runs without an explicit --run-id.

With --run-id omitted, collection_name defaults to locomo_None and the Chroma vector store persists to /tmp/mem0_None_chroma. Re-running the benchmark reuses the same persisted collection instead of starting clean, allowing old memories to be retrieved again and making runs irreproducible. Always pass --run-id to isolate benchmark runs.

🧰 Tools

🪛 Ruff (0.15.10)

[error] 244-244: Probable insecure usage of temporary file or directory: "/tmp/mem0_"

(S108)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@benchmarks/mem0_locomo_runner.py` around lines 223 - 247, The mem0 config currently uses args.run_id directly so when omitted it becomes "None" (collection_name f"locomo_{args.run_id}" and path f"/tmp/mem0_{args.run_id}_chroma") causing persistent reuse across runs; fix by normalizing/assigning a unique default run id inside _build_mem0_config (or before calling it) when args.run_id is falsy — e.g., set run_id = args.run_id or generate a stable unique fallback (timestamp or uuid4) and use that run_id for collection_name and path so each run without an explicit --run-id gets an isolated backing store.

coderabbitai · 2026-04-19T18:15:00Z

+            # evidence_hits/evidence_total not computable without dia_id metadata from
+            # mem0's stored facts (mem0 doesn't preserve per-turn dia_ids). Set to 0.
+            # TODO: if mem0 exposes metadata round-trip, wire dia_id here.
+            "evidence_hits": 0,
+            "evidence_total": len(evidence),


⚠️ Potential issue | 🟠 Major

R@K is guaranteed to report 0.0 here.

These rows keep evidence_total=len(evidence) while hardcoding evidence_hits=0 because mem0 does not round-trip dia_id. _summary() treats that as a real miss, so every mem0 run publishes retrieval_recall=0.0 even when retrieval succeeded. Mark the metric unavailable instead of zeroing it out, or plumb evidence IDs through mem0 first.

🐛 Proposed fix

- "evidence_hits": 0, - "evidence_total": len(evidence), + "evidence_hits": 0, + "evidence_total": 0, # exclude unavailable metric from retrieval_recall + "retrieval_recall_unavailable": True, "error": str(exc), } ... - "evidence_hits": 0, - "evidence_total": len(evidence), + "evidence_hits": 0, + "evidence_total": 0, + "retrieval_recall_unavailable": True, }

Also applies to: 371-375

CodeRabbit flagged four benchmark-integrity issues on #30. All real — they let infra failures and adapter artefacts masquerade as real negative results, biasing the published numbers and hiding problems. 1. _judge() returned 0.0 on Ollama timeout / network error — identical to a genuine "NO" grade. Now returns None; _summary filters None out of the Judge average so transport flakiness doesn't depress the score. The row still records 0.0 vs None distinctly. 2. Generation errors stored a synthetic "[generation_error: ...]" prediction and then computed F1/BLEU/Judge on it. That folded infra failures into the benchmark averages, and failed_qa stayed zero so the run reported "complete" despite missing answers. Now: on gen failure, predicted=""; f1/bleu/judge=None; row carries an `error` field; _guarded increments failed_qa for those rows; _summary excludes None metrics. 3. mem0_locomo_runner.py hardcoded evidence_hits=0 in every row because mem0 2.x doesn't round-trip per-turn dia_ids. _summary then published retrieval_recall=0.0 for every mem0 run — a fake miss. Now sets evidence_hits and evidence_total to None (metric unavailable, not zero), and _summary skips None rows from recall. 4. mem0 runner inherited both the 0.0-on-failure judge and the error- folding pattern. Both paths fixed to the same None-on-failure convention. _summary now also emits judge_scored + recall_scored alongside count so the JSON shows the denominator honestly (e.g. "Judge 0.41 over 1487 scored of 1540 total"), making infra flakiness inspectable rather than invisible. No schema-breaking changes: existing .rescored.json outputs that have evidence_hits=0 or judge=0.0 remain readable — the new _summary treats them as real zeros, which is how they were when written. Only forward runs produce None for "metric unavailable".

Third memory architecture in the comparison harness. Same generator (gemma4:e2b), same ANSWER_PROMPT, same JUDGE_PROMPT, same top-K, same 1540 QAs as the taosmd and mem0 runners. Only the retrieval layer changes — routes search through mempalace.searcher.search_memories(). MemPalace has never published end-to-end Judge on LoCoMo — their own benchmarks/BENCHMARKS.md reports R@10 only (60.3% raw, 88.9% hybrid v5 per their 2026-03 results). Our run adds the novel measurement so the three-way comparison is done under identical conditions. Inherits the silent-failure conventions from PR #33: judge timeouts and generation errors return None (not 0.0) so _summary excludes them from averages. evidence_hits/evidence_total reported as None since MemPalace doesn't round-trip LoCoMo dia_id. Requires: pip install mempalace

coderabbitai bot reviewed Apr 19, 2026

View reviewed changes

jaylfc force-pushed the feat/locomo-prompt-opt branch from d2f1a1e to fd27d2c Compare April 19, 2026 18:06

coderabbitai bot reviewed Apr 19, 2026

View reviewed changes

jaylfc mentioned this pull request Apr 19, 2026

fix(bench): stop silent-failure patterns from biasing LoCoMo scores #33

Open

3 tasks

	loop = asyncio.get_event_loop()
	loop = asyncio.get_running_loop()

Conversation

jaylfc commented Apr 19, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What lands

Validated results

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (3 warnings)

Uh oh!

kilo-code-bot bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Overview

CRITICAL

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jaylfc commented Apr 19, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 19, 2026 •

edited

Loading

kilo-code-bot bot commented Apr 19, 2026 •

edited

Loading