Skip to content

feat(bench): LoCoMo runner + prompt-opt + mem0 adapter (validated)#30

Open
jaylfc wants to merge 2 commits intomasterfrom
feat/locomo-prompt-opt
Open

feat(bench): LoCoMo runner + prompt-opt + mem0 adapter (validated)#30
jaylfc wants to merge 2 commits intomasterfrom
feat/locomo-prompt-opt

Conversation

@jaylfc
Copy link
Copy Markdown
Owner

@jaylfc jaylfc commented Apr 19, 2026

Summary

Comprehensive LoCoMo benchmark PR — lands all the work that produced our published LongMemEval 97% and the new LoCoMo numbers. Supersedes the closed #25 (content subsumed into #26's unintentional bundle).

What lands

Benchmark infrastructure (~2000 LOC):

  • benchmarks/locomo_runner.py — full LoCoMo runner (ONNX embeds + Ollama answer/judge, F1/BLEU/Judge/R@K per category, --concurrency, --per-conv-limit, --timeout, --model flags, live progress markers)
  • benchmarks/longmemeval_runner.py — ported off dead tinyagentos imports, now uses TAOSMD_OLLAMA_URL/TAOSMD_OLLAMA_MODEL env vars (no hardcoded IPs)
  • benchmarks/mem0_locomo_runner.py — apples-to-apples adapter routing retrieval through mem0.search() with the same generator/prompt/judge (for the upcoming memory-architecture comparison)
  • benchmarks/locomo_rescore.py — already merged via feat(bench): locomo_rescore + resume tools (external judge with better diagnostics) #29, included here only by branch history
  • eval/librarian_eval.py — fixed divide-by-zero token ratio (replaced with absolute per-query budget)

Prompt tweaks (the +0.03 Overall Judge improvement, validated Apr 18–19):

  • benchmarks/locomo_runner.py ANSWER_PROMPT — adds absolute-date instruction + softened IDK instruction
  • Measured impact under external qwen3:4b judge:
    • Temporal: 0.36 → 0.41 (+0.05)
    • Multi-hop: 0.21 → 0.24 (+0.03)
    • Single-hop + Open-dom: unchanged (prompt targeted temporal/multi-hop specifically — clean signal)

Methodology docs:

  • docs/specs/2026-04-16-taosmd-next-steps.md — four-workstream roadmap
  • docs/specs/2026-04-16-supersede-benchmark-design.md
  • docs/specs/2026-04-16-routing-benchmark-design.md
  • docs/specs/2026-04-24-ama-question-draft.md — Imran Ahmad AMA on r/LLMeng
  • docs/specs/2026-04-methodology-post-draft.md — publishable protocol-transparency post

README framing (updated on prior commits on this branch):

  • Reframed Librarian section around "query-time RAG breaks on implicit context" critique
  • Added Security & Integrity section on memory poisoning

Validated results

LoCoMo full scorecard under qwen3:4b external judge (100% coverage, 0 errors):

Category e2b e4b e2b+prompt-opt
Single-hop 0.17 0.15 0.16
Temporal 0.36 0.31 0.41
Multi-hop 0.21 0.14 0.24
Open-dom 0.51 0.51 0.51
Overall 0.40 0.38 0.41
F1 0.162 0.152 0.175

Test plan

  • All three models benchmarked against the full 1540-QA LoCoMo set
  • External judge (qwen3:4b) re-judged all three runs — 100% coverage, 0 errors
  • Prompt-opt improvement localized to targeted categories as designed
  • Existing test suite green (pytest tests/ -q --ignore=tests/integration)
  • Post-merge: delete feat/locomo-benchmark and feat/locomo-prompt-opt
  • Post-merge: Fedora git fetch --prune && git checkout master before next benchmark run

Summary by CodeRabbit

Release Notes

  • New Features

    • Added LoCoMo benchmark runners for evaluating retrieval and QA generation performance
    • Made benchmark LLM and embedding configuration configurable via environment variables
  • Documentation

    • Added benchmark design specifications for routing and temporal reasoning evaluation scenarios
  • Chores

    • Updated evaluation metrics to use absolute token-per-query budget instead of relative comparison

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 19, 2026

Warning

Rate limit exceeded

@jaylfc has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 12 minutes and 45 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 12 minutes and 45 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: bf2c2afe-47ef-430e-a3e7-d25e6053cf42

📥 Commits

Reviewing files that changed from the base of the PR and between fd27d2c and ca0ccb7.

📒 Files selected for processing (2)
  • benchmarks/mempalace_locomo_runner.py
  • pyproject.toml
📝 Walkthrough

Walkthrough

This PR introduces two new LoCoMo benchmark runners (taosmd and mem0 variants), updates the existing LongMemEval runner to use taosmd modules with environment-configurable LLM endpoints, modifies the Librarian evaluation token-budget logic from multiplier-based to absolute tokens-per-query, and adds multiple specification documents for upcoming benchmarking initiatives and roadmap planning.

Changes

Cohort / File(s) Summary
LoCoMo Benchmark Runners
benchmarks/locomo_runner.py, benchmarks/mem0_locomo_runner.py
Two new benchmark executables implementing full LoCoMo workflows: locomo_runner.py (465 lines) ingests conversations into VectorMemory, runs QA retrieval/generation/judging with Ollama, computes F1/BLEU-1/judge metrics per category; mem0_locomo_runner.py (549 lines) performs equivalent operations using mem0 as the vector store backend. Both handle per-conversation temp storage, concurrent QA processing, result aggregation, and JSON/table output.
Benchmark Integration
benchmarks/longmemeval_runner.py
Updated imports from deprecated tinyagentos.* to taosmd.* modules; made Ollama endpoint and model configurable via environment variables (TAOSMD_OLLAMA_URL, TAOSMD_OLLAMA_MODEL) with sensible defaults; extended VectorMemory construction to accept embedding mode and ONNX path from environment variables (TAOSMD_EMBED_MODE, TAOSMD_ONNX_PATH).
Evaluation Logic
eval/librarian_eval.py
Changed token-budget criterion in _print_summary from a relative multiplier (token_mult <= 3.0) to an absolute per-query budget: derives query count from librarian report axes, computes tokens_per_query, applies fixed TOKEN_BUDGET = 2000 threshold, and updates output messaging.
Dependencies
pyproject.toml
Added optional benchmarks dependency group with mem0ai>=0.1.0; removed unused [tool.setuptools.package-data] section for docs.
Specification Documents
docs/specs/2026-04-16-routing-benchmark-design.md, docs/specs/2026-04-16-supersede-benchmark-design.md, docs/specs/2026-04-16-taosmd-next-steps.md, docs/specs/2026-04-24-ama-question-draft.md, docs/specs/2026-04-methodology-post-draft.md
Five new spec/plan documents: routing and supersede benchmark designs (measurement questions, configurations, prerequisites, metrics), taosmd execution roadmap (LoCoMo publication, Librarian validations, README reframing, ecosystem visibility), AMA question draft for community engagement, and methodology transparency post addressing benchmark reproducibility and reporting standards.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Runner
    participant Dataset
    participant VectorMemory
    participant Ollama
    participant Judge

    User->>Runner: run(dataset, model, concurrency, top_k)
    Runner->>Dataset: load LoCoMo JSON
    Dataset-->>Runner: conversations & QA items

    loop For each conversation
        Runner->>VectorMemory: create temp instance
        Runner->>Runner: ingest session turns
        VectorMemory-->>Runner: indexed (dialog_id, speaker, datetime)

        loop For each QA (concurrent, semaphore-limited)
            Runner->>VectorMemory: search(query, top_k)
            VectorMemory-->>Runner: ranked hits
            Runner->>Runner: build context (hits + datetime)
            Runner->>Ollama: /api/generate(context + question)
            Ollama-->>Runner: predicted_answer
            Runner->>Judge: /api/generate(judge_prompt)
            Judge-->>Runner: YES/NO score (0.0 or 1.0)
            Runner->>Runner: compute F1, BLEU-1 vs reference
            Runner->>Runner: store metrics & latencies
        end
    end

    Runner->>Runner: aggregate by category
    Runner->>Runner: compute retrieval_recall
    Runner-->>User: JSON results + formatted summary table
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

The two new benchmark runners introduce substantial, multi-component async workflows (VectorMemory/mem0 ingestion, Ollama integration, concurrent QA processing, metric computation) across 1,000+ lines. The logic is moderately dense with helper functions, error handling, and aggregation logic. The variety of changes (two new benchmarks, existing benchmark refactor, eval logic modification, dependency updates, and documentation) adds heterogeneity. The evaluation logic change requires careful reasoning about the token-budget semantics. Documentation specs are straightforward but numerous.

Possibly related PRs

  • PR #29: Both implement Ollama-based judge logic and parse YES/NO judge replies; locomo_runner.py introduces the foundational _judge() and _ollama_generate() functions that may be reused or compared with similar patterns in #29.

Poem

🐰 Benchmark runners hop and bound,
LoCoMo logic newly found,
Mem0 searches, Ollama thinks,
Metrics flow through careful links,
Roadmap charted, specs align—
More transparency by design! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (3 warnings)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The linked issue #25 requires prompt migration and JSON parsing changes in session_catalog.py, but no changes to that file appear in this PR's changeset. Review whether #25's refactoring of session_catalog.py and prompts module integration was intended to be included or if it belongs in a separate PR.
Out of Scope Changes check ⚠️ Warning The PR adds extensive LoCoMo benchmark code, mem0 adapter, evaluation tweaks, and methodology docs that align with the PR objectives but diverge from the linked issue #25 requirements. Clarify the relationship between #25 (session catalog refactoring) and the current PR scope; consider separating them if they represent different feature tracks.
Docstring Coverage ⚠️ Warning Docstring coverage is 21.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly references the three main deliverables: LoCoMo runner, prompt optimization, and mem0 adapter, matching the primary changes in the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/locomo-prompt-opt

Comment @coderabbitai help to get the list of available commands and usage tips.

@kilo-code-bot
Copy link
Copy Markdown

kilo-code-bot bot commented Apr 19, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Overview

Severity Count
CRITICAL 0
WARNING 0
SUGGESTION 0
Issue Details (click to expand)

CRITICAL

Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

Files Reviewed (2 files)
  • benchmarks/mempalace_locomo_runner.py - New runner for MemPalace benchmark, properly implements consistent benchmark interface, follows same patterns as other runners, handles None values correctly, uses relative paths for dataset
  • pyproject.toml - Added mempalace to benchmarks optional dependencies

Reviewed by seed-2-0-pro-260328 · 262,431 tokens

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (8)
.gitignore (1)

13-13: Keep a versioned CLAUDE.md template if local CLAUDE.md is ignored.

Ignoring CLAUDE.md is fine for local machine-specific config, but it can hide required shared agent/memory instructions from the repo. Consider committing CLAUDE.md.example (or equivalent) and documenting copy/setup steps.

Suggested `.gitignore` adjustment
 CLAUDE.md
+!CLAUDE.md.example

Based on learnings: “Applies to {AGENTS,CLAUDE}.md: Do not overwrite existing AGENTS.md or CLAUDE.md files without first reading them and checking for existing memory system integrations (Mem0, Zep, MemGPT/Letta, OpenClaw memory, LangChain modules)”.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.gitignore at line 13, Make sure a versioned template for CLAUDE.md is
committed and .gitignore only excludes the local file: add a committed
CLAUDE.md.example (or CLAUDE.md.template) to the repo containing the shared
agent/memory instructions, update .gitignore to ignore CLAUDE.md but not
CLAUDE.md.example, and add a brief note in the README or a new CONTRIBUTING
section describing the copy/setup steps (copy CLAUDE.md.example -> CLAUDE.md and
merge any existing local CLAUDE.md to avoid overwriting memory integrations).
benchmarks/locomo_runner.py (2)

387-387: datetime.utcnow() is deprecated in Python 3.12+.

Consider using datetime.now(timezone.utc) for forward compatibility.

♻️ Proposed fix
+from datetime import datetime, timezone
 ...
-    timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S")
+    timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/locomo_runner.py` at line 387, Replace the deprecated
datetime.utcnow() usage when building the timestamp: change the assignment
timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S") to use timezone-aware
time, e.g. timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S"), and
add/update the import to include timezone (from datetime import datetime,
timezone) so the timestamp is produced in a forward-compatible, UTC-aware way.

198-200: Clarify the double-metadata unwrapping logic.

The expression h.get("metadata", {}).get("metadata", h.get("metadata", {})) suggests the retrieve API sometimes returns metadata nested inside another metadata key. This defensive handling works, but may indicate an API inconsistency worth investigating.

Consider adding a comment explaining when the double-nesting occurs, or normalizing the API response in retrieve().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/locomo_runner.py` around lines 198 - 200, The list comprehension
unpacks nested metadata defensively; update the code to normalize the API
response in the retrieve() function so hits always have a flat dict at
h["metadata"] (or explicitly document when nested metadata occurs) and then
simplify the comprehension to use h.get("metadata", {}) directly; specifically,
modify retrieve() to detect and unwrap cases where response returns {"metadata":
{"metadata": ...}} into {"metadata": ...}, and add a short inline comment near
the return (or in retrieve()) explaining the normalization so the expression
h.get("metadata", {}) is sufficient and the double-metadata defensive logic is
no longer needed.
benchmarks/locomo_rescore.py (1)

257-268: Closure captures client before it's defined.

The nested _judge_one function references client at line 260, but client is only assigned at line 267. This works due to Python's late binding of closures (the name is looked up when _judge_one executes, not when it's defined), but it's fragile and potentially confusing.

Consider passing client as a parameter to make the dependency explicit:

♻️ Proposed fix
-    async def _judge_one(idx: int, rec: dict):
+    async def _judge_one(idx: int, rec: dict, client: httpx.AsyncClient):
         nonlocal unparseable
         async with sem:
             val = await _call_ollama_judge(client, ollama_url, model, rec)
         ...

     limits = httpx.Limits(max_connections=concurrency + 2, max_keepalive_connections=concurrency)
     async with httpx.AsyncClient(limits=limits) as client:
-        tasks = [asyncio.create_task(_judge_one(i, r)) for i, r in enumerate(records)]
+        tasks = [asyncio.create_task(_judge_one(i, r, client)) for i, r in enumerate(records)]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/locomo_rescore.py` around lines 257 - 268, The nested function
_judge_one currently captures client from the outer scope after it is created,
which is fragile; make the dependency explicit by adding a client parameter to
_judge_one (e.g., async def _judge_one(idx: int, rec: dict, client):) and update
its internal call to await _call_ollama_judge(client, ollama_url, model, rec).
Then, when creating tasks inside the async with httpx.AsyncClient(...) as client
block, pass the client into each task (e.g., asyncio.create_task(_judge_one(i,
r, client))) while leaving uses of sem, results, and records unchanged.
benchmarks/mem0_locomo_runner.py (3)

467-467: datetime.utcnow() is deprecated in Python 3.12+.

Consider using datetime.now(timezone.utc) for forward compatibility.

♻️ Proposed fix
+from datetime import datetime, timezone
 ...
-    timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S")
+    timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/mem0_locomo_runner.py` at line 467, Replace the deprecated
datetime.utcnow() call used when constructing timestamp (the line assigning
timestamp) with a timezone-aware call: use
datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S") and ensure timezone is
imported (from datetime import datetime, timezone) so the timestamp is
forward-compatible with Python 3.12+.

27-65: Consider extracting shared constants to a common module.

The prompts and utilities are intentionally duplicated for self-containment, but this creates a maintenance burden. If the prompts change in locomo_runner.py, they must be manually updated here. Consider extracting ANSWER_PROMPT, JUDGE_PROMPT, CATEGORY_NAMES, and shared utilities (_tokenize, _f1, _bleu1, _judge) to a benchmarks/_common.py module.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/mem0_locomo_runner.py` around lines 27 - 65, The duplicated
prompts and utilities (ANSWER_PROMPT, JUDGE_PROMPT, CATEGORY_NAMES and helper
functions like _tokenize, _f1, _bleu1, _judge) should be extracted into a single
shared module (e.g., _common) and imported from both runners so they stay in
sync; create the new module containing those constants and utility
implementations, update mem0_locomo_runner.py and locomo_runner.py to import
ANSWER_PROMPT, JUDGE_PROMPT, CATEGORY_NAMES and the utility functions instead of
redefining them, and ensure any references to those symbols remain unchanged so
behavior is preserved.

240-246: Consider using tempfile.mkdtemp for safer temporary directory creation.

The hardcoded /tmp/mem0_{run_id}_chroma path is predictable. While acceptable for benchmark tooling, using tempfile.mkdtemp (as done in locomo_runner.py line 351) would be more robust against symlink attacks on shared systems.

♻️ Proposed fix
+import tempfile
+
 def _build_mem0_config(args: argparse.Namespace) -> dict:
+    tmp_dir = tempfile.mkdtemp(prefix=f"mem0_{args.run_id}_")
     return {
         ...
         "vector_store": {
             "provider": "chroma",
             "config": {
                 "collection_name": f"locomo_{args.run_id}",
-                "path": f"/tmp/mem0_{args.run_id}_chroma",
+                "path": tmp_dir,
             },
         },
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/mem0_locomo_runner.py` around lines 240 - 246, The vector_store
config currently uses a predictable hardcoded path
f"/tmp/mem0_{args.run_id}_chroma"; replace this with a securely-created
temporary directory using tempfile.mkdtemp to avoid predictable symlink attacks:
import tempfile, call tempfile.mkdtemp(prefix=f"mem0_{args.run_id}_chroma_") to
get a tempdir, and assign that tempdir to the "path" value while keeping
"collection_name" as f"locomo_{args.run_id}"; update any teardown/cleanup logic
if needed to remove the tempdir after the benchmark.
docs/specs/2026-04-16-taosmd-next-steps.md (1)

33-33: Task may already be completed.

The relevant code snippet from eval/librarian_eval.py:630-642 shows that the token cost metric already uses tokens_per_query <= TOKEN_BUDGET (absolute threshold) rather than an infx ratio. Consider updating this task to reflect its completion status or verifying if additional work remains.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/specs/2026-04-16-taosmd-next-steps.md` at line 33, The spec task refers
to changing the token cost metric but the code in eval/librarian_eval.py already
enforces the absolute threshold via the tokens_per_query <= TOKEN_BUDGET check;
update the docs/specs/2026-04-16-taosmd-next-steps.md entry 5 to mark this task
as completed (or note any remaining subtasks), and if you want extra
verification, open eval/librarian_eval.py and confirm the evaluation uses
tokens_per_query and TOKEN_BUDGET rather than an infx ratio in the token-cost
calculation logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmarks/locomo_runner.py`:
- Around line 421-432: The defaults for the argparse options are using hardcoded
user-specific paths; update the p.add_argument calls for "--dataset" and
"--onnx-path" to derive sensible defaults from environment variables or relative
locations instead of "/home/jay/...". For example, read env vars like
DATASET_PATH and ONNX_PATH (os.environ.get("DATASET_PATH")) or build a
project-relative default via pathlib.Path(__file__).resolve().parents[...] and
fall back to None or a clearly documented relative path; update the "--dataset"
and "--onnx-path" argument default values accordingly and update help text to
mention the env vars/relative default and that users can override via CLI.
- Around line 351-382: The tmp directory created with tempfile.mkdtemp(...) is
never removed; change the code to ensure cleanup by using a
tempfile.TemporaryDirectory() context manager around the lifecycle of tmp and
the VectorMemory instance (replace tempfile.mkdtemp(...) with
TemporaryDirectory() as tmp_dir and pass os.path.join(tmp_dir, "vmem.db")), or
if you must keep mkdtemp, call shutil.rmtree(tmp) after await vmem.close();
update references to tmp (used when constructing VectorMemory and when closing
vmem) so the directory is always removed even if exceptions occur (use
try/finally or context manager around vmem.init()/await
_ingest_conversation/.../await vmem.close()).

In `@benchmarks/longmemeval_runner.py`:
- Around line 152-156: The default onnx_path is currently hardcoded to a
machine-specific path; update the VectorMemory construction to avoid a hardcoded
default by using a neutral fallback (e.g., None or a package/resource-relative
path) and rely on the TAOSMD_ONNX_PATH env var or a config to provide the real
model location; modify the call that creates VectorMemory (parameters: db_path,
embed_mode, onnx_path) so onnx_path uses os.environ.get("TAOSMD_ONNX_PATH") with
no hardcoded user path and add handling in VectorMemory (or the caller) to raise
a clear error or load a bundled model when onnx_path is None.

In `@benchmarks/mem0_locomo_runner.py`:
- Line 412: Replace the deprecated call to asyncio.get_event_loop() inside the
async function run() with asyncio.get_running_loop(); locate the assignment
"loop = asyncio.get_event_loop()" and change it to use
asyncio.get_running_loop() so the coroutine retrieves the currently running
event loop instead of the deprecated global accessor.
- Line 511: The default dataset argument in p.add_argument("--dataset", ...)
uses a hardcoded user-specific path; change it to use a portable default or an
environment variable: remove the absolute /home/jay/... path and either set a
relative default like "data/locomo10.json" or set
default=os.environ.get("LOCOMO_DATASET", "data/locomo10.json") and import os;
alternatively make the argument required (no default) so callers must supply a
path. Update the p.add_argument("--dataset", ...) call accordingly and ensure
any README or run scripts reference the chosen default or env var.

In `@docs/specs/2026-04-16-routing-benchmark-design.md`:
- Around line 54-69: The fenced code block showing the scorecard currently uses
an untyped fence (```) which triggers MD040; change the opening fence to a typed
fence by replacing the leading ``` with ```text so the block becomes ```text ...
``` and leave the closing backticks unchanged; this affects the fenced block
that starts with the line "Category       vector-only ..." and ends with the
final "open-domain → crystals       [N/M correct]" closing fence.

In `@eval/librarian_eval.py`:
- Around line 628-633: The n_queries calculation in eval/librarian_eval.py uses
a non-existent key "n" for axis_c (lib.axis_c.get("n", 0)) while eval_axis_c
emits "n_sessions", causing undercounting; update the expression that computes
n_queries (the max(...) block) to read lib.axis_c.get("n_sessions", 0) (or sum n
and n_sessions as appropriate) so axis_c's sessions are included consistently
with eval_axis_c and tokens/query remains correct.

In `@README.md`:
- Around line 7-9: The README currently uses "97.0%" ambiguously; the top claims
"97.0% end-to-end Judge accuracy on LongMemEval-S" while later sections refer to
"97.0%" as Recall@5 — fix by unifying the framing: decide whether 97.0% refers
to end-to-end Judge accuracy or Recall@5, then update every occurrence of the
phrase "97.0%" and its surrounding label (search for the strings "97.0%",
"LongMemEval-S", and "Recall@5") so each mention explicitly states the metric
(e.g., "97.0% end-to-end Judge accuracy" or "97.0% Recall@5"); alternatively,
present both numbers distinctly (e.g., "97.0% end-to-end Judge accuracy;
Recall@5 = X") and update the sentences around those symbols to remove
ambiguity.

---

Nitpick comments:
In @.gitignore:
- Line 13: Make sure a versioned template for CLAUDE.md is committed and
.gitignore only excludes the local file: add a committed CLAUDE.md.example (or
CLAUDE.md.template) to the repo containing the shared agent/memory instructions,
update .gitignore to ignore CLAUDE.md but not CLAUDE.md.example, and add a brief
note in the README or a new CONTRIBUTING section describing the copy/setup steps
(copy CLAUDE.md.example -> CLAUDE.md and merge any existing local CLAUDE.md to
avoid overwriting memory integrations).

In `@benchmarks/locomo_rescore.py`:
- Around line 257-268: The nested function _judge_one currently captures client
from the outer scope after it is created, which is fragile; make the dependency
explicit by adding a client parameter to _judge_one (e.g., async def
_judge_one(idx: int, rec: dict, client):) and update its internal call to await
_call_ollama_judge(client, ollama_url, model, rec). Then, when creating tasks
inside the async with httpx.AsyncClient(...) as client block, pass the client
into each task (e.g., asyncio.create_task(_judge_one(i, r, client))) while
leaving uses of sem, results, and records unchanged.

In `@benchmarks/locomo_runner.py`:
- Line 387: Replace the deprecated datetime.utcnow() usage when building the
timestamp: change the assignment timestamp =
datetime.utcnow().strftime("%Y%m%d_%H%M%S") to use timezone-aware time, e.g.
timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S"), and add/update
the import to include timezone (from datetime import datetime, timezone) so the
timestamp is produced in a forward-compatible, UTC-aware way.
- Around line 198-200: The list comprehension unpacks nested metadata
defensively; update the code to normalize the API response in the retrieve()
function so hits always have a flat dict at h["metadata"] (or explicitly
document when nested metadata occurs) and then simplify the comprehension to use
h.get("metadata", {}) directly; specifically, modify retrieve() to detect and
unwrap cases where response returns {"metadata": {"metadata": ...}} into
{"metadata": ...}, and add a short inline comment near the return (or in
retrieve()) explaining the normalization so the expression h.get("metadata", {})
is sufficient and the double-metadata defensive logic is no longer needed.

In `@benchmarks/mem0_locomo_runner.py`:
- Line 467: Replace the deprecated datetime.utcnow() call used when constructing
timestamp (the line assigning timestamp) with a timezone-aware call: use
datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S") and ensure timezone is
imported (from datetime import datetime, timezone) so the timestamp is
forward-compatible with Python 3.12+.
- Around line 27-65: The duplicated prompts and utilities (ANSWER_PROMPT,
JUDGE_PROMPT, CATEGORY_NAMES and helper functions like _tokenize, _f1, _bleu1,
_judge) should be extracted into a single shared module (e.g., _common) and
imported from both runners so they stay in sync; create the new module
containing those constants and utility implementations, update
mem0_locomo_runner.py and locomo_runner.py to import ANSWER_PROMPT,
JUDGE_PROMPT, CATEGORY_NAMES and the utility functions instead of redefining
them, and ensure any references to those symbols remain unchanged so behavior is
preserved.
- Around line 240-246: The vector_store config currently uses a predictable
hardcoded path f"/tmp/mem0_{args.run_id}_chroma"; replace this with a
securely-created temporary directory using tempfile.mkdtemp to avoid predictable
symlink attacks: import tempfile, call
tempfile.mkdtemp(prefix=f"mem0_{args.run_id}_chroma_") to get a tempdir, and
assign that tempdir to the "path" value while keeping "collection_name" as
f"locomo_{args.run_id}"; update any teardown/cleanup logic if needed to remove
the tempdir after the benchmark.

In `@docs/specs/2026-04-16-taosmd-next-steps.md`:
- Line 33: The spec task refers to changing the token cost metric but the code
in eval/librarian_eval.py already enforces the absolute threshold via the
tokens_per_query <= TOKEN_BUDGET check; update the
docs/specs/2026-04-16-taosmd-next-steps.md entry 5 to mark this task as
completed (or note any remaining subtasks), and if you want extra verification,
open eval/librarian_eval.py and confirm the evaluation uses tokens_per_query and
TOKEN_BUDGET rather than an infx ratio in the token-cost calculation logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 607d827f-0e5a-4d00-8eb3-51299f9cdc95

📥 Commits

Reviewing files that changed from the base of the PR and between c360faa and d2f1a1e.

📒 Files selected for processing (13)
  • .gitignore
  • README.md
  • benchmarks/locomo_rescore.py
  • benchmarks/locomo_runner.py
  • benchmarks/longmemeval_runner.py
  • benchmarks/mem0_locomo_runner.py
  • docs/specs/2026-04-16-routing-benchmark-design.md
  • docs/specs/2026-04-16-supersede-benchmark-design.md
  • docs/specs/2026-04-16-taosmd-next-steps.md
  • docs/specs/2026-04-24-ama-question-draft.md
  • docs/specs/2026-04-methodology-post-draft.md
  • eval/librarian_eval.py
  • pyproject.toml

Comment on lines +351 to +382
tmp = tempfile.mkdtemp(prefix=f"locomo_{conv_id}_")
vmem = VectorMemory(
db_path=os.path.join(tmp, "vmem.db"),
qmd_url=args.qmd_url,
embed_mode=args.embed_mode,
onnx_path=args.onnx_path,
)
await vmem.init(http_client=client)
added, ingest_s = await _ingest_conversation(vmem, conv)
print(f"[{conv_id}] ingested {added} turns in {ingest_s:.1f}s", flush=True)

# Pick eligible QAs up-front. --per-conv-limit caps QAs per
# conversation (for balanced sampling); --limit is the global cap.
eligible: list[dict] = []
for qa in conv.get("qa", []) or []:
cat = int(qa.get("category", 0))
if cat not in include_cats:
continue
if args.per_conv_limit and len(eligible) >= args.per_conv_limit:
break
if args.limit and (total_seen + len(eligible)) >= args.limit:
break
eligible.append(qa)

if eligible:
tasks = [_guarded(qa, conv_id, vmem, client) for qa in eligible]
gathered = await asyncio.gather(*tasks, return_exceptions=True)
for outcome in gathered:
if isinstance(outcome, Exception) or outcome is None:
continue
results.append(outcome)
await vmem.close()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Temporary directories are not cleaned up after use.

tempfile.mkdtemp() creates directories that persist after the script exits. For benchmark runs processing multiple conversations, this could accumulate significant disk usage in /tmp.

🐛 Proposed fix using context manager
+import shutil
 ...
-            tmp = tempfile.mkdtemp(prefix=f"locomo_{conv_id}_")
-            vmem = VectorMemory(
-                db_path=os.path.join(tmp, "vmem.db"),
-                ...
-            )
-            ...
-            await vmem.close()
+            tmp = tempfile.mkdtemp(prefix=f"locomo_{conv_id}_")
+            try:
+                vmem = VectorMemory(
+                    db_path=os.path.join(tmp, "vmem.db"),
+                    ...
+                )
+                ...
+                await vmem.close()
+            finally:
+                shutil.rmtree(tmp, ignore_errors=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/locomo_runner.py` around lines 351 - 382, The tmp directory
created with tempfile.mkdtemp(...) is never removed; change the code to ensure
cleanup by using a tempfile.TemporaryDirectory() context manager around the
lifecycle of tmp and the VectorMemory instance (replace tempfile.mkdtemp(...)
with TemporaryDirectory() as tmp_dir and pass os.path.join(tmp_dir, "vmem.db")),
or if you must keep mkdtemp, call shutil.rmtree(tmp) after await vmem.close();
update references to tmp (used when constructing VectorMemory and when closing
vmem) so the directory is always removed even if exceptions occur (use
try/finally or context manager around vmem.init()/await
_ingest_conversation/.../await vmem.close()).

Comment thread benchmarks/locomo_runner.py Outdated
Comment thread benchmarks/longmemeval_runner.py
concurrency = max(1, int(args.concurrency))
sem = asyncio.Semaphore(concurrency)
progress_lock = asyncio.Lock()
loop = asyncio.get_event_loop()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use asyncio.get_running_loop() instead of asyncio.get_event_loop().

asyncio.get_event_loop() is deprecated in Python 3.10+ when called from a coroutine. Since run() is an async function, use asyncio.get_running_loop() instead.

🐛 Proposed fix
-    loop = asyncio.get_event_loop()
+    loop = asyncio.get_running_loop()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
loop = asyncio.get_event_loop()
loop = asyncio.get_running_loop()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/mem0_locomo_runner.py` at line 412, Replace the deprecated call to
asyncio.get_event_loop() inside the async function run() with
asyncio.get_running_loop(); locate the assignment "loop =
asyncio.get_event_loop()" and change it to use asyncio.get_running_loop() so the
coroutine retrieves the currently running event loop instead of the deprecated
global accessor.

Comment thread benchmarks/mem0_locomo_runner.py Outdated
Comment on lines +54 to +69
```
Category vector-only flat-thorough router-directed
F1 Judge tokens F1 Judge tokens F1 Judge tokens
single-hop 0.xx 0.xx N 0.xx 0.xx N 0.xx 0.xx N
temporal 0.xx 0.xx N 0.xx 0.xx N 0.xx 0.xx N
multi-hop 0.xx 0.xx N 0.xx 0.xx N 0.xx 0.xx N
open-domain 0.xx 0.xx N 0.xx 0.xx N 0.xx 0.xx N
------
overall 0.xx 0.xx N 0.xx 0.xx N 0.xx 0.xx N

Router accuracy (vs expected layer for each category):
single-hop → vector [N/M correct]
temporal → catalog+kg [N/M correct]
multi-hop → kg [N/M correct]
open-domain → crystals [N/M correct]
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language tag to the fenced block to satisfy markdown lint.

The scorecard block should use a typed fence (e.g., text) to avoid MD040 failures.

Suggested fix
-```
+```text
 Category       vector-only              flat-thorough            router-directed
                F1    Judge  tokens      F1    Judge  tokens      F1    Judge  tokens
 single-hop     0.xx  0.xx   N           0.xx  0.xx   N           0.xx  0.xx   N
@@
   open-domain → crystals       [N/M correct]
-```
+```
🧰 Tools
🪛 markdownlint-cli2 (0.22.0)

[warning] 54-54: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/specs/2026-04-16-routing-benchmark-design.md` around lines 54 - 69, The
fenced code block showing the scorecard currently uses an untyped fence (```)
which triggers MD040; change the opening fence to a typed fence by replacing the
leading ``` with ```text so the block becomes ```text ... ``` and leave the
closing backticks unchanged; this affects the fenced block that starts with the
line "Category       vector-only ..." and ends with the final "open-domain →
crystals       [N/M correct]" closing fence.

Comment thread eval/librarian_eval.py
Comment thread README.md
Comprehensive LoCoMo benchmark bundle rebased onto current master.
Supersedes the closed #25.

## What lands

**Benchmark infrastructure** (~2000 LOC):
- benchmarks/locomo_runner.py — full LoCoMo runner (ONNX embeds + Ollama
  answer/judge, F1/BLEU/Judge/R@K per category, flags for --concurrency,
  --per-conv-limit, --timeout, --model)
- benchmarks/longmemeval_runner.py — ported off dead tinyagentos imports
- benchmarks/mem0_locomo_runner.py — apples-to-apples mem0 adapter
  routing retrieval through mem0.search() with same generator/prompt/
  judge as the taosmd runner
- pyproject.toml — adds mem0ai + chroma optional deps for the adapter

**Prompt tweaks** (validated +0.03 Overall Judge under external qwen3:4b):
- Absolute-date instruction + softened IDK in ANSWER_PROMPT
- Per-category: Temporal 0.36→0.41 (+0.05), Multi-hop 0.21→0.24 (+0.03)
- Targeted improvements landed on targeted categories; Single-hop +
  Open-dom unchanged (clean intervention signal)

**Methodology docs** (4 specs under docs/specs/)

## Review-feedback fixes applied

- Kilo CRITICAL: all hardcoded /home/jay/... paths in
  locomo_runner.py (--dataset, --onnx-path), longmemeval_runner.py,
  mem0_locomo_runner.py are now repo-relative defaults derived from
  __file__, with LOCOMO_DATASET + TAOSMD_ONNX_PATH env-var overrides
- CodeRabbit major: eval/librarian_eval.py was reading axis_c.get("n")
  but eval_axis_c emits "n_sessions" — fixed with get("n_sessions",
  get("n", 0)) compat shim, undercounted n_queries no longer
  inflates tokens_per_query

## Test plan
- [x] All three models benchmarked against 1540-QA LoCoMo set with
  100% external-judge (qwen3:4b) coverage, 0 errors
- [x] Prompt-opt delta localized to targeted categories as designed
- [x] Python syntax check on all four modified benchmark files
- [x] No /home/jay/... paths remain in benchmarks/ or eval/
@jaylfc jaylfc force-pushed the feat/locomo-prompt-opt branch from d2f1a1e to fd27d2c Compare April 19, 2026 18:06
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (3)
docs/specs/2026-04-16-routing-benchmark-design.md (1)

54-69: ⚠️ Potential issue | 🟡 Minor

Add a language tag to the scorecard fenced block.

The opening fence at Line 54 is untyped and triggers MD040; use a typed fence (e.g., text).

Suggested fix
-```
+```text
 Category       vector-only              flat-thorough            router-directed
                F1    Judge  tokens      F1    Judge  tokens      F1    Judge  tokens
 single-hop     0.xx  0.xx   N           0.xx  0.xx   N           0.xx  0.xx   N
@@
   open-domain → crystals       [N/M correct]
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/specs/2026-04-16-routing-benchmark-design.md around lines 54 - 69, The
fenced code block that contains the scorecard currently uses an untyped opening
fence (); update the opening fence to include a language tag (for example change to text) so the block starts with text and keep the closing ```
unchanged; this addresses the MD040 lint warning for the scorecard fenced block.


</details>

</blockquote></details>
<details>
<summary>benchmarks/locomo_runner.py (1)</summary><blockquote>

`351-382`: _⚠️ Potential issue_ | _🟡 Minor_

**Per-conversation temp directories still are not cleaned up.**

`mkdtemp()` is called for every conversation, but the directory is never removed, and `vmem.close()` is not protected by `finally`. Long benchmark sweeps will leak SQLite files under `/tmp`, especially on ingest or retrieval failures.  
 

<details>
<summary>🐛 Proposed fix</summary>

```diff
+import shutil
...
-            tmp = tempfile.mkdtemp(prefix=f"locomo_{conv_id}_")
-            vmem = VectorMemory(
-                db_path=os.path.join(tmp, "vmem.db"),
-                qmd_url=args.qmd_url,
-                embed_mode=args.embed_mode,
-                onnx_path=args.onnx_path,
-            )
-            await vmem.init(http_client=client)
-            added, ingest_s = await _ingest_conversation(vmem, conv)
-            print(f"[{conv_id}] ingested {added} turns in {ingest_s:.1f}s", flush=True)
+            tmp = tempfile.mkdtemp(prefix=f"locomo_{conv_id}_")
+            vmem = None
+            try:
+                vmem = VectorMemory(
+                    db_path=os.path.join(tmp, "vmem.db"),
+                    qmd_url=args.qmd_url,
+                    embed_mode=args.embed_mode,
+                    onnx_path=args.onnx_path,
+                )
+                await vmem.init(http_client=client)
+                added, ingest_s = await _ingest_conversation(vmem, conv)
+                print(f"[{conv_id}] ingested {added} turns in {ingest_s:.1f}s", flush=True)
+                ...
+            finally:
+                if vmem is not None:
+                    await vmem.close()
+                shutil.rmtree(tmp, ignore_errors=True)
...
-            await vmem.close()
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/locomo_runner.py` around lines 351 - 382, The per-conversation
temporary directory created by tempfile.mkdtemp (used to construct VectorMemory
with db_path) is never removed and vmem.close() is not executed in a finally
block; wrap the lifecycle of VectorMemory and its temp dir in a try/finally so
that you always await vmem.close() (or call sync close if appropriate) and then
remove the tmp directory with shutil.rmtree(tmp, ignore_errors=True) after
closing the DB; keep the existing calls to await vmem.init(...) and
_ingest_conversation(...) inside the try and ensure exception paths (e.g.,
around the asyncio.gather of _guarded tasks) still flow to the finally that
closes vmem and deletes the temp dir.
```

</details>

</blockquote></details>
<details>
<summary>benchmarks/mem0_locomo_runner.py (1)</summary><blockquote>

`409-412`: _⚠️ Potential issue_ | _🟡 Minor_

**Use `get_running_loop()` inside `run()`.**

`run()` is already executing inside an event loop, so `asyncio.get_event_loop()` is the deprecated accessor here. `get_running_loop()` avoids version-dependent behavior.
  

<details>
<summary>🐛 Proposed fix</summary>

```diff
-    loop = asyncio.get_event_loop()
+    loop = asyncio.get_running_loop()
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/mem0_locomo_runner.py` around lines 409 - 412, The code is calling
asyncio.get_event_loop() (assigning to loop) while already running inside run();
replace that call with asyncio.get_running_loop() to avoid deprecated/ambiguous
behavior—update the assignment where loop is created (near variables
concurrency, sem, progress_lock) so loop = asyncio.get_running_loop() is used
inside the run() function or coroutine.
```

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @benchmarks/locomo_runner.py:

  • Around line 236-245: The try/except around _ollama_generate currently converts
    generation exceptions into a synthetic predicted string and still runs
    metric/judge computation, which folds generation failures into scores and
    doesn't increment failed_qa; change the error path so that when _ollama_generate
    raises you (a) record the failure by incrementing failed_qa (the counter used
    for failed generations), (b) do not set a synthetic prediction for scoring (use
    None or a flag), and (c) skip calling _judge and any F1/BLEU metric computation
    for that item (only compute gen_ms as now). Update the flow around
    _ollama_generate, predicted, gen_ms, and the call to _judge so that _judge is
    only invoked when prediction succeeded and failed_qa is incremented on
    exceptions.
  • Around line 138-149: The judge currently masks transport/timeouts by catching
    all exceptions in _judge and returning 0.0; instead, detect and surface
    transport failures so they aren't counted as a "NO". Modify _judge (which calls
    _ollama_generate with JUDGE_PROMPT) to not convert exceptions into a 0.0 score —
    either re-raise the exception or return a sentinel (e.g., None or math.nan) and
    log the error with details — and then update the caller(s) that aggregate scores
    to skip/handle the sentinel (so only real negative verdicts produce 0.0 and
    infra failures are excluded from the average).

In @benchmarks/mem0_locomo_runner.py:

  • Around line 223-247: The mem0 config currently uses args.run_id directly so
    when omitted it becomes "None" (collection_name f"locomo_{args.run_id}" and path
    f"/tmp/mem0_{args.run_id}_chroma") causing persistent reuse across runs; fix by
    normalizing/assigning a unique default run id inside _build_mem0_config (or
    before calling it) when args.run_id is falsy — e.g., set run_id = args.run_id or
    generate a stable unique fallback (timestamp or uuid4) and use that run_id for
    collection_name and path so each run without an explicit --run-id gets an
    isolated backing store.

Duplicate comments:
In @benchmarks/locomo_runner.py:

  • Around line 351-382: The per-conversation temporary directory created by
    tempfile.mkdtemp (used to construct VectorMemory with db_path) is never removed
    and vmem.close() is not executed in a finally block; wrap the lifecycle of
    VectorMemory and its temp dir in a try/finally so that you always await
    vmem.close() (or call sync close if appropriate) and then remove the tmp
    directory with shutil.rmtree(tmp, ignore_errors=True) after closing the DB; keep
    the existing calls to await vmem.init(...) and _ingest_conversation(...) inside
    the try and ensure exception paths (e.g., around the asyncio.gather of _guarded
    tasks) still flow to the finally that closes vmem and deletes the temp dir.

In @benchmarks/mem0_locomo_runner.py:

  • Around line 409-412: The code is calling asyncio.get_event_loop() (assigning
    to loop) while already running inside run(); replace that call with
    asyncio.get_running_loop() to avoid deprecated/ambiguous behavior—update the
    assignment where loop is created (near variables concurrency, sem,
    progress_lock) so loop = asyncio.get_running_loop() is used inside the run()
    function or coroutine.

In @docs/specs/2026-04-16-routing-benchmark-design.md:

  • Around line 54-69: The fenced code block that contains the scorecard currently
    uses an untyped opening fence (); update the opening fence to include a language tag (for example change to text) so the block starts with text and keep the closing ``` unchanged; this addresses the MD040 lint
    warning for the scorecard fenced block.

</details>

<details>
<summary>🪄 Autofix (Beta)</summary>

Fix all unresolved CodeRabbit comments on this PR:

- [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended)
- [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: defaults

**Review profile**: CHILL

**Plan**: Pro Plus

**Run ID**: `edb764c1-6d97-40e6-9184-8e4def7dbfd7`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between d2f1a1e632e54b06271a0cddd8da9811404ec99b and fd27d2ce574fa3f646e7e64f5ba7fe9e1d8d4de7.

</details>

<details>
<summary>📒 Files selected for processing (10)</summary>

* `benchmarks/locomo_runner.py`
* `benchmarks/longmemeval_runner.py`
* `benchmarks/mem0_locomo_runner.py`
* `docs/specs/2026-04-16-routing-benchmark-design.md`
* `docs/specs/2026-04-16-supersede-benchmark-design.md`
* `docs/specs/2026-04-16-taosmd-next-steps.md`
* `docs/specs/2026-04-24-ama-question-draft.md`
* `docs/specs/2026-04-methodology-post-draft.md`
* `eval/librarian_eval.py`
* `pyproject.toml`

</details>

<details>
<summary>✅ Files skipped from review due to trivial changes (3)</summary>

* benchmarks/longmemeval_runner.py
* docs/specs/2026-04-methodology-post-draft.md
* docs/specs/2026-04-16-taosmd-next-steps.md

</details>

<details>
<summary>🚧 Files skipped from review as they are similar to previous changes (2)</summary>

* pyproject.toml
* eval/librarian_eval.py

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment on lines +138 to +149
async def _judge(client: httpx.AsyncClient, url: str, model: str,
question: str, reference: str, predicted: str) -> float:
try:
reply = await _ollama_generate(
client, url, model,
JUDGE_PROMPT.format(question=question, reference=reference, predicted=predicted),
temperature=0.0,
)
except Exception:
return 0.0
head = reply.strip().upper().split()
return 1.0 if head and head[0].startswith("YES") else 0.0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Judge transport failures are being scored as “NO”.

_judge() returns 0.0 on any exception, so an Ollama timeout/network failure becomes indistinguishable from a real negative grade. That biases the judge average downward and hides infra issues in the report.

🧰 Tools
🪛 Ruff (0.15.10)

[warning] 146-146: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/locomo_runner.py` around lines 138 - 149, The judge currently
masks transport/timeouts by catching all exceptions in _judge and returning 0.0;
instead, detect and surface transport failures so they aren't counted as a "NO".
Modify _judge (which calls _ollama_generate with JUDGE_PROMPT) to not convert
exceptions into a 0.0 score — either re-raise the exception or return a sentinel
(e.g., None or math.nan) and log the error with details — and then update the
caller(s) that aggregate scores to skip/handle the sentinel (so only real
negative verdicts produce 0.0 and infra failures are excluded from the average).

Comment on lines +236 to +245
try:
predicted = await _ollama_generate(
client, ollama_url, model,
ANSWER_PROMPT.format(context=context, question=question),
)
except Exception as exc:
predicted = f"[generation_error: {exc}]"
gen_ms = (time.time() - t1) * 1000.0

judge = await _judge(client, ollama_url, model, question, reference, predicted)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Generation errors are folded into the benchmark scores.

When answer generation fails, the code stores a synthetic [generation_error: ...] prediction and still computes F1/BLEU/Judge for that row. Those rows count toward total_qa, but failed_qa stays unchanged, so the JSON can report a “complete” run even though some QAs never produced an answer.

Also applies to: 247-260

🧰 Tools
🪛 Ruff (0.15.10)

[warning] 241-241: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/locomo_runner.py` around lines 236 - 245, The try/except around
_ollama_generate currently converts generation exceptions into a synthetic
predicted string and still runs metric/judge computation, which folds generation
failures into scores and doesn't increment failed_qa; change the error path so
that when _ollama_generate raises you (a) record the failure by incrementing
failed_qa (the counter used for failed generations), (b) do not set a synthetic
prediction for scoring (use None or a flag), and (c) skip calling _judge and any
F1/BLEU metric computation for that item (only compute gen_ms as now). Update
the flow around _ollama_generate, predicted, gen_ms, and the call to _judge so
that _judge is only invoked when prediction succeeded and failed_qa is
incremented on exceptions.

Comment on lines +223 to +247
def _build_mem0_config(args: argparse.Namespace) -> dict:
return {
"llm": {
"provider": "ollama",
"config": {
"model": args.model,
"ollama_base_url": args.ollama_url,
"temperature": 0.2,
},
},
"embedder": {
"provider": "ollama",
"config": {
"model": "nomic-embed-text",
"ollama_base_url": args.ollama_url,
},
},
"vector_store": {
"provider": "chroma",
"config": {
"collection_name": f"locomo_{args.run_id}",
"path": f"/tmp/mem0_{args.run_id}_chroma",
},
},
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find and read the file to understand the context
find . -name "mem0_locomo_runner.py" -type f

Repository: jaylfc/taosmd

Length of output: 91


🏁 Script executed:

# Search for the file and examine its content
fd "mem0_locomo_runner.py" --type f

Repository: jaylfc/taosmd

Length of output: 89


🏁 Script executed:

# Read the entire file to understand the context
cat -n benchmarks/mem0_locomo_runner.py

Repository: jaylfc/taosmd

Length of output: 23880


The default mem0 backing store is reused across runs without an explicit --run-id.

With --run-id omitted, collection_name defaults to locomo_None and the Chroma vector store persists to /tmp/mem0_None_chroma. Re-running the benchmark reuses the same persisted collection instead of starting clean, allowing old memories to be retrieved again and making runs irreproducible. Always pass --run-id to isolate benchmark runs.

🧰 Tools
🪛 Ruff (0.15.10)

[error] 244-244: Probable insecure usage of temporary file or directory: "/tmp/mem0_"

(S108)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/mem0_locomo_runner.py` around lines 223 - 247, The mem0 config
currently uses args.run_id directly so when omitted it becomes "None"
(collection_name f"locomo_{args.run_id}" and path
f"/tmp/mem0_{args.run_id}_chroma") causing persistent reuse across runs; fix by
normalizing/assigning a unique default run id inside _build_mem0_config (or
before calling it) when args.run_id is falsy — e.g., set run_id = args.run_id or
generate a stable unique fallback (timestamp or uuid4) and use that run_id for
collection_name and path so each run without an explicit --run-id gets an
isolated backing store.

Comment on lines +321 to +325
# evidence_hits/evidence_total not computable without dia_id metadata from
# mem0's stored facts (mem0 doesn't preserve per-turn dia_ids). Set to 0.
# TODO: if mem0 exposes metadata round-trip, wire dia_id here.
"evidence_hits": 0,
"evidence_total": len(evidence),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

R@K is guaranteed to report 0.0 here.

These rows keep evidence_total=len(evidence) while hardcoding evidence_hits=0 because mem0 does not round-trip dia_id. _summary() treats that as a real miss, so every mem0 run publishes retrieval_recall=0.0 even when retrieval succeeded. Mark the metric unavailable instead of zeroing it out, or plumb evidence IDs through mem0 first.

🐛 Proposed fix
-            "evidence_hits": 0,
-            "evidence_total": len(evidence),
+            "evidence_hits": 0,
+            "evidence_total": 0,  # exclude unavailable metric from retrieval_recall
+            "retrieval_recall_unavailable": True,
             "error": str(exc),
         }
...
-        "evidence_hits": 0,
-        "evidence_total": len(evidence),
+        "evidence_hits": 0,
+        "evidence_total": 0,
+        "retrieval_recall_unavailable": True,
     }

Also applies to: 371-375

jaylfc added a commit that referenced this pull request Apr 19, 2026
CodeRabbit flagged four benchmark-integrity issues on #30. All real —
they let infra failures and adapter artefacts masquerade as real
negative results, biasing the published numbers and hiding problems.

1. _judge() returned 0.0 on Ollama timeout / network error — identical
   to a genuine "NO" grade. Now returns None; _summary filters None
   out of the Judge average so transport flakiness doesn't depress
   the score. The row still records 0.0 vs None distinctly.

2. Generation errors stored a synthetic "[generation_error: ...]"
   prediction and then computed F1/BLEU/Judge on it. That folded
   infra failures into the benchmark averages, and failed_qa stayed
   zero so the run reported "complete" despite missing answers.
   Now: on gen failure, predicted=""; f1/bleu/judge=None; row carries
   an `error` field; _guarded increments failed_qa for those rows;
   _summary excludes None metrics.

3. mem0_locomo_runner.py hardcoded evidence_hits=0 in every row
   because mem0 2.x doesn't round-trip per-turn dia_ids. _summary
   then published retrieval_recall=0.0 for every mem0 run — a fake
   miss. Now sets evidence_hits and evidence_total to None (metric
   unavailable, not zero), and _summary skips None rows from recall.

4. mem0 runner inherited both the 0.0-on-failure judge and the error-
   folding pattern. Both paths fixed to the same None-on-failure
   convention.

_summary now also emits judge_scored + recall_scored alongside count
so the JSON shows the denominator honestly (e.g. "Judge 0.41 over 1487
scored of 1540 total"), making infra flakiness inspectable rather
than invisible.

No schema-breaking changes: existing .rescored.json outputs that have
evidence_hits=0 or judge=0.0 remain readable — the new _summary treats
them as real zeros, which is how they were when written. Only forward
runs produce None for "metric unavailable".
Third memory architecture in the comparison harness. Same generator
(gemma4:e2b), same ANSWER_PROMPT, same JUDGE_PROMPT, same top-K, same
1540 QAs as the taosmd and mem0 runners. Only the retrieval layer
changes — routes search through mempalace.searcher.search_memories().

MemPalace has never published end-to-end Judge on LoCoMo — their own
benchmarks/BENCHMARKS.md reports R@10 only (60.3% raw, 88.9% hybrid v5
per their 2026-03 results). Our run adds the novel measurement so the
three-way comparison is done under identical conditions.

Inherits the silent-failure conventions from PR #33: judge timeouts
and generation errors return None (not 0.0) so _summary excludes them
from averages. evidence_hits/evidence_total reported as None since
MemPalace doesn't round-trip LoCoMo dia_id.

Requires: pip install mempalace
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant