feat(bench): LoCoMo runner + prompt-opt + mem0 adapter (validated)#30
feat(bench): LoCoMo runner + prompt-opt + mem0 adapter (validated)#30
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 12 minutes and 45 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR introduces two new LoCoMo benchmark runners (taosmd and mem0 variants), updates the existing LongMemEval runner to use taosmd modules with environment-configurable LLM endpoints, modifies the Librarian evaluation token-budget logic from multiplier-based to absolute tokens-per-query, and adds multiple specification documents for upcoming benchmarking initiatives and roadmap planning. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Runner
participant Dataset
participant VectorMemory
participant Ollama
participant Judge
User->>Runner: run(dataset, model, concurrency, top_k)
Runner->>Dataset: load LoCoMo JSON
Dataset-->>Runner: conversations & QA items
loop For each conversation
Runner->>VectorMemory: create temp instance
Runner->>Runner: ingest session turns
VectorMemory-->>Runner: indexed (dialog_id, speaker, datetime)
loop For each QA (concurrent, semaphore-limited)
Runner->>VectorMemory: search(query, top_k)
VectorMemory-->>Runner: ranked hits
Runner->>Runner: build context (hits + datetime)
Runner->>Ollama: /api/generate(context + question)
Ollama-->>Runner: predicted_answer
Runner->>Judge: /api/generate(judge_prompt)
Judge-->>Runner: YES/NO score (0.0 or 1.0)
Runner->>Runner: compute F1, BLEU-1 vs reference
Runner->>Runner: store metrics & latencies
end
end
Runner->>Runner: aggregate by category
Runner->>Runner: compute retrieval_recall
Runner-->>User: JSON results + formatted summary table
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes The two new benchmark runners introduce substantial, multi-component async workflows (VectorMemory/mem0 ingestion, Ollama integration, concurrent QA processing, metric computation) across 1,000+ lines. The logic is moderately dense with helper functions, error handling, and aggregation logic. The variety of changes (two new benchmarks, existing benchmark refactor, eval logic modification, dependency updates, and documentation) adds heterogeneity. The evaluation logic change requires careful reasoning about the token-budget semantics. Documentation specs are straightforward but numerous. Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 3❌ Failed checks (3 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Overview
Issue Details (click to expand)CRITICALOther Observations (not in diff)Issues found in unchanged code that cannot receive inline comments: Files Reviewed (2 files)
Reviewed by seed-2-0-pro-260328 · 262,431 tokens |
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (8)
.gitignore (1)
13-13: Keep a versionedCLAUDE.mdtemplate if localCLAUDE.mdis ignored.Ignoring
CLAUDE.mdis fine for local machine-specific config, but it can hide required shared agent/memory instructions from the repo. Consider committingCLAUDE.md.example(or equivalent) and documenting copy/setup steps.Suggested `.gitignore` adjustment
CLAUDE.md +!CLAUDE.md.exampleBased on learnings: “Applies to {AGENTS,CLAUDE}.md: Do not overwrite existing
AGENTS.mdorCLAUDE.mdfiles without first reading them and checking for existing memory system integrations (Mem0, Zep, MemGPT/Letta, OpenClaw memory, LangChain modules)”.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.gitignore at line 13, Make sure a versioned template for CLAUDE.md is committed and .gitignore only excludes the local file: add a committed CLAUDE.md.example (or CLAUDE.md.template) to the repo containing the shared agent/memory instructions, update .gitignore to ignore CLAUDE.md but not CLAUDE.md.example, and add a brief note in the README or a new CONTRIBUTING section describing the copy/setup steps (copy CLAUDE.md.example -> CLAUDE.md and merge any existing local CLAUDE.md to avoid overwriting memory integrations).benchmarks/locomo_runner.py (2)
387-387:datetime.utcnow()is deprecated in Python 3.12+.Consider using
datetime.now(timezone.utc)for forward compatibility.♻️ Proposed fix
+from datetime import datetime, timezone ... - timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S") + timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/locomo_runner.py` at line 387, Replace the deprecated datetime.utcnow() usage when building the timestamp: change the assignment timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S") to use timezone-aware time, e.g. timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S"), and add/update the import to include timezone (from datetime import datetime, timezone) so the timestamp is produced in a forward-compatible, UTC-aware way.
198-200: Clarify the double-metadata unwrapping logic.The expression
h.get("metadata", {}).get("metadata", h.get("metadata", {}))suggests the retrieve API sometimes returns metadata nested inside another metadata key. This defensive handling works, but may indicate an API inconsistency worth investigating.Consider adding a comment explaining when the double-nesting occurs, or normalizing the API response in
retrieve().🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/locomo_runner.py` around lines 198 - 200, The list comprehension unpacks nested metadata defensively; update the code to normalize the API response in the retrieve() function so hits always have a flat dict at h["metadata"] (or explicitly document when nested metadata occurs) and then simplify the comprehension to use h.get("metadata", {}) directly; specifically, modify retrieve() to detect and unwrap cases where response returns {"metadata": {"metadata": ...}} into {"metadata": ...}, and add a short inline comment near the return (or in retrieve()) explaining the normalization so the expression h.get("metadata", {}) is sufficient and the double-metadata defensive logic is no longer needed.benchmarks/locomo_rescore.py (1)
257-268: Closure capturesclientbefore it's defined.The nested
_judge_onefunction referencesclientat line 260, butclientis only assigned at line 267. This works due to Python's late binding of closures (the name is looked up when_judge_oneexecutes, not when it's defined), but it's fragile and potentially confusing.Consider passing
clientas a parameter to make the dependency explicit:♻️ Proposed fix
- async def _judge_one(idx: int, rec: dict): + async def _judge_one(idx: int, rec: dict, client: httpx.AsyncClient): nonlocal unparseable async with sem: val = await _call_ollama_judge(client, ollama_url, model, rec) ... limits = httpx.Limits(max_connections=concurrency + 2, max_keepalive_connections=concurrency) async with httpx.AsyncClient(limits=limits) as client: - tasks = [asyncio.create_task(_judge_one(i, r)) for i, r in enumerate(records)] + tasks = [asyncio.create_task(_judge_one(i, r, client)) for i, r in enumerate(records)]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/locomo_rescore.py` around lines 257 - 268, The nested function _judge_one currently captures client from the outer scope after it is created, which is fragile; make the dependency explicit by adding a client parameter to _judge_one (e.g., async def _judge_one(idx: int, rec: dict, client):) and update its internal call to await _call_ollama_judge(client, ollama_url, model, rec). Then, when creating tasks inside the async with httpx.AsyncClient(...) as client block, pass the client into each task (e.g., asyncio.create_task(_judge_one(i, r, client))) while leaving uses of sem, results, and records unchanged.benchmarks/mem0_locomo_runner.py (3)
467-467:datetime.utcnow()is deprecated in Python 3.12+.Consider using
datetime.now(timezone.utc)for forward compatibility.♻️ Proposed fix
+from datetime import datetime, timezone ... - timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S") + timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/mem0_locomo_runner.py` at line 467, Replace the deprecated datetime.utcnow() call used when constructing timestamp (the line assigning timestamp) with a timezone-aware call: use datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S") and ensure timezone is imported (from datetime import datetime, timezone) so the timestamp is forward-compatible with Python 3.12+.
27-65: Consider extracting shared constants to a common module.The prompts and utilities are intentionally duplicated for self-containment, but this creates a maintenance burden. If the prompts change in
locomo_runner.py, they must be manually updated here. Consider extractingANSWER_PROMPT,JUDGE_PROMPT,CATEGORY_NAMES, and shared utilities (_tokenize,_f1,_bleu1,_judge) to abenchmarks/_common.pymodule.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/mem0_locomo_runner.py` around lines 27 - 65, The duplicated prompts and utilities (ANSWER_PROMPT, JUDGE_PROMPT, CATEGORY_NAMES and helper functions like _tokenize, _f1, _bleu1, _judge) should be extracted into a single shared module (e.g., _common) and imported from both runners so they stay in sync; create the new module containing those constants and utility implementations, update mem0_locomo_runner.py and locomo_runner.py to import ANSWER_PROMPT, JUDGE_PROMPT, CATEGORY_NAMES and the utility functions instead of redefining them, and ensure any references to those symbols remain unchanged so behavior is preserved.
240-246: Consider usingtempfile.mkdtempfor safer temporary directory creation.The hardcoded
/tmp/mem0_{run_id}_chromapath is predictable. While acceptable for benchmark tooling, usingtempfile.mkdtemp(as done inlocomo_runner.pyline 351) would be more robust against symlink attacks on shared systems.♻️ Proposed fix
+import tempfile + def _build_mem0_config(args: argparse.Namespace) -> dict: + tmp_dir = tempfile.mkdtemp(prefix=f"mem0_{args.run_id}_") return { ... "vector_store": { "provider": "chroma", "config": { "collection_name": f"locomo_{args.run_id}", - "path": f"/tmp/mem0_{args.run_id}_chroma", + "path": tmp_dir, }, }, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/mem0_locomo_runner.py` around lines 240 - 246, The vector_store config currently uses a predictable hardcoded path f"/tmp/mem0_{args.run_id}_chroma"; replace this with a securely-created temporary directory using tempfile.mkdtemp to avoid predictable symlink attacks: import tempfile, call tempfile.mkdtemp(prefix=f"mem0_{args.run_id}_chroma_") to get a tempdir, and assign that tempdir to the "path" value while keeping "collection_name" as f"locomo_{args.run_id}"; update any teardown/cleanup logic if needed to remove the tempdir after the benchmark.docs/specs/2026-04-16-taosmd-next-steps.md (1)
33-33: Task may already be completed.The relevant code snippet from
eval/librarian_eval.py:630-642shows that the token cost metric already usestokens_per_query <= TOKEN_BUDGET(absolute threshold) rather than aninfxratio. Consider updating this task to reflect its completion status or verifying if additional work remains.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/specs/2026-04-16-taosmd-next-steps.md` at line 33, The spec task refers to changing the token cost metric but the code in eval/librarian_eval.py already enforces the absolute threshold via the tokens_per_query <= TOKEN_BUDGET check; update the docs/specs/2026-04-16-taosmd-next-steps.md entry 5 to mark this task as completed (or note any remaining subtasks), and if you want extra verification, open eval/librarian_eval.py and confirm the evaluation uses tokens_per_query and TOKEN_BUDGET rather than an infx ratio in the token-cost calculation logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@benchmarks/locomo_runner.py`:
- Around line 421-432: The defaults for the argparse options are using hardcoded
user-specific paths; update the p.add_argument calls for "--dataset" and
"--onnx-path" to derive sensible defaults from environment variables or relative
locations instead of "/home/jay/...". For example, read env vars like
DATASET_PATH and ONNX_PATH (os.environ.get("DATASET_PATH")) or build a
project-relative default via pathlib.Path(__file__).resolve().parents[...] and
fall back to None or a clearly documented relative path; update the "--dataset"
and "--onnx-path" argument default values accordingly and update help text to
mention the env vars/relative default and that users can override via CLI.
- Around line 351-382: The tmp directory created with tempfile.mkdtemp(...) is
never removed; change the code to ensure cleanup by using a
tempfile.TemporaryDirectory() context manager around the lifecycle of tmp and
the VectorMemory instance (replace tempfile.mkdtemp(...) with
TemporaryDirectory() as tmp_dir and pass os.path.join(tmp_dir, "vmem.db")), or
if you must keep mkdtemp, call shutil.rmtree(tmp) after await vmem.close();
update references to tmp (used when constructing VectorMemory and when closing
vmem) so the directory is always removed even if exceptions occur (use
try/finally or context manager around vmem.init()/await
_ingest_conversation/.../await vmem.close()).
In `@benchmarks/longmemeval_runner.py`:
- Around line 152-156: The default onnx_path is currently hardcoded to a
machine-specific path; update the VectorMemory construction to avoid a hardcoded
default by using a neutral fallback (e.g., None or a package/resource-relative
path) and rely on the TAOSMD_ONNX_PATH env var or a config to provide the real
model location; modify the call that creates VectorMemory (parameters: db_path,
embed_mode, onnx_path) so onnx_path uses os.environ.get("TAOSMD_ONNX_PATH") with
no hardcoded user path and add handling in VectorMemory (or the caller) to raise
a clear error or load a bundled model when onnx_path is None.
In `@benchmarks/mem0_locomo_runner.py`:
- Line 412: Replace the deprecated call to asyncio.get_event_loop() inside the
async function run() with asyncio.get_running_loop(); locate the assignment
"loop = asyncio.get_event_loop()" and change it to use
asyncio.get_running_loop() so the coroutine retrieves the currently running
event loop instead of the deprecated global accessor.
- Line 511: The default dataset argument in p.add_argument("--dataset", ...)
uses a hardcoded user-specific path; change it to use a portable default or an
environment variable: remove the absolute /home/jay/... path and either set a
relative default like "data/locomo10.json" or set
default=os.environ.get("LOCOMO_DATASET", "data/locomo10.json") and import os;
alternatively make the argument required (no default) so callers must supply a
path. Update the p.add_argument("--dataset", ...) call accordingly and ensure
any README or run scripts reference the chosen default or env var.
In `@docs/specs/2026-04-16-routing-benchmark-design.md`:
- Around line 54-69: The fenced code block showing the scorecard currently uses
an untyped fence (```) which triggers MD040; change the opening fence to a typed
fence by replacing the leading ``` with ```text so the block becomes ```text ...
``` and leave the closing backticks unchanged; this affects the fenced block
that starts with the line "Category vector-only ..." and ends with the
final "open-domain → crystals [N/M correct]" closing fence.
In `@eval/librarian_eval.py`:
- Around line 628-633: The n_queries calculation in eval/librarian_eval.py uses
a non-existent key "n" for axis_c (lib.axis_c.get("n", 0)) while eval_axis_c
emits "n_sessions", causing undercounting; update the expression that computes
n_queries (the max(...) block) to read lib.axis_c.get("n_sessions", 0) (or sum n
and n_sessions as appropriate) so axis_c's sessions are included consistently
with eval_axis_c and tokens/query remains correct.
In `@README.md`:
- Around line 7-9: The README currently uses "97.0%" ambiguously; the top claims
"97.0% end-to-end Judge accuracy on LongMemEval-S" while later sections refer to
"97.0%" as Recall@5 — fix by unifying the framing: decide whether 97.0% refers
to end-to-end Judge accuracy or Recall@5, then update every occurrence of the
phrase "97.0%" and its surrounding label (search for the strings "97.0%",
"LongMemEval-S", and "Recall@5") so each mention explicitly states the metric
(e.g., "97.0% end-to-end Judge accuracy" or "97.0% Recall@5"); alternatively,
present both numbers distinctly (e.g., "97.0% end-to-end Judge accuracy;
Recall@5 = X") and update the sentences around those symbols to remove
ambiguity.
---
Nitpick comments:
In @.gitignore:
- Line 13: Make sure a versioned template for CLAUDE.md is committed and
.gitignore only excludes the local file: add a committed CLAUDE.md.example (or
CLAUDE.md.template) to the repo containing the shared agent/memory instructions,
update .gitignore to ignore CLAUDE.md but not CLAUDE.md.example, and add a brief
note in the README or a new CONTRIBUTING section describing the copy/setup steps
(copy CLAUDE.md.example -> CLAUDE.md and merge any existing local CLAUDE.md to
avoid overwriting memory integrations).
In `@benchmarks/locomo_rescore.py`:
- Around line 257-268: The nested function _judge_one currently captures client
from the outer scope after it is created, which is fragile; make the dependency
explicit by adding a client parameter to _judge_one (e.g., async def
_judge_one(idx: int, rec: dict, client):) and update its internal call to await
_call_ollama_judge(client, ollama_url, model, rec). Then, when creating tasks
inside the async with httpx.AsyncClient(...) as client block, pass the client
into each task (e.g., asyncio.create_task(_judge_one(i, r, client))) while
leaving uses of sem, results, and records unchanged.
In `@benchmarks/locomo_runner.py`:
- Line 387: Replace the deprecated datetime.utcnow() usage when building the
timestamp: change the assignment timestamp =
datetime.utcnow().strftime("%Y%m%d_%H%M%S") to use timezone-aware time, e.g.
timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S"), and add/update
the import to include timezone (from datetime import datetime, timezone) so the
timestamp is produced in a forward-compatible, UTC-aware way.
- Around line 198-200: The list comprehension unpacks nested metadata
defensively; update the code to normalize the API response in the retrieve()
function so hits always have a flat dict at h["metadata"] (or explicitly
document when nested metadata occurs) and then simplify the comprehension to use
h.get("metadata", {}) directly; specifically, modify retrieve() to detect and
unwrap cases where response returns {"metadata": {"metadata": ...}} into
{"metadata": ...}, and add a short inline comment near the return (or in
retrieve()) explaining the normalization so the expression h.get("metadata", {})
is sufficient and the double-metadata defensive logic is no longer needed.
In `@benchmarks/mem0_locomo_runner.py`:
- Line 467: Replace the deprecated datetime.utcnow() call used when constructing
timestamp (the line assigning timestamp) with a timezone-aware call: use
datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S") and ensure timezone is
imported (from datetime import datetime, timezone) so the timestamp is
forward-compatible with Python 3.12+.
- Around line 27-65: The duplicated prompts and utilities (ANSWER_PROMPT,
JUDGE_PROMPT, CATEGORY_NAMES and helper functions like _tokenize, _f1, _bleu1,
_judge) should be extracted into a single shared module (e.g., _common) and
imported from both runners so they stay in sync; create the new module
containing those constants and utility implementations, update
mem0_locomo_runner.py and locomo_runner.py to import ANSWER_PROMPT,
JUDGE_PROMPT, CATEGORY_NAMES and the utility functions instead of redefining
them, and ensure any references to those symbols remain unchanged so behavior is
preserved.
- Around line 240-246: The vector_store config currently uses a predictable
hardcoded path f"/tmp/mem0_{args.run_id}_chroma"; replace this with a
securely-created temporary directory using tempfile.mkdtemp to avoid predictable
symlink attacks: import tempfile, call
tempfile.mkdtemp(prefix=f"mem0_{args.run_id}_chroma_") to get a tempdir, and
assign that tempdir to the "path" value while keeping "collection_name" as
f"locomo_{args.run_id}"; update any teardown/cleanup logic if needed to remove
the tempdir after the benchmark.
In `@docs/specs/2026-04-16-taosmd-next-steps.md`:
- Line 33: The spec task refers to changing the token cost metric but the code
in eval/librarian_eval.py already enforces the absolute threshold via the
tokens_per_query <= TOKEN_BUDGET check; update the
docs/specs/2026-04-16-taosmd-next-steps.md entry 5 to mark this task as
completed (or note any remaining subtasks), and if you want extra verification,
open eval/librarian_eval.py and confirm the evaluation uses tokens_per_query and
TOKEN_BUDGET rather than an infx ratio in the token-cost calculation logic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 607d827f-0e5a-4d00-8eb3-51299f9cdc95
📒 Files selected for processing (13)
.gitignoreREADME.mdbenchmarks/locomo_rescore.pybenchmarks/locomo_runner.pybenchmarks/longmemeval_runner.pybenchmarks/mem0_locomo_runner.pydocs/specs/2026-04-16-routing-benchmark-design.mddocs/specs/2026-04-16-supersede-benchmark-design.mddocs/specs/2026-04-16-taosmd-next-steps.mddocs/specs/2026-04-24-ama-question-draft.mddocs/specs/2026-04-methodology-post-draft.mdeval/librarian_eval.pypyproject.toml
| tmp = tempfile.mkdtemp(prefix=f"locomo_{conv_id}_") | ||
| vmem = VectorMemory( | ||
| db_path=os.path.join(tmp, "vmem.db"), | ||
| qmd_url=args.qmd_url, | ||
| embed_mode=args.embed_mode, | ||
| onnx_path=args.onnx_path, | ||
| ) | ||
| await vmem.init(http_client=client) | ||
| added, ingest_s = await _ingest_conversation(vmem, conv) | ||
| print(f"[{conv_id}] ingested {added} turns in {ingest_s:.1f}s", flush=True) | ||
|
|
||
| # Pick eligible QAs up-front. --per-conv-limit caps QAs per | ||
| # conversation (for balanced sampling); --limit is the global cap. | ||
| eligible: list[dict] = [] | ||
| for qa in conv.get("qa", []) or []: | ||
| cat = int(qa.get("category", 0)) | ||
| if cat not in include_cats: | ||
| continue | ||
| if args.per_conv_limit and len(eligible) >= args.per_conv_limit: | ||
| break | ||
| if args.limit and (total_seen + len(eligible)) >= args.limit: | ||
| break | ||
| eligible.append(qa) | ||
|
|
||
| if eligible: | ||
| tasks = [_guarded(qa, conv_id, vmem, client) for qa in eligible] | ||
| gathered = await asyncio.gather(*tasks, return_exceptions=True) | ||
| for outcome in gathered: | ||
| if isinstance(outcome, Exception) or outcome is None: | ||
| continue | ||
| results.append(outcome) | ||
| await vmem.close() |
There was a problem hiding this comment.
Temporary directories are not cleaned up after use.
tempfile.mkdtemp() creates directories that persist after the script exits. For benchmark runs processing multiple conversations, this could accumulate significant disk usage in /tmp.
🐛 Proposed fix using context manager
+import shutil
...
- tmp = tempfile.mkdtemp(prefix=f"locomo_{conv_id}_")
- vmem = VectorMemory(
- db_path=os.path.join(tmp, "vmem.db"),
- ...
- )
- ...
- await vmem.close()
+ tmp = tempfile.mkdtemp(prefix=f"locomo_{conv_id}_")
+ try:
+ vmem = VectorMemory(
+ db_path=os.path.join(tmp, "vmem.db"),
+ ...
+ )
+ ...
+ await vmem.close()
+ finally:
+ shutil.rmtree(tmp, ignore_errors=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@benchmarks/locomo_runner.py` around lines 351 - 382, The tmp directory
created with tempfile.mkdtemp(...) is never removed; change the code to ensure
cleanup by using a tempfile.TemporaryDirectory() context manager around the
lifecycle of tmp and the VectorMemory instance (replace tempfile.mkdtemp(...)
with TemporaryDirectory() as tmp_dir and pass os.path.join(tmp_dir, "vmem.db")),
or if you must keep mkdtemp, call shutil.rmtree(tmp) after await vmem.close();
update references to tmp (used when constructing VectorMemory and when closing
vmem) so the directory is always removed even if exceptions occur (use
try/finally or context manager around vmem.init()/await
_ingest_conversation/.../await vmem.close()).
| concurrency = max(1, int(args.concurrency)) | ||
| sem = asyncio.Semaphore(concurrency) | ||
| progress_lock = asyncio.Lock() | ||
| loop = asyncio.get_event_loop() |
There was a problem hiding this comment.
Use asyncio.get_running_loop() instead of asyncio.get_event_loop().
asyncio.get_event_loop() is deprecated in Python 3.10+ when called from a coroutine. Since run() is an async function, use asyncio.get_running_loop() instead.
🐛 Proposed fix
- loop = asyncio.get_event_loop()
+ loop = asyncio.get_running_loop()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| loop = asyncio.get_event_loop() | |
| loop = asyncio.get_running_loop() |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@benchmarks/mem0_locomo_runner.py` at line 412, Replace the deprecated call to
asyncio.get_event_loop() inside the async function run() with
asyncio.get_running_loop(); locate the assignment "loop =
asyncio.get_event_loop()" and change it to use asyncio.get_running_loop() so the
coroutine retrieves the currently running event loop instead of the deprecated
global accessor.
| ``` | ||
| Category vector-only flat-thorough router-directed | ||
| F1 Judge tokens F1 Judge tokens F1 Judge tokens | ||
| single-hop 0.xx 0.xx N 0.xx 0.xx N 0.xx 0.xx N | ||
| temporal 0.xx 0.xx N 0.xx 0.xx N 0.xx 0.xx N | ||
| multi-hop 0.xx 0.xx N 0.xx 0.xx N 0.xx 0.xx N | ||
| open-domain 0.xx 0.xx N 0.xx 0.xx N 0.xx 0.xx N | ||
| ------ | ||
| overall 0.xx 0.xx N 0.xx 0.xx N 0.xx 0.xx N | ||
|
|
||
| Router accuracy (vs expected layer for each category): | ||
| single-hop → vector [N/M correct] | ||
| temporal → catalog+kg [N/M correct] | ||
| multi-hop → kg [N/M correct] | ||
| open-domain → crystals [N/M correct] | ||
| ``` |
There was a problem hiding this comment.
Add a language tag to the fenced block to satisfy markdown lint.
The scorecard block should use a typed fence (e.g., text) to avoid MD040 failures.
Suggested fix
-```
+```text
Category vector-only flat-thorough router-directed
F1 Judge tokens F1 Judge tokens F1 Judge tokens
single-hop 0.xx 0.xx N 0.xx 0.xx N 0.xx 0.xx N
@@
open-domain → crystals [N/M correct]
-```
+```🧰 Tools
🪛 markdownlint-cli2 (0.22.0)
[warning] 54-54: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/specs/2026-04-16-routing-benchmark-design.md` around lines 54 - 69, The
fenced code block showing the scorecard currently uses an untyped fence (```)
which triggers MD040; change the opening fence to a typed fence by replacing the
leading ``` with ```text so the block becomes ```text ... ``` and leave the
closing backticks unchanged; this affects the fenced block that starts with the
line "Category vector-only ..." and ends with the final "open-domain →
crystals [N/M correct]" closing fence.
Comprehensive LoCoMo benchmark bundle rebased onto current master. Supersedes the closed #25. ## What lands **Benchmark infrastructure** (~2000 LOC): - benchmarks/locomo_runner.py — full LoCoMo runner (ONNX embeds + Ollama answer/judge, F1/BLEU/Judge/R@K per category, flags for --concurrency, --per-conv-limit, --timeout, --model) - benchmarks/longmemeval_runner.py — ported off dead tinyagentos imports - benchmarks/mem0_locomo_runner.py — apples-to-apples mem0 adapter routing retrieval through mem0.search() with same generator/prompt/ judge as the taosmd runner - pyproject.toml — adds mem0ai + chroma optional deps for the adapter **Prompt tweaks** (validated +0.03 Overall Judge under external qwen3:4b): - Absolute-date instruction + softened IDK in ANSWER_PROMPT - Per-category: Temporal 0.36→0.41 (+0.05), Multi-hop 0.21→0.24 (+0.03) - Targeted improvements landed on targeted categories; Single-hop + Open-dom unchanged (clean intervention signal) **Methodology docs** (4 specs under docs/specs/) ## Review-feedback fixes applied - Kilo CRITICAL: all hardcoded /home/jay/... paths in locomo_runner.py (--dataset, --onnx-path), longmemeval_runner.py, mem0_locomo_runner.py are now repo-relative defaults derived from __file__, with LOCOMO_DATASET + TAOSMD_ONNX_PATH env-var overrides - CodeRabbit major: eval/librarian_eval.py was reading axis_c.get("n") but eval_axis_c emits "n_sessions" — fixed with get("n_sessions", get("n", 0)) compat shim, undercounted n_queries no longer inflates tokens_per_query ## Test plan - [x] All three models benchmarked against 1540-QA LoCoMo set with 100% external-judge (qwen3:4b) coverage, 0 errors - [x] Prompt-opt delta localized to targeted categories as designed - [x] Python syntax check on all four modified benchmark files - [x] No /home/jay/... paths remain in benchmarks/ or eval/
d2f1a1e to
fd27d2c
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (3)
docs/specs/2026-04-16-routing-benchmark-design.md (1)
54-69:⚠️ Potential issue | 🟡 MinorAdd a language tag to the scorecard fenced block.
The opening fence at Line 54 is untyped and triggers MD040; use a typed fence (e.g.,
text).Suggested fix
-``` +```text Category vector-only flat-thorough router-directed F1 Judge tokens F1 Judge tokens F1 Judge tokens single-hop 0.xx 0.xx N 0.xx 0.xx N 0.xx 0.xx N @@ open-domain → crystals [N/M correct]</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@docs/specs/2026-04-16-routing-benchmark-design.mdaround lines 54 - 69, The
fenced code block that contains the scorecard currently uses an untyped opening
fence (); update the opening fence to include a language tag (for example changetotext) so the block starts withtext and keep the closing ```
unchanged; this addresses the MD040 lint warning for the scorecard fenced block.</details> </blockquote></details> <details> <summary>benchmarks/locomo_runner.py (1)</summary><blockquote> `351-382`: _⚠️ Potential issue_ | _🟡 Minor_ **Per-conversation temp directories still are not cleaned up.** `mkdtemp()` is called for every conversation, but the directory is never removed, and `vmem.close()` is not protected by `finally`. Long benchmark sweeps will leak SQLite files under `/tmp`, especially on ingest or retrieval failures. <details> <summary>🐛 Proposed fix</summary> ```diff +import shutil ... - tmp = tempfile.mkdtemp(prefix=f"locomo_{conv_id}_") - vmem = VectorMemory( - db_path=os.path.join(tmp, "vmem.db"), - qmd_url=args.qmd_url, - embed_mode=args.embed_mode, - onnx_path=args.onnx_path, - ) - await vmem.init(http_client=client) - added, ingest_s = await _ingest_conversation(vmem, conv) - print(f"[{conv_id}] ingested {added} turns in {ingest_s:.1f}s", flush=True) + tmp = tempfile.mkdtemp(prefix=f"locomo_{conv_id}_") + vmem = None + try: + vmem = VectorMemory( + db_path=os.path.join(tmp, "vmem.db"), + qmd_url=args.qmd_url, + embed_mode=args.embed_mode, + onnx_path=args.onnx_path, + ) + await vmem.init(http_client=client) + added, ingest_s = await _ingest_conversation(vmem, conv) + print(f"[{conv_id}] ingested {added} turns in {ingest_s:.1f}s", flush=True) + ... + finally: + if vmem is not None: + await vmem.close() + shutil.rmtree(tmp, ignore_errors=True) ... - await vmem.close() ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@benchmarks/locomo_runner.py` around lines 351 - 382, The per-conversation temporary directory created by tempfile.mkdtemp (used to construct VectorMemory with db_path) is never removed and vmem.close() is not executed in a finally block; wrap the lifecycle of VectorMemory and its temp dir in a try/finally so that you always await vmem.close() (or call sync close if appropriate) and then remove the tmp directory with shutil.rmtree(tmp, ignore_errors=True) after closing the DB; keep the existing calls to await vmem.init(...) and _ingest_conversation(...) inside the try and ensure exception paths (e.g., around the asyncio.gather of _guarded tasks) still flow to the finally that closes vmem and deletes the temp dir. ``` </details> </blockquote></details> <details> <summary>benchmarks/mem0_locomo_runner.py (1)</summary><blockquote> `409-412`: _⚠️ Potential issue_ | _🟡 Minor_ **Use `get_running_loop()` inside `run()`.** `run()` is already executing inside an event loop, so `asyncio.get_event_loop()` is the deprecated accessor here. `get_running_loop()` avoids version-dependent behavior. <details> <summary>🐛 Proposed fix</summary> ```diff - loop = asyncio.get_event_loop() + loop = asyncio.get_running_loop() ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@benchmarks/mem0_locomo_runner.py` around lines 409 - 412, The code is calling asyncio.get_event_loop() (assigning to loop) while already running inside run(); replace that call with asyncio.get_running_loop() to avoid deprecated/ambiguous behavior—update the assignment where loop is created (near variables concurrency, sem, progress_lock) so loop = asyncio.get_running_loop() is used inside the run() function or coroutine. ``` </details> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against the current code and only fix it if needed.
Inline comments:
In@benchmarks/locomo_runner.py:
- Around line 236-245: The try/except around _ollama_generate currently converts
generation exceptions into a synthetic predicted string and still runs
metric/judge computation, which folds generation failures into scores and
doesn't increment failed_qa; change the error path so that when _ollama_generate
raises you (a) record the failure by incrementing failed_qa (the counter used
for failed generations), (b) do not set a synthetic prediction for scoring (use
None or a flag), and (c) skip calling _judge and any F1/BLEU metric computation
for that item (only compute gen_ms as now). Update the flow around
_ollama_generate, predicted, gen_ms, and the call to _judge so that _judge is
only invoked when prediction succeeded and failed_qa is incremented on
exceptions.- Around line 138-149: The judge currently masks transport/timeouts by catching
all exceptions in _judge and returning 0.0; instead, detect and surface
transport failures so they aren't counted as a "NO". Modify _judge (which calls
_ollama_generate with JUDGE_PROMPT) to not convert exceptions into a 0.0 score —
either re-raise the exception or return a sentinel (e.g., None or math.nan) and
log the error with details — and then update the caller(s) that aggregate scores
to skip/handle the sentinel (so only real negative verdicts produce 0.0 and
infra failures are excluded from the average).In
@benchmarks/mem0_locomo_runner.py:
- Around line 223-247: The mem0 config currently uses args.run_id directly so
when omitted it becomes "None" (collection_name f"locomo_{args.run_id}" and path
f"/tmp/mem0_{args.run_id}_chroma") causing persistent reuse across runs; fix by
normalizing/assigning a unique default run id inside _build_mem0_config (or
before calling it) when args.run_id is falsy — e.g., set run_id = args.run_id or
generate a stable unique fallback (timestamp or uuid4) and use that run_id for
collection_name and path so each run without an explicit --run-id gets an
isolated backing store.
Duplicate comments:
In@benchmarks/locomo_runner.py:
- Around line 351-382: The per-conversation temporary directory created by
tempfile.mkdtemp (used to construct VectorMemory with db_path) is never removed
and vmem.close() is not executed in a finally block; wrap the lifecycle of
VectorMemory and its temp dir in a try/finally so that you always await
vmem.close() (or call sync close if appropriate) and then remove the tmp
directory with shutil.rmtree(tmp, ignore_errors=True) after closing the DB; keep
the existing calls to await vmem.init(...) and _ingest_conversation(...) inside
the try and ensure exception paths (e.g., around the asyncio.gather of _guarded
tasks) still flow to the finally that closes vmem and deletes the temp dir.In
@benchmarks/mem0_locomo_runner.py:
- Around line 409-412: The code is calling asyncio.get_event_loop() (assigning
to loop) while already running inside run(); replace that call with
asyncio.get_running_loop() to avoid deprecated/ambiguous behavior—update the
assignment where loop is created (near variables concurrency, sem,
progress_lock) so loop = asyncio.get_running_loop() is used inside the run()
function or coroutine.In
@docs/specs/2026-04-16-routing-benchmark-design.md:
- Around line 54-69: The fenced code block that contains the scorecard currently
uses an untyped opening fence (); update the opening fence to include a language tag (for example changetotext) so the block starts withtext and keep the closing ``` unchanged; this addresses the MD040 lint
warning for the scorecard fenced block.</details> <details> <summary>🪄 Autofix (Beta)</summary> Fix all unresolved CodeRabbit comments on this PR: - [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended) - [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes </details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: defaults **Review profile**: CHILL **Plan**: Pro Plus **Run ID**: `edb764c1-6d97-40e6-9184-8e4def7dbfd7` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between d2f1a1e632e54b06271a0cddd8da9811404ec99b and fd27d2ce574fa3f646e7e64f5ba7fe9e1d8d4de7. </details> <details> <summary>📒 Files selected for processing (10)</summary> * `benchmarks/locomo_runner.py` * `benchmarks/longmemeval_runner.py` * `benchmarks/mem0_locomo_runner.py` * `docs/specs/2026-04-16-routing-benchmark-design.md` * `docs/specs/2026-04-16-supersede-benchmark-design.md` * `docs/specs/2026-04-16-taosmd-next-steps.md` * `docs/specs/2026-04-24-ama-question-draft.md` * `docs/specs/2026-04-methodology-post-draft.md` * `eval/librarian_eval.py` * `pyproject.toml` </details> <details> <summary>✅ Files skipped from review due to trivial changes (3)</summary> * benchmarks/longmemeval_runner.py * docs/specs/2026-04-methodology-post-draft.md * docs/specs/2026-04-16-taosmd-next-steps.md </details> <details> <summary>🚧 Files skipped from review as they are similar to previous changes (2)</summary> * pyproject.toml * eval/librarian_eval.py </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
| async def _judge(client: httpx.AsyncClient, url: str, model: str, | ||
| question: str, reference: str, predicted: str) -> float: | ||
| try: | ||
| reply = await _ollama_generate( | ||
| client, url, model, | ||
| JUDGE_PROMPT.format(question=question, reference=reference, predicted=predicted), | ||
| temperature=0.0, | ||
| ) | ||
| except Exception: | ||
| return 0.0 | ||
| head = reply.strip().upper().split() | ||
| return 1.0 if head and head[0].startswith("YES") else 0.0 |
There was a problem hiding this comment.
Judge transport failures are being scored as “NO”.
_judge() returns 0.0 on any exception, so an Ollama timeout/network failure becomes indistinguishable from a real negative grade. That biases the judge average downward and hides infra issues in the report.
🧰 Tools
🪛 Ruff (0.15.10)
[warning] 146-146: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@benchmarks/locomo_runner.py` around lines 138 - 149, The judge currently
masks transport/timeouts by catching all exceptions in _judge and returning 0.0;
instead, detect and surface transport failures so they aren't counted as a "NO".
Modify _judge (which calls _ollama_generate with JUDGE_PROMPT) to not convert
exceptions into a 0.0 score — either re-raise the exception or return a sentinel
(e.g., None or math.nan) and log the error with details — and then update the
caller(s) that aggregate scores to skip/handle the sentinel (so only real
negative verdicts produce 0.0 and infra failures are excluded from the average).
| try: | ||
| predicted = await _ollama_generate( | ||
| client, ollama_url, model, | ||
| ANSWER_PROMPT.format(context=context, question=question), | ||
| ) | ||
| except Exception as exc: | ||
| predicted = f"[generation_error: {exc}]" | ||
| gen_ms = (time.time() - t1) * 1000.0 | ||
|
|
||
| judge = await _judge(client, ollama_url, model, question, reference, predicted) |
There was a problem hiding this comment.
Generation errors are folded into the benchmark scores.
When answer generation fails, the code stores a synthetic [generation_error: ...] prediction and still computes F1/BLEU/Judge for that row. Those rows count toward total_qa, but failed_qa stays unchanged, so the JSON can report a “complete” run even though some QAs never produced an answer.
Also applies to: 247-260
🧰 Tools
🪛 Ruff (0.15.10)
[warning] 241-241: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@benchmarks/locomo_runner.py` around lines 236 - 245, The try/except around
_ollama_generate currently converts generation exceptions into a synthetic
predicted string and still runs metric/judge computation, which folds generation
failures into scores and doesn't increment failed_qa; change the error path so
that when _ollama_generate raises you (a) record the failure by incrementing
failed_qa (the counter used for failed generations), (b) do not set a synthetic
prediction for scoring (use None or a flag), and (c) skip calling _judge and any
F1/BLEU metric computation for that item (only compute gen_ms as now). Update
the flow around _ollama_generate, predicted, gen_ms, and the call to _judge so
that _judge is only invoked when prediction succeeded and failed_qa is
incremented on exceptions.
| def _build_mem0_config(args: argparse.Namespace) -> dict: | ||
| return { | ||
| "llm": { | ||
| "provider": "ollama", | ||
| "config": { | ||
| "model": args.model, | ||
| "ollama_base_url": args.ollama_url, | ||
| "temperature": 0.2, | ||
| }, | ||
| }, | ||
| "embedder": { | ||
| "provider": "ollama", | ||
| "config": { | ||
| "model": "nomic-embed-text", | ||
| "ollama_base_url": args.ollama_url, | ||
| }, | ||
| }, | ||
| "vector_store": { | ||
| "provider": "chroma", | ||
| "config": { | ||
| "collection_name": f"locomo_{args.run_id}", | ||
| "path": f"/tmp/mem0_{args.run_id}_chroma", | ||
| }, | ||
| }, | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, find and read the file to understand the context
find . -name "mem0_locomo_runner.py" -type fRepository: jaylfc/taosmd
Length of output: 91
🏁 Script executed:
# Search for the file and examine its content
fd "mem0_locomo_runner.py" --type fRepository: jaylfc/taosmd
Length of output: 89
🏁 Script executed:
# Read the entire file to understand the context
cat -n benchmarks/mem0_locomo_runner.pyRepository: jaylfc/taosmd
Length of output: 23880
The default mem0 backing store is reused across runs without an explicit --run-id.
With --run-id omitted, collection_name defaults to locomo_None and the Chroma vector store persists to /tmp/mem0_None_chroma. Re-running the benchmark reuses the same persisted collection instead of starting clean, allowing old memories to be retrieved again and making runs irreproducible. Always pass --run-id to isolate benchmark runs.
🧰 Tools
🪛 Ruff (0.15.10)
[error] 244-244: Probable insecure usage of temporary file or directory: "/tmp/mem0_"
(S108)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@benchmarks/mem0_locomo_runner.py` around lines 223 - 247, The mem0 config
currently uses args.run_id directly so when omitted it becomes "None"
(collection_name f"locomo_{args.run_id}" and path
f"/tmp/mem0_{args.run_id}_chroma") causing persistent reuse across runs; fix by
normalizing/assigning a unique default run id inside _build_mem0_config (or
before calling it) when args.run_id is falsy — e.g., set run_id = args.run_id or
generate a stable unique fallback (timestamp or uuid4) and use that run_id for
collection_name and path so each run without an explicit --run-id gets an
isolated backing store.
| # evidence_hits/evidence_total not computable without dia_id metadata from | ||
| # mem0's stored facts (mem0 doesn't preserve per-turn dia_ids). Set to 0. | ||
| # TODO: if mem0 exposes metadata round-trip, wire dia_id here. | ||
| "evidence_hits": 0, | ||
| "evidence_total": len(evidence), |
There was a problem hiding this comment.
R@K is guaranteed to report 0.0 here.
These rows keep evidence_total=len(evidence) while hardcoding evidence_hits=0 because mem0 does not round-trip dia_id. _summary() treats that as a real miss, so every mem0 run publishes retrieval_recall=0.0 even when retrieval succeeded. Mark the metric unavailable instead of zeroing it out, or plumb evidence IDs through mem0 first.
🐛 Proposed fix
- "evidence_hits": 0,
- "evidence_total": len(evidence),
+ "evidence_hits": 0,
+ "evidence_total": 0, # exclude unavailable metric from retrieval_recall
+ "retrieval_recall_unavailable": True,
"error": str(exc),
}
...
- "evidence_hits": 0,
- "evidence_total": len(evidence),
+ "evidence_hits": 0,
+ "evidence_total": 0,
+ "retrieval_recall_unavailable": True,
}Also applies to: 371-375
CodeRabbit flagged four benchmark-integrity issues on #30. All real — they let infra failures and adapter artefacts masquerade as real negative results, biasing the published numbers and hiding problems. 1. _judge() returned 0.0 on Ollama timeout / network error — identical to a genuine "NO" grade. Now returns None; _summary filters None out of the Judge average so transport flakiness doesn't depress the score. The row still records 0.0 vs None distinctly. 2. Generation errors stored a synthetic "[generation_error: ...]" prediction and then computed F1/BLEU/Judge on it. That folded infra failures into the benchmark averages, and failed_qa stayed zero so the run reported "complete" despite missing answers. Now: on gen failure, predicted=""; f1/bleu/judge=None; row carries an `error` field; _guarded increments failed_qa for those rows; _summary excludes None metrics. 3. mem0_locomo_runner.py hardcoded evidence_hits=0 in every row because mem0 2.x doesn't round-trip per-turn dia_ids. _summary then published retrieval_recall=0.0 for every mem0 run — a fake miss. Now sets evidence_hits and evidence_total to None (metric unavailable, not zero), and _summary skips None rows from recall. 4. mem0 runner inherited both the 0.0-on-failure judge and the error- folding pattern. Both paths fixed to the same None-on-failure convention. _summary now also emits judge_scored + recall_scored alongside count so the JSON shows the denominator honestly (e.g. "Judge 0.41 over 1487 scored of 1540 total"), making infra flakiness inspectable rather than invisible. No schema-breaking changes: existing .rescored.json outputs that have evidence_hits=0 or judge=0.0 remain readable — the new _summary treats them as real zeros, which is how they were when written. Only forward runs produce None for "metric unavailable".
Third memory architecture in the comparison harness. Same generator (gemma4:e2b), same ANSWER_PROMPT, same JUDGE_PROMPT, same top-K, same 1540 QAs as the taosmd and mem0 runners. Only the retrieval layer changes — routes search through mempalace.searcher.search_memories(). MemPalace has never published end-to-end Judge on LoCoMo — their own benchmarks/BENCHMARKS.md reports R@10 only (60.3% raw, 88.9% hybrid v5 per their 2026-03 results). Our run adds the novel measurement so the three-way comparison is done under identical conditions. Inherits the silent-failure conventions from PR #33: judge timeouts and generation errors return None (not 0.0) so _summary excludes them from averages. evidence_hits/evidence_total reported as None since MemPalace doesn't round-trip LoCoMo dia_id. Requires: pip install mempalace
Summary
Comprehensive LoCoMo benchmark PR — lands all the work that produced our published LongMemEval 97% and the new LoCoMo numbers. Supersedes the closed #25 (content subsumed into #26's unintentional bundle).
What lands
Benchmark infrastructure (~2000 LOC):
benchmarks/locomo_runner.py— full LoCoMo runner (ONNX embeds + Ollama answer/judge, F1/BLEU/Judge/R@K per category,--concurrency,--per-conv-limit,--timeout,--modelflags, live progress markers)benchmarks/longmemeval_runner.py— ported off dead tinyagentos imports, now usesTAOSMD_OLLAMA_URL/TAOSMD_OLLAMA_MODELenv vars (no hardcoded IPs)benchmarks/mem0_locomo_runner.py— apples-to-apples adapter routing retrieval throughmem0.search()with the same generator/prompt/judge (for the upcoming memory-architecture comparison)benchmarks/locomo_rescore.py— already merged via feat(bench): locomo_rescore + resume tools (external judge with better diagnostics) #29, included here only by branch historyeval/librarian_eval.py— fixed divide-by-zero token ratio (replaced with absolute per-query budget)Prompt tweaks (the +0.03 Overall Judge improvement, validated Apr 18–19):
benchmarks/locomo_runner.pyANSWER_PROMPT— adds absolute-date instruction + softened IDK instructionMethodology docs:
docs/specs/2026-04-16-taosmd-next-steps.md— four-workstream roadmapdocs/specs/2026-04-16-supersede-benchmark-design.mddocs/specs/2026-04-16-routing-benchmark-design.mddocs/specs/2026-04-24-ama-question-draft.md— Imran Ahmad AMA on r/LLMengdocs/specs/2026-04-methodology-post-draft.md— publishable protocol-transparency postREADME framing (updated on prior commits on this branch):
Validated results
LoCoMo full scorecard under
qwen3:4bexternal judge (100% coverage, 0 errors):Test plan
pytest tests/ -q --ignore=tests/integration)feat/locomo-benchmarkandfeat/locomo-prompt-optgit fetch --prune && git checkout masterbefore next benchmark runSummary by CodeRabbit
Release Notes
New Features
Documentation
Chores