docs(specs): LoCoMo scorecard log — taosmd variants + mem0, ongoing#34
docs(specs): LoCoMo scorecard log — taosmd variants + mem0, ongoing#34
Conversation
Single source of truth for every LoCoMo number we've produced so they don't live only in chat transcripts. Captures: - Self-judge scorecards for taosmd-e2b, taosmd-e4b, taosmd-e2b+prompt-opt, mem0-e2b (all runs 2026-04-17 to 2026-04-19) - External qwen3:4b rescore numbers for the three taosmd variants (100% coverage, 0 errors). mem0 rescore queued. - Per-category tables, not just headlines — Temporal 0.29 vs 0.02 (14.5x) is the most dramatic architecture signal - Known artefacts: mem0 R@K=0.0 is an adapter limitation (no dia_id pass-through), patched in PR #33 - Methodology disclosures: same generator (gemma4:e2b), same prompt, same dataset, same top-K=10, same judge (qwen3:4b), commit SHAs for every input - Follow-up: mem0 external rescore in flight, MemPalace adapter queued — will add scorecards to this doc as they complete
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new LoCoMo benchmark scorecard documenting LoCoMo‑10 runs across taosmd variants, mem0, and MemPalace: dataset/category scheme, host/runtime/generator/judge configs, self-judge and external rescore tables with JSON links, methodology/commit disclosures, metric notes, and run status. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Files Reviewed (1 files)
Reviewed by seed-2-0-pro-260328 · 215,955 tokens |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
docs/specs/2026-04-19-locomo-scorecards.md (1)
49-61: Inconsistent data mixing in external judge table.Line 60 shows
0.05 (self-graded)for mem0's Overall F1 within the "External qwen3:4b judge" table. This mixes methodologies—the self-graded value is already presented in the self-judge table (line 41).For consistency, consider using
—orpendingfor mem0's F1 in this table, or move the parenthetical note outside the table to avoid confusion.📝 Suggested table consistency fix
| Category (count) | taosmd-e2b | taosmd-e4b | taosmd-e2b+prompt-opt | mem0-e2b | |---|---|---|---|---| | Single-hop (282) | 0.08 | 0.18 | 0.16 | — | | Temporal (321) | 0.13 | 0.14 | 0.21 | — | | Multi-hop (96) | 0.05 | 0.00 | 0.00 | — | | Open-dom (841) | 0.43 | 0.30 | 0.46 | — | | **Overall Judge** | **0.27** | **0.22** | **0.34** | **pending** | -| Overall F1 | 0.162 | 0.152 | 0.175 | 0.05 (self-graded) | +| Overall F1 | 0.162 | 0.152 | 0.175 | — |Then optionally add a note below the table:
> mem0 F1 from self-judge: 0.05 (see self-judge table above).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/specs/2026-04-19-locomo-scorecards.md` around lines 49 - 61, Replace the inconsistent "0.05 (self-graded)" cell under the "Overall F1" row in the "External qwen3:4b judge" table (the mem0 column) with a neutral placeholder such as "—" or "pending" to avoid mixing self-graded results into the external-judge table, and add a short footnote below the table like: "mem0 F1 from self-judge: 0.05 (see self-judge table above)" to preserve the information without contaminating the external judge matrix; update the table cell for mem0 and insert the single-line note immediately under the table.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/specs/2026-04-19-locomo-scorecards.md`:
- Line 125: The three commit SHAs referenced (`40403cc`, `86c4c19`, `3c5c6c2`)
in the methodology table row "Commits | taosmd runs: `40403cc` (runner) +
`86c4c19` (rescore); prompt-opt: `3c5c6c2`" are not in the repo history; update
that table cell by replacing each invalid SHA with the correct existing commit
SHAs that correspond to the runner, rescore, and prompt-opt changes, or if those
commits are not applicable remove the SHA references entirely and leave a clear
statement (e.g., "not available" or omit commit-level detail) so the
reproducibility claim is accurate.
- Line 47: The documentation references a non-existent result filename
`locomo_20260419_185944_full_mem0_e2b_noinfer_full_mem0_e2b.json` (and three
other similarly missing result files); either remove these incorrect
`benchmarks/results/...` entries from the spec or replace them with the actual
existing result filenames (e.g., the `matrix_*.json` files that live in
benchmarks/results) so the listed result files match files in the repository;
update the list in the specs document accordingly to ensure all referenced
result filenames are present and correct.
---
Nitpick comments:
In `@docs/specs/2026-04-19-locomo-scorecards.md`:
- Around line 49-61: Replace the inconsistent "0.05 (self-graded)" cell under
the "Overall F1" row in the "External qwen3:4b judge" table (the mem0 column)
with a neutral placeholder such as "—" or "pending" to avoid mixing self-graded
results into the external-judge table, and add a short footnote below the table
like: "mem0 F1 from self-judge: 0.05 (see self-judge table above)" to preserve
the information without contaminating the external judge matrix; update the
table cell for mem0 and insert the single-line note immediately under the table.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: ead23815-c57e-4d7d-987a-5875f983a9ae
📒 Files selected for processing (1)
docs/specs/2026-04-19-locomo-scorecards.md
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
docs/specs/2026-04-19-locomo-scorecards.md (1)
71-72: Avoid hard ETA timestamps in a living scorecard logThese ETAs will stale quickly and can mislead later readers. Prefer a durable status format like “as of
<timestamp>: in progress/completed” plus a follow-up commit reference when finalized.Also applies to: 131-133
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/specs/2026-04-19-locomo-scorecards.md` around lines 71 - 72, Replace the hard ETA text in the scorecard entry "mem0 rescore output pending (in flight, ETA ~23:00 BST 2026-04-19)" with a durable status format — e.g. "mem0 rescore output: in progress (as of 2026-04-19T23:00:00Z)" or "in progress (as of <timestamp>)" and when finalizing add a follow-up commit/reference; apply the same change to the other occurrences referenced in the comment (the similar entries around lines 131-133) so no static ETA timestamps remain.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/specs/2026-04-19-locomo-scorecards.md`:
- Line 103: Fix the two wording issues in the spec: remove the extra space in
the phrase "Existing mem0 result JSON" so it reads "Existing mem0 result JSON",
and update the phrase "commit an update the same PR/branch" to "commit an update
in the same PR/branch". Make these edits in the doc content where those exact
phrases appear (look for "Existing mem0 result JSON" and "commit an update the
same PR/branch").
---
Nitpick comments:
In `@docs/specs/2026-04-19-locomo-scorecards.md`:
- Around line 71-72: Replace the hard ETA text in the scorecard entry "mem0
rescore output pending (in flight, ETA ~23:00 BST 2026-04-19)" with a durable
status format — e.g. "mem0 rescore output: in progress (as of
2026-04-19T23:00:00Z)" or "in progress (as of <timestamp>)" and when finalizing
add a follow-up commit/reference; apply the same change to the other occurrences
referenced in the comment (the similar entries around lines 131-133) so no
static ETA timestamps remain.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: f4c1ddb3-3e66-4571-8190-271f7e29a468
📒 Files selected for processing (1)
docs/specs/2026-04-19-locomo-scorecards.md
| - **mem0 R@K always reports 0.0** because mem0 doesn't round-trip per-turn | ||
| `dia_id` metadata, so the adapter hardcodes `evidence_hits=0`. PR #33 | ||
| patches this to `None` (metric unavailable) so the summary drops it from | ||
| recall aggregation rather than publishing a fake 0.0. Existing mem0 |
There was a problem hiding this comment.
Fix minor wording/grammar to keep the spec polished
Two small text issues:
- Line 103: extra spacing in “Existing mem0 result JSON”.
- Lines 147-149: “commit an update the same PR/branch” is missing “in” → “commit an update in the same PR/branch.”
Also applies to: 147-149
🧰 Tools
🪛 LanguageTool
[grammar] ~103-~103: Ensure spelling is correct
Context: ...er than publishing a fake 0.0. Existing mem0 result JSON should be rescored via the fixed `...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/specs/2026-04-19-locomo-scorecards.md` at line 103, Fix the two wording
issues in the spec: remove the extra space in the phrase "Existing mem0 result
JSON" so it reads "Existing mem0 result JSON", and update the phrase "commit an
update the same PR/branch" to "commit an update in the same PR/branch". Make
these edits in the doc content where those exact phrases appear (look for
"Existing mem0 result JSON" and "commit an update the same PR/branch").
Two corrections in one: 1. The external qwen3:4b scorecards table had wrong numbers (0.27 / 0.22 / 0.34 for taosmd variants). Those were the earlier qwen3.5:9b biased- sample numbers that got superseded but I left in the table by mistake. Now corrected to the actual qwen3:4b 100%-coverage numbers (0.40 / 0.38 / 0.41) directly from the streaming rescore log. Per-category rows also restated from source. 2. mem0 rescore completed in 116.9 min, 100% coverage, 0 errors: - Single-hop 0.04 / Temporal 0.02 / Multi-hop 0.10 / Open-dom 0.07 - Overall Judge 0.06 Added to the same table. Biggest architecture gap is Temporal (taosmd-e2b+prompt-opt 0.41 vs mem0 0.02 = 20.5x). Overall gap ~7x under identical external judge, same generator. Also refreshed the "In flight / queued" section: mem0 rescore done, MemPalace adapter already built as `ca0ccb7` (landed in PR #30, ready to run — just needs `pip install mempalace` on the Fedora host). The earlier stale numbers are kept in the caveat block so anyone comparing against chat history or the push notifications knows why they shifted.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
docs/specs/2026-04-19-locomo-scorecards.md (1)
157-158:⚠️ Potential issue | 🟡 MinorFix grammatical error in update guidance.
The phrase "commit an update the same PR/branch" is missing a preposition. Change to either "commit an update to the same PR/branch" or "commit an update in the same PR/branch".
📝 Proposed fix
-After every new run completes (or its rescore completes) — commit an update -the same PR/branch that ships the numbers. Don't let scorecards live only +After every new run completes (or its rescore completes) — commit an update in +the same PR/branch that ships the numbers. Don't let scorecards live only🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/specs/2026-04-19-locomo-scorecards.md` around lines 157 - 158, Replace the ungrammatical phrase "commit an update the same PR/branch" with a corrected version—e.g., "commit an update to the same PR/branch"—so the sentence reads "After every new run completes (or its rescore completes) — commit an update to the same PR/branch that ships the numbers." Locate the exact phrase in the docs/specs text (the string "commit an update the same PR/branch") and insert "to" (or alternatively "in") between "update" and "the" to fix the grammar.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/specs/2026-04-19-locomo-scorecards.md`:
- Around line 111-112: Update the phrase that currently contains multiple spaces
("Existing mem0 result JSON") to use a single space ("Existing mem0 result
JSON") in the sentence mentioning rescoring via the fixed `_summary` logic and
reference to issue `#33` so the text reads: "Existing mem0 result JSON should be
rescored via the fixed `_summary` logic once `#33`".
---
Duplicate comments:
In `@docs/specs/2026-04-19-locomo-scorecards.md`:
- Around line 157-158: Replace the ungrammatical phrase "commit an update the
same PR/branch" with a corrected version—e.g., "commit an update to the same
PR/branch"—so the sentence reads "After every new run completes (or its rescore
completes) — commit an update to the same PR/branch that ships the numbers."
Locate the exact phrase in the docs/specs text (the string "commit an update the
same PR/branch") and insert "to" (or alternatively "in") between "update" and
"the" to fix the grammar.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 0905c2da-3058-4a1c-8247-7d6fdf0b0a90
📒 Files selected for processing (1)
docs/specs/2026-04-19-locomo-scorecards.md
| recall aggregation rather than publishing a fake 0.0. Existing mem0 | ||
| result JSON should be rescored via the fixed `_summary` logic once #33 |
There was a problem hiding this comment.
Remove extra spacing between "mem0" and "result".
Static analysis detects extra spaces in "Existing mem0 result JSON". Please reduce to single space: "Existing mem0 result JSON".
🧰 Tools
🪛 LanguageTool
[grammar] ~111-~111: Ensure spelling is correct
Context: ...er than publishing a fake 0.0. Existing mem0 result JSON should be rescored via the fixed `...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/specs/2026-04-19-locomo-scorecards.md` around lines 111 - 112, Update
the phrase that currently contains multiple spaces ("Existing mem0 result
JSON") to use a single space ("Existing mem0 result JSON") in the sentence
mentioning rescoring via the fixed `_summary` logic and reference to issue `#33`
so the text reads: "Existing mem0 result JSON should be rescored via the fixed
`_summary` logic once `#33`".
Captures every model actually used during the benchmark (generator variants, external judge, embedders, cross-encoder, fact extractor) with params, quant, VRAM footprint, and backend. Adds the runtime/host row so anyone reproducing knows the Ollama parallel limit and rescore timeout. Derives hardware-tier recommendations from what we measured: - Orange Pi (RK3588 NPU, 16 GB): qwen3:4b gen on rkllama, external judge, MiniLM ONNX embed, taosmd arch - Fedora 3060 (12 GB VRAM): gemma4:e2b gen, qwen3:4b judge co-resident, prompt-opt on by default - Laptop / Mac Mini: qwen3:4b gen via Ollama, external judge - High-end (≥24 GB): qwen3.5:9b gen viable; e2b still competitive Documents the seven lessons that drive the defaults: bigger-gen-≠-better at small scale, qwen for structured output, NUM_PARALLEL is the real ceiling, nomic context forces batching, architecture dominates generator choice, self-judge inflates, R@K needs dia_id round-trip. Also corrects the Commits row: superseded SHAs (ca0ccb7 → 571d8af for mempalace) and references the right open PRs (#34, #35, #36).
…ry split MemPalace-e2b full run completed. Self-judge Overall 0.42 — much closer to taosmd (0.48) than to mem0 (0.09). Per-category: - MemPalace beats baseline taosmd on Temporal (0.33 vs 0.29) + Multi-hop (0.24 vs 0.22) - taosmd pulls ahead on Open-dom (0.64 vs 0.51) + Single-hop (0.34 vs 0.29) - prompt-opt variant still the Overall leader at 0.51 - mem0 a distant fourth on every category Story shifts from "taosmd wins by 7x over competitors" to "taosmd and MemPalace are in the same tier, mem0 is much further behind — and raw verbatim-store + a sensible default embedder is a strong baseline on its own." Also added ingest-timing comparison: MemPalace fastest at ~100s for all 10 convs (simpler architecture = less processing per turn). External rescore for MemPalace is running now on Fedora, ETA ~01:55 BST.
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (2)
docs/specs/2026-04-19-locomo-scorecards.md (2)
225-226:⚠️ Potential issue | 🟡 MinorFix grammar: missing preposition.
Line 225-226 reads "commit an update the same PR/branch" but should include "in" or "to": "commit an update in the same PR/branch".
📝 Proposed fix
-After every new run completes (or its rescore completes) — commit an update -the same PR/branch that ships the numbers. Don't let scorecards live only +After every new run completes (or its rescore completes) — commit an update in +the same PR/branch that ships the numbers. Don't let scorecards live only🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/specs/2026-04-19-locomo-scorecards.md` around lines 225 - 226, Fix the grammar in the sentence fragment "commit an update the same PR/branch" by inserting the missing preposition: change it to "commit an update in the same PR/branch" (or "commit an update to the same PR/branch") so the clause reads correctly; update the occurrence of that phrase in the document section containing "After every new run completes (or its rescore completes) — commit an update the same PR/branch that ships the numbers." to include "in" (or "to") after "update".
122-123:⚠️ Potential issue | 🟡 MinorRemove extra spacing between "mem0" and "result".
Static analysis detects multiple spaces in "Existing mem0 result JSON". Reduce to single space: "Existing mem0 result JSON".
📝 Proposed fix
- result JSON should be rescored via the fixed `_summary` logic once `#33` + result JSON should be rescored via the fixed `_summary` logic once `#33`Or if the extra spaces are on the previous line (line 122), adjust there to ensure single space between words.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/specs/2026-04-19-locomo-scorecards.md` around lines 122 - 123, The phrase "Existing mem0 result JSON" contains multiple consecutive spaces; update the spec text to use a single space so it reads "Existing mem0 result JSON" (also check the prior line if the extra spaces originate there), leaving references like mem0 and the `_summary` logic unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/specs/2026-04-19-locomo-scorecards.md`:
- Around line 52-56: The doc references five missing result files in
docs/specs/2026-04-19-locomo-scorecards.md; either add the exact JSON files
(benchmarks/results/locomo_20260417_140810_full_gemma_e2b.json,
locomo_20260418_035232_full_gemma_e4b.json,
locomo_20260418_130212_full_gemma_e2b_opt.json,
locomo_20260419_185944_full_mem0_e2b_noinfer_full_mem0_e2b.json,
locomo_20260419_225441_full_mempalace_e2b_full_mempalace_e2b.json) to
benchmarks/results/ and commit them, or update the markdown in
locomo-scorecards.md to remove or replace those filenames with the
correct/available result files and adjust any reproducibility instructions or
links accordingly.
- Around line 84-88: The docs file references four non-existent rescore output
files (locomo_20260417_140810_full_gemma_e2b.rescored_v2.json,
locomo_20260418_035232_full_gemma_e4b.rescored_v2.json,
locomo_20260418_130212_full_gemma_e2b_opt.rescored_v2.json,
locomo_20260419_185944_full_mem0_e2b_noinfer.rescored_v2.json) in the "Rescore
output files" list; either add those files to the repo or remove/replace those
entries in docs/specs/2026-04-19-locomo-scorecards.md (the "Rescore output files
(include per-item `judge_rejudged`):" section) with the correct existing
filenames or a note explaining they are unavailable. Ensure the listed filenames
in that section match actual files in benchmarks/results/ or update the text to
not reference missing artifacts.
- Line 144: The methodology table contains incorrect/unverifiable commit SHAs:
replace or verify the commit for PR `#30` (currently listed as fd27d2c) by looking
up the actual merged commit SHA and update the table entry (or mark PR `#30` as
not merged if appropriate); for the pending PRs SHAs (bc0a773, 571d8af, ca0ccb7)
either remove those SHA references until they exist or change their notes to
explicitly state “pending/not yet merged” (and if 571d8af supersedes ca0ccb7,
keep only the active PR note); ensure the table cells referencing these SHAs
(the “Commits” row) reflect verified SHAs or clear pending status so every
listed commit can be validated.
---
Duplicate comments:
In `@docs/specs/2026-04-19-locomo-scorecards.md`:
- Around line 225-226: Fix the grammar in the sentence fragment "commit an
update the same PR/branch" by inserting the missing preposition: change it to
"commit an update in the same PR/branch" (or "commit an update to the same
PR/branch") so the clause reads correctly; update the occurrence of that phrase
in the document section containing "After every new run completes (or its
rescore completes) — commit an update the same PR/branch that ships the
numbers." to include "in" (or "to") after "update".
- Around line 122-123: The phrase "Existing mem0 result JSON" contains
multiple consecutive spaces; update the spec text to use a single space so it
reads "Existing mem0 result JSON" (also check the prior line if the extra spaces
originate there), leaving references like mem0 and the `_summary` logic
unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 8bb1fc9a-418e-4699-9c2c-42eeda82d074
📒 Files selected for processing (1)
docs/specs/2026-04-19-locomo-scorecards.md
MemPalace-e2b external qwen3:4b rescore: Overall Judge 0.34 (180.5 min, 100% coverage, 0 errors). All three architectures now have the same treatment: same generator, same prompt, same external judge, same 1540 QAs. Only the memory layer varies. Final headline numbers (external Judge, gemma4:e2b generator): - taosmd-e2b+prompt-opt 0.41 - taosmd-e2b 0.40 - taosmd-e4b 0.38 - MemPalace-e2b 0.34 - mem0-e2b (infer=False) 0.06 Per-category reveals a more nuanced story than the Overall numbers: - Single-hop: three-way tie at ~0.16-0.17 — solved at this tier by any competent semantic-retrieval system - Temporal: taosmd (0.36) and MemPalace (0.35) nearly tied; only prompt-opt breaks away at 0.41 - Multi-hop: taosmd-opt leads at 0.24; KG + query expansion help on synthesis questions - Open-dom: taosmd's clearest architectural win (0.51 vs MemPalace 0.41, +24% relative) - mem0 distant fourth everywhere Reframes the positioning: taosmd's architecture edge concentrates on harder question types that benefit from rerank + synthesis (Open-dom, Multi-hop); on simpler retrieval (Single-hop, Temporal) MemPalace's verbatim-store + default embedder is nearly as good. Cleaner story than "we dominate" and more useful for positioning against the target audiences documented in project_taosmd_positioning.md. Next: README rewrite aligned with that positioning memory and these nuanced numbers — lead with target audiences (SBC, taOS clusters, offline/compliance, long-horizon agents), frame benchmark numbers as "at the compute tier we target," highlight architectural edge on the categories where it actually shows.
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (2)
docs/specs/2026-04-19-locomo-scorecards.md (2)
234-235:⚠️ Potential issue | 🟡 MinorFix grammatical error: missing preposition.
Line 234-235 reads "commit an update the same PR/branch" but should be "commit an update in the same PR/branch" or "commit an update to the same PR/branch."
✏️ Proposed fix
-After every new run completes (or its rescore completes) — commit an update -the same PR/branch that ships the numbers. Don't let scorecards live only +After every new run completes (or its rescore completes) — commit an update in +the same PR/branch that ships the numbers. Don't let scorecards live only🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/specs/2026-04-19-locomo-scorecards.md` around lines 234 - 235, The sentence "commit an update the same PR/branch that ships the numbers." is missing a preposition; update the phrasing in the docs/specs text (around the sentence near "After every new run completes...") to read either "commit an update in the same PR/branch that ships the numbers." or "commit an update to the same PR/branch that ships the numbers." so the grammar is correct and the intended meaning is preserved.
130-131:⚠️ Potential issue | 🟡 MinorRemove extra spacing between "mem0" and "result".
Static analysis detects multiple spaces in "Existing mem0 result JSON" at line 130. This was flagged in previous reviews; please reduce to single space.
✏️ Proposed fix
- recall aggregation rather than publishing a fake 0.0. Existing mem0 - result JSON should be rescored via the fixed `_summary` logic once `#33` + recall aggregation rather than publishing a fake 0.0. Existing mem0 + result JSON should be rescored via the fixed `_summary` logic once `#33`Note: Ensure there is only a single space between "mem0" and "result" when viewing the raw file.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/specs/2026-04-19-locomo-scorecards.md` around lines 130 - 131, The phrase "Existing mem0 result JSON" contains multiple consecutive spaces; update the doc text (the string containing "Existing mem0 result JSON") to use a single space so it reads "Existing mem0 result JSON", ensuring the raw file has exactly one space between "mem0" and "result" (verify by opening the source and removing the extra space).
🧹 Nitpick comments (2)
docs/specs/2026-04-19-locomo-scorecards.md (2)
23-24: Clarify "three taosmd runs" to avoid confusion with total run count.Line 24 states "100% coverage / 0 errors on the three taosmd runs," but the external judge table (lines 64-71) includes five systems total (three taosmd variants + mem0 + MemPalace). While technically accurate, this phrasing may confuse readers. Consider: "100% coverage / 0 errors on all five systems (three taosmd variants, mem0, and MemPalace)."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/specs/2026-04-19-locomo-scorecards.md` around lines 23 - 24, Update the ambiguous sentence that reads "100% coverage / 0 errors on the three taosmd runs" to explicitly state the full set of systems evaluated so readers aren’t confused by the external judge table; locate the exact phrase and replace it with something like "100% coverage / 0 errors on all five systems (three taosmd variants, mem0, and MemPalace)" so it references the three taosmd variants plus mem0 and MemPalace.
64-64: Make column headers consistent between self-judge and external judge tables.The self-judge table (line 34) labels the mem0 column as "mem0-e2b (infer=False)" while the external judge table (line 64) uses just "mem0-e2b". For consistency and clarity, use the same label in both tables.
📝 Suggested fix
-| Category (count) | taosmd-e2b | taosmd-e4b | taosmd-e2b+prompt-opt | mem0-e2b | MemPalace-e2b | +| Category (count) | taosmd-e2b | taosmd-e4b | taosmd-e2b+prompt-opt | mem0-e2b (infer=False) | MemPalace-e2b |🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/specs/2026-04-19-locomo-scorecards.md` at line 64, The external judge table header is inconsistent with the self-judge table: update the header label in the external judge table (the column currently named "mem0-e2b") to match the self-judge table's label "mem0-e2b (infer=False)" so both tables use the exact same column name; locate the external judge table header row containing "| Category (count) | taosmd-e2b | taosmd-e4b | taosmd-e2b+prompt-opt | mem0-e2b | MemPalace-e2b |" and replace the "mem0-e2b" token with "mem0-e2b (infer=False)".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/specs/2026-04-19-locomo-scorecards.md`:
- Around line 114-117: The overall judge improvement claim is incorrect: update
the sentence referencing "taosmd-e2b-opt" (and the comparison between
"taosmd-e2b" and "taosmd-e2b+prompt-opt") to state a +0.01 Overall Judge gain
(0.40 → 0.41) instead of +0.07, and while editing confirm the per-category
deltas for Temporal (+0.05) and Multi-hop (+0.03) match the external judge table
values before committing the change.
- Line 152: Update the table row that lists commit SHAs to remove or correct the
four non-existent references (fd27d2c, bc0a773, 571d8af, ca0ccb7) so only valid
commits remain; either replace each invalid SHA with the correct commit hash or
delete the corresponding mention (e.g., remove the "Runner + adapter +
prompt-opt: `fd27d2c`", "Silent-failure fixes: `bc0a773`", "MemPalace adapter
(chromadb path): `571d8af`" and the superseded `ca0ccb7`), leaving the valid
entries (`c360faa`, `116edab`) intact and ensuring the surrounding text/PR
references still read sensibly.
- Around line 91-96: The docs/specs file references five rescored result files
that don't exist; update the spec to either remove those five filenames or
commit the missing JSON files. Specifically, in
docs/specs/2026-04-19-locomo-scorecards.md remove or replace the list entries
for benchmarks/results/locomo_20260417_140810_full_gemma_e2b.rescored_v2.json,
benchmarks/results/locomo_20260418_035232_full_gemma_e4b.rescored_v2.json,
benchmarks/results/locomo_20260418_130212_full_gemma_e2b_opt.rescored_v2.json,
benchmarks/results/locomo_20260419_185944_full_mem0_e2b_noinfer.rescored_v2.json,
and
benchmarks/results/locomo_20260419_225441_full_mempalace_e2b.rescored_v2.json;
if you choose to add the files, commit them under benchmarks/results with the
exact names used in the spec so the documentation links resolve.
- Around line 52-56: The documentation references five non-existent result files
in locomo-scorecards.md; either add the missing JSONs to benchmarks/results with
the exact filenames listed (locomo_20260417_140810_full_gemma_e2b.json,
locomo_20260418_035232_full_gemma_e4b.json,
locomo_20260418_130212_full_gemma_e2b_opt.json,
locomo_20260419_185944_full_mem0_e2b_noinfer_full_mem0_e2b.json,
locomo_20260419_225441_full_mempalace_e2b_full_mempalace_e2b.json) ensuring they
are valid result JSONs, or remove the five lines referencing those filenames
from locomo-scorecards.md so the doc no longer points to missing files.
---
Duplicate comments:
In `@docs/specs/2026-04-19-locomo-scorecards.md`:
- Around line 234-235: The sentence "commit an update the same PR/branch that
ships the numbers." is missing a preposition; update the phrasing in the
docs/specs text (around the sentence near "After every new run completes...") to
read either "commit an update in the same PR/branch that ships the numbers." or
"commit an update to the same PR/branch that ships the numbers." so the grammar
is correct and the intended meaning is preserved.
- Around line 130-131: The phrase "Existing mem0 result JSON" contains
multiple consecutive spaces; update the doc text (the string containing
"Existing mem0 result JSON") to use a single space so it reads "Existing mem0
result JSON", ensuring the raw file has exactly one space between "mem0" and
"result" (verify by opening the source and removing the extra space).
---
Nitpick comments:
In `@docs/specs/2026-04-19-locomo-scorecards.md`:
- Around line 23-24: Update the ambiguous sentence that reads "100% coverage / 0
errors on the three taosmd runs" to explicitly state the full set of systems
evaluated so readers aren’t confused by the external judge table; locate the
exact phrase and replace it with something like "100% coverage / 0 errors on all
five systems (three taosmd variants, mem0, and MemPalace)" so it references the
three taosmd variants plus mem0 and MemPalace.
- Line 64: The external judge table header is inconsistent with the self-judge
table: update the header label in the external judge table (the column currently
named "mem0-e2b") to match the self-judge table's label "mem0-e2b (infer=False)"
so both tables use the exact same column name; locate the external judge table
header row containing "| Category (count) | taosmd-e2b | taosmd-e4b |
taosmd-e2b+prompt-opt | mem0-e2b | MemPalace-e2b |" and replace the "mem0-e2b"
token with "mem0-e2b (infer=False)".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 0cccf7f9-7849-481e-832c-7c054367b523
📒 Files selected for processing (1)
docs/specs/2026-04-19-locomo-scorecards.md
| | Self-judge | gemma4:e2b (i.e. the same generator) | | ||
| | External judge | qwen3:4b via Ollama, temperature 0.0, `benchmarks/locomo_rescore_streaming.py` | | ||
| | Timeout (rescore) | 240s per call, concurrency 3 | | ||
| | Commits | Runner + adapter + prompt-opt: `fd27d2c` (PR #30, merged). Rescore tool: `c360faa` (PR #29, merged). README Judge framing: `116edab`/`PR #31` + `fix/readme-judge-accuracy-consistency` (PR #32, merged). Silent-failure fixes: `bc0a773` (PR #35, pending). MemPalace adapter (chromadb path): `571d8af` (PR #36, pending — supersedes earlier `ca0ccb7` palace/mine attempt). | |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Verify all referenced commit SHAs exist in repository history
commits=("fd27d2c" "c360faa" "116edab" "bc0a773" "571d8af" "ca0ccb7")
echo "Checking commit SHAs..."
for commit in "${commits[@]}"; do
if git cat-file -e "$commit^{commit}" 2>/dev/null; then
echo "✓ Commit $commit exists"
git log -1 --oneline "$commit"
else
echo "✗ Commit $commit NOT FOUND"
fi
doneRepository: jaylfc/taosmd
Length of output: 423
Remove or correct the four non-existent commit references.
The documentation table on line 152 references six commit SHAs, but four of them do not exist in the repository:
- ✗
fd27d2c— NOT FOUND - ✗
bc0a773— NOT FOUND - ✗
571d8af— NOT FOUND - ✗
ca0ccb7— NOT FOUND - ✓
c360faa— EXISTS - ✓
116edab— EXISTS
Either correct these SHAs with valid commit hashes or remove the references if they are no longer relevant.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/specs/2026-04-19-locomo-scorecards.md` at line 152, Update the table row
that lists commit SHAs to remove or correct the four non-existent references
(fd27d2c, bc0a773, 571d8af, ca0ccb7) so only valid commits remain; either
replace each invalid SHA with the correct commit hash or delete the
corresponding mention (e.g., remove the "Runner + adapter + prompt-opt:
`fd27d2c`", "Silent-failure fixes: `bc0a773`", "MemPalace adapter (chromadb
path): `571d8af`" and the superseded `ca0ccb7`), leaving the valid entries
(`c360faa`, `116edab`) intact and ensuring the surrounding text/PR references
still read sensibly.
… flight - Add Parametric retrieval matrix (C1-C6) scorecard: C3 adjacent_turns is the biggest single-lever win at 0.465; C6 multihop_decompose regresses to 0.317; C5 bge_reranker deferred pending refactor. - Add lessons #8 (multihop decomposition regresses at small-LLM scale) and #9 (context stitching beats retrieval width). - Reorganise 'In flight / queued' section into Complete / In flight / Queued sub-headings. Log the c_stack run currently mid-bench and the three queued follow-ups (qwen9b dense, Qwen3.6 HLWQ via vLLM, Qwen3.6 MoE via Ollama).
…act yesterday's claim) Five new results logged (2026-04-21 evening + 2026-04-22): - c_stack final 0.482 — stacking IS additive (+0.017 over adj=1). Yesterday's 'stacking didn't stack' read was from a 62% partial rescore. - adj_sweep_adj2 0.499 — new leader, +0.089 vs baseline-opt. - adj_sweep_adj3 0.487 — regresses from adj=2, sweet spot is 2. - adj1_k20 0.479 — k=20 adds +0.014 on adj=1. - adj1_llm partial 0.464 — llm-exp flat on adj=1. Clean stack decomposition: adj=1 alone = 0.465 adj=1 + k=20 = 0.479 (+0.014 from k=20) adj=1 + llm-exp = 0.464 (+0.00 from llm-exp) adj=1 + k=20 + llm = 0.482 (+0.003 from llm-exp on top of k=20) Next queued: adj2_k20 (predicted ~0.513), then qwen3.5:9b block, then Qwen3.6 MoE (HLWQ via vLLM + GGUF via Ollama).
Living results doc under
docs/specs/2026-04-19-locomo-scorecards.md.Purpose: single source of truth for every LoCoMo number we produce. Keeps scorecards off chat transcripts and tied to commit SHAs, judge model, dataset version, and methodology disclosures.
Captured so far:
Convention: update this doc alongside every new run or rescore so we never lose numbers.
Summary by CodeRabbit