Skip to content

docs(specs): LoCoMo scorecard log — taosmd variants + mem0, ongoing#34

Open
jaylfc wants to merge 8 commits intomasterfrom
docs/locomo-scorecards
Open

docs(specs): LoCoMo scorecard log — taosmd variants + mem0, ongoing#34
jaylfc wants to merge 8 commits intomasterfrom
docs/locomo-scorecards

Conversation

@jaylfc
Copy link
Copy Markdown
Owner

@jaylfc jaylfc commented Apr 19, 2026

Living results doc under docs/specs/2026-04-19-locomo-scorecards.md.

Purpose: single source of truth for every LoCoMo number we produce. Keeps scorecards off chat transcripts and tied to commit SHAs, judge model, dataset version, and methodology disclosures.

Captured so far:

  • Self-judge scorecards for taosmd-e2b / taosmd-e4b / taosmd-e2b+prompt-opt / mem0-e2b
  • External qwen3:4b rescore (100% coverage, 0 errors) for the three taosmd variants
  • Per-category tables — Temporal shows the clearest architecture split (taosmd 0.29 / mem0 0.02 = 14.5×)
  • Known artefacts + methodology disclosures + pending work (mem0 rescore in flight, MemPalace adapter queued)

Convention: update this doc alongside every new run or rescore so we never lose numbers.

Summary by CodeRabbit

  • Documentation
    • Added a live, append-updated LoCoMo benchmark scorecard covering LoCoMo-10 categories, execution environment, generator and judging configurations.
    • Includes self-judge and external re-score tables with per-category/overall metrics, F1, and links to result artifacts; notes some earlier partial results are superseded.
    • Documents known metric artifacts, interpretation guidance, run status, and when to update the scorecard.

Single source of truth for every LoCoMo number we've produced so they
don't live only in chat transcripts. Captures:

- Self-judge scorecards for taosmd-e2b, taosmd-e4b, taosmd-e2b+prompt-opt,
  mem0-e2b (all runs 2026-04-17 to 2026-04-19)
- External qwen3:4b rescore numbers for the three taosmd variants
  (100% coverage, 0 errors). mem0 rescore queued.
- Per-category tables, not just headlines — Temporal 0.29 vs 0.02
  (14.5x) is the most dramatic architecture signal
- Known artefacts: mem0 R@K=0.0 is an adapter limitation (no dia_id
  pass-through), patched in PR #33
- Methodology disclosures: same generator (gemma4:e2b), same prompt,
  same dataset, same top-K=10, same judge (qwen3:4b), commit SHAs for
  every input
- Follow-up: mem0 external rescore in flight, MemPalace adapter queued
  — will add scorecards to this doc as they complete
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 19, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new LoCoMo benchmark scorecard documenting LoCoMo‑10 runs across taosmd variants, mem0, and MemPalace: dataset/category scheme, host/runtime/generator/judge configs, self-judge and external rescore tables with JSON links, methodology/commit disclosures, metric notes, and run status.

Changes

Cohort / File(s) Summary
LoCoMo Scorecards Documentation
docs/specs/2026-04-19-locomo-scorecards.md
New comprehensive scorecard for LoCoMo‑10: dataset and category definitions (Single‑hop/Temporal/Multi‑hop/Open‑dom; adversarial excluded), benchmarking host/runtime and generator setup (gemma4:e2b via Ollama, shared ANSWER_PROMPT), judging flow (self‑judge vs external rescore with qwen3:4b), per‑system per‑category and overall judge/F1 tables with links to raw JSON and rescored outputs, rescoring checkpointing/timeouts, observed metric/data artefacts, commit SHAs and config log, and run status (completed/queued) guidance.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Poem

🐰 I nibble logs and tally every line,
Scores hop in rows, each metric by design.
Generators hum, judges ponder deep,
I hide a carrot where the numbers sleep—
A small glad hop for data, soft and fine 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: addition of a LoCoMo scorecard documentation file with results for taosmd variants and mem0. It is concise, specific, and directly reflects the file added and its purpose.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/locomo-scorecards

Comment @coderabbitai help to get the list of available commands and usage tips.

@kilo-code-bot
Copy link
Copy Markdown

kilo-code-bot Bot commented Apr 19, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Files Reviewed (1 files)
  • docs/specs/2026-04-19-locomo-scorecards.md

Reviewed by seed-2-0-pro-260328 · 215,955 tokens

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
docs/specs/2026-04-19-locomo-scorecards.md (1)

49-61: Inconsistent data mixing in external judge table.

Line 60 shows 0.05 (self-graded) for mem0's Overall F1 within the "External qwen3:4b judge" table. This mixes methodologies—the self-graded value is already presented in the self-judge table (line 41).

For consistency, consider using or pending for mem0's F1 in this table, or move the parenthetical note outside the table to avoid confusion.

📝 Suggested table consistency fix
 | Category (count) | taosmd-e2b | taosmd-e4b | taosmd-e2b+prompt-opt | mem0-e2b |
 |---|---|---|---|---|
 | Single-hop (282) | 0.08 | 0.18 | 0.16 | — |
 | Temporal (321)   | 0.13 | 0.14 | 0.21 | — |
 | Multi-hop (96)   | 0.05 | 0.00 | 0.00 | — |
 | Open-dom (841)   | 0.43 | 0.30 | 0.46 | — |
 | **Overall Judge** | **0.27** | **0.22** | **0.34** | **pending** |
-| Overall F1 | 0.162 | 0.152 | 0.175 | 0.05 (self-graded) |
+| Overall F1 | 0.162 | 0.152 | 0.175 | — |

Then optionally add a note below the table:

> mem0 F1 from self-judge: 0.05 (see self-judge table above).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/specs/2026-04-19-locomo-scorecards.md` around lines 49 - 61, Replace the
inconsistent "0.05 (self-graded)" cell under the "Overall F1" row in the
"External qwen3:4b judge" table (the mem0 column) with a neutral placeholder
such as "—" or "pending" to avoid mixing self-graded results into the
external-judge table, and add a short footnote below the table like: "mem0 F1
from self-judge: 0.05 (see self-judge table above)" to preserve the information
without contaminating the external judge matrix; update the table cell for mem0
and insert the single-line note immediately under the table.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/specs/2026-04-19-locomo-scorecards.md`:
- Line 125: The three commit SHAs referenced (`40403cc`, `86c4c19`, `3c5c6c2`)
in the methodology table row "Commits | taosmd runs: `40403cc` (runner) +
`86c4c19` (rescore); prompt-opt: `3c5c6c2`" are not in the repo history; update
that table cell by replacing each invalid SHA with the correct existing commit
SHAs that correspond to the runner, rescore, and prompt-opt changes, or if those
commits are not applicable remove the SHA references entirely and leave a clear
statement (e.g., "not available" or omit commit-level detail) so the
reproducibility claim is accurate.
- Line 47: The documentation references a non-existent result filename
`locomo_20260419_185944_full_mem0_e2b_noinfer_full_mem0_e2b.json` (and three
other similarly missing result files); either remove these incorrect
`benchmarks/results/...` entries from the spec or replace them with the actual
existing result filenames (e.g., the `matrix_*.json` files that live in
benchmarks/results) so the listed result files match files in the repository;
update the list in the specs document accordingly to ensure all referenced
result filenames are present and correct.

---

Nitpick comments:
In `@docs/specs/2026-04-19-locomo-scorecards.md`:
- Around line 49-61: Replace the inconsistent "0.05 (self-graded)" cell under
the "Overall F1" row in the "External qwen3:4b judge" table (the mem0 column)
with a neutral placeholder such as "—" or "pending" to avoid mixing self-graded
results into the external-judge table, and add a short footnote below the table
like: "mem0 F1 from self-judge: 0.05 (see self-judge table above)" to preserve
the information without contaminating the external judge matrix; update the
table cell for mem0 and insert the single-line note immediately under the table.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: ead23815-c57e-4d7d-987a-5875f983a9ae

📥 Commits

Reviewing files that changed from the base of the PR and between 7b7d044 and 714a592.

📒 Files selected for processing (1)
  • docs/specs/2026-04-19-locomo-scorecards.md

Comment thread docs/specs/2026-04-19-locomo-scorecards.md
Comment thread docs/specs/2026-04-19-locomo-scorecards.md Outdated
CodeRabbit CRITICAL on #34 caught that 40403cc / 86c4c19 / 3c5c6c2 are
no longer reachable — rewritten out of history by PR #30's rebase to a
single commit. Replaced with the reachable SHAs and noted that the old
ones were intentionally rewritten so anyone reading git log won't be
confused.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
docs/specs/2026-04-19-locomo-scorecards.md (1)

71-72: Avoid hard ETA timestamps in a living scorecard log

These ETAs will stale quickly and can mislead later readers. Prefer a durable status format like “as of <timestamp>: in progress/completed” plus a follow-up commit reference when finalized.

Also applies to: 131-133

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/specs/2026-04-19-locomo-scorecards.md` around lines 71 - 72, Replace the
hard ETA text in the scorecard entry "mem0 rescore output pending (in flight,
ETA ~23:00 BST 2026-04-19)" with a durable status format — e.g. "mem0 rescore
output: in progress (as of 2026-04-19T23:00:00Z)" or "in progress (as of
<timestamp>)" and when finalizing add a follow-up commit/reference; apply the
same change to the other occurrences referenced in the comment (the similar
entries around lines 131-133) so no static ETA timestamps remain.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/specs/2026-04-19-locomo-scorecards.md`:
- Line 103: Fix the two wording issues in the spec: remove the extra space in
the phrase "Existing mem0  result JSON" so it reads "Existing mem0 result JSON",
and update the phrase "commit an update the same PR/branch" to "commit an update
in the same PR/branch". Make these edits in the doc content where those exact
phrases appear (look for "Existing mem0  result JSON" and "commit an update the
same PR/branch").

---

Nitpick comments:
In `@docs/specs/2026-04-19-locomo-scorecards.md`:
- Around line 71-72: Replace the hard ETA text in the scorecard entry "mem0
rescore output pending (in flight, ETA ~23:00 BST 2026-04-19)" with a durable
status format — e.g. "mem0 rescore output: in progress (as of
2026-04-19T23:00:00Z)" or "in progress (as of <timestamp>)" and when finalizing
add a follow-up commit/reference; apply the same change to the other occurrences
referenced in the comment (the similar entries around lines 131-133) so no
static ETA timestamps remain.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: f4c1ddb3-3e66-4571-8190-271f7e29a468

📥 Commits

Reviewing files that changed from the base of the PR and between 714a592 and 681547c.

📒 Files selected for processing (1)
  • docs/specs/2026-04-19-locomo-scorecards.md

- **mem0 R@K always reports 0.0** because mem0 doesn't round-trip per-turn
`dia_id` metadata, so the adapter hardcodes `evidence_hits=0`. PR #33
patches this to `None` (metric unavailable) so the summary drops it from
recall aggregation rather than publishing a fake 0.0. Existing mem0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix minor wording/grammar to keep the spec polished

Two small text issues:

  • Line 103: extra spacing in “Existing mem0 result JSON”.
  • Lines 147-149: “commit an update the same PR/branch” is missing “in” → “commit an update in the same PR/branch.”

Also applies to: 147-149

🧰 Tools
🪛 LanguageTool

[grammar] ~103-~103: Ensure spelling is correct
Context: ...er than publishing a fake 0.0. Existing mem0 result JSON should be rescored via the fixed `...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/specs/2026-04-19-locomo-scorecards.md` at line 103, Fix the two wording
issues in the spec: remove the extra space in the phrase "Existing mem0  result
JSON" so it reads "Existing mem0 result JSON", and update the phrase "commit an
update the same PR/branch" to "commit an update in the same PR/branch". Make
these edits in the doc content where those exact phrases appear (look for
"Existing mem0  result JSON" and "commit an update the same PR/branch").

Two corrections in one:

1. The external qwen3:4b scorecards table had wrong numbers (0.27 / 0.22 /
   0.34 for taosmd variants). Those were the earlier qwen3.5:9b biased-
   sample numbers that got superseded but I left in the table by
   mistake. Now corrected to the actual qwen3:4b 100%-coverage numbers
   (0.40 / 0.38 / 0.41) directly from the streaming rescore log.
   Per-category rows also restated from source.

2. mem0 rescore completed in 116.9 min, 100% coverage, 0 errors:
   - Single-hop 0.04 / Temporal 0.02 / Multi-hop 0.10 / Open-dom 0.07
   - Overall Judge 0.06
   Added to the same table. Biggest architecture gap is Temporal
   (taosmd-e2b+prompt-opt 0.41 vs mem0 0.02 = 20.5x). Overall gap ~7x
   under identical external judge, same generator.

Also refreshed the "In flight / queued" section: mem0 rescore done,
MemPalace adapter already built as `ca0ccb7` (landed in PR #30, ready
to run — just needs `pip install mempalace` on the Fedora host).

The earlier stale numbers are kept in the caveat block so anyone
comparing against chat history or the push notifications knows why
they shifted.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
docs/specs/2026-04-19-locomo-scorecards.md (1)

157-158: ⚠️ Potential issue | 🟡 Minor

Fix grammatical error in update guidance.

The phrase "commit an update the same PR/branch" is missing a preposition. Change to either "commit an update to the same PR/branch" or "commit an update in the same PR/branch".

📝 Proposed fix
-After every new run completes (or its rescore completes) — commit an update
-the same PR/branch that ships the numbers. Don't let scorecards live only
+After every new run completes (or its rescore completes) — commit an update in
+the same PR/branch that ships the numbers. Don't let scorecards live only
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/specs/2026-04-19-locomo-scorecards.md` around lines 157 - 158, Replace
the ungrammatical phrase "commit an update the same PR/branch" with a corrected
version—e.g., "commit an update to the same PR/branch"—so the sentence reads
"After every new run completes (or its rescore completes) — commit an update to
the same PR/branch that ships the numbers." Locate the exact phrase in the
docs/specs text (the string "commit an update the same PR/branch") and insert
"to" (or alternatively "in") between "update" and "the" to fix the grammar.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/specs/2026-04-19-locomo-scorecards.md`:
- Around line 111-112: Update the phrase that currently contains multiple spaces
("Existing mem0   result JSON") to use a single space ("Existing mem0 result
JSON") in the sentence mentioning rescoring via the fixed `_summary` logic and
reference to issue `#33` so the text reads: "Existing mem0 result JSON should be
rescored via the fixed `_summary` logic once `#33`".

---

Duplicate comments:
In `@docs/specs/2026-04-19-locomo-scorecards.md`:
- Around line 157-158: Replace the ungrammatical phrase "commit an update the
same PR/branch" with a corrected version—e.g., "commit an update to the same
PR/branch"—so the sentence reads "After every new run completes (or its rescore
completes) — commit an update to the same PR/branch that ships the numbers."
Locate the exact phrase in the docs/specs text (the string "commit an update the
same PR/branch") and insert "to" (or alternatively "in") between "update" and
"the" to fix the grammar.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 0905c2da-3058-4a1c-8247-7d6fdf0b0a90

📥 Commits

Reviewing files that changed from the base of the PR and between 681547c and 7c1ce61.

📒 Files selected for processing (1)
  • docs/specs/2026-04-19-locomo-scorecards.md

Comment on lines +111 to +112
recall aggregation rather than publishing a fake 0.0. Existing mem0
result JSON should be rescored via the fixed `_summary` logic once #33
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove extra spacing between "mem0" and "result".

Static analysis detects extra spaces in "Existing mem0 result JSON". Please reduce to single space: "Existing mem0 result JSON".

🧰 Tools
🪛 LanguageTool

[grammar] ~111-~111: Ensure spelling is correct
Context: ...er than publishing a fake 0.0. Existing mem0 result JSON should be rescored via the fixed `...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/specs/2026-04-19-locomo-scorecards.md` around lines 111 - 112, Update
the phrase that currently contains multiple spaces ("Existing mem0   result
JSON") to use a single space ("Existing mem0 result JSON") in the sentence
mentioning rescoring via the fixed `_summary` logic and reference to issue `#33`
so the text reads: "Existing mem0 result JSON should be rescored via the fixed
`_summary` logic once `#33`".

jaylfc added 2 commits April 19, 2026 22:10
Captures every model actually used during the benchmark (generator
variants, external judge, embedders, cross-encoder, fact extractor) with
params, quant, VRAM footprint, and backend. Adds the runtime/host row so
anyone reproducing knows the Ollama parallel limit and rescore timeout.

Derives hardware-tier recommendations from what we measured:
- Orange Pi (RK3588 NPU, 16 GB): qwen3:4b gen on rkllama, external judge,
  MiniLM ONNX embed, taosmd arch
- Fedora 3060 (12 GB VRAM): gemma4:e2b gen, qwen3:4b judge co-resident,
  prompt-opt on by default
- Laptop / Mac Mini: qwen3:4b gen via Ollama, external judge
- High-end (≥24 GB): qwen3.5:9b gen viable; e2b still competitive

Documents the seven lessons that drive the defaults: bigger-gen-≠-better
at small scale, qwen for structured output, NUM_PARALLEL is the real
ceiling, nomic context forces batching, architecture dominates
generator choice, self-judge inflates, R@K needs dia_id round-trip.

Also corrects the Commits row: superseded SHAs (ca0ccb7571d8af for
mempalace) and references the right open PRs (#34, #35, #36).
…ry split

MemPalace-e2b full run completed. Self-judge Overall 0.42 — much closer
to taosmd (0.48) than to mem0 (0.09). Per-category:
- MemPalace beats baseline taosmd on Temporal (0.33 vs 0.29) + Multi-hop
  (0.24 vs 0.22)
- taosmd pulls ahead on Open-dom (0.64 vs 0.51) + Single-hop (0.34 vs 0.29)
- prompt-opt variant still the Overall leader at 0.51
- mem0 a distant fourth on every category

Story shifts from "taosmd wins by 7x over competitors" to "taosmd and
MemPalace are in the same tier, mem0 is much further behind — and raw
verbatim-store + a sensible default embedder is a strong baseline on
its own."

Also added ingest-timing comparison: MemPalace fastest at ~100s for
all 10 convs (simpler architecture = less processing per turn).

External rescore for MemPalace is running now on Fedora, ETA ~01:55 BST.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
docs/specs/2026-04-19-locomo-scorecards.md (2)

225-226: ⚠️ Potential issue | 🟡 Minor

Fix grammar: missing preposition.

Line 225-226 reads "commit an update the same PR/branch" but should include "in" or "to": "commit an update in the same PR/branch".

📝 Proposed fix
-After every new run completes (or its rescore completes) — commit an update
-the same PR/branch that ships the numbers. Don't let scorecards live only
+After every new run completes (or its rescore completes) — commit an update in
+the same PR/branch that ships the numbers. Don't let scorecards live only
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/specs/2026-04-19-locomo-scorecards.md` around lines 225 - 226, Fix the
grammar in the sentence fragment "commit an update the same PR/branch" by
inserting the missing preposition: change it to "commit an update in the same
PR/branch" (or "commit an update to the same PR/branch") so the clause reads
correctly; update the occurrence of that phrase in the document section
containing "After every new run completes (or its rescore completes) — commit an
update the same PR/branch that ships the numbers." to include "in" (or "to")
after "update".

122-123: ⚠️ Potential issue | 🟡 Minor

Remove extra spacing between "mem0" and "result".

Static analysis detects multiple spaces in "Existing mem0 result JSON". Reduce to single space: "Existing mem0 result JSON".

📝 Proposed fix
- result JSON should be rescored via the fixed `_summary` logic once `#33`
+  result JSON should be rescored via the fixed `_summary` logic once `#33`

Or if the extra spaces are on the previous line (line 122), adjust there to ensure single space between words.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/specs/2026-04-19-locomo-scorecards.md` around lines 122 - 123, The
phrase "Existing mem0   result JSON" contains multiple consecutive spaces;
update the spec text to use a single space so it reads "Existing mem0 result
JSON" (also check the prior line if the extra spaces originate there), leaving
references like mem0 and the `_summary` logic unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/specs/2026-04-19-locomo-scorecards.md`:
- Around line 52-56: The doc references five missing result files in
docs/specs/2026-04-19-locomo-scorecards.md; either add the exact JSON files
(benchmarks/results/locomo_20260417_140810_full_gemma_e2b.json,
locomo_20260418_035232_full_gemma_e4b.json,
locomo_20260418_130212_full_gemma_e2b_opt.json,
locomo_20260419_185944_full_mem0_e2b_noinfer_full_mem0_e2b.json,
locomo_20260419_225441_full_mempalace_e2b_full_mempalace_e2b.json) to
benchmarks/results/ and commit them, or update the markdown in
locomo-scorecards.md to remove or replace those filenames with the
correct/available result files and adjust any reproducibility instructions or
links accordingly.
- Around line 84-88: The docs file references four non-existent rescore output
files (locomo_20260417_140810_full_gemma_e2b.rescored_v2.json,
locomo_20260418_035232_full_gemma_e4b.rescored_v2.json,
locomo_20260418_130212_full_gemma_e2b_opt.rescored_v2.json,
locomo_20260419_185944_full_mem0_e2b_noinfer.rescored_v2.json) in the "Rescore
output files" list; either add those files to the repo or remove/replace those
entries in docs/specs/2026-04-19-locomo-scorecards.md (the "Rescore output files
(include per-item `judge_rejudged`):" section) with the correct existing
filenames or a note explaining they are unavailable. Ensure the listed filenames
in that section match actual files in benchmarks/results/ or update the text to
not reference missing artifacts.
- Line 144: The methodology table contains incorrect/unverifiable commit SHAs:
replace or verify the commit for PR `#30` (currently listed as fd27d2c) by looking
up the actual merged commit SHA and update the table entry (or mark PR `#30` as
not merged if appropriate); for the pending PRs SHAs (bc0a773, 571d8af, ca0ccb7)
either remove those SHA references until they exist or change their notes to
explicitly state “pending/not yet merged” (and if 571d8af supersedes ca0ccb7,
keep only the active PR note); ensure the table cells referencing these SHAs
(the “Commits” row) reflect verified SHAs or clear pending status so every
listed commit can be validated.

---

Duplicate comments:
In `@docs/specs/2026-04-19-locomo-scorecards.md`:
- Around line 225-226: Fix the grammar in the sentence fragment "commit an
update the same PR/branch" by inserting the missing preposition: change it to
"commit an update in the same PR/branch" (or "commit an update to the same
PR/branch") so the clause reads correctly; update the occurrence of that phrase
in the document section containing "After every new run completes (or its
rescore completes) — commit an update the same PR/branch that ships the
numbers." to include "in" (or "to") after "update".
- Around line 122-123: The phrase "Existing mem0   result JSON" contains
multiple consecutive spaces; update the spec text to use a single space so it
reads "Existing mem0 result JSON" (also check the prior line if the extra spaces
originate there), leaving references like mem0 and the `_summary` logic
unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 8bb1fc9a-418e-4699-9c2c-42eeda82d074

📥 Commits

Reviewing files that changed from the base of the PR and between 7c1ce61 and b3790d9.

📒 Files selected for processing (1)
  • docs/specs/2026-04-19-locomo-scorecards.md

Comment thread docs/specs/2026-04-19-locomo-scorecards.md
MemPalace-e2b external qwen3:4b rescore: Overall Judge 0.34 (180.5 min,
100% coverage, 0 errors). All three architectures now have the same
treatment: same generator, same prompt, same external judge, same 1540
QAs. Only the memory layer varies.

Final headline numbers (external Judge, gemma4:e2b generator):
- taosmd-e2b+prompt-opt  0.41
- taosmd-e2b             0.40
- taosmd-e4b             0.38
- MemPalace-e2b          0.34
- mem0-e2b (infer=False) 0.06

Per-category reveals a more nuanced story than the Overall numbers:
- Single-hop: three-way tie at ~0.16-0.17 — solved at this tier by any
  competent semantic-retrieval system
- Temporal: taosmd (0.36) and MemPalace (0.35) nearly tied; only
  prompt-opt breaks away at 0.41
- Multi-hop: taosmd-opt leads at 0.24; KG + query expansion help on
  synthesis questions
- Open-dom: taosmd's clearest architectural win (0.51 vs MemPalace 0.41,
  +24% relative)
- mem0 distant fourth everywhere

Reframes the positioning: taosmd's architecture edge concentrates on
harder question types that benefit from rerank + synthesis (Open-dom,
Multi-hop); on simpler retrieval (Single-hop, Temporal) MemPalace's
verbatim-store + default embedder is nearly as good. Cleaner story
than "we dominate" and more useful for positioning against the
target audiences documented in project_taosmd_positioning.md.

Next: README rewrite aligned with that positioning memory and these
nuanced numbers — lead with target audiences (SBC, taOS clusters,
offline/compliance, long-horizon agents), frame benchmark numbers as
"at the compute tier we target," highlight architectural edge on the
categories where it actually shows.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (2)
docs/specs/2026-04-19-locomo-scorecards.md (2)

234-235: ⚠️ Potential issue | 🟡 Minor

Fix grammatical error: missing preposition.

Line 234-235 reads "commit an update the same PR/branch" but should be "commit an update in the same PR/branch" or "commit an update to the same PR/branch."

✏️ Proposed fix
-After every new run completes (or its rescore completes) — commit an update
-the same PR/branch that ships the numbers. Don't let scorecards live only
+After every new run completes (or its rescore completes) — commit an update in
+the same PR/branch that ships the numbers. Don't let scorecards live only
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/specs/2026-04-19-locomo-scorecards.md` around lines 234 - 235, The
sentence "commit an update the same PR/branch that ships the numbers." is
missing a preposition; update the phrasing in the docs/specs text (around the
sentence near "After every new run completes...") to read either "commit an
update in the same PR/branch that ships the numbers." or "commit an update to
the same PR/branch that ships the numbers." so the grammar is correct and the
intended meaning is preserved.

130-131: ⚠️ Potential issue | 🟡 Minor

Remove extra spacing between "mem0" and "result".

Static analysis detects multiple spaces in "Existing mem0 result JSON" at line 130. This was flagged in previous reviews; please reduce to single space.

✏️ Proposed fix
-  recall aggregation rather than publishing a fake 0.0. Existing mem0
-  result JSON should be rescored via the fixed `_summary` logic once `#33`
+  recall aggregation rather than publishing a fake 0.0. Existing mem0
+  result JSON should be rescored via the fixed `_summary` logic once `#33`

Note: Ensure there is only a single space between "mem0" and "result" when viewing the raw file.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/specs/2026-04-19-locomo-scorecards.md` around lines 130 - 131, The
phrase "Existing mem0   result JSON" contains multiple consecutive spaces;
update the doc text (the string containing "Existing mem0   result JSON") to use
a single space so it reads "Existing mem0 result JSON", ensuring the raw file
has exactly one space between "mem0" and "result" (verify by opening the source
and removing the extra space).
🧹 Nitpick comments (2)
docs/specs/2026-04-19-locomo-scorecards.md (2)

23-24: Clarify "three taosmd runs" to avoid confusion with total run count.

Line 24 states "100% coverage / 0 errors on the three taosmd runs," but the external judge table (lines 64-71) includes five systems total (three taosmd variants + mem0 + MemPalace). While technically accurate, this phrasing may confuse readers. Consider: "100% coverage / 0 errors on all five systems (three taosmd variants, mem0, and MemPalace)."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/specs/2026-04-19-locomo-scorecards.md` around lines 23 - 24, Update the
ambiguous sentence that reads "100% coverage / 0 errors on the three taosmd
runs" to explicitly state the full set of systems evaluated so readers aren’t
confused by the external judge table; locate the exact phrase and replace it
with something like "100% coverage / 0 errors on all five systems (three taosmd
variants, mem0, and MemPalace)" so it references the three taosmd variants plus
mem0 and MemPalace.

64-64: Make column headers consistent between self-judge and external judge tables.

The self-judge table (line 34) labels the mem0 column as "mem0-e2b (infer=False)" while the external judge table (line 64) uses just "mem0-e2b". For consistency and clarity, use the same label in both tables.

📝 Suggested fix
-| Category (count) | taosmd-e2b | taosmd-e4b | taosmd-e2b+prompt-opt | mem0-e2b | MemPalace-e2b |
+| Category (count) | taosmd-e2b | taosmd-e4b | taosmd-e2b+prompt-opt | mem0-e2b (infer=False) | MemPalace-e2b |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/specs/2026-04-19-locomo-scorecards.md` at line 64, The external judge
table header is inconsistent with the self-judge table: update the header label
in the external judge table (the column currently named "mem0-e2b") to match the
self-judge table's label "mem0-e2b (infer=False)" so both tables use the exact
same column name; locate the external judge table header row containing "|
Category (count) | taosmd-e2b | taosmd-e4b | taosmd-e2b+prompt-opt | mem0-e2b |
MemPalace-e2b |" and replace the "mem0-e2b" token with "mem0-e2b (infer=False)".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/specs/2026-04-19-locomo-scorecards.md`:
- Around line 114-117: The overall judge improvement claim is incorrect: update
the sentence referencing "taosmd-e2b-opt" (and the comparison between
"taosmd-e2b" and "taosmd-e2b+prompt-opt") to state a +0.01 Overall Judge gain
(0.40 → 0.41) instead of +0.07, and while editing confirm the per-category
deltas for Temporal (+0.05) and Multi-hop (+0.03) match the external judge table
values before committing the change.
- Line 152: Update the table row that lists commit SHAs to remove or correct the
four non-existent references (fd27d2c, bc0a773, 571d8af, ca0ccb7) so only valid
commits remain; either replace each invalid SHA with the correct commit hash or
delete the corresponding mention (e.g., remove the "Runner + adapter +
prompt-opt: `fd27d2c`", "Silent-failure fixes: `bc0a773`", "MemPalace adapter
(chromadb path): `571d8af`" and the superseded `ca0ccb7`), leaving the valid
entries (`c360faa`, `116edab`) intact and ensuring the surrounding text/PR
references still read sensibly.
- Around line 91-96: The docs/specs file references five rescored result files
that don't exist; update the spec to either remove those five filenames or
commit the missing JSON files. Specifically, in
docs/specs/2026-04-19-locomo-scorecards.md remove or replace the list entries
for benchmarks/results/locomo_20260417_140810_full_gemma_e2b.rescored_v2.json,
benchmarks/results/locomo_20260418_035232_full_gemma_e4b.rescored_v2.json,
benchmarks/results/locomo_20260418_130212_full_gemma_e2b_opt.rescored_v2.json,
benchmarks/results/locomo_20260419_185944_full_mem0_e2b_noinfer.rescored_v2.json,
and
benchmarks/results/locomo_20260419_225441_full_mempalace_e2b.rescored_v2.json;
if you choose to add the files, commit them under benchmarks/results with the
exact names used in the spec so the documentation links resolve.
- Around line 52-56: The documentation references five non-existent result files
in locomo-scorecards.md; either add the missing JSONs to benchmarks/results with
the exact filenames listed (locomo_20260417_140810_full_gemma_e2b.json,
locomo_20260418_035232_full_gemma_e4b.json,
locomo_20260418_130212_full_gemma_e2b_opt.json,
locomo_20260419_185944_full_mem0_e2b_noinfer_full_mem0_e2b.json,
locomo_20260419_225441_full_mempalace_e2b_full_mempalace_e2b.json) ensuring they
are valid result JSONs, or remove the five lines referencing those filenames
from locomo-scorecards.md so the doc no longer points to missing files.

---

Duplicate comments:
In `@docs/specs/2026-04-19-locomo-scorecards.md`:
- Around line 234-235: The sentence "commit an update the same PR/branch that
ships the numbers." is missing a preposition; update the phrasing in the
docs/specs text (around the sentence near "After every new run completes...") to
read either "commit an update in the same PR/branch that ships the numbers." or
"commit an update to the same PR/branch that ships the numbers." so the grammar
is correct and the intended meaning is preserved.
- Around line 130-131: The phrase "Existing mem0   result JSON" contains
multiple consecutive spaces; update the doc text (the string containing
"Existing mem0   result JSON") to use a single space so it reads "Existing mem0
result JSON", ensuring the raw file has exactly one space between "mem0" and
"result" (verify by opening the source and removing the extra space).

---

Nitpick comments:
In `@docs/specs/2026-04-19-locomo-scorecards.md`:
- Around line 23-24: Update the ambiguous sentence that reads "100% coverage / 0
errors on the three taosmd runs" to explicitly state the full set of systems
evaluated so readers aren’t confused by the external judge table; locate the
exact phrase and replace it with something like "100% coverage / 0 errors on all
five systems (three taosmd variants, mem0, and MemPalace)" so it references the
three taosmd variants plus mem0 and MemPalace.
- Line 64: The external judge table header is inconsistent with the self-judge
table: update the header label in the external judge table (the column currently
named "mem0-e2b") to match the self-judge table's label "mem0-e2b (infer=False)"
so both tables use the exact same column name; locate the external judge table
header row containing "| Category (count) | taosmd-e2b | taosmd-e4b |
taosmd-e2b+prompt-opt | mem0-e2b | MemPalace-e2b |" and replace the "mem0-e2b"
token with "mem0-e2b (infer=False)".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 0cccf7f9-7849-481e-832c-7c054367b523

📥 Commits

Reviewing files that changed from the base of the PR and between b3790d9 and ef15cd7.

📒 Files selected for processing (1)
  • docs/specs/2026-04-19-locomo-scorecards.md

Comment thread docs/specs/2026-04-19-locomo-scorecards.md
Comment thread docs/specs/2026-04-19-locomo-scorecards.md
Comment thread docs/specs/2026-04-19-locomo-scorecards.md
| Self-judge | gemma4:e2b (i.e. the same generator) |
| External judge | qwen3:4b via Ollama, temperature 0.0, `benchmarks/locomo_rescore_streaming.py` |
| Timeout (rescore) | 240s per call, concurrency 3 |
| Commits | Runner + adapter + prompt-opt: `fd27d2c` (PR #30, merged). Rescore tool: `c360faa` (PR #29, merged). README Judge framing: `116edab`/`PR #31` + `fix/readme-judge-accuracy-consistency` (PR #32, merged). Silent-failure fixes: `bc0a773` (PR #35, pending). MemPalace adapter (chromadb path): `571d8af` (PR #36, pending — supersedes earlier `ca0ccb7` palace/mine attempt). |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Verify all referenced commit SHAs exist in repository history

commits=("fd27d2c" "c360faa" "116edab" "bc0a773" "571d8af" "ca0ccb7")

echo "Checking commit SHAs..."
for commit in "${commits[@]}"; do
  if git cat-file -e "$commit^{commit}" 2>/dev/null; then
    echo "✓ Commit $commit exists"
    git log -1 --oneline "$commit"
  else
    echo "✗ Commit $commit NOT FOUND"
  fi
done

Repository: jaylfc/taosmd

Length of output: 423


Remove or correct the four non-existent commit references.

The documentation table on line 152 references six commit SHAs, but four of them do not exist in the repository:

  • fd27d2c — NOT FOUND
  • bc0a773 — NOT FOUND
  • 571d8af — NOT FOUND
  • ca0ccb7 — NOT FOUND
  • c360faa — EXISTS
  • 116edab — EXISTS

Either correct these SHAs with valid commit hashes or remove the references if they are no longer relevant.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/specs/2026-04-19-locomo-scorecards.md` at line 152, Update the table row
that lists commit SHAs to remove or correct the four non-existent references
(fd27d2c, bc0a773, 571d8af, ca0ccb7) so only valid commits remain; either
replace each invalid SHA with the correct commit hash or delete the
corresponding mention (e.g., remove the "Runner + adapter + prompt-opt:
`fd27d2c`", "Silent-failure fixes: `bc0a773`", "MemPalace adapter (chromadb
path): `571d8af`" and the superseded `ca0ccb7`), leaving the valid entries
(`c360faa`, `116edab`) intact and ensuring the surrounding text/PR references
still read sensibly.

jaylfc added 2 commits April 21, 2026 18:44
… flight

- Add Parametric retrieval matrix (C1-C6) scorecard: C3 adjacent_turns is the
  biggest single-lever win at 0.465; C6 multihop_decompose regresses to 0.317;
  C5 bge_reranker deferred pending refactor.
- Add lessons #8 (multihop decomposition regresses at small-LLM scale) and #9
  (context stitching beats retrieval width).
- Reorganise 'In flight / queued' section into Complete / In flight / Queued
  sub-headings. Log the c_stack run currently mid-bench and the three queued
  follow-ups (qwen9b dense, Qwen3.6 HLWQ via vLLM, Qwen3.6 MoE via Ollama).
…act yesterday's claim)

Five new results logged (2026-04-21 evening + 2026-04-22):
- c_stack final 0.482 — stacking IS additive (+0.017 over adj=1).
  Yesterday's 'stacking didn't stack' read was from a 62% partial rescore.
- adj_sweep_adj2 0.499 — new leader, +0.089 vs baseline-opt.
- adj_sweep_adj3 0.487 — regresses from adj=2, sweet spot is 2.
- adj1_k20 0.479 — k=20 adds +0.014 on adj=1.
- adj1_llm partial 0.464 — llm-exp flat on adj=1.

Clean stack decomposition:
  adj=1 alone        = 0.465
  adj=1 + k=20       = 0.479  (+0.014 from k=20)
  adj=1 + llm-exp    = 0.464  (+0.00 from llm-exp)
  adj=1 + k=20 + llm = 0.482  (+0.003 from llm-exp on top of k=20)

Next queued: adj2_k20 (predicted ~0.513), then qwen3.5:9b block,
then Qwen3.6 MoE (HLWQ via vLLM + GGUF via Ollama).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant