From a893cf1f6b25a81b7eb8601cf4b71df375ff1207 Mon Sep 17 00:00:00 2001
From: jaylfc <jaylfc25@gmail.com>
Date: Sun, 19 Apr 2026 19:12:17 +0100
Subject: [PATCH] docs(readme): propagate Judge accuracy framing across all
 LongMemEval sections
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

PR #31 clarified the headline (lines 7-9) that our 97.0% is end-to-end
Judge accuracy, not Recall@5. But three downstream references still
labelled it as Recall@5:

1. Benchmark Results table (line 158) — column header was "Recall@5"
   even though our 97% is Judge and the competitors' numbers are the
   different, looser Recall@5 metric. Split into "Score" + "Metric"
   columns so each row is honestly labelled; added a clarifying
   paragraph below the table pointing at both benchmark scripts.
2. Fusion Strategy Comparison (line 181) — column said "Recall@5"
   but all four strategies were measured with the same Judge harness.
   Renamed header to "Judge accuracy" and softened the "MemPalace-
   equivalent" label to "same algorithm as MemPalace" so it describes
   the retrieval approach, not the metric.
3. Key Features list (line 263) — "97.0% Recall@5" → "97.0% end-to-end
   Judge accuracy".

Remaining Recall@5 references are all intentional: competitors'
published numbers, the narrative paragraph explaining the metric
difference, and the code-block comment for `longmemeval_recall.py`
which is the Recall@5 reproduction script.
---
 README.md | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/README.md b/README.md
index 6ef36c2..05ad7dd 100644
--- a/README.md
+++ b/README.md
@@ -155,14 +155,14 @@ If you're using Claude Code, OpenClaw, Cursor, or any AI coding agent, paste thi
 
 ## Benchmark Results
 
-| System | Recall@5 | Method | Cloud |
-|--------|----------|--------|-------|
-| **taOSmd** | **97.0%** | Hybrid + query expansion | None |
-| MemPalace | 96.6% | Raw semantic (ChromaDB) | None |
-| agentmemory | 95.2% | BM25 + vector | None |
-| SuperMemory | 81.6% | Cloud embeddings | Yes |
+| System | Score | Metric | Method | Cloud |
+|--------|-------|--------|--------|-------|
+| **taOSmd** | **97.0%** | end-to-end Judge accuracy | Hybrid + query expansion | None |
+| MemPalace | 96.6% | Recall@5 | Raw semantic (ChromaDB) | None |
+| agentmemory | 95.2% | Recall@5 | BM25 + vector | None |
+| SuperMemory | 81.6% | Recall@5 | Cloud embeddings | Yes |
 
-All systems tested on the same benchmark (LongMemEval-S, 500 questions) with the same embedding model (all-MiniLM-L6-v2, 384-dim).
+All systems tested on the same benchmark (LongMemEval-S, 500 questions) with the same embedding model (all-MiniLM-L6-v2, 384-dim). **Our 97.0% is end-to-end Judge accuracy** (retrieve → generate → LLM-judge against the reference answer) — the stricter metric. MemPalace, agentmemory, and SuperMemory publish Recall@5 (retrieval-only, whether the correct session appears in the top-5 retrieved). Direct comparison isn't apples-to-apples until they re-run end-to-end; see `benchmarks/longmemeval_runner.py` for our Judge harness and `benchmarks/longmemeval_recall.py` for the Recall@5 variant used to reproduce MemPalace's methodology.
 
 ### Per-Category Breakdown
 
@@ -178,9 +178,9 @@ All systems tested on the same benchmark (LongMemEval-S, 500 questions) with the
 
 ### Fusion Strategy Comparison
 
-| Strategy | Recall@5 | Delta |
-|----------|----------|-------|
-| Raw cosine (MemPalace-equivalent) | 95.0% | — |
+| Strategy | Judge accuracy | Delta |
+|----------|---------------|-------|
+| Raw cosine (same algorithm as MemPalace) | 95.0% | — |
 | Additive keyword boost | 96.6% | +1.6 |
 | **Hybrid + query expansion (default)** | **97.0%** | **+2.0** |
 | All-turns hybrid (harder test) | 93.2% | -1.8 |
@@ -260,7 +260,7 @@ events = await archive.search_fts("hello")
 
 ## Key Features
 
-- **97.0% Recall@5** on LongMemEval-S benchmark (SOTA)
+- **97.0% end-to-end Judge accuracy** on LongMemEval-S benchmark (SOTA)
 - **Zero cloud dependencies** — runs entirely on local hardware
 - **Framework-agnostic** — HTTP API works with any agent framework
 - **Hybrid search** — semantic similarity + keyword overlap boosting