From a893cf1f6b25a81b7eb8601cf4b71df375ff1207 Mon Sep 17 00:00:00 2001 From: jaylfc Date: Sun, 19 Apr 2026 19:12:17 +0100 Subject: [PATCH] docs(readme): propagate Judge accuracy framing across all LongMemEval sections MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit PR #31 clarified the headline (lines 7-9) that our 97.0% is end-to-end Judge accuracy, not Recall@5. But three downstream references still labelled it as Recall@5: 1. Benchmark Results table (line 158) — column header was "Recall@5" even though our 97% is Judge and the competitors' numbers are the different, looser Recall@5 metric. Split into "Score" + "Metric" columns so each row is honestly labelled; added a clarifying paragraph below the table pointing at both benchmark scripts. 2. Fusion Strategy Comparison (line 181) — column said "Recall@5" but all four strategies were measured with the same Judge harness. Renamed header to "Judge accuracy" and softened the "MemPalace- equivalent" label to "same algorithm as MemPalace" so it describes the retrieval approach, not the metric. 3. Key Features list (line 263) — "97.0% Recall@5" → "97.0% end-to-end Judge accuracy". Remaining Recall@5 references are all intentional: competitors' published numbers, the narrative paragraph explaining the metric difference, and the code-block comment for `longmemeval_recall.py` which is the Recall@5 reproduction script. --- README.md | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index 6ef36c2..05ad7dd 100644 --- a/README.md +++ b/README.md @@ -155,14 +155,14 @@ If you're using Claude Code, OpenClaw, Cursor, or any AI coding agent, paste thi ## Benchmark Results -| System | Recall@5 | Method | Cloud | -|--------|----------|--------|-------| -| **taOSmd** | **97.0%** | Hybrid + query expansion | None | -| MemPalace | 96.6% | Raw semantic (ChromaDB) | None | -| agentmemory | 95.2% | BM25 + vector | None | -| SuperMemory | 81.6% | Cloud embeddings | Yes | +| System | Score | Metric | Method | Cloud | +|--------|-------|--------|--------|-------| +| **taOSmd** | **97.0%** | end-to-end Judge accuracy | Hybrid + query expansion | None | +| MemPalace | 96.6% | Recall@5 | Raw semantic (ChromaDB) | None | +| agentmemory | 95.2% | Recall@5 | BM25 + vector | None | +| SuperMemory | 81.6% | Recall@5 | Cloud embeddings | Yes | -All systems tested on the same benchmark (LongMemEval-S, 500 questions) with the same embedding model (all-MiniLM-L6-v2, 384-dim). +All systems tested on the same benchmark (LongMemEval-S, 500 questions) with the same embedding model (all-MiniLM-L6-v2, 384-dim). **Our 97.0% is end-to-end Judge accuracy** (retrieve → generate → LLM-judge against the reference answer) — the stricter metric. MemPalace, agentmemory, and SuperMemory publish Recall@5 (retrieval-only, whether the correct session appears in the top-5 retrieved). Direct comparison isn't apples-to-apples until they re-run end-to-end; see `benchmarks/longmemeval_runner.py` for our Judge harness and `benchmarks/longmemeval_recall.py` for the Recall@5 variant used to reproduce MemPalace's methodology. ### Per-Category Breakdown @@ -178,9 +178,9 @@ All systems tested on the same benchmark (LongMemEval-S, 500 questions) with the ### Fusion Strategy Comparison -| Strategy | Recall@5 | Delta | -|----------|----------|-------| -| Raw cosine (MemPalace-equivalent) | 95.0% | — | +| Strategy | Judge accuracy | Delta | +|----------|---------------|-------| +| Raw cosine (same algorithm as MemPalace) | 95.0% | — | | Additive keyword boost | 96.6% | +1.6 | | **Hybrid + query expansion (default)** | **97.0%** | **+2.0** | | All-turns hybrid (harder test) | 93.2% | -1.8 | @@ -260,7 +260,7 @@ events = await archive.search_fts("hello") ## Key Features -- **97.0% Recall@5** on LongMemEval-S benchmark (SOTA) +- **97.0% end-to-end Judge accuracy** on LongMemEval-S benchmark (SOTA) - **Zero cloud dependencies** — runs entirely on local hardware - **Framework-agnostic** — HTTP API works with any agent framework - **Hybrid search** — semantic similarity + keyword overlap boosting