feat(v3.0): Phase 53 Benchmark Suite (closes v3.0) by RichardHightower · Pull Request #30 · SpillwaveSolutions/agent-memory

RichardHightower · 2026-05-14T21:38:06Z

Summary

Phase 53 of v3.0 — the final phase of the Competitive Parity & Benchmarks milestone. Adds a benchmark harness for measuring retrieval quality, including a LOCOMO adapter that produces publishable scores.

What ships

A new memory-bench crate (binary name memory-bench) with:

Subcommand	Purpose
`temporal`	Run the temporal-reasoning fixture suite
`multisession`	Run the multi-session continuity suite
`compression`	Run the summarization/compression fixture suite
`all`	Run every TOML fixture in the dataset
`locomo`	Run against the LOCOMO dataset (path provided at runtime; dataset itself gitignored)

Module breakdown

fixture.rs — TOML fixture loader with schema validation
runner.rs — drives a memory-cli binary against the fixture's setup/queries
scorer.rs — produces per-query scores (precision/recall/MRR depending on suite)
report.rs — markdown + JSON output, baseline-comparison diffs
baseline.rs — store and load reference scores for regression detection
locomo.rs — LOCOMO dataset adapter (parses the public eval format)

Design decisions

Black-box harness — the bench binary invokes memory (Phase 52's CLI) over the daemon's gRPC, not the daemon directly. Lets us benchmark the actual user-facing surface.
TOML fixtures — each suite is a directory of TOML files defining setup events + queries + expected answers. Easy to add new cases without touching Rust.
LOCOMO dataset never committed — gitignored. Adapter loads from --dataset-path. Avoids licensing/size issues.
CI smoke test only — the workflow runs a minimal memory-bench all --dry-run to ensure the harness compiles and starts; it does NOT run the full LOCOMO scoring (would take minutes and require external data).
Baseline-comparison built-in — --baseline path/to/baseline.json makes the runner emit a diff report, ready for regression CI later.

Files added

crates/memory-bench/ — new binary crate, 9 source files
.planning/phases/53-benchmark-suite/ — 3 plans + 3 summaries + context/research/validation/verification

Tests

All workspace tests pass (task test): 58 test blocks, 0 failures (up from 55 pre-Phase-53 — picked up 3 new memory-bench test blocks)
Clippy clean (task clippy): zero warnings on --workspace --all-targets --all-features -- -D warnings
Docs clean (task doc): zero rustdoc warnings under RUSTDOCFLAGS=-D warnings
task fmt passes

Rebase notes (2026-05-14)

Rebased from gsd/phase-53-benchmark-suite (local stack, accumulated through stacked phases) onto post-Phase-53.5 origin/main. Used git rebase --onto origin/main 2c7e1fe 2733233 to drop all Phase 51 + 52 commits already on main. 13 commits in this PR are Phase 53's own work plus one final clean planning-doc update.

What this closes

v3.0 Competitive Parity & Benchmarks ships on merge — Phases 51, 51.5, 52, 53.5 already on main; Phase 53 is the final piece. After merge, the milestone is shipped (5 phases, 8 plans counting decimals).

- 53-03: fix wave 2->3, add 53-02 to depends_on (dependency_correctness blocker) - 53-02: add compute_compression_ratio + estimate_raw_tokens to scorer.rs task (task_completeness warning) - 53-03: add .github/workflows/ci.yml benchmark-smoke job with continue-on-error: true (requirement_coverage BENCH-07) Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

- Add memory-bench crate to workspace with clap, serde, toml deps - Create minimal binary entrypoint and lib.rs with fixture module - Add benchmarks/baselines.toml with MemMachine and Mem0 scores - Create download-locomo.sh script for LOCOMO dataset - Add 7 stub JSONL session files for benchmark fixtures - Gitignore locomo-data/ directory Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- Define Fixture and TestCase structs with serde deserialization - Implement Fixture::load with id/query validation - Implement Fixture::load_dir to collect test cases from directory - Add 4 unit tests: parse valid, empty id, empty query, load_dir - Create 3 TOML fixture files: temporal, multisession, compression Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- Add 53-01-SUMMARY.md with execution results - Update STATE.md position to plan 2 of 3 - Update ROADMAP.md progress for phase 53 - Mark BENCH-01 and BENCH-06 requirements complete Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- Scorer: accuracy, recall@k, percentile, compression ratio, token estimation - Runner: shells out to memory binary for search queries and session ingestion - Report: JSON and markdown table output with optional competitor comparison - Baseline: TOML loader for competitor scores - 18 tests passing, clippy clean Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- CLI: temporal, multisession, compression, all, locomo subcommands - Global --memory-bin flag for binary path override - All subcommand: --compare flag for baseline side-by-side table - Full pipeline: load fixtures, ingest sessions, run queries, score, report Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- Add locomo.rs with LocomoConversation, Turn, Question structs - Implement load_dataset to read JSON files from directory - Implement score_conversation with case-insensitive substring matching - Implement aggregate_results with per-type breakdown - Add 6 unit tests covering parsing, scoring, aggregation, and loading - Register pub mod locomo in lib.rs Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…l QA - Wire Commands::Locomo in main.rs with dataset ingestion and scoring - Move tempfile from dev-dependencies to dependencies for runtime use - Add benchmark-smoke job to CI with continue-on-error: true - Format all files to pass cargo fmt check - Full pr-precheck passes (fmt, clippy, test, doc) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- Create 53-03-SUMMARY.md with execution results - Update STATE.md: phase 53 complete, 148 plans total - Update ROADMAP.md: phase 53 progress to complete - Mark BENCH-04 and BENCH-07 requirements complete Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…er Rust

RichardHightower and others added 14 commits May 14, 2026 16:27

docs(53): generate context from PRD

aeb2c68

docs(53): create phase plan for benchmark suite

30d82ed

docs(53-02): complete runner/scorer/report/CLI plan

606531a

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

docs(phase-53): complete phase execution

96d55d5

docs(53): final planning update for Phase 53 PR

5267602

RichardHightower temporarily deployed to e2e-cli May 14, 2026 21:38 — with GitHub Actions Inactive

RichardHightower had a problem deploying to e2e-cli May 14, 2026 21:38 — with GitHub Actions Failure

RichardHightower temporarily deployed to e2e-cli May 14, 2026 21:38 — with GitHub Actions Inactive

RichardHightower had a problem deploying to e2e-cli May 14, 2026 21:38 — with GitHub Actions Failure

RichardHightower temporarily deployed to e2e-cli May 14, 2026 21:38 — with GitHub Actions Inactive

fix(53): use checked_div to satisfy clippy::manual_checked_ops on new…

058d94a

…er Rust

RichardHightower temporarily deployed to e2e-cli May 14, 2026 21:44 — with GitHub Actions Inactive

RichardHightower merged commit 68ab122 into main May 14, 2026
21 checks passed

RichardHightower deleted the feature/v3.0-phase-53-benchmark-suite branch May 14, 2026 21:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(v3.0): Phase 53 Benchmark Suite (closes v3.0)#30

feat(v3.0): Phase 53 Benchmark Suite (closes v3.0)#30
RichardHightower merged 15 commits into
mainfrom
feature/v3.0-phase-53-benchmark-suite

RichardHightower commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RichardHightower commented May 14, 2026

Summary

What ships

Module breakdown

Design decisions

Files added

Tests

Rebase notes (2026-05-14)

What this closes

Next

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant