Skip to content

feat(v3.0): Phase 53 Benchmark Suite (closes v3.0)#30

Merged
RichardHightower merged 15 commits into
mainfrom
feature/v3.0-phase-53-benchmark-suite
May 14, 2026
Merged

feat(v3.0): Phase 53 Benchmark Suite (closes v3.0)#30
RichardHightower merged 15 commits into
mainfrom
feature/v3.0-phase-53-benchmark-suite

Conversation

@RichardHightower
Copy link
Copy Markdown
Contributor

Summary

Phase 53 of v3.0 — the final phase of the Competitive Parity & Benchmarks milestone. Adds a benchmark harness for measuring retrieval quality, including a LOCOMO adapter that produces publishable scores.

What ships

A new memory-bench crate (binary name memory-bench) with:

Subcommand Purpose
temporal Run the temporal-reasoning fixture suite
multisession Run the multi-session continuity suite
compression Run the summarization/compression fixture suite
all Run every TOML fixture in the dataset
locomo Run against the LOCOMO dataset (path provided at runtime; dataset itself gitignored)

Module breakdown

  • fixture.rs — TOML fixture loader with schema validation
  • runner.rs — drives a memory-cli binary against the fixture's setup/queries
  • scorer.rs — produces per-query scores (precision/recall/MRR depending on suite)
  • report.rs — markdown + JSON output, baseline-comparison diffs
  • baseline.rs — store and load reference scores for regression detection
  • locomo.rs — LOCOMO dataset adapter (parses the public eval format)

Design decisions

  • Black-box harness — the bench binary invokes memory (Phase 52's CLI) over the daemon's gRPC, not the daemon directly. Lets us benchmark the actual user-facing surface.
  • TOML fixtures — each suite is a directory of TOML files defining setup events + queries + expected answers. Easy to add new cases without touching Rust.
  • LOCOMO dataset never committed — gitignored. Adapter loads from --dataset-path. Avoids licensing/size issues.
  • CI smoke test only — the workflow runs a minimal memory-bench all --dry-run to ensure the harness compiles and starts; it does NOT run the full LOCOMO scoring (would take minutes and require external data).
  • Baseline-comparison built-in--baseline path/to/baseline.json makes the runner emit a diff report, ready for regression CI later.

Files added

  • crates/memory-bench/ — new binary crate, 9 source files
  • .planning/phases/53-benchmark-suite/ — 3 plans + 3 summaries + context/research/validation/verification

Tests

  • All workspace tests pass (task test): 58 test blocks, 0 failures (up from 55 pre-Phase-53 — picked up 3 new memory-bench test blocks)
  • Clippy clean (task clippy): zero warnings on --workspace --all-targets --all-features -- -D warnings
  • Docs clean (task doc): zero rustdoc warnings under RUSTDOCFLAGS=-D warnings
  • task fmt passes

Rebase notes (2026-05-14)

Rebased from gsd/phase-53-benchmark-suite (local stack, accumulated through stacked phases) onto post-Phase-53.5 origin/main. Used git rebase --onto origin/main 2c7e1fe 2733233 to drop all Phase 51 + 52 commits already on main. 13 commits in this PR are Phase 53's own work plus one final clean planning-doc update.

What this closes

v3.0 Competitive Parity & Benchmarks ships on merge — Phases 51, 51.5, 52, 53.5 already on main; Phase 53 is the final piece. After merge, the milestone is shipped (5 phases, 8 plans counting decimals).

Next

v3.1 work is on gsd/phase-{54..56} locally (export/backup/import). v3.2 work is on gsd/phase-{57..58} (runtime registration metadata). Both stacks are backed up to origin; landing strategy still TBD.

RichardHightower and others added 14 commits May 14, 2026 16:27
- 53-03: fix wave 2->3, add 53-02 to depends_on (dependency_correctness blocker)
- 53-02: add compute_compression_ratio + estimate_raw_tokens to scorer.rs task (task_completeness warning)
- 53-03: add .github/workflows/ci.yml benchmark-smoke job with continue-on-error: true (requirement_coverage BENCH-07)

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- Add memory-bench crate to workspace with clap, serde, toml deps
- Create minimal binary entrypoint and lib.rs with fixture module
- Add benchmarks/baselines.toml with MemMachine and Mem0 scores
- Create download-locomo.sh script for LOCOMO dataset
- Add 7 stub JSONL session files for benchmark fixtures
- Gitignore locomo-data/ directory

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Define Fixture and TestCase structs with serde deserialization
- Implement Fixture::load with id/query validation
- Implement Fixture::load_dir to collect test cases from directory
- Add 4 unit tests: parse valid, empty id, empty query, load_dir
- Create 3 TOML fixture files: temporal, multisession, compression

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Add 53-01-SUMMARY.md with execution results
- Update STATE.md position to plan 2 of 3
- Update ROADMAP.md progress for phase 53
- Mark BENCH-01 and BENCH-06 requirements complete

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Scorer: accuracy, recall@k, percentile, compression ratio, token estimation
- Runner: shells out to memory binary for search queries and session ingestion
- Report: JSON and markdown table output with optional competitor comparison
- Baseline: TOML loader for competitor scores
- 18 tests passing, clippy clean

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- CLI: temporal, multisession, compression, all, locomo subcommands
- Global --memory-bin flag for binary path override
- All subcommand: --compare flag for baseline side-by-side table
- Full pipeline: load fixtures, ingest sessions, run queries, score, report

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Add locomo.rs with LocomoConversation, Turn, Question structs
- Implement load_dataset to read JSON files from directory
- Implement score_conversation with case-insensitive substring matching
- Implement aggregate_results with per-type breakdown
- Add 6 unit tests covering parsing, scoring, aggregation, and loading
- Register pub mod locomo in lib.rs

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…l QA

- Wire Commands::Locomo in main.rs with dataset ingestion and scoring
- Move tempfile from dev-dependencies to dependencies for runtime use
- Add benchmark-smoke job to CI with continue-on-error: true
- Format all files to pass cargo fmt check
- Full pr-precheck passes (fmt, clippy, test, doc)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Create 53-03-SUMMARY.md with execution results
- Update STATE.md: phase 53 complete, 148 plans total
- Update ROADMAP.md: phase 53 progress to complete
- Mark BENCH-04 and BENCH-07 requirements complete

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@RichardHightower RichardHightower merged commit 68ab122 into main May 14, 2026
21 checks passed
@RichardHightower RichardHightower deleted the feature/v3.0-phase-53-benchmark-suite branch May 14, 2026 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant