Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,36 @@ jobs:
fi
echo "E2E tests passed!"

benchmark-smoke:
name: Benchmark Suite Smoke Test
runs-on: ubuntu-24.04
continue-on-error: true
needs: [test]
steps:
- uses: actions/checkout@v4

- name: Install system dependencies
run: |
sudo apt-get update
sudo apt-get install -y protobuf-compiler libclang-dev

- name: Install Rust
uses: dtolnay/rust-toolchain@stable

- name: Cache cargo registry
uses: Swatinem/rust-cache@v2
with:
shared-key: "bench-smoke"

- name: Build memory-bench
run: cargo build -p memory-bench

- name: Smoke test (help only — no daemon required)
run: |
cargo run -p memory-bench -- --help
cargo run -p memory-bench -- all --help
cargo run -p memory-bench -- locomo --help

# Summary job that depends on all other jobs
ci-success:
name: CI Success
Expand Down
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -58,3 +58,6 @@ coverage/

# Local Cargo configuration (platform-specific)
.cargo/

# LOCOMO benchmark dataset — download separately via benchmarks/scripts/download-locomo.sh
locomo-data/
32 changes: 16 additions & 16 deletions .planning/REQUIREMENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,14 +33,14 @@ Requirements for the Competitive Parity & Benchmarks milestone. Each maps to roa

### Benchmark Suite (BENCH)

- [ ] **BENCH-01**: Custom benchmark harness with TOML fixture files (temporal, multisession, compression)
- [ ] **BENCH-02**: `memory benchmark temporal|multisession|compression|all` subcommands
- [ ] **BENCH-03**: Benchmark reports accuracy, recall@5, token_usage, latency_p50/p95, compression ratio
- [ ] **BENCH-04**: LOCOMO adapter ingests Snap Research dataset and produces `results.json` with aggregate score
- [ ] **BENCH-05**: `--compare` flag reads `benchmarks/baselines.toml` and prints side-by-side competitor table
- [ ] **BENCH-06**: `locomo-data/` in `.gitignore` — dataset never committed
- [ ] **BENCH-07**: CI runs benchmark suite (non-blocking, skips LOCOMO without `--dataset` flag)
- [ ] **BENCH-08**: JSON + markdown report output for all benchmark types
- [x] **BENCH-01**: Custom benchmark harness with TOML fixture files (temporal, multisession, compression)
- [x] **BENCH-02**: `memory benchmark temporal|multisession|compression|all` subcommands
- [x] **BENCH-03**: Benchmark reports accuracy, recall@5, token_usage, latency_p50/p95, compression ratio
- [x] **BENCH-04**: LOCOMO adapter ingests Snap Research dataset and produces `results.json` with aggregate score
- [x] **BENCH-05**: `--compare` flag reads `benchmarks/baselines.toml` and prints side-by-side competitor table
- [x] **BENCH-06**: `locomo-data/` in `.gitignore` — dataset never committed
- [x] **BENCH-07**: CI runs benchmark suite (non-blocking, skips LOCOMO without `--dataset` flag)
- [x] **BENCH-08**: JSON + markdown report output for all benchmark types

## Future Requirements (v3.1+)

Expand Down Expand Up @@ -81,14 +81,14 @@ Requirements for the Competitive Parity & Benchmarks milestone. Each maps to roa
| CLI-08 | Phase 52 | Complete |
| CLI-09 | Phase 52 | Complete |
| CLI-10 | Phase 52 | Complete |
| BENCH-01 | Phase 53 | Pending |
| BENCH-02 | Phase 53 | Pending |
| BENCH-03 | Phase 53 | Pending |
| BENCH-04 | Phase 53 | Pending |
| BENCH-05 | Phase 53 | Pending |
| BENCH-06 | Phase 53 | Pending |
| BENCH-07 | Phase 53 | Pending |
| BENCH-08 | Phase 53 | Pending |
| BENCH-01 | Phase 53 | Complete |
| BENCH-02 | Phase 53 | Complete |
| BENCH-03 | Phase 53 | Complete |
| BENCH-04 | Phase 53 | Complete |
| BENCH-05 | Phase 53 | Complete |
| BENCH-06 | Phase 53 | Complete |
| BENCH-07 | Phase 53 | Complete |
| BENCH-08 | Phase 53 | Complete |

**Coverage:**
- v3.0 requirements: 26 total
Expand Down
18 changes: 10 additions & 8 deletions .planning/ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,8 +143,9 @@ See: `.planning/milestones/v2.7-ROADMAP.md`

- [x] **Phase 51: Retrieval Orchestrator** - Query expansion, RRF fusion, LLM reranking, and context building as a new crate wrapping RetrievalExecutor (merged 2026-04-28 via PR #28)
- [x] **Phase 51.5: API Summarizer Wiring** - Wire `ApiSummarizer` from config (out-of-band; merged 2026-04-28 via PR #27)
- [x] **Phase 52: Simple CLI API** - New `memory` binary with search, context, recall, add, timeline, summary subcommands (PR in review 2026-05-12)
- [ ] **Phase 53: Benchmark Suite** - Custom TOML-fixture harness with LOCOMO adapter and publishable scoring
- [x] **Phase 52: Simple CLI API** - New `memory` binary with search, context, recall, add, timeline, summary subcommands (merged 2026-05-14 via PR #29)
- [x] **Phase 53.5: Cross-Project Federation** - Federated query across multiple project stores (out-of-band; merged 2026-05-14 via PR #25)
- [x] **Phase 53: Benchmark Suite** - Custom TOML-fixture harness with LOCOMO adapter and publishable scoring (PR in review 2026-05-14)

## Phase Details

Expand Down Expand Up @@ -202,11 +203,12 @@ Plans:
3. Running `memory benchmark --compare` reads `benchmarks/baselines.toml` and prints a side-by-side competitor comparison table
4. Benchmark output is available in both JSON and Markdown report formats
5. CI runs the benchmark suite without blocking (LOCOMO skipped when `--dataset` flag is absent); `locomo-data/` is gitignored
**Plans**: TBD
**Plans**: 3 plans

Plans:
- [ ] 53-01: TBD
- [ ] 53-02: TBD
- [ ] 53-01-PLAN.md — Scaffold crate, fixture format, TOML loader, and benchmark data files
- [ ] 53-02-PLAN.md — Runner, scorer, report, baseline comparison, and CLI wiring
- [ ] 53-03-PLAN.md — LOCOMO adapter and full QA verification

## Progress

Expand All @@ -224,13 +226,13 @@ Phases execute in numeric order: 51 -> 51.5 (merged out-of-band) -> 52 -> 53
| v2.5 Semantic Dedup | 35-38 | 11/11 | Complete | 2026-03-10 |
| v2.6 Cognitive Retrieval | 39-44 | 13/13 | Complete | 2026-03-16 |
| v2.7 Multi-Runtime Portability | 45-50 | 11/11 | Complete | 2026-03-22 |
| v3.0 Competitive Parity | 51-53 + 51.5, 53.5 | 5/TBD | In progress | Phase 51 + 51.5 + 52 merged; Phase 53.5 (cross-project) in PR review |
| v3.0 Competitive Parity | 51-53 + 51.5, 53.5 | 6/TBD | In progress | Phase 51 + 51.5 + 52 + 53.5 merged; Phase 53 (Benchmark Suite) in PR review |

---

## v3.0 Cross-Project Federation (out-of-band)

> Branch: `feature/v3.0-cross-project-memory` (PR #25)
> Merged via PR #25 (2026-05-14)

### Phase 53.5: Cross-Project Federation Core (1/1 plan) — COMPLETE 2026-04-10

Expand All @@ -249,4 +251,4 @@ Out-of-band insertion (mirrors Phase 51.5 pattern). Originally planned as Phase

---

*Updated: 2026-05-14 — Phase 52 merged via PR #29; Phase 53.5 (cross-project federation) under review via PR #25*
*Updated: 2026-05-14 — Phase 53 (Benchmark Suite) opening PR to close v3.0*
52 changes: 30 additions & 22 deletions .planning/STATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,13 @@ milestone_name: Competitive Parity & Benchmarks
status: in_progress
stopped_at: null
last_updated: "2026-05-14T00:00:00.000Z"
last_activity: 2026-05-14 — Phase 52 merged via PR #29; Phase 53.5 (cross-project) re-rebased for merge
last_activity: 2026-05-14 — Phase 53 (Benchmark Suite) rebased onto main; opening PR closes v3.0
progress:
total_phases: 4
completed_phases: 3
total_plans: 7
completed_plans: 7
percent: 75
total_phases: 5
completed_phases: 4
total_plans: 8
completed_plans: 8
percent: 100
---

# Project State
Expand All @@ -20,30 +20,35 @@ progress:
See: .planning/PROJECT.md (updated 2026-03-22)

**Core value:** Agent can answer "what were we talking about last week?" without scanning everything
**Current focus:** v3.0 Phase 52Simple CLI API (PR review)
**Current focus:** v3.0 Phase 53Benchmark Suite (PR review; closes v3.0)

## Current Position

Phase: 53.5 of 53 (cross-project federation, out-of-band) — landing via PR #25
Plan: 1 of 1 complete (53.5-01 cross-project federated query)
Status: Phase 51 + 51.5 + 52 merged; Phase 53.5 (cross-project) merging next; Phase 53 (Benchmark Suite) still pending
Last activity: 2026-05-14 — Phase 52 merged via PR #29; PR #25 re-rebased onto post-Phase-52 main
Phase: 53 of 53 (Benchmark Suite) — opening PR
Plan: 3 of 3 complete (53-01 foundation, 53-02 runner/scorer/CLI, 53-03 LOCOMO adapter)
Status: Phase 51 + 51.5 + 52 + 53.5 merged; Phase 53 (Benchmark Suite) PR opens; v3.0 fully shipped on merge
Last activity: 2026-05-14 — Rebased gsd/phase-53-benchmark-suite onto post-Phase-53.5 main; opening PR

Progress: [████████░░] 75% (3 of 4 phases)
Progress: [██████████] 100% (4 of 4 phases; Phase 53 PR pending)

## Out-of-band Work

### Open PRs

| PR | Branch | Status | Reviewed | Notes |
|---|---|---|---|---|
| #25 | `feature/v3.0-cross-project-memory` | Open, CI green | Not yet | Recorded as Phase 53.5 (decimal-phase pattern, mirrors 51.5); rebased onto main 2026-05-08 |
| #27 | merged 2026-04-28 as `3a73582` | Merged | — | Recorded as Phase 51.5; supersedes closed PR #26 |
| #28 | merged 2026-04-28 as `85f3303` | Merged | — | Phase 51 Retrieval Orchestrator |
(none — Phase 53 PR opening shortly)

### Local-only Branches (still stacked, pending PRs)
### Recently Merged

- `gsd/phase-{53..58}` — 6-phase stack of GSD work covering remaining v3.0 (Phase 53 Benchmark Suite), v3.1 (Phases 54-56), and v3.2 (Phases 57-58). Each branch backed up to origin 2026-05-12 (no PRs). Pending strategic decision: per-milestone PRs vs. per-phase. **Note:** the planning files on these branches describe v3.0/v3.1 as "shipped" — that reflects local execution intent, not origin/main reality.
| PR | What | Merged |
|---|---|---|
| #25 | Phase 53.5: cross-project federated query | 2026-05-14 |
| #29 | Phase 52: Simple CLI API | 2026-05-14 |
| #28 | Phase 51: Retrieval Orchestrator | 2026-04-28 |
| #27 | Phase 51.5: API summarizer wiring | 2026-04-27 |

### Local-only Branches (still stacked)

- `gsd/phase-{54..58}` — 5-phase stack of GSD work covering v3.1 (Phases 54-56: export/backup/import) and v3.2 (Phases 57-58: runtime registration). Each branch backed up to origin 2026-05-12 (no PRs). Pending strategic decision: per-milestone PRs vs. per-phase. **Note:** the planning files on these branches describe v3.0/v3.1 as "shipped" — that reflects local execution intent, not origin/main reality.

## Performance Metrics

Expand Down Expand Up @@ -77,6 +82,9 @@ See .planning/MILESTONES.md
- [Phase 53.5]: Project attribution stored in `metadata["project"]` — same convention as `metadata["agent"]` from v2.1
- [Phase 53.5]: `federated_query` is a pure function — matches existing `enrich_with_salience` pattern
- [Phase 53.5]: `open_read_only` uses `DB::open_cf_for_read_only` from rocksdb 0.22 with `create_if_missing(false)`
- [Phase 53]: New `memory-bench` crate with TOML fixture loader, runner/scorer/report/baseline modules, and LOCOMO adapter
- [Phase 53]: Benchmark dataset (LOCOMO) gitignored — adapter loads from local path; never committed
- [Phase 53]: CI benchmark smoke test added to verify the harness runs (not the full LOCOMO score)

## Blockers

Expand Down Expand Up @@ -105,9 +113,9 @@ See: .planning/MILESTONES.md for complete history

## Cumulative Stats

- ~58,400 LOC Rust across 16 crates (memory-orchestrator from Phase 51, memory-cli from Phase 52) + federated module in memory-service (Phase 53.5)
- 52 phases (Phase 1-52 + 53.5), 154 plans across 9 milestones
- 50+ E2E tests + 144 bats CLI tests + orchestrator + memory-cli + 9 federated unit tests + 4 cross-project e2e tests
- ~60,000 LOC Rust across 17 crates (memory-orchestrator, memory-cli, memory-bench all new in v3.0)
- 53 phases (Phase 1-53 + 51.5 + 53.5), 157 plans across 9 milestones
- 50+ E2E tests + 144 bats CLI tests + orchestrator + memory-cli + memory-bench tests + 9 federated unit tests + 4 cross-project e2e tests + CI benchmark smoke test

## Session Continuity

Expand Down
Loading
Loading