Skip to content

docs(sonar): S0 reframe — append-only belt federation brief#15

Draft
zkSoju wants to merge 3 commits into
mainfrom
arch/belt-federation-brief
Draft

docs(sonar): S0 reframe — append-only belt federation brief#15
zkSoju wants to merge 3 commits into
mainfrom
arch/belt-federation-brief

Conversation

@zkSoju
Copy link
Copy Markdown

@zkSoju zkSoju commented May 22, 2026

What

Architecture brief reframing the sonar-belt-factory cycle after the S0 calibration spike, plus the generalized belt re-init runbook.

This PR exists for Bridgebuilder review of the brief. Planning docs are normally gitignored (grimoires/loa/context/*) per Loa convention; the brief was force-added so BB has a diff to review. Flatline has reviewed the same brief in parallel (local-doc path).

Why (the S0 finding)

The original cycle premise — 12 pure-product physical belts for blast-radius isolation — is budget-infeasible:

  • Operator ceiling: < $100/mo (the bar to justify leaving Envio's ~$300/mo hosted)
  • 12 belts ≈ $280–450/mo (each belt = a separate memory-resident indexer process; cost = memory × process-count)
  • SCALE.md already proved the real bottleneck is the 8-hour full reindex on any source addition, not steady-state isolation (D4: per-chain split, not per-product)

The reframe — append-only belt federation

  • Indexer = one freeside module: index fast + serve one federated API. NOT the durable analytics store (that's score-api, with cron capture + fallbacks).
  • 1 consolidated corpus belt (never --restart) + on-demand isolated belts for new sources (parallel scoped backfill) + a federation gateway (query-time, no DB merge).
  • Load-bearing bet: federate, never merge.

Review focus

§6 (load-bearing bets B1–B4) and §7 (open questions Q1–Q6) — especially Q4 (is cold-sync eRPC-throughput-bound or CPU-bound? determines whether parallel backfill even helps).

Files

  • grimoires/loa/context/arch-brief-belt-federation.md — the brief
  • grimoires/loa/runbooks/belt-reinit.md — generalized KF-013 re-init runbook (S0-T3)

🤖 Generated with Claude Code

… runbook

S0 calibration spike found the 12-belt premise budget-infeasible (~$280-450/mo
vs the <$100/mo ceiling). Reframes the cycle to append-only belt federation
(indexer serves a federated API / score-api captures + owns durability).
Includes the generalized KF-013 belt re-init runbook.

For Bridgebuilder review (force-added; planning docs are normally gitignored).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@zkSoju zkSoju added the bridgebuilder:self-review Admit framework/grimoires artifacts into Bridgebuilder review label May 22, 2026
@zkSoju
Copy link
Copy Markdown
Author

zkSoju commented May 22, 2026

🏗️ How it works — architecture walkthrough

Answering the review questions: what is a belt responsible for — raw events or composed objects? what about logic composed across handlers? what happens when you add a contract to an existing belt?

1. A belt produces composed, queryable objects — not raw events

A belt is a full Envio indexer scoped to a subset of contracts. Its handlers compose freely. The paddle belt is the model — for a single Mint event it writes three kinds of entity:

Kind Entity How
per-event record PaddleSupply context.PaddleSupply.set(...)
running aggregate PaddleSupplier get() → update totals → set()
cross-cutting normalized Action recordAction(context, …)

So the indexer still surfaces rich objects + rollups, exactly like today. The reframe does not reduce belts to raw log surfacing.

2. The boundary: compose inside a belt, federate across belts

Two different things get "composed," and they follow different rules:

Shared handler LOGIC (code)src/handlers/* + src/lib/* — stays shared. Belts import the same handler functions; each belt only runs the ones for its own contracts. (e.g. recordAction lives in lib/actions.ts and is imported by 21 handlers today.) Compile-time sharing, scoped execution — nothing changes here.

DATA (entities) is belt-scoped. A handler may read/write any entity in its own belt's schema subset — that is where aggregates, rollups and cross-cutting shapes live (S0 proved the per-belt subset compiles). But a handler in belt A cannot touch an entity in belt B's Postgres.

So:

  • Inside a belt: full composition power (per-event + aggregate + derived + cross-cutting).
  • Across belts: composition moves to query time (federation gateway) or analytics time (score-api) — never index time.
 belt: paddle          belt: mibera          belt: berachain-core
   Action(paddle)        Action(mibera)        Action(bgt, …)
   PaddleSupplier        MiberaLoan            BgtBoostEvent
        │                     │                      │
        └────── federation gateway: UNION `Action` across belts ──────┐
                                                                       ▼
                          one logical `Action` stream → score-api → ClickHouse

The cross-cutting entities (Action, Mint, Holder, Token) are the interesting case — 21 handlers write Action today. In the federated model each belt writes its own slice; the gateway unions them into one logical stream at read time. (Same pattern the SDD already names: cross-cutting shapes written per-belt-that-writes, merged at federation.)

3. Adding a new contract — the key question

The unit of re-sync is the belt. What happens depends on where the new contract's history sits relative to the belt's current sync head (the S0 D6 finding, code-confirmed in belt-reinit.md):

Case Behavior Cost
New contract deploys after the belt's current head plain redeploy (resume) picks it up going forward free — no re-sync
New contract has history before head (you need its past events) resume skips the history; you must --restart the belt re-backfill that whole belt

Why: Envio's isInitialized() checks table existence, not config — so resume continues each chain from its DB progressBlockNumber and never re-scans for a newly-added contract's past.

This is the whole reason for the federation model. To add a contract that needs history you have two choices:

  • (a) Add it into an existing belt → if history is needed, you pay a --restart = re-backfill of everything in that belt (the 8-hour problem, scoped to one belt).
  • (b) Add it as its own new belt → backfill only the new contract (parallel, fast), corpus untouched, the gateway federates it in. ← the append-only path.

Decision rule: new contract on an already-synced chain that needs history → own belt. New contract that's forward-only, or a belt cheap to re-sync → fold into an existing belt.

4. The honest caveat (from the Flatline review)

Flatline (3 models, run on this brief) flagged the real tension in SKP-003 (HIGH): the eventual "fold" — periodically --restart-ing the corpus belt to absorb mature siblings so belt-count stays bounded — is the same --restart re-sync the design exists to avoid. You don't eliminate the re-sync cost; you choose when to pay it (a deliberate, operator-gated maintenance window vs. on every change). The federation buys control over timing, not elimination.

And SKP-001 (CRITICAL): the "indexer is disposable because score-api backstops it" assumption is asserted, not proven — it needs a real recovery contract (gap detection, replay, idempotency, max-lag) before we lean on it. Both are tracked as open questions in the brief (§7).


Multi-model review status: Flatline ✅ (3 voices, $0 metered via CLI subscription) · Bridgebuilder ✅ (runs on CLI subscription; findings posting separately).

@zkSoju
Copy link
Copy Markdown
Author

zkSoju commented May 22, 2026

🔬 Flatline review (3-model, CLI subscription · $0 metered)

claude-headless + codex-headless + gemini-headless · phase=spec · confidence: full · not degraded. The skeptic converged hard on the two premises the whole design rests on — both currently unproven, possibly false.

🚨 Blockers (deduped from 13 findings)

Sev Concern Maps to
CRITICAL Cross-belt query correctness — pagination, global ordering, dedup, entity identity at query time have no stated approach. all Actions by wallet X sorted by timestamp across N belts → gateway OOM / huge latency. This is the federation's hardest unsolved problem. brief §6 B4
CRITICAL Parallel-backfill ceiling — sibling belts likely bottleneck on the shared eRPC throughput, so "spin up a sibling to backfill fast" may not actually be faster. SCALE.md already showed RPC-layer optimization gave negligible gains. The design proceeds as if parallelism helps. brief §7 Q4 ("most important")
CRITICAL Durability backstop unproven — "indexer is disposable because score-api backstops it" is asserted, not proven for dropped / delayed / duplicated / partial events. No recovery contract. brief §6 B2
HIGH Fold = the 8-hour problem — periodic --restart of the corpus to absorb siblings is the re-sync the design exists to avoid. No safety threshold or runbook. brief §4, §7 Q1
HIGH S0 scope — S0 proved codegen+tsc compile; it did not prove BeaconV3 federation fans out across N live belts. brief §9
HIGH Cost extrapolation — corpus belt (~$40-55/mo) is extrapolated from a subset test belt; the real corpus indexes everything → likely higher. Need a calibration deploy + hard sibling cap + cost alarm. brief §5

✅ High-consensus next actions (before the SDD)

  1. IMP-001 — bounded spike: does a sibling belt actually backfill faster in parallel? (resolves Q4; the perf(indexer): Add parallel await optimizations and prevent negative balances #1 risk)
  2. IMP-002 — narrow the federation tech options (Apollo / Hasura remote schemas / custom gateway) before SDD
  3. IMP-003 — define the cross-belt consistency contract (ordering/dedup/pagination)
  4. IMP-005 — verify score-api can actually reconstruct missed events (the durability assumption)
  5. IMP-004 — define the consolidation trigger rule (operator-gated, never autonomous)
  6. IMP-006 — resolve corpus belt count for the budget claim
  7. IMP-007 — join-corpus vs own-belt rule, incl. partially-synced chains

Takeaway

The reframe is directionally sound, but two load-bearing premises (parallel backfill helps · query-time federation is feasible) are unproven and the review thinks at least one may be false. Recommend resolving IMP-001 (backfill spike) + IMP-002/003 (federation correctness) before writing the SDD — i.e., a short DIG/spike phase, not straight to implementation.

Copy link
Copy Markdown
Author

@zkSoju zkSoju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Analytical review of #15. Enrichment pass was unavailable; findings are unenriched.

Findings

{
  "schema_version": 1,
  "findings": [
    {
      "id": "F-001",
      "title": "B3 bet on parallel backfill may be invalidated by existing SCALE.md evidence",
      "severity": "HIGH",
      "category": "architecture",
      "file": "grimoires/loa/context/arch-brief-belt-federation.md",
      "description": "Section 6 B3 asserts sibling belts backfill concurrently without contending on the shared eRPC ceiling, but the brief itself notes (same section) that SCALE.md already showed optimization didn't help and raises 'is RPC the ceiling?' as an unresolved question. If eRPC throughput is the bottleneck, multiple sibling belts hitting the same eRPC instance simultaneously degrades all belts rather than isolating the new-source backfill. The current architecture places eRPC in the shared-infra tier (§5 cost model), meaning the isolation boundary does not extend to the RPC layer.",
      "suggestion": "Before committing to the sibling-belt model, resolve Q4 empirically: run a single sibling backfill while the corpus belt is at steady-state and measure eRPC request rates vs. throughput ceiling. If RPC is the bottleneck, the design needs either per-belt eRPC allocation (cost impact) or a HyperSync break-glass path before the model is viable.",
      "confidence": 0.85
    },
    {
      "id": "F-002",
      "title": "Consolidation cadence is undefined, leaving belt-count growth unbounded in practice",
      "severity": "HIGH",
      "category": "architecture",
      "file": "grimoires/loa/context/arch-brief-belt-federation.md",
      "description": "§4 describes periodic consolidation (fold sibling into corpus via deliberate --restart during a maintenance window) as the cost-control valve, but neither §4 nor §7 Q1 defines a trigger condition or cadence. Without a concrete rule (e.g., 'fold when belt has been at chain-head for N days' or 'fold when steady-state sibling cost > $X'), consolidation becomes a manual judgment call that is easy to defer. Each deferred consolidation adds fan-out width to the federation gateway and increases steady-state cost, potentially pushing the total above the $100/mo ceiling.",
      "suggestion": "Define an explicit consolidation trigger in the brief before the SDD stage — either a time-based rule, a cost threshold, or a belt-count ceiling. The trigger should be automatable (observable metric → runbook step) rather than purely operator-discretion.",
      "confidence": 0.8
    },
    {
      "id": "F-003",
      "title": "Cross-belt consistency contract is unspecified but consumers may depend on it",
      "severity": "HIGH",
      "category": "correctness",
      "file": "grimoires/loa/context/arch-brief-belt-federation.md",
      "description": "§7 Q2 acknowledges that federated reads hit belts at different sync heights during backfill, producing inconsistent cross-belt state. The brief defers resolution to 'acceptable because score-api capture tolerates it?' without establishing the actual contract. If any consumer (score-api cron or a direct GraphQL client) issues a cross-belt query that joins or correlates entities across belts (e.g., Holder counts against Action events from a sibling belt still mid-backfill), it will silently receive incomplete data with no signal that the result is partial.",
      "suggestion": "Define the consistency contract explicitly: either (a) expose a per-belt sync-height field in the federated schema so consumers can detect partial state, or (b) document that cross-belt queries are undefined during sibling backfill and enforce this at the gateway layer (e.g., route cross-belt queries to corpus-only until sibling reaches chain head).",
      "confidence": 0.75
    },
    {
      "id": "F-004",
      "title": "Federation technology choice deferred with no evaluation criteria",
      "severity": "MEDIUM",
      "category": "architecture",
      "file": "grimoires/loa/context/arch-brief-belt-federation.md",
      "description": "§7 Q6 lists four federation options (Apollo Federation, thin custom gateway, Hasura remote schemas, one Hasura over multiple Postgres sources) with no selection criteria or elimination reasoning. The choice has significant cost, operational, and capability implications: Hasura-per-belt vs. Apollo Federation vs. custom proxy differ in cross-belt join support, pagination semantics, and failure-mode behavior. Leaving this open at the architecture-brief stage means the SDD cannot specify the BeaconV3 contract shape.",
      "suggestion": "Add a §6-style bet or a pre-SDD spike to evaluate at minimum: (a) whether cross-belt entity deduplication/ordering is required (if yes, eliminates pure proxy options), and (b) operational cost of Hasura-per-belt vs. shared Hasura with multiple Postgres sources (if Hasura is the direction). This can be a half-day spike before the SDD is written.",
      "confidence": 0.8
    },
    {
      "id": "F-005",
      "title": "--restart all-or-nothing semantics within a belt creates hidden blast radius for multi-chain corpus",
      "severity": "MEDIUM",
      "category": "correctness",
      "file": "grimoires/loa/runbooks/belt-reinit.md",
      "description": "The D6 table correctly documents that --restart is all-or-nothing across a belt's chains. For the corpus belt (which consolidates all chains at steady-state), adding a single new historical-event source requires a full corpus reindex — reproducing the original 8-hour sync problem that motivated the belt split. The runbook notes this under 'Adding a NEW chain to an existing belt' but does not surface it as a constraint on the corpus belt specifically.",
      "suggestion": "Add an explicit callout in the runbook (and in the arch brief §4 or §3) that the corpus belt's --restart cost scales with the number of consolidated chains. This is the key constraint that determines when a new source *must* get its own sibling belt (Q3 decision rule): if the source has historical events and the corpus belt has N consolidated chains, corpus --restart cost is proportional to N, not to the new source alone.",
      "confidence": 0.9
    },
    {
      "id": "F-006",
      "title": "ENVIO_RESTART env-var removal step has no verification gate before resume deploy",
      "severity": "MEDIUM",
      "category": "operational",
      "file": "grimoires/loa/runbooks/belt-reinit.md",
      "description": "Procedure A step 1 says to verify that chain_metadata has a row per config chain after the expected 28P01 crash, then step 2 removes ENVIO_RESTART and redeploys. The runbook does not specify what to do if the chain_metadata verification in step 1 fails (e.g., the JS layer crashed before seeding, or only a subset of chains were seeded). Proceeding to step 2 with incomplete seeding silently leaves unseeded chains skipped on resume — the same silent-skip behavior documented in the D6 table.",
      "suggestion": "Add an explicit verification query to step 1 with a pass/fail gate: the operator should confirm `SELECT COUNT(*) FROM chain_metadata` equals the expected chain count before removing ENVIO_RESTART. If the count is wrong, the step is to redeploy with ENVIO_RESTART=1 again rather than proceeding to resume.",
      "confidence": 0.85
    },
    {
      "id": "F-007",
      "title": "Cost model lower bound depends on Hasura sharing which is architecturally deferred",
      "severity": "MEDIUM",
      "category": "architecture",
      "file": "grimoires/loa/context/arch-brief-belt-federation.md",
      "description": "§5 notes 'Sharing Hasura (1 → N belt Postgres, multi-source)' as one of the two levers to stay under $100/mo. But §7 Q6 leaves the federation technology entirely open, and §4 explicitly rejects shared Postgres. The cost model's lower bound assumes Hasura sharing is viable, while the architecture section hasn't committed to it. If the federation choice lands on per-belt Hasura (the simpler isolation path), the cost model needs to be recalculated.",
      "suggestion": "Resolve the Hasura-sharing question before finalizing the cost model. At minimum, note the cost model dependency explicitly: 'this bound assumes shared Hasura; per-belt Hasura adds ~$X per belt.'",
      "confidence": 0.75
    },
    {
      "id": "F-008",
      "title": "PRAISE: B2 (score-api durability backstop) enables disposable indexer design",
      "severity": "PRAISE",
      "category": "architecture",
      "file": "grimoires/loa/context/arch-brief-belt-federation.md",
      "description": "The explicit lambda-architecture framing in §3 — indexer as hot serving tier, score-api/ClickHouse as warm/cold analytics — is well-grounded. Naming score-api's fallbacks as the mechanism that makes the indexer disposable is precise: it correctly identifies which system owns durability and which owns serving, rather than requiring both from the same process. This is load-bearing for the cost argument and the re-init tolerance.",
      "confidence": 0.9
    },
    {
      "id": "F-009",
      "title": "PRAISE: D6 decision table grounds reset semantics in code, not inference",
      "severity": "PRAISE",
      "category": "operational",
      "file": "grimoires/loa/runbooks/belt-reinit.md",
      "description": "The D6 table traces each mutation type through isInitialized (table-existence check) and makeFromDbState (resume-from-progressBlockNumber) with source file and line references. Grounding the behavioral table in actual code paths rather than observed behavior alone means the table remains valid even if observed behavior changes — a reviewer can verify the table by reading the source rather than re-running experiments.",
      "confidence": 0.95
    },
    {
      "id": "F-010",
      "title": "PRAISE: KF-013 DO-NOT section records misdiagnosis explicitly, preventing recurrence",
      "severity": "PRAISE",
      "category": "operational",
      "file": "grimoires/loa/runbooks/belt-reinit.md",
      "description": "The DO-NOT section names the specific misdiagnosis (ENVIO_PG_SSL_MODE=false as fix for 28P01) with a dated entry, and separately documents why ENVIO_PG_SSL_MODE=disable fails for a different reason. Recording failed fix attempts with their failure mode — not just the working solution — is operationally valuable: it prevents future operators from spending time re-attempting known dead ends under different conditions.",
      "confidence": 0.95
    },
    {
      "id": "F-011",
      "title": "Partition-by-chain axis (D4) assumed valid without addressing cross-chain query patterns",
      "severity": "LOW",
      "category": "architecture",
      "file": "grimoires/loa/context/arch-brief-belt-federation.md",
      "description": "§4 states 'Partition axis = reindex-cost (chain), per D4' as settled. This is the right axis for minimizing reindex blast radius, but it may conflict with the federation gateway's cross-chain query requirements. If consumers frequently issue queries that span multiple chains (e.g., all Mints for a given address across all chains), the federation layer must fan out to all corpus/sibling belts and merge results — adding latency proportional to belt count and chain count.",
      "suggestion": "Note in §6 or §4 whether cross-chain queries are a first-class use case. If they are, the federation gateway must handle cross-chain merge, ordering, and pagination — which may be a constraint on Q6's technology choice.",
      "confidence": 0.6
    }
  ]
}

Callouts

Enrichment unavailable for this review.

Retires r1 federation framing (siblings + query-time gateway) after review
(Flatline SKP-002/003 CRITICAL, BB F-001/F-003): one consolidated belt behind a
stable alias, changes shipped by blue-green promotion (green backfills in
background, atomic alias flip, zero consumer downtime). Removes cross-belt query
correctness + fold-downtime risk by construction. Adds belt-reinit.md F-006
verification gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Author

@zkSoju zkSoju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Analytical review of #15. Enrichment pass was unavailable; findings are unenriched.

Findings

{
  "schema_version": 1,
  "findings": [
    {
      "id": "F-001",
      "title": "Alias mechanism (B3/Q1) is unresolved — architecture is unimplementable without it",
      "severity": "HIGH",
      "category": "architectural-gap",
      "file": "grimoires/loa/context/arch-brief-belt-federation.md",
      "description": "The entire value proposition of r2 rests on 'atomic alias flip blue→green' and 'consumers hit a fixed endpoint; the swap is invisible.' But Q1 explicitly admits the alias mechanism is unknown — Railway custom domain reassign, gateway holding current URL, DNS/proxy layer, or something else. §4 calls this 'SCALE.md Guardrail 5 — the core deliverable' yet it has no implementation sketch, no atomicity guarantee, and no definition of what 'single source of truth' means in Railway terms. Without this, B3 and B4 cannot be evaluated. The brief cannot graduate from candidate status until Q1 has an answer.",
      "suggestion": "Promote Q1 from open question to architectural prerequisite. Before sprint planning, resolve: (a) the specific Railway mechanism that provides a stable endpoint, (b) whether the flip is truly atomic from the consumer's perspective (DNS TTL? load-balancer drain?), and (c) how 'all consumers' are confirmed to have transitioned before blue is retired. Add a §10 stub with the proposed mechanism so reviewers can attack it specifically.",
      "confidence": 0.95
    },
    {
      "id": "F-002",
      "title": "`--restart` wipe destructiveness is under-emphasized for the 'add a single new chain' case",
      "severity": "HIGH",
      "category": "operational-safety",
      "file": "grimoires/loa/runbooks/belt-reinit.md",
      "description": "Procedure A and D6 both note that `--restart` re-seeds ALL of the belt's chains, not just the new one being added. But the warning ('⚠️ `--restart` re-seeds all of that belt's chains (full belt reindex)') appears as a parenthetical after the usage instruction, not before it. In a blue-green world, the operator's mental model for 'add a chain' should be: stand up a green, because `--restart` on a live blue wipes it. The current runbook structure presents `--restart` as the normal tool for adding a chain without making it clear that this operation is only safe to run on a non-serving green — never on the production blue.",
      "suggestion": "Add an explicit callout box at the top of Procedure A: '`--restart` on the production (blue) belt wipes and reindexes ALL chains — full downtime. For the blue-green model described in arch-brief r2, `--restart` must only be run on a green (non-serving) deployment. Never run Procedure A on the live belt.' Cross-reference §4 of the arch brief.",
      "confidence": 0.9
    },
    {
      "id": "F-003",
      "title": "Behavioral assertions sourced from `3.0.0-alpha.17` internals are fragile",
      "severity": "MEDIUM",
      "category": "dependency-risk",
      "file": "grimoires/loa/runbooks/belt-reinit.md",
      "description": "The runbook's core correctness depends on specific alpha-build behavior: `isInitialized` checks table existence (not config hash), `makeFromDbState` iterates `chain_metadata` rows (not config chains), and resume never runs the Rust-CLI `persisted_state` upsert. These are cited with line references from `node_modules/` source. Alpha builds (pre-1.0) make no stability guarantees. If Envio changes `isInitialized` to be config-aware, or changes the `persisted_state` upsert path, the runbook's 2-step dance becomes incorrect — silently or with different error modes. The DO-NOT section would become misleading.",
      "suggestion": "Add a 'Version lock' warning: 'This runbook is verified against Envio `3.0.0-alpha.17`. Re-verify the `isInitialized` and `makeFromDbState` behavior on each alpha version bump before using. Key check: does resume still skip chains absent from `chain_metadata`?' Consider adding the version as frontmatter so it's visually prominent at the top.",
      "confidence": 0.85
    },
    {
      "id": "F-004",
      "title": "Verification gate retry loop has no escalation path",
      "severity": "MEDIUM",
      "category": "operational-safety",
      "file": "grimoires/loa/runbooks/belt-reinit.md",
      "description": "Procedure A step 1 says: 'On a short count, do NOT proceed to step 2 — redeploy with `ENVIO_RESTART=1` again until the count matches.' There is no guidance for: how many retries before escalating, what to look at if JS crashes before seeding all chains consistently, or how to distinguish a transient crash (safe to retry) from a structural issue (retrying will not fix it, and each retry wipes the schema again). An operator in a degraded state following this instruction could loop indefinitely.",
      "suggestion": "Add a retry bound: 'If the count does not match after 2 `ENVIO_RESTART=1` deploys, examine the Railway deploy logs for the JS-layer crash point before the schema seed completed. Check Envio issues for alpha-specific seeding regressions. Do not continue retrying blindly — each `--restart` wipes the schema.' Reference `grimoires/loa/known-failures.md` as the escalation surface.",
      "confidence": 0.8
    },
    {
      "id": "F-005",
      "title": "Cost model does not quantify the promotion window overhead",
      "severity": "MEDIUM",
      "category": "cost-model",
      "file": "grimoires/loa/context/arch-brief-belt-federation.md",
      "description": "§5 says the transient during-promotion cost is '~2× the belt for the catch-up window (hours), then back to 1×' and calls it 'negligible monthly.' But no number is given. At $50–60/mo per belt, an 8-hour backfill window costs approximately $0.60–0.70 per promotion. A 48-hour backfill (plausible for a large corpus) costs ~$3.50. Neither is budget-threatening, but stating the number removes the vagueness. More importantly, Q4 says backfill wall-time is 'worth a one-shot confirmation' — the cost model's 'negligible' claim depends on that confirmation.",
      "suggestion": "Add a row to the cost table: 'Per promotion event (green backfill, ~N hours at 2× belt rate): ~$X.' Fill N and X once Q4 is answered. This grounds the 'negligible' claim and makes it auditable.",
      "confidence": 0.75
    },
    {
      "id": "F-006",
      "title": "D6 'handler read-side bug fix' row overstates forward-applicability",
      "severity": "LOW",
      "category": "documentation-accuracy",
      "file": "grimoires/loa/runbooks/belt-reinit.md",
      "description": "D6 states that a 'handler read-side bug fix (no aggregate-shape change)' requires only a redeploy because 'new code applies forward.' This is accurate for events not yet indexed, but any events already indexed under the buggy handler are left in the DB with incorrect data. The row's framing ('no gap') could lead an operator to skip a reindex when historical correctness is actually required. The distinction between aggregate-shape change and handler logic change is not always clean — a handler can produce wrong aggregate values without changing the entity schema.",
      "suggestion": "Amend the row to: 'New code applies forward only; already-indexed events retain old handler output. If historical correctness is required (e.g., wrong aggregate values computed from past events), treat as a schema/entity change and use blue-green or `--restart`.' This prevents misclassification.",
      "confidence": 0.8
    },
    {
      "id": "F-007",
      "title": "B2 backfill speed is an unverified operator assertion for the full corpus",
      "severity": "LOW",
      "category": "unverified-bet",
      "file": "grimoires/loa/context/arch-brief-belt-federation.md",
      "description": "B2 states 'operator asserts RPC backfill is fast enough' and Q4 acknowledges a confirmation is 'worth doing.' The brief correctly lowers the stakes ('no longer gates downtime') but the promotion cadence and SLA for 'time-to-promote' cannot be estimated without a measured full-corpus backfill time. If a chain's historical depth is large and eRPC throughput is rate-limited, a promotion could take days — not hours — which affects how batching (Q5) should be planned.",
      "suggestion": "Q4 is already the right call. Elevate it slightly: make the full-corpus backfill measurement a named S1 prerequisite for setting the promotion SLA, not just 'worth a one-shot confirmation.' Until measured, promotion cadence estimates are ungrounded.",
      "confidence": 0.7
    },
    {
      "id": "F-008",
      "title": "PRAISE — DO-NOT section grounds anti-patterns in specific incidents and misdiagnoses",
      "severity": "PRAISE",
      "category": "operational-quality",
      "file": "grimoires/loa/runbooks/belt-reinit.md",
      "description": "The three DO-NOT entries are each anchored to a specific failure mode with a date and a named misdiagnosis: 'chasing the password' on 28P01, the 2026-05-20 sslmode=false misdiagnosis, and the `sslmode=disable` crash path. This structure prevents recurrence more effectively than a generic warning because it gives an operator in a degraded state specific things to rule out, with the reasoning already done. The KF-013 cross-reference provides audit depth.",
      "suggestion": "No change needed. This pattern should be replicated in future runbooks: anchor DO-NOT entries to specific past attempts, dates, and named failure modes rather than generic 'don't do X.'",
      "confidence": 1.0
    },
    {
      "id": "F-009",
      "title": "PRAISE — Verification gate in Procedure A surfaces the D6 silent-skip hazard at the decision point",
      "severity": "PRAISE",
      "category": "operational-safety",
      "file": "grimoires/loa/runbooks/belt-reinit.md",
      "description": "The BB F-006 verification gate ('SELECT COUNT(*) FROM chain_metadata MUST equal the number of chains in config') is placed exactly where an operator would be about to step 2, and explicitly blocks progression on a short count. The 'do not skip' instruction and the consequence ('silently skipped on resume') are co-located. This is the right structure: the hazard (D6 silent-skip behavior) is surfaced at the point where it can be caught, not buried in a separate reference section.",
      "suggestion": "No change needed.",
      "confidence": 1.0
    },
    {
      "id": "F-010",
      "title": "REFRAME — Q1 framing as 'open question' understates it as an implementation blocker",
      "severity": "REFRAME",
      "category": "framing",
      "file": "grimoires/loa/context/arch-brief-belt-federation.md",
      "description": "§7 lists Q1 (alias mechanism) alongside Q3 (promotion trigger) and Q4 (backfill speed) as peer open questions. But Q1 is categorically different: Q3 and Q4 are refinements to a mechanism that exists; Q1 is the mechanism itself. The blue-green architecture as described is not a refinement of an existing system — it depends on something that does not yet exist and whose implementation shape is unknown. Placing it in the same list as 'what gate confirms green is safe to promote' obscures this asymmetry.",
      "suggestion": "Add a '§10 — Prerequisite: alias mechanism spec' section (even as a one-paragraph stub) that is explicitly gated: 'Sprint planning for S1 is blocked until this section is filled in.' This makes the dependency visible in the document's structure, not just in a bulleted Q list.",
      "confidence": 0.85
    }
  ]
}

Callouts

Enrichment unavailable for this review.

@zkSoju
Copy link
Copy Markdown
Author

zkSoju commented May 22, 2026

🔁 r2 review synthesis — both models, CLI subscription ($0 metered)

Re-ran Flatline (3-model) + Bridgebuilder on brief r2 (344c07e). Headline: r2 retired both r1 architectural CRITICALs — and the two reviews converge hard on a single remaining blocker.

✅ Cleared by the r2 reframe (one belt + blue-green)

r1 finding r2 status
SKP-002 CRIT — cross-belt query correctness gone — one belt, no cross-belt UNION/dedup/ordering
SKP-003 HIGH — fold = 8h downtime gone — green builds in background, blue serves, atomic flip
F-002/SKP-005 — sibling sprawl / cost creep gone — no permanent siblings

BB even flagged 2× PRAISE: the incident-grounded DO-NOT section, and the new F-006 verification gate surfacing the D6 silent-skip hazard at the decision point.

🚨 The one true blocker (unanimous)

The stable alias/promotion mechanism is unspecified — and it's load-bearing, not an "open question."

  • BB F-001 (HIGH): "architecture is unimplementable without it" · BB F-010 REFRAME: Q1 understates it — it's an implementation blocker.
  • Flatline SKP-001 (CRITICAL ×3): the alias is the only thing that makes zero-downtime work; a DNS-based alias is NOT atomic → needs a real router/proxy with single-source-of-truth swap. (This is SCALE.md Guardrail 5, already flagged "not built.")

📋 SDD spec checklist (the rest — all "specify the mechanism," not "design is wrong")

Item Source
Promotion gate = reconciliation (entity counts), not just latest_block ≥ blue FL SKP-002 HIGH
Rollback path after a bad alias flip (keep blue until green verified) FL SKP-003 HIGH
Green DB isolation during re-init — green's --restart must not touch blue's data FL SKP-002 CRIT
Breaking (non-additive) schema change path (B1 only covers additive) FL SKP-003 CRIT
Quantify promotion-window cost (transient blue+green vs already-89% mem) FL SKP-004 + BB F-005
Measure B2 backfill convergence (currently operator assertion) FL SKP-003 + BB F-007
Verification-gate retry needs an escalation path (no infinite loop) BB F-004

Verdict

r1 criticals were "this might not work." r2's are "this works — now specify the alias, promotion gate, rollback, and breaking-change policy." That's the brief→SDD boundary. The architecture is validated; the SDD's #1 job is the alias/promotion mechanism (likely a short spike to pick the tech: Railway domain reassignment vs. a thin proxy/router vs. a gateway), then the spec checklist above.

(Tooling note: BB's Pass-2 enrichment was unavailable on claude-headless in all runs — findings are Pass-1 convergence only, no FAANG-parallel depth. Repeatable; logging as a known-failure.)

…t-gateway

Resolves the reviews' #1 blocker (Q1/SKP-001/F-001): the stable alias is already
built + verified (Caddy belt-gateway, BELT_UPSTREAM swap, reversible rollback;
proxy not DNS = atomic + single-source). Green DB isolation is free (separate
service+Postgres). Remaining: swap atomicity (redeploy blip vs caddy reload) = SDD.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@zkSoju
Copy link
Copy Markdown
Author

zkSoju commented May 22, 2026

🔑 Alias-mechanism spike — the #1 blocker is largely already built

Both r2 reviews flagged the stable alias as the load-bearing unspecified blocker (BB F-001 "unimplementable without it" + F-010 REFRAME; Flatline SKP-001 CRITICAL ×3, "DNS isn't atomic"). The spike found it already exists — the reviews just didn't have the repo's Dockerfile.gateway/Caddyfile in their diff.

The stable alias = the Caddy belt-gateway (shipped S3-T1/T2):

:{$PORT}  →  reverse_proxy {$BELT_UPSTREAM}
  • Proxy, not DNS → atomic by construction, not propagation-bound (resolves SKP-001's core objection).
  • Single-source-of-truth → one Caddy config + one env var, not per-consumer config (resolves Guardrail 5's split-brain).
  • Swap built + verified (NOTES.md): railway variables -s belt-gateway --set 'BELT_UPSTREAM=<green>' → "bad upstream→502, revert→live data."
  • Rollback = revert BELT_UPSTREAM — already proven reversible (SKP-003 HIGH "no rollback path" → addressed).
  • Green↔blue DB isolation (SKP-002 CRIT) is free — green is a separate Railway service + own Postgres, structurally unable to touch blue's data.

What this does to the review findings

Finding New status
Q1/SKP-001/F-001/F-010 — alias unspecified/unimplementable RESOLVED — exists, verified, proxy-not-DNS
SKP-003 — no rollback after bad flip RESOLVED — revert BELT_UPSTREAM (proven)
SKP-002 — green DB isolation RESOLVED — separate service + Postgres
SKP-002 — promotion gate (block-height insufficient) still SDD — add reconciliation gate before swap
SKP-003/B1 — breaking (non-additive) schema path still SDD
swap atomicity (redeploy blip vs caddy reload) bounded SDD choiceadmin off today; ~sec blip, score-api covers it

Net

The hardest "this might not be implementable" blocker collapses to "it's built; pick the swap-atomicity refinement." What genuinely remains for the SDD is small and concrete: (1) promotion gate = reconciliation + all-chain latest_block ≥ blue, (2) breaking-change schema path, (3) swap atomicity (blip vs graceful reload), (4) green-build orchestration around the existing belt-reinit.md. The SDD is unblocked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bridgebuilder:self-review Admit framework/grimoires artifacts into Bridgebuilder review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant