sonar-belt-factory: 6-chain consolidated belt + blue→green expansion gate + zero-downtime swap#16
Conversation
Opens the sonar-belt-factory cycle (D4 per-belt split + BeaconV3 federation + Effect serving layer). Supersedes the merged indexer-belt-rebuild (Mibera) cycle, now archived. - PRD: full monolith→belt migration; blast-radius/BeaconV3/uptime gates - SDD r6: 12-belt pure-product partition (operator-confirmed taxonomy); Flatline-remediated (3-model, 9 blockers → §17 R-A..R-F) - Sprint plan: S0 calibration spike (gates all) → S1-S4 (global 172-176) - Ledger: Mibera cycle archived, sonar-belt-factory active Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Zero-dep, fail-closed promotion gate (SDD §6 / FR-4): block-height parity + 3-mode entity-count reconciliation (A at-block / B Action timestamp-proxy / C converged-exact, per Task 1.0) + schema-superset (additive-only incl. nullability). Pure over snapshots; 15/15 tests pass (test/promotion-gate.test.ts). Live fetch (cutoff / raw-L1 / content-sample) wired for S2. SR-1 closed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ssion-4 kickoff belt-reinit.md (KF-013 generalized re-init + F-006 seed-count gate); known-failures KF-014 (BB Pass-2 enrichment fails on headless — accept Pass-1); SCALE.md D6 closed; session-4 kickoff handoff (specs + tracks) for /run sprint-plan → /run-bridge 3; NOTES.md S1 live-data AC deferrals. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Supersedes the r6/12-belt plan (79ef8df). PRD r2 (one consolidated belt + blue-green promotion + stable alias), SDD r7 (§17 R-A..R-G Flatline-remediated, OQ-1=Caddy reload), sprint v2.0 (SR-1..SR-7), ledger sync (S0 done, S1-S4 g173-176). All reviewed on PR #15 (/bin/zsh CLI-subscription). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… review) Review (cross-model codex-headless, $0) found checkSchemaSuperset's parseSchema ignored `enum` declarations — an enum value-set contraction (blue drops a value green keeps... green drops a value blue has) passed the gate, violating FR-7 additive-only / AC-7 IMP-005 "nullability AND enum dimensions" (reviewer.md had over-claimed AC-7 as Met). Adds zero-dep parseEnums() + green-⊇-blue per-enum value-set assertion; value removal OR whole-enum removal -> FAIL, addition allowed. 3 tests added (18 passed). schema.graphql has 0 enums today -> closed a latent gap before the first enum lands. Non-blocking concerns (parseSchema nested-brace fragility; both-wrong needs S2 Part-4 raw-L1) recorded for S2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
S1 reconciliation gate: review APPROVED (DISS-001 enum fix landed), security audit APPROVED (0 CRIT/HIGH/MED, adversarial audit 0 findings). Epic bd-z7d + all 9 tasks closed. 3 live-data ACs + DISS-002 (makeFetchSnapshot) + 2 LOW hygiene notes accepted-deferred->S2 per NOTES.md Decision Log. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…undwork S2 reframed (operator decision) from parity dry-run to the real consolidation EXPANSION: green = config.yaml (6-chain consolidated, +Arbitrum +Zora, 41 contracts) vs live blue = config.mibera.yaml (4-chain mibera belt). Dockerfile.belt now takes a BELT_CONFIG build-arg (default config.mibera.yaml so blue's build is UNCHANGED; green sets config.yaml). Decision + implications recorded in NOTES Decision Log; amended S2 tasks bd-umw.6/.7/.8 (gate expansion-mode, Part-4 raw-L1 ground-truth now load-bearing for new chains, Dockerfile param). config.yaml codegen verified clean locally (rescript 24/24). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lt-indexer-green seeded) Green = 6-chain consolidated belt standing up in freeside-sonar/production. Postgres-vRR1 (isolated green DB) + belt-indexer-green created; isolation verified (ENVIO_PG_HOST=postgres-vrr1.railway.internal, not blue's postgres-3vic). Seed completed: schema + 6 chain_metadata rows (1/10/8453/42161/80094/7777777 incl new Arbitrum+Zora) — BB-F006 COUNT==6 gate PASS. ENVIO_RESTART=0 set → resume/backfill triggered. Pending: belt-hasura-green, gate expansion-mode + Part-4, certify, swap. Token at ~/.railway-green.tok (revoke after cycle). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ansion-mode + swap Build doc + session track for resuming sonar-belt-factory after green's cold-sync converges: belt-hasura-green, gate expansion-mode (bd-umw.6), Part-4 raw-L1 ground-truth (bd-umw.7), certify, swap (bd-c09.*), retire blue, S4 + run-bridge 3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…E_OPTIONS 6-chain green belt-indexer crash-looped (FATAL: JS heap out of memory) — Node's ~2GB default heap vs the 24GB container under 6 concurrent chain fetchers; Arbitrum + Zora froze on dense HoneyJar mint regions while lighter chains advanced. Fixed with NODE_OPTIONS=--max-old-space-size=12288 on belt-indexer-green (green-only; blue 4-chain fine at default). Confirmed: Zora converged to head, Arbitrum +58.5M unstuck. KF-015 logged + indexed; NOTES continuity + session-5 build doc note the heap requirement must persist post-swap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hboard Local one-command dashboard (node scripts/sync-dashboard.cjs -> localhost:8787): per-chain progress bars, %, blocks remaining, events, live rate + ETA from chain_metadata in Postgres-vRR1. Zero new deps (bundled postgres.js); read-only; DB URL via ~/.railway-green.tok or GREEN_DB_URL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ns read as active A chain can fetch far ahead of its committed latest_processed_block (envio commits in batches) — e.g. Arbitrum fetched 94M blocks ahead while processed sat frozen, which reads as 'stuck' but is healthy. Dashboard now shows a ghost bar (fetched edge) + 'fetched +N ahead' + derives the live rate from leading-edge movement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…api RFC #163 Green 6-chain consolidated belt fully backfilled — all chains at head, remaining=0. KF-015 heap fix held through the dense Base/Berachain regions. Score raw-data direction filed as score-api#163 (watermark ETL; S3 promote.sh must publish a current-belt DATABASE_URL for raw consumers). Ready for Session 5: certify + swap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…zero-downtime) Gate (bd-umw.6/.7): promotion-gate.js EXPANSION-mode (green ≥ blue non-lossy for shared entities/chains) + Part-4 raw-L1 eth_getLogs ground-truth (golden-tx identity, empty-200=GAP per KF-012) + live makeFetchSnapshot; independent EXPECTED_CHAINS completeness (no self-attestation); redactUrl credential-leak guard. 54/54 tests. Cross-model review+audit caught+fixed 2 fail-closed blockers (circular requiredChains; non-emptiness-only Part-4) + 1 secrets HIGH (URLs in the committed report) — all proven fail-closed live. Swap (bd-c09.1/.4): scripts/promote.sh — sole BELT_UPSTREAM writer (R-D), runs the gate as a NON-SKIPPABLE precondition (fail-closed), --rollback (no gate, R-A), --dry-run, Score #163 signal (configurable). Caddyfile admin localhost-only (Option B §7.4, committed/undeployed). bats 6/6. Verified live: green certified (expansion PASS), flipped blue→green at ZERO downtime (0 5xx across swap+rollback+re-promote, 439 polls), rollback exercised. Green serves all 6 chains via the alias; blue kept hot for the soak window. G1+G3+G4+R-A proven. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| if (process.env.GREEN_DB_URL) return process.env.GREEN_DB_URL; | ||
| if (!existsSync(TOKEN_FILE)) throw new Error(`set GREEN_DB_URL, or put a Railway project token at ${TOKEN_FILE}`); | ||
| const tok = readFileSync(TOKEN_FILE, "utf8").trim(); | ||
| const out = execSync(`railway variables --service ${PG_SERVICE} --json`, { |
| res.end(JSON.stringify(data)); | ||
| } catch (e) { | ||
| res.writeHead(500, { "content-type": "application/json" }); | ||
| res.end(JSON.stringify({ error: String(e.message || e) })); |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f0c6274aad
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| try { schema = readFileSync(schemaPath, "utf8"); } catch { schema = ""; } | ||
| return { chainMeta, counts, schema }; |
There was a problem hiding this comment.
Fetch schema per deployment before superset check
makeFetchSnapshot() ignores the GraphQL endpoint for schema data and always reads schema.graphql from disk, so Part 3 compares the same local schema for both blue and green. That means the gate can report PASS even if the deployed green schema removed/changed fields relative to blue, which defeats the additive-safety check and can allow a breaking swap.
Useful? React with 👍 / 👎.
| const greenUrl = env.GREEN_GRAPHQL_URL || blueUrl; // self-parity when green unset | ||
| const mode = env.PROMOTION_MODE === "expansion" ? "expansion" : "parity"; |
There was a problem hiding this comment.
Require GREEN_GRAPHQL_URL for real promotion runs
Defaulting greenUrl to blueUrl creates a fail-open path where a misconfigured run validates blue against itself and can still pass, even though green was never checked. Because promote.sh treats this gate as the non-skippable precondition before flipping BELT_UPSTREAM, this can promote an unvalidated green deployment in any context where the extra expansion guards are not forcing failure.
Useful? React with 👍 / 👎.
zkSoju
left a comment
There was a problem hiding this comment.
Summary
Verdict: REQUEST_CHANGES. The PR shows strong operational intent around zero-downtime promotion, but two gate checks currently pass by construction: blue-vs-blue endpoint parity and local-vs-local schema comparison. For a belt promotion system, the decisive question is not whether the script is elegant; it is whether it can refuse a beautiful lie.
Promotion gates are production boundary objects. Netflix Hystrix taught the same lesson at service-call scale: the guardrail matters most when the happy path is persuasive. In Loa terms, this is kaironic time: knowing when to stop the swap is part of the craft.
Findings
{
"schema_version": 1,
"findings": [
{
"id": "F1",
"title": "Promotion can pass without a green endpoint",
"severity": "HIGH",
"category": "safety",
"file": "scripts/promotion-gate.js:478",
"description": "GREEN_GRAPHQL_URL defaults to BLUE_GRAPHQL_URL, so a real promote can accidentally run blue-vs-blue self-parity and pass without validating the green deployment. scripts/promote.sh does not require GREEN_GRAPHQL_URL before flipping BELT_UPSTREAM to green.",
"suggestion": "In expansion mode or when invoked from promote.sh, require GREEN_GRAPHQL_URL to be explicitly set and different from BLUE_GRAPHQL_URL. Keep self-parity behind an explicit SELF_PARITY=1 or test-only mode.",
"confidence": 0.95,
"faang_parallel": "Google SRE launch checklists require validating the actual candidate, not an alias of production.",
"metaphor": "This is checking the same passport twice and declaring two travelers cleared.",
"teachable_moment": "Fail-closed gates must make the dangerous default impossible. Test conveniences need explicit names so they cannot leak into live operations.",
"connection": "This is the bridge loop at the deployment boundary: blue and green must meet as distinct witnesses before the hounfour can bless the swap."
},
{
"id": "F2",
"title": "Live schema gate compares the same local schema for blue and green",
"severity": "HIGH",
"category": "correctness",
"file": "scripts/promotion-gate.js:386",
"description": "makeFetchSnapshot reads schema.graphql from disk for both blue and green snapshots, so Part 3 passes by construction during live runs and cannot detect that the deployed green Hasura schema removed or changed fields relative to blue.",
"suggestion": "Fetch each deployment's actual GraphQL schema via introspection or accept separate BLUE_SCHEMA_PATH and GREEN_SCHEMA_PATH inputs. Fail closed if either schema cannot be retrieved.",
"confidence": 0.9,
"faang_parallel": "Amazon service teams rely on live contract tests because repository schemas can diverge from deployed reality.",
"metaphor": "It is comparing two printed menus while the kitchen may be serving something else.",
"teachable_moment": "A compatibility gate must observe the deployed artifact, not the intended artifact. Otherwise it certifies documentation, not behavior.",
"connection": "This is schema-is-not-the-contract in practice: the contract includes the live endpoint and its invariants."
},
{
"id": "F3",
"title": "Schema parser misses common GraphQL directives with arguments",
"severity": "MEDIUM",
"category": "correctness",
"file": "scripts/promotion-gate.js:175",
"description": "parseSchema only allows bare directives between the type name and body. A declaration such as `type Action @entity(name: \"actions\") { ... }` is not parsed, which can omit blue types from the superset comparison and hide breaking removals.",
"suggestion": "Use a GraphQL parser package already available in the project, or update the parser to handle directives with arguments. Add tests with type and enum directives containing arguments.",
"confidence": 0.82,
"faang_parallel": "Facebook's GraphQL ecosystem standardized AST parsing because regex-level schema handling misses valid language constructs.",
"metaphor": "This is a customs form that only recognizes middle names if they have no punctuation.",
"teachable_moment": "When a domain has a grammar, use the grammar. Parsers buy correctness across edge cases future maintainers will not remember.",
"connection": "The gate should behave like a cheval: the vessel persists even as schema syntax arrives in different valid forms."
},
{
"id": "F4",
"title": "Expansion mode does not fail on unexpected green-only chains",
"severity": "MEDIUM",
"category": "safety",
"file": "scripts/promotion-gate.js:104",
"description": "checkBlockHeights records every green-only chain as deferred in expansion mode, even when that chain is not in EXPECTED_CHAINS. This can allow a misconfigured green deployment to include an unintended extra chain without failing the promotion gate.",
"suggestion": "When expectedChains is provided, fail if green contains any chain not present in blue and not listed in expectedChains.",
"confidence": 0.72,
"faang_parallel": "Kubernetes admission control commonly distinguishes allowed drift from unknown drift; unspecified resources are rejected.",
"metaphor": "Expansion mode should open named doors, not leave the whole building unlocked.",
"teachable_moment": "Allowlist semantics are strongest when unexpected additions fail loudly. Deferred should mean planned, not merely unrecognized.",
"connection": "This is Loa room discipline: allowed inputs must be explicit, and forbidden context should not sneak in under exploration."
},
{
"id": "F5",
"title": "Shell command uses unescaped service name",
"severity": "LOW",
"category": "security",
"file": "scripts/sync-dashboard.cjs:29",
"description": "GREEN_PG_SERVICE is interpolated into an execSync shell command. Although this is a local dashboard script, a malformed environment value can execute additional shell syntax.",
"suggestion": "Use execFileSync with argument arrays, for example `execFileSync(\"railway\", [\"variables\", \"--service\", PG_SERVICE, \"--json\"], ...)`.",
"confidence": 0.78,
"faang_parallel": "Chrome and Android build tooling avoid shell interpolation for user-controlled arguments for the same reason.",
"metaphor": "A shell string is a shared microphone; argument arrays give each word its own channel.",
"teachable_moment": "Even local scripts become production habits. Prefer APIs that make injection structurally impossible.",
"connection": "Operational tooling is part of the hounfour too; every helper script should preserve the same boundary discipline."
},
{
"id": "F6",
"title": "Promotion path has focused fail-closed test coverage",
"severity": "PRAISE",
"category": "testing",
"file": "test/promote.bats:38",
"description": "The promote.sh tests explicitly verify that a failing gate produces no Railway writes and that rollback does not run the gate. These are the right invariants for the highest-risk operational path.",
"suggestion": "Keep these tests as required checks for any future promotion script changes.",
"confidence": 0.95,
"faang_parallel": "Google SRE postmortem practice turns outage lessons into regression checks on the exact control plane path.",
"metaphor": "This is testing that the emergency brake stops the train before testing the upholstery.",
"teachable_moment": "The most valuable tests assert irreversible side effects do not happen under failure.",
"connection": "This is kaironic time encoded as test coverage: the system knows when not to move."
},
{
"id": "F7",
"title": "Raw-L1 check treats empty logs as a hard failure",
"severity": "PRAISE",
"category": "correctness",
"file": "scripts/promotion-gate.js:251",
"description": "checkRawL1 correctly distinguishes an empty eth_getLogs response from a successful proof and fails closed, which directly addresses silent log-loss risk on new chains.",
"suggestion": "Retain this behavior and require configured golden samples for every expansion chain.",
"confidence": 0.95,
"faang_parallel": "Netflix Hystrix treated absence of signal as a first-class failure mode, not a quiet success.",
"metaphor": "No smoke from the signal fire is not proof the mountain is safe.",
"teachable_moment": "Empty responses need domain meaning. In promotion gates, silence is usually evidence to stop.",
"connection": "This is strong bridge-loop behavior: the new chain must speak before it is trusted."
}
]
}Callouts
[Operational] The promotion tests are pointed at the right invariants: no Railway writes after a failed gate, and rollback remaining independent from promotion certification. That is mature control-plane thinking.
[Correctness] The Raw-L1 empty-log handling is excellent because it refuses ambiguity. Keep that posture across the schema and endpoint checks: every gate should prove the candidate, not merely avoid crashing.
Reviewed with: unknown v0.0.0
…fail-closed gaps BB review (codex-headless) on PR #16 surfaced 2 HIGH + 2 MED. Fixed the actionable ones (all strictly ADD fail-closed conditions — can't reduce safety): - F1 (HIGH, safety): a real promotion could run blue-vs-blue self-parity and pass — GREEN_GRAPHQL_URL defaulted to BLUE. Now expansion mode + promote.sh REFUSE unless GREEN_GRAPHQL_URL is set AND != BLUE_GRAPHQL_URL. (promotion-gate.js main + promote.sh) - F3 (MED, correctness): parseSchema/parseEnums now handle directives WITH arguments (`@entity(name:"x")`) so such a type isn't silently dropped from the superset check. - F4 (MED, safety): expansion now FAILS a green-only chain not in EXPECTED_CHAINS (unplanned drift), instead of silently deferring it. Deferred (documented): F2 (HIGH) live-schema introspection — both belts deploy the identical schema.graphql + the counts query is the live entity-presence check; zero-dep introspection→SDL is the real fix, low-risk in the identical-schema reality. F5 (LOW) sync-dashboard.cjs execSync → execFileSync (tool script, env-trusted, not the promotion path). Tests: 59 vitest + 9 bats green. Hardened gate re-verified: live green still EXPANSION PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Bridgebuilder findings addressed (commit
All fixes only add fail-closed conditions. Tests: 59 vitest + 9 bats green; hardened gate re-verified — live green still EXPANSION PASS. Thanks for the sharp review — F1 was a real hole. 🙏 |
What
Consolidates the sonar belt to a 6-chain footprint and ships the blue→green promotion machinery that put it live at zero downtime.
config.yaml: ETH 1 · OP 10 · Base 8453 · Arbitrum 42161 [new] · Berachain 80094 · Zora 7777777 [new], 41 contracts).Dockerfile.beltBELT_CONFIGbuild-arg (blue unchanged).scripts/promotion-gate.js, 54/54 tests) — EXPANSION-mode: shared entitiesgreen ≥ blue − floor(non-lossy, green MAY exceed); new chains have no blue baseline → verified by Part-4 raw-L1eth_getLogsground-truth (golden-tx identity; empty-200 = GAP per KF-012). IndependentEXPECTED_CHAINScompleteness (green can't self-attest). Fail-closed throughout.redactUrlcredential guard.scripts/promote.sh, bats 6/6) — the only swap path: runs the gate as a non-skippable precondition (fail-closed), soleBELT_UPSTREAMwriter,--rollback(no gate, blue stays hot),--dry-run.Caddyfileadmin localhost-only (Option B §7.4).Verified live (2026-05-22)
Green now serves all 6 chains via the stable alias — 0 consumer config changes (G3). Gate-gated (expansion PASS), zero-downtime (G1), rollback exercised (G4/R-A). Cross-model review+audit caught and fixed 2 fail-closed blockers + 1 secrets HIGH before going live.
Deferred (post-merge, soak-gated)
promote.shfollow-up: stop default-publishing the Score#163signal to the gateway (cred over-exposure) — wire per #163 (score-side).🤖 Generated with Claude Code