Skip to content

fix(migrations): cold-start robustness in apply-migrations + v0.29.1 orchestrator#760

Open
knee5 wants to merge 2 commits intogarrytan:masterfrom
knee5:clevin/migration-cold-start-fixes
Open

fix(migrations): cold-start robustness in apply-migrations + v0.29.1 orchestrator#760
knee5 wants to merge 2 commits intogarrytan:masterfrom
knee5:clevin/migration-cold-start-fixes

Conversation

@knee5
Copy link
Copy Markdown
Contributor

@knee5 knee5 commented May 9, 2026

Summary

Two bugs found during clevin's gbrain v0.22 → v0.30.1 migration on 2026-05-08 that together cause gbrain apply-migrations --yes to fail from any cold-start state (pre-v0.28 or pre-v0.30 schema). Both fixes make the command "just work" in one shot without flags or workarounds, completing the operational hardening intent stated in v0.30.1's release notes.


Bug 1 — v0.29.1 phaseBBackfill / phaseCVerify missing engine.connect()

File: src/commands/migrations/v0_29_1.ts

Symptom: During gbrain apply-migrations --yes on a schema at v29 or below, Phase B fails immediately with: "No database connection". Phase C has the same bug.

Root cause: Both phases called createEngine(config) to instantiate the engine but never called await engine.connect(engineConfig). createEngine returns an unconnected instance; connect is the separate lifecycle call that establishes the pool. Compare the v0.28.0 orchestrator (the established pattern) — it calls await engine.connect(engineConfig) in its shared setup block right after createEngine.

Repro:

  1. Schema at v29 (or any pre-v30 version)
  2. gbrain apply-migrations --yes
  3. Phase A succeeds (DDL via gbrain init --migrate-only); Phase B (phaseBBackfill) throws "No database connection"

Workaround used (2026-05-08): gbrain reindex-frontmatter directly (examined=2286, updated=2286, fallback=1616, 467 s), then manually appended a complete row to ~/.gbrain/migrations/completed.jsonl.

Patch: Added await engine.connect(engineConfig) immediately after createEngine(engineConfig) in both phaseBBackfill and phaseCVerify, plus try/finally + engine.disconnect() for resource cleanup — matching the v0.28.0 pattern exactly.


Bug 2 — v0.28.0 Phase A aborts instead of applying schema on cold-start

File: src/commands/migrations/v0_28_0.ts

Symptom: gbrain apply-migrations --yes on a brain upgrading from v0.22 (or any pre-v0.28 version) fails with: "expected schema version >= 38 (takes + access_tokens.permissions); got <N>". The command exits 1 before doing any work.

Root cause: v0.28.0 Phase A was a pure verify: it checked schema_version >= 38 and returned status: 'failed' if behind. The design comment says "the schema runner does the actual DDL during gbrain upgrade/initSchema" — but on a direct install upgrade (not via gbrain upgrade), the DDL hasn't run yet.

Diagnosis: The --force-schema flag in apply-migrations.ts calls runMigrations(eng) in-process to resolve exactly this drift. The two-step workaround (--force-schema --yes then --yes) works because --force-schema brings the DDL to LATEST_VERSION before the orchestrator chain starts.

Patch: When Phase A detects schema < 38, call await engine.initSchema() inline (the engine is already connected via the v0.28.0 shared orchestrator setup block) before the post-condition table check. Re-reads the version after apply to surface any partial DDL failure clearly. This makes apply-migrations --yes succeed from any starting schema version in one shot.

Workaround used (2026-05-08):

gbrain apply-migrations --force-schema --yes
gbrain apply-migrations --yes

Testing

  • Added test/migration-orchestrator-v0_29_1.test.ts — registry, phase shape, and dry-run assertions following the migration-orchestrator-v0_21_0.test.ts style.
  • All existing unit tests pass: bun test test/migration-orchestrator-v0_29_1.test.ts test/migrations-registry.test.ts test/apply-migrations.test.ts → 32 pass, 0 fail.
  • Build note: bun build --compile fails on missing optional packages (heic-decode, @jsquash/png) in the checkout — pre-existing, not introduced by this PR.

Discovered during

clevin's gbrain v0.22.2 → v0.30.1 migration on 2026-05-08. These fixes complete the intent of v0.30.1's "operational hardening — make upgrades just work on Supabase" release note: with both patches, gbrain apply-migrations --yes succeeds from any pre-v0.30 starting state without flags or manual workarounds.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com


View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

knee5 and others added 2 commits May 8, 2026 20:14
…Verify

Both phases created an engine via createEngine() but never called
engine.connect(), causing every executeRaw / backfillEffectiveDate call
to fail with "No database connection" on cold-start (v29 or earlier
schema upgrading to v30+).

Pattern fix: follow the v0.28.0 orchestrator convention — capture the
EngineConfig first, call createEngine(engineConfig), then immediately
await engine.connect(engineConfig) before any engine method. Added
try/finally + engine.disconnect() in both phases to mirror v0.28.0's
resource-cleanup shape.

Repro:
  1. Schema at v29 (or pre-v30)
  2. gbrain apply-migrations --yes
  3. Phase B fails: "No database connection"

Workaround used during clevin's v0.22→v0.30.1 migration on 2026-05-08:
  gbrain reindex-frontmatter (examined=2286, updated=2286, fallback=1616,
  467 s), then manually appended a `complete` row to
  ~/.gbrain/migrations/completed.jsonl.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase A previously asserted schema version >= 38 but returned status
'failed' if the schema was behind. On a direct upgrade from v0.22 (or
any pre-v0.28 version) without first running `gbrain upgrade`, the schema
version is below 38 and the entire orchestrator chain aborts, requiring
the user to run `gbrain apply-migrations --force-schema --yes` first and
then re-run `gbrain apply-migrations --yes`.

Fix: when Phase A detects schema < 38, call `engine.initSchema()` inline
(the engine is already connected at this point via the v0.28.0 shared
orchestrator setup) before proceeding to the post-condition table check.
Re-reads the version after apply to give a clear actionable error if DDL
fails partway through. This makes `gbrain apply-migrations --yes` succeed
from any starting schema version in one shot.

Also adds test/migration-orchestrator-v0_29_1.test.ts covering the
v0.29.1 orchestrator registration, phase shape, and dry-run behavior —
matches the style of migration-orchestrator-v0_21_0.test.ts.

Repro for Bug 2:
  1. Fresh install of gbrain v0.30.1 on a brain previously at v0.22
  2. gbrain apply-migrations --yes
  3. v0.28.0 Phase A fails: "expected schema version >= 38; got <N>"

Workaround used during clevin's v0.22→v0.30.1 migration on 2026-05-08:
  gbrain apply-migrations --force-schema --yes
  gbrain apply-migrations --yes

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant