v1.2.0 — within-stage crash-resume + graceful budget halt#1
Merged
RichardHightower merged 12 commits intomainfrom Apr 21, 2026
Merged
v1.2.0 — within-stage crash-resume + graceful budget halt#1RichardHightower merged 12 commits intomainfrom
RichardHightower merged 12 commits intomainfrom
Conversation
Writes to <path>.tmp then os.replace() — POSIX-atomic rename so a SIGKILL between the two steps never leaves a truncated file on disk. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Shape changes dict[str, str] -> dict[str, dict[str, str]]. Old state files migrate in-memory with empty input_hash, which forces reprocess (same as pre-v1.2 behavior). Also introduces module-level state_lock for concurrent save safety in later tasks. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
SIGKILL mid-save no longer leaves truncated .designdoc-state.json. The module-level state_lock (added in prior commit) serializes the 50-way concurrent-save test to a deterministic final shape. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Each file summary updates artifact_index + atomically rewrites stage2_summaries.json under state_lock. Mid-stage crash + rerun hits LLM only for files not already checkpointed. Also prunes deleted files from the aggregated JSON and unconditionally persists at stage end so a zero-LLM-call path still reflects the current signature set. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Each class doc updates artifact_index with its composite input_hash (source SHA + canonical signature JSON) and writes the markdown via atomic_write. Mid-stage crash + rerun skips every class whose inputs still match the checkpoint. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Each package README is now written via atomic_write and checkpointed to artifact_index["package:<pkg>"] with its composite input_hash (SHA1 of sorted class-doc contents). A mid-stage crash + rerun skips any package whose class-doc inputs still match the checkpoint. The v1.1 rollup_hashes cross-run skip coexists unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Each class doc's mermaid diagram is now written via atomic_write and checkpointed to artifact_index["mermaid:<rel-path>"] with its input_hash (SHA1 of the class doc body minus any existing Diagram section). A mid-stage crash + rerun skips any class doc whose input_hash still matches the checkpoint. The v1.1 rollup_hashes cross-run skip coexists unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Each dep's research result is now checkpointed to artifact_index["dep:<name>"] with a per-dep input_hash (SHA1 of name + pinned + source) and the serialised row JSON. On resume, deps whose entry matches the current hash AND whose TECH_DEBT.md still exists on disk are skipped with zero LLM calls. The partial ledger is rewritten atomically after every completed dep so a mid-stage crash leaves a consistent (partial) TECH_DEBT.md. The v1.1 whole-stage rollup_hashes skip coexists unchanged. Updated test_stage6_incremental to reflect v1.2 per-dep skip semantics (unchanged deps are reused even when the overall manifest hash changes). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Budget exceeded mid-stage now sets state.halted_on_budget, marks the stage FAILED, and orchestrator.run() returns cleanly (does not re-raise). CLI checks the flag after anyio.run returns, prints 'budget exhausted at $X / cap $Y. Run `designdoc resume --budget <new-cap>` to continue', and exits 0 per v1.2 spec — the pipeline is resumable, not crashed. CLI also clears halted_on_budget when the user reruns with --budget so a successful rerun doesn't leave a stale flag. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Starting-log line for stages that have checkpointed artifacts now surfaces the resume context: [N/9] stage class_docs: 40 artifacts checkpointed Stages with no prior checkpoints keep the legacy "starting" message. _id_belongs_to_stage maps stage names to the actual artifact_id conventions (file:, package:, mermaid:, dep:, plus the Stage 3 "<path>::<Class>" no-prefix form). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Full-pipeline test that halts mid-Stage-3 via a tight budget cap, resumes with a higher cap, and verifies the load-bearing invariant of v1.2: checkpointed artifacts are reused on resume, not regenerated. The load-bearing assertion counts Stage-3 doer calls in the resume phase — exactly 2 (class 2 + class 3) proves the class 1 checkpoint was reused. If the resume path ever regresses and re-runs checkpointed work, this count would jump to 3 and the test fails loudly. Supporting checks: - state.halted_on_budget True after the first run (Task 9 integration) - Stage statuses: file_analysis DONE, class_docs FAILED mid-flight - All 9 stages DONE after resume - Final tree hash matches a clean cold run byte-for-byte Plan deviation: the original plan proposed a requires_api subprocess test comparing byte-identical output to a live-API clean run. That's unreliable because LLM output is non-deterministic — two fresh runs won't produce byte-identical trees even with the same inputs. The deterministic fake SDK (pattern from test_resume.py) lets us make sharper assertions about resume behavior AND runs in CI without API costs. Uses parallelism=1 so the budget-raise landing is predictable; with default parallelism=3 multiple doers can complete before the first raise propagates through asyncio.gather, which fuzzes the expected call counts. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Within-stage crash-resume + graceful budget halt. See CHANGELOG for the full list. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Within-stage crash-resume. A pipeline killed mid-Stage-N no longer loses that stage's partial progress — the rerun skips any artifact whose input hash still matches the checkpoint. Budget-cap halts are now graceful exits with a resume-message format.
Added
state.artifact_indexafter every completed artifact, under anasyncio.Lock. Mid-stage crash + rerun never re-calls the LLM for already-produced artifacts.designdoc.io_utils.atomic_write(path, content)writes to a sibling.tmpthenos.replace()(POSIX-atomic). Every artifact and.designdoc-state.jsonitself use it.BudgetExceededErrormid-stage now setsstate.halted_on_budget=True, marks the stage FAILED, saves state. CLI printsbudget exhausted at $X / cap $Y. Rundesigndoc resume --budget <new-cap>to continuewith exit 0 (was exit 4).[3/9] stage class_docs: 40 artifacts checkpointed).Changed
PipelineState.artifact_indexis nowdict[str, dict[str, str]]carrying{"path": ..., "input_hash": ...}per artifact. Old-shape state files are migrated in-memory with emptyinput_hash— safe fallback that forces reprocessing.Orchestrator.run()no longer re-raisesBudgetExceededError. CLI readsstate.halted_on_budgetafter the run and formats the resume message itself.Fixed
state.save()underasyncio.gatheris now serialized by a module-levelasyncio.Lock, preventing lost writes atparallelism > 1.Test plan
task ci— 95 tests passing (lint + format + unit + integration)tests/unit/test_io_utils.py— atomic_write helpertests/unit/test_state_backcompat.py— v1.1 → v1.2 artifact_index migrationtests/integration/test_state_concurrent_save.py— gather-children save locktests/integration/test_stage{2,3,4,5,6}_resume.py— per-stage checkpoint resumetests/integration/test_budget_halt.py— graceful budget halttests/integration/test_orchestrator_checkpoint_logs.py— log observabilitytests/integration/test_within_stage_resume_e2e.py— full-pipeline halt+resume with deterministic fake SDKPlan deviations
dep:prefix (plan had staletopic:); test seed keys use realistic<path>::<Class>form.Post-merge
After this PR is merged to
main, tagv1.2.0on main and cut the GitHub release using the CHANGELOG v1.2.0 section as the notes body.🤖 Generated with Claude Code