v1.2.0 — within-stage crash-resume + graceful budget halt by RichardHightower · Pull Request #1 · SpillwaveSolutions/docgen

RichardHightower · 2026-04-21T05:42:52Z

Summary

Within-stage crash-resume. A pipeline killed mid-Stage-N no longer loses that stage's partial progress — the rerun skips any artifact whose input hash still matches the checkpoint. Budget-cap halts are now graceful exits with a resume-message format.

Added

Within-stage checkpointing. Stages 2 (file summaries), 3 (class docs), 4 (package rollups), 5 (mermaid), and 6 (tech-debt topics) write to state.artifact_index after every completed artifact, under an asyncio.Lock. Mid-stage crash + rerun never re-calls the LLM for already-produced artifacts.
Atomic artifact writes. New designdoc.io_utils.atomic_write(path, content) writes to a sibling .tmp then os.replace() (POSIX-atomic). Every artifact and .designdoc-state.json itself use it.
Graceful budget halt. BudgetExceededError mid-stage now sets state.halted_on_budget=True, marks the stage FAILED, saves state. CLI prints budget exhausted at $X / cap $Y. Run designdoc resume --budget <new-cap> to continue with exit 0 (was exit 4).
Observability. Orchestrator stage-start log lines include the count of already-checkpointed artifacts (e.g. [3/9] stage class_docs: 40 artifacts checkpointed).

Changed

PipelineState.artifact_index is now dict[str, dict[str, str]] carrying {"path": ..., "input_hash": ...} per artifact. Old-shape state files are migrated in-memory with empty input_hash — safe fallback that forces reprocessing.
Orchestrator.run() no longer re-raises BudgetExceededError. CLI reads state.halted_on_budget after the run and formats the resume message itself.

Fixed

Concurrent state.save() under asyncio.gather is now serialized by a module-level asyncio.Lock, preventing lost writes at parallelism > 1.

Test plan

Plan deviations

Task 10: Helper mapping uses real dep: prefix (plan had stale topic:); test seed keys use realistic <path>::<Class> form.
Task 11: Replaced the plan's live-API subprocess test with an integration-level test using a deterministic fake SDK. The original byte-identical-against-live-API approach was doomed by LLM non-determinism; the fake-SDK version directly asserts checkpointed artifacts are reused (counts Stage-3 doer calls in the resume phase).

Post-merge

After this PR is merged to main, tag v1.2.0 on main and cut the GitHub release using the CHANGELOG v1.2.0 section as the notes body.

🤖 Generated with Claude Code

Writes to <path>.tmp then os.replace() — POSIX-atomic rename so a SIGKILL between the two steps never leaves a truncated file on disk. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Shape changes dict[str, str] -> dict[str, dict[str, str]]. Old state files migrate in-memory with empty input_hash, which forces reprocess (same as pre-v1.2 behavior). Also introduces module-level state_lock for concurrent save safety in later tasks. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

SIGKILL mid-save no longer leaves truncated .designdoc-state.json. The module-level state_lock (added in prior commit) serializes the 50-way concurrent-save test to a deterministic final shape. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Each file summary updates artifact_index + atomically rewrites stage2_summaries.json under state_lock. Mid-stage crash + rerun hits LLM only for files not already checkpointed. Also prunes deleted files from the aggregated JSON and unconditionally persists at stage end so a zero-LLM-call path still reflects the current signature set. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Each class doc updates artifact_index with its composite input_hash (source SHA + canonical signature JSON) and writes the markdown via atomic_write. Mid-stage crash + rerun skips every class whose inputs still match the checkpoint. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Each package README is now written via atomic_write and checkpointed to artifact_index["package:<pkg>"] with its composite input_hash (SHA1 of sorted class-doc contents). A mid-stage crash + rerun skips any package whose class-doc inputs still match the checkpoint. The v1.1 rollup_hashes cross-run skip coexists unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Each class doc's mermaid diagram is now written via atomic_write and checkpointed to artifact_index["mermaid:<rel-path>"] with its input_hash (SHA1 of the class doc body minus any existing Diagram section). A mid-stage crash + rerun skips any class doc whose input_hash still matches the checkpoint. The v1.1 rollup_hashes cross-run skip coexists unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Each dep's research result is now checkpointed to artifact_index["dep:<name>"] with a per-dep input_hash (SHA1 of name + pinned + source) and the serialised row JSON. On resume, deps whose entry matches the current hash AND whose TECH_DEBT.md still exists on disk are skipped with zero LLM calls. The partial ledger is rewritten atomically after every completed dep so a mid-stage crash leaves a consistent (partial) TECH_DEBT.md. The v1.1 whole-stage rollup_hashes skip coexists unchanged. Updated test_stage6_incremental to reflect v1.2 per-dep skip semantics (unchanged deps are reused even when the overall manifest hash changes). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Budget exceeded mid-stage now sets state.halted_on_budget, marks the stage FAILED, and orchestrator.run() returns cleanly (does not re-raise). CLI checks the flag after anyio.run returns, prints 'budget exhausted at $X / cap $Y. Run `designdoc resume --budget <new-cap>` to continue', and exits 0 per v1.2 spec — the pipeline is resumable, not crashed. CLI also clears halted_on_budget when the user reruns with --budget so a successful rerun doesn't leave a stale flag. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Starting-log line for stages that have checkpointed artifacts now surfaces the resume context: [N/9] stage class_docs: 40 artifacts checkpointed Stages with no prior checkpoints keep the legacy "starting" message. _id_belongs_to_stage maps stage names to the actual artifact_id conventions (file:, package:, mermaid:, dep:, plus the Stage 3 "<path>::<Class>" no-prefix form). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Full-pipeline test that halts mid-Stage-3 via a tight budget cap, resumes with a higher cap, and verifies the load-bearing invariant of v1.2: checkpointed artifacts are reused on resume, not regenerated. The load-bearing assertion counts Stage-3 doer calls in the resume phase — exactly 2 (class 2 + class 3) proves the class 1 checkpoint was reused. If the resume path ever regresses and re-runs checkpointed work, this count would jump to 3 and the test fails loudly. Supporting checks: - state.halted_on_budget True after the first run (Task 9 integration) - Stage statuses: file_analysis DONE, class_docs FAILED mid-flight - All 9 stages DONE after resume - Final tree hash matches a clean cold run byte-for-byte Plan deviation: the original plan proposed a requires_api subprocess test comparing byte-identical output to a live-API clean run. That's unreliable because LLM output is non-deterministic — two fresh runs won't produce byte-identical trees even with the same inputs. The deterministic fake SDK (pattern from test_resume.py) lets us make sharper assertions about resume behavior AND runs in CI without API costs. Uses parallelism=1 so the budget-raise landing is predictable; with default parallelism=3 multiple doers can complete before the first raise propagates through asyncio.gather, which fuzzes the expected call counts. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Within-stage crash-resume + graceful budget halt. See CHANGELOG for the full list. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

RichardHightower and others added 12 commits April 20, 2026 13:10

feat(io): atomic_write helper for crash-safe artifact writes

ea1d723

Writes to <path>.tmp then os.replace() — POSIX-atomic rename so a SIGKILL between the two steps never leaves a truncated file on disk. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

chore(release): v1.2.0

25f57ce

Within-stage crash-resume + graceful budget halt. See CHANGELOG for the full list. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

RichardHightower merged commit 8ecb273 into main Apr 21, 2026
2 checks passed

RichardHightower deleted the feat/v1.2-within-stage-resume branch April 21, 2026 05:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.2.0 — within-stage crash-resume + graceful budget halt#1

v1.2.0 — within-stage crash-resume + graceful budget halt#1
RichardHightower merged 12 commits intomainfrom
feat/v1.2-within-stage-resume

RichardHightower commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RichardHightower commented Apr 21, 2026

Summary

Added

Changed

Fixed

Test plan

Plan deviations

Post-merge

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant