Skip to content

v1.2.0 — within-stage crash-resume + graceful budget halt#1

Merged
RichardHightower merged 12 commits intomainfrom
feat/v1.2-within-stage-resume
Apr 21, 2026
Merged

v1.2.0 — within-stage crash-resume + graceful budget halt#1
RichardHightower merged 12 commits intomainfrom
feat/v1.2-within-stage-resume

Conversation

@RichardHightower
Copy link
Copy Markdown
Contributor

Summary

Within-stage crash-resume. A pipeline killed mid-Stage-N no longer loses that stage's partial progress — the rerun skips any artifact whose input hash still matches the checkpoint. Budget-cap halts are now graceful exits with a resume-message format.

Added

  • Within-stage checkpointing. Stages 2 (file summaries), 3 (class docs), 4 (package rollups), 5 (mermaid), and 6 (tech-debt topics) write to state.artifact_index after every completed artifact, under an asyncio.Lock. Mid-stage crash + rerun never re-calls the LLM for already-produced artifacts.
  • Atomic artifact writes. New designdoc.io_utils.atomic_write(path, content) writes to a sibling .tmp then os.replace() (POSIX-atomic). Every artifact and .designdoc-state.json itself use it.
  • Graceful budget halt. BudgetExceededError mid-stage now sets state.halted_on_budget=True, marks the stage FAILED, saves state. CLI prints budget exhausted at $X / cap $Y. Run designdoc resume --budget <new-cap> to continue with exit 0 (was exit 4).
  • Observability. Orchestrator stage-start log lines include the count of already-checkpointed artifacts (e.g. [3/9] stage class_docs: 40 artifacts checkpointed).

Changed

  • PipelineState.artifact_index is now dict[str, dict[str, str]] carrying {"path": ..., "input_hash": ...} per artifact. Old-shape state files are migrated in-memory with empty input_hash — safe fallback that forces reprocessing.
  • Orchestrator.run() no longer re-raises BudgetExceededError. CLI reads state.halted_on_budget after the run and formats the resume message itself.

Fixed

  • Concurrent state.save() under asyncio.gather is now serialized by a module-level asyncio.Lock, preventing lost writes at parallelism > 1.

Test plan

  • task ci — 95 tests passing (lint + format + unit + integration)
  • New unit + integration coverage:
    • tests/unit/test_io_utils.py — atomic_write helper
    • tests/unit/test_state_backcompat.py — v1.1 → v1.2 artifact_index migration
    • tests/integration/test_state_concurrent_save.py — gather-children save lock
    • tests/integration/test_stage{2,3,4,5,6}_resume.py — per-stage checkpoint resume
    • tests/integration/test_budget_halt.py — graceful budget halt
    • tests/integration/test_orchestrator_checkpoint_logs.py — log observability
    • tests/integration/test_within_stage_resume_e2e.py — full-pipeline halt+resume with deterministic fake SDK
  • Byte-identical tree hash + exact resume-phase doer call count asserted in the e2e integration test (stronger than the plan's original live-API byte-compare, which LLM non-determinism would have broken)

Plan deviations

  • Task 10: Helper mapping uses real dep: prefix (plan had stale topic:); test seed keys use realistic <path>::<Class> form.
  • Task 11: Replaced the plan's live-API subprocess test with an integration-level test using a deterministic fake SDK. The original byte-identical-against-live-API approach was doomed by LLM non-determinism; the fake-SDK version directly asserts checkpointed artifacts are reused (counts Stage-3 doer calls in the resume phase).

Post-merge

After this PR is merged to main, tag v1.2.0 on main and cut the GitHub release using the CHANGELOG v1.2.0 section as the notes body.

🤖 Generated with Claude Code

RichardHightower and others added 12 commits April 20, 2026 13:10
Writes to <path>.tmp then os.replace() — POSIX-atomic rename so a
SIGKILL between the two steps never leaves a truncated file on disk.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Shape changes dict[str, str] -> dict[str, dict[str, str]]. Old state
files migrate in-memory with empty input_hash, which forces reprocess
(same as pre-v1.2 behavior). Also introduces module-level state_lock
for concurrent save safety in later tasks.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
SIGKILL mid-save no longer leaves truncated .designdoc-state.json. The
module-level state_lock (added in prior commit) serializes the 50-way
concurrent-save test to a deterministic final shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Each file summary updates artifact_index + atomically rewrites
stage2_summaries.json under state_lock. Mid-stage crash + rerun hits
LLM only for files not already checkpointed. Also prunes deleted files
from the aggregated JSON and unconditionally persists at stage end
so a zero-LLM-call path still reflects the current signature set.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Each class doc updates artifact_index with its composite input_hash
(source SHA + canonical signature JSON) and writes the markdown via
atomic_write. Mid-stage crash + rerun skips every class whose inputs
still match the checkpoint.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Each package README is now written via atomic_write and checkpointed to
artifact_index["package:<pkg>"] with its composite input_hash (SHA1 of
sorted class-doc contents). A mid-stage crash + rerun skips any package
whose class-doc inputs still match the checkpoint. The v1.1 rollup_hashes
cross-run skip coexists unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Each class doc's mermaid diagram is now written via atomic_write and
checkpointed to artifact_index["mermaid:<rel-path>"] with its input_hash
(SHA1 of the class doc body minus any existing Diagram section). A mid-stage
crash + rerun skips any class doc whose input_hash still matches the
checkpoint. The v1.1 rollup_hashes cross-run skip coexists unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Each dep's research result is now checkpointed to
artifact_index["dep:<name>"] with a per-dep input_hash (SHA1 of name +
pinned + source) and the serialised row JSON. On resume, deps whose
entry matches the current hash AND whose TECH_DEBT.md still exists on
disk are skipped with zero LLM calls. The partial ledger is rewritten
atomically after every completed dep so a mid-stage crash leaves a
consistent (partial) TECH_DEBT.md. The v1.1 whole-stage rollup_hashes
skip coexists unchanged. Updated test_stage6_incremental to reflect
v1.2 per-dep skip semantics (unchanged deps are reused even when the
overall manifest hash changes).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Budget exceeded mid-stage now sets state.halted_on_budget, marks the
stage FAILED, and orchestrator.run() returns cleanly (does not re-raise).
CLI checks the flag after anyio.run returns, prints 'budget exhausted
at $X / cap $Y. Run `designdoc resume --budget <new-cap>` to continue',
and exits 0 per v1.2 spec — the pipeline is resumable, not crashed.
CLI also clears halted_on_budget when the user reruns with --budget so
a successful rerun doesn't leave a stale flag.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Starting-log line for stages that have checkpointed artifacts now
surfaces the resume context:
  [N/9] stage class_docs: 40 artifacts checkpointed
Stages with no prior checkpoints keep the legacy "starting" message.
_id_belongs_to_stage maps stage names to the actual artifact_id
conventions (file:, package:, mermaid:, dep:, plus the Stage 3
"<path>::<Class>" no-prefix form).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Full-pipeline test that halts mid-Stage-3 via a tight budget cap, resumes
with a higher cap, and verifies the load-bearing invariant of v1.2:
checkpointed artifacts are reused on resume, not regenerated.

The load-bearing assertion counts Stage-3 doer calls in the resume
phase — exactly 2 (class 2 + class 3) proves the class 1 checkpoint
was reused. If the resume path ever regresses and re-runs checkpointed
work, this count would jump to 3 and the test fails loudly.

Supporting checks:
- state.halted_on_budget True after the first run (Task 9 integration)
- Stage statuses: file_analysis DONE, class_docs FAILED mid-flight
- All 9 stages DONE after resume
- Final tree hash matches a clean cold run byte-for-byte

Plan deviation: the original plan proposed a requires_api subprocess test
comparing byte-identical output to a live-API clean run. That's unreliable
because LLM output is non-deterministic — two fresh runs won't produce
byte-identical trees even with the same inputs. The deterministic fake SDK
(pattern from test_resume.py) lets us make sharper assertions about
resume behavior AND runs in CI without API costs.

Uses parallelism=1 so the budget-raise landing is predictable; with
default parallelism=3 multiple doers can complete before the first raise
propagates through asyncio.gather, which fuzzes the expected call counts.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Within-stage crash-resume + graceful budget halt. See CHANGELOG for
the full list.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@RichardHightower RichardHightower merged commit 8ecb273 into main Apr 21, 2026
2 checks passed
@RichardHightower RichardHightower deleted the feat/v1.2-within-stage-resume branch April 21, 2026 05:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant