[TEST PR] Add resumable training functionality by leesharkey · Pull Request #286 · goodfire-ai/spd

leesharkey · 2025-12-03T20:32:14Z

Description

This PR adds checkpoint-based resume functionality to SPD training jobs, enabling seamless continuation of interrupted
training runs. This is critical for cluster environments where jobs may be killed and rescheduled.

Key Features

Full state preservation: Saves and restores model weights, optimizer state (momentum buffers), RNG states (PyTorch,
NumPy, Python, CUDA), and dataloader position
Hybrid resume modes:
- Auto-resume: Automatically detects and resumes from the latest checkpoint
- Explicit resume: Resume from a specific checkpoint path
Config validation: Errors on breaking changes (architecture differences), warns on non-critical changes
(hyperparameters)
WandB integration: Continues the same WandB run when resuming
Distributed training support: Works with multi-GPU and multi-node setups (DDP)
Deterministic resume: Achieves bit-exact reproduction of training trajectories

Implementation Details

New module: spd/checkpoint.py

save_checkpoint(): Saves full training state including all RNG states
load_checkpoint(): Loads and validates checkpoints with config compatibility checks
find_latest_checkpoint(): Auto-detects the most recent checkpoint by step number

Modified: spd/configs.py

Added auto_resume: Enable automatic checkpoint detection
Added resume_from_checkpoint: Path to explicit checkpoint file
Added wandb_run_id: For continuing existing WandB runs

Modified: spd/run_spd.py

Resume detection and checkpoint loading logic
Conditional faithfulness warmup skip on resume
Dataloader fast-forward with RNG preservation to maintain exact training sequence
Fixed dataloader position calculation to account for alive_tracker batch and inclusive step counting

Related Issue

N/A - Feature requested for upcoming cluster deployment

Motivation and Context

We're moving to a cluster setup where training jobs may be killed and rescheduled. Without resume functionality, this would
mean:

Lost training progress
Wasted compute resources
Inability to run long training jobs reliably

This PR enables resilient training that can survive interruptions while maintaining deterministic behavior.

How Has This Been Tested?

Test 1: Constant LR Schedule (TMS)

Phase 1: Trained steps 0→50, saved checkpoint at step 25
Phase 2: Resumed from step 25→50
Result: Loss differences < 0.1% at all checkpoints

Test 2: Cosine LR Schedule (TMS)

Phase 1: Trained steps 0→100 with cosine schedule + 10% warmup, saved checkpoint at step 50
Phase 2: Resumed from step 50→100
Result:
- All LR values match exactly (bit-perfect)
- Loss values are identical (0.00% difference)
- Confirms LR schedule continues correctly after resume

Test Configuration

100 training steps, checkpoint at step 50
Cosine schedule with 10% warmup
Verified at steps 60, 80, 100

Step   | Phase 1 LR   | Phase 2 LR   | LR Match | Phase 1 Loss   | Phase 2 Loss   | Loss Diff %
------------------------------------------------------------------------------------------------------
60     | 0.000635000  | 0.000635000  | ✓        | 0.00959301740  | 0.00959301740  | 0.00%
80     | 0.000329000  | 0.000329000  | ✓        | 0.01291703433  | 0.01291703433  | 0.00%
100    | -0.000018000 | -0.000018000 | ✓        | 0.01030338556  | 0.01030338556  | 0.00%

Bug Fixes During Development

Off-by-one error in dataloader position calculation
RNG contamination from dataloader fast-forward
RNG contamination from alive_tracker batch consumption
Missing numpy import
Type checking issues with CUDA RNG state storage

All bugs were identified through testing and fixed before commit.

Does this PR introduce a breaking change?

No breaking changes. This is a purely additive feature:

All new config fields have default values (auto_resume=False, resume_from_checkpoint=None)
Existing training runs continue to work exactly as before
Resume functionality is opt-in via config

Usage Examples

Auto-resume from latest checkpoint

auto_resume: true
save_freq: 1000  # Save every 1000 steps

Resume from specific checkpoint

resume_from_checkpoint: /path/to/model_5000.pth

Continue WandB run

resume_from_checkpoint: /path/to/checkpoint.pth
wandb_run_id: abc123xyz  # ID from previous run

…tested

… than eval

Implements full checkpoint save/load with deterministic resume support for cluster environments where jobs may be killed and rescheduled. Features: - Hybrid resume: auto-detect latest checkpoint OR explicit path override - Full state preservation: model, optimizer, RNG states, dataloader position - Config compatibility validation (errors on breaking changes) - Skips faithfulness warmup on resume - WandB run continuation support - Multi-GPU/DDP compatible Implementation: - Created spd/checkpoint.py with save/load/validation functions - Added config fields: auto_resume, resume_from_checkpoint, wandb_run_id - Integrated resume logic into spd/run_spd.py Key fixes for determinism: - Correct dataloader position accounting (alive_tracker batch + inclusive step counting) - RNG state preservation during fast-forward (save before, restore after) - Alive_tracker batch consumed before RNG restoration to avoid contamination Testing: - Integration tests show <0.1% loss difference between continuous and resumed training - Tested with TMS experiments, verified deterministic behavior 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

PROBLEM: Resume functionality was producing losses that differed by ~0.5-1% from continuous training, instead of matching at floating-point precision. ROOT CAUSE: The code was saving RNG state before dataloader fast-forward, then restoring it after. This caused the dataloader to produce WRONG batches: 1. Save RNG state (e.g., state at batch 0) 2. Fast-forward dataloader (consumes RNG, advances to batch N) 3. Restore RNG state (rolls back to batch 0 state) 4. Next batch fetch produces batch 0 data instead of batch N! This meant resumed training was seeing different data than continuous training, causing loss divergence. SOLUTION: Remove the RNG save/restore around dataloader fast-forward. The fast-forward should naturally advance the RNG state, which is exactly what we want for deterministic resume. TESTING: Created diagnostic tests that confirmed: - Before fix: Batches after fast-forward differed by ~100% - After fix: Batches match at exact floating-point precision End-to-end training test (TMS, steps 2-4): - All losses match to 15 decimal places between continuous and resumed - Step 2: 0.115864656865597 (both runs) - Step 3: 0.118010997772217 (both runs) - Step 4: 0.028962565585971 (both runs) IMPACT: Resume functionality now achieves true deterministic training with floating-point precision. Resumed training produces bit-identical results to continuous training. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

… for GPU support Added init_distributed() calls to tms_decomposition.py and resid_mlp_decomposition.py to enable proper GPU training with torchrun. Without this, GPU training would fail with assertion errors when trying to get the device. The lm_decomposition.py already had this initialization. Also fixed type annotations in run_spd.py resume logic with pyright ignore comments for RNG state handling (dict[str, Any] to specific types). Tested resume functionality on GPU for all experiment types (TMS, ResidMLP1, LM) - all pass with perfect determinism. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

leesharkey and others added 30 commits September 16, 2025 18:07

Geometric similarity comparison made consistent with other evals and …

b93b9d6

…tested

Replaced mean max cosine sim with mean max ABS cosine sim

cd5fda2

Configs for geom comparison runs

61d3408

Merge remote-tracking branch 'origin/main' into feature/geom_sim_compar

63c85f0

Minor modifications to make PR-ready

770a5c5

Merge remote-tracking branch 'origin/main' into feature/geom_sim_compar

49ba925

Update seed to be consistent with other configs again

364198e

Cleaned up some comments and other bits

57c2c76

Major update of PR following review: Now implemented as script rather…

2e7752d

… than eval

Merge remote-tracking branch 'origin/main' into feature/geom_sim_compar

4fbf807

Updated registry to delete old obselete experiments

98a6620

Merge branch 'main' into feature/geom_sim_compar

bede346

Merge branch 'main' into feature/geom_sim_compar

acc04f1

Reorganized compare_models into subdirectory and cleaned up config code

62bd77e

Merging

b84814a

Updated README.md

5173a6a

Added some example models to the config

181cac8

Getting rid of newline

8db7559

Minor changes to make the PR mergeable

0d05f0a

Merge branch 'main' of https://github.com/goodfire-ai/spd

8767194

Merge branch 'main' of https://github.com/goodfire-ai/spd

019eb2d

Merge branch 'main' of https://github.com/goodfire-ai/spd

b935b4c

Merge branch 'main' of github.com:goodfire-ai/spd

3d1edeb

Merge branch 'main' of github.com:goodfire-ai/spd

1dd738d

Merge branch 'main' of github.com:goodfire-ai/spd

956f3d4

Merge branch 'main' of github.com:goodfire-ai/spd

f7ad411

Merge branch 'main' of github.com:goodfire-ai/spd

ade1377

Merge branch 'main' of github.com:goodfire-ai/spd

08875a9

Merge branch 'main' of github.com:goodfire-ai/spd

7ca7037

Merge branch 'main' of github.com:goodfire-ai/spd

cbbdb61

leesharkey and others added 11 commits October 28, 2025 14:00

Merge branch 'main' of github.com:goodfire-ai/spd

267deb6

Merge branch 'main' of github.com:goodfire-ai/spd

f49e9e0

Merge branch 'main' of github.com:goodfire-ai/spd

22f7cfc

Merge branch 'main' of github.com:goodfire-ai/spd

ab5346d

Merge branch 'main' of github.com:goodfire-ai/spd

7cb528f

Merge branch 'main' of github.com:goodfire-ai/spd

01d1b6b

Merge branch 'main' of github.com:goodfire-ai/spd

a78fdc5

Merge branch 'main' of github.com:goodfire-ai/spd

296a8d2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TEST PR] Add resumable training functionality#286

[TEST PR] Add resumable training functionality#286
leesharkey wants to merge 41 commits intomainfrom
feature/resumable

leesharkey commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leesharkey commented Dec 3, 2025

Description

Key Features

Implementation Details

Related Issue

Motivation and Context

How Has This Been Tested?

Test 1: Constant LR Schedule (TMS)

Test 2: Cosine LR Schedule (TMS)

Test Configuration

Bug Fixes During Development

Does this PR introduce a breaking change?

Usage Examples

Auto-resume from latest checkpoint

Resume from specific checkpoint

Continue WandB run

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant