Restructure and handle pile training by danbraunai-goodfire · Pull Request #4 · goodfire-ai/simple_stories_train

danbraunai-goodfire · 2026-02-03T13:38:47Z

Description

Related Issue

Motivation and Context

How Has This Been Tested?

Does this PR introduce a breaking change?

- Add gpt2_simple-pile, gpt2_simple-pile-1L, gpt2_simple-pile-2L model configs - Use GPT-2 tokenizer vocab (50259 = 50257 + [UNK] + [EOS]) - Same architecture as SimpleStories versions (4L/1L/2L, 4 heads, 128 embd) - Add train_config_pile.yaml for training on monology/pile-uncopyrighted - Uses NeelNanda/pile-10k for validation (smaller, faster) - Streaming enabled for large Pile dataset 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Use underscore notation for num_iterations (100_000) - Consolidate model config comment to single line 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

When streaming datasets are loaded (e.g., Pile), the IterableDataset.features attribute is None, causing a TypeError when trying to iterate over it. This fix skips the column filtering for streaming datasets without features, since the map() function with remove_columns parameter will handle column management properly during tokenization. Also updated type hints to accept Dataset | IterableDataset. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

The tokenize_and_concatenate function was not receiving the column_name parameter from the dataset config, causing it to use the default value 'story' instead of the actual column name (e.g., 'text' for Pile dataset). This caused _keep_single_column to remove all columns since it was looking for 'story' but the actual column was 'text'. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

The Pile dataset uses zstd compression, which requires the zstandard Python package. Without it, dataset loading fails with: ValueError: Compression type zstd not supported 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Add scripts/slurm_train.py for generating and submitting SLURM jobs - Support single-node (1-8 GPUs) and multi-node (>8 GPUs) DDP training - Add SLURM documentation to README - Add slurm_scripts/ and slurm_logs/ to .gitignore 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Use hyphen (gpt2_simple-pile) not underscore (gpt2_simple_pile) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

The validation loss was different between 1-GPU and 8-GPU runs because the validation dataset was being split across ranks, but val_loss was only computed on rank 0 without all-reduce. This meant 8-GPU runs were evaluating on 1/8th of the validation data (different samples than 1-GPU). Now validation data is not split, so all ranks see the same validation samples, making val_loss comparable across different GPU configurations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

# Conflicts: # simple_stories_train/models/model_configs.py

leesharkey and others added 24 commits December 11, 2025 18:40

Fix formatting issues from code review

26287ec

- Use underscore notation for num_iterations (100_000) - Consolidate model config comment to single line 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Fix default SLURM partition to h200-reserved-default

e04b281

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Update README with correct SLURM partition names

f5c190f

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Fix model_id typo in train_config_pile.yaml

3d04463

Use hyphen (gpt2_simple-pile) not underscore (gpt2_simple_pile) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Change checkpoint frequency from every 1k to every 50k steps

5e091bb

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Merge branch 'dev' into pile-training

52a0317

# Conflicts: # simple_stories_train/models/model_configs.py

Temp playing with pile-training

4a950e6

Update config structure

53b3d5a

Create settings.py for default data paths

42b548b

Use Pile validation and remove grad accum steps

6199bfc

Make batch_size be the total batch size

62ed376

Update config paths

c7c9239

Move REPO_ROOT to settings.py and fix SLURM_LOG_DIR

a769cd6

Add uv.lock

9676fdc

Fix pyright issues

2070e38

Fix tests

c8ee243

Update deploy script path

456c6bc

Fix ci

a9e2124

danbraunai-goodfire merged commit 647337d into dev Feb 3, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restructure and handle pile training#4

Restructure and handle pile training#4
danbraunai-goodfire merged 24 commits intodevfrom
dan-pile-training

danbraunai-goodfire commented Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

danbraunai-goodfire commented Feb 3, 2026

Description

Related Issue

Motivation and Context

How Has This Been Tested?

Does this PR introduce a breaking change?

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants