Skip to content

Restructure and handle pile training#4

Merged
danbraunai-goodfire merged 24 commits intodevfrom
dan-pile-training
Feb 3, 2026
Merged

Restructure and handle pile training#4
danbraunai-goodfire merged 24 commits intodevfrom
dan-pile-training

Conversation

@danbraunai-goodfire
Copy link

Description

Related Issue

Motivation and Context

How Has This Been Tested?

Does this PR introduce a breaking change?

leesharkey and others added 24 commits December 11, 2025 18:40
- Add gpt2_simple-pile, gpt2_simple-pile-1L, gpt2_simple-pile-2L model configs
- Use GPT-2 tokenizer vocab (50259 = 50257 + [UNK] + [EOS])
- Same architecture as SimpleStories versions (4L/1L/2L, 4 heads, 128 embd)
- Add train_config_pile.yaml for training on monology/pile-uncopyrighted
- Uses NeelNanda/pile-10k for validation (smaller, faster)
- Streaming enabled for large Pile dataset

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Use underscore notation for num_iterations (100_000)
- Consolidate model config comment to single line

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
When streaming datasets are loaded (e.g., Pile), the IterableDataset.features
attribute is None, causing a TypeError when trying to iterate over it.

This fix skips the column filtering for streaming datasets without features,
since the map() function with remove_columns parameter will handle column
management properly during tokenization.

Also updated type hints to accept Dataset | IterableDataset.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The tokenize_and_concatenate function was not receiving the column_name
parameter from the dataset config, causing it to use the default value
'story' instead of the actual column name (e.g., 'text' for Pile dataset).

This caused _keep_single_column to remove all columns since it was looking
for 'story' but the actual column was 'text'.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The Pile dataset uses zstd compression, which requires the zstandard
Python package. Without it, dataset loading fails with:
ValueError: Compression type zstd not supported

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add scripts/slurm_train.py for generating and submitting SLURM jobs
- Support single-node (1-8 GPUs) and multi-node (>8 GPUs) DDP training
- Add SLURM documentation to README
- Add slurm_scripts/ and slurm_logs/ to .gitignore

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Use hyphen (gpt2_simple-pile) not underscore (gpt2_simple_pile)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The validation loss was different between 1-GPU and 8-GPU runs because
the validation dataset was being split across ranks, but val_loss was
only computed on rank 0 without all-reduce. This meant 8-GPU runs were
evaluating on 1/8th of the validation data (different samples than 1-GPU).

Now validation data is not split, so all ranks see the same validation
samples, making val_loss comparable across different GPU configurations.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
# Conflicts:
#	simple_stories_train/models/model_configs.py
@danbraunai-goodfire danbraunai-goodfire merged commit 647337d into dev Feb 3, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants