Restructure and handle pile training#4
Merged
danbraunai-goodfire merged 24 commits intodevfrom Feb 3, 2026
Merged
Conversation
- Add gpt2_simple-pile, gpt2_simple-pile-1L, gpt2_simple-pile-2L model configs - Use GPT-2 tokenizer vocab (50259 = 50257 + [UNK] + [EOS]) - Same architecture as SimpleStories versions (4L/1L/2L, 4 heads, 128 embd) - Add train_config_pile.yaml for training on monology/pile-uncopyrighted - Uses NeelNanda/pile-10k for validation (smaller, faster) - Streaming enabled for large Pile dataset 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Use underscore notation for num_iterations (100_000) - Consolidate model config comment to single line 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
When streaming datasets are loaded (e.g., Pile), the IterableDataset.features attribute is None, causing a TypeError when trying to iterate over it. This fix skips the column filtering for streaming datasets without features, since the map() function with remove_columns parameter will handle column management properly during tokenization. Also updated type hints to accept Dataset | IterableDataset. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
The tokenize_and_concatenate function was not receiving the column_name parameter from the dataset config, causing it to use the default value 'story' instead of the actual column name (e.g., 'text' for Pile dataset). This caused _keep_single_column to remove all columns since it was looking for 'story' but the actual column was 'text'. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
The Pile dataset uses zstd compression, which requires the zstandard Python package. Without it, dataset loading fails with: ValueError: Compression type zstd not supported 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add scripts/slurm_train.py for generating and submitting SLURM jobs - Support single-node (1-8 GPUs) and multi-node (>8 GPUs) DDP training - Add SLURM documentation to README - Add slurm_scripts/ and slurm_logs/ to .gitignore 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Use hyphen (gpt2_simple-pile) not underscore (gpt2_simple_pile) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
The validation loss was different between 1-GPU and 8-GPU runs because the validation dataset was being split across ranks, but val_loss was only computed on rank 0 without all-reduce. This meant 8-GPU runs were evaluating on 1/8th of the validation data (different samples than 1-GPU). Now validation data is not split, so all ranks see the same validation samples, making val_loss comparable across different GPU configurations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
# Conflicts: # simple_stories_train/models/model_configs.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Related Issue
Motivation and Context
How Has This Been Tested?
Does this PR introduce a breaking change?