Misc changes since paper by danbraunai · Pull Request #2 · goodfire-ai/simple_stories_train

danbraunai · 2025-09-03T14:30:21Z

No description provided.

…dataset

…and what pruning does

…omments for train and prune tokenizer

…in_config.yaml; deleted extra config.

Add pruning functionality to tokenizer creation to remove zero frequency tokens

* Add rms_norm_eps argument to config with default 1e-6 * Fix type issues * Add convert_to_hf.py script and tests * Re-enable tests in CI * Use torch>=2.6 in pyproject.toml * Improve naming * Fix type errors * Added conversion scripts and corresponding tests * Fixed pyright issues * Marked a test as slow since it downloads all models from HF * Revert "Marked a test as slow since it downloads all models from HF" This reverts commit 2c9aedb. Wrong commit with pytest! * Marked a test as slow since it downloads all models from HF * corrected the docstring of a test case. Made it more verbose to mention the backward compatibility --------- Co-authored-by: chandanms <[email protected]>

* Reset train dataloader when depleted * Fix pyright errors * Cast instead of isinstance * Update pinned torch version * Factor out gpt2 and make general train.py * Prefix wandb run name with model_id * Create gpt2 hf converters * Create push_to_hf * Upload tokenizer to hf too * Refactor gpt conversions

…ens; Made the data to tokenizer training iterable.

…_test_cases Tokenizer test cases and reformatting of tokenizer training file

Fix for accidently adding the updated read me to wrong folder

Added a note that mentions deviation from the original paper to implementation

* Play around with no ln * Fix weights_only=True * Add ability to finetune without ln * Revert to old llama from_pretrained * Minor fixes for ln ablation finetune * Fix pyright error

- Add gpt2_simple-pile, gpt2_simple-pile-1L, gpt2_simple-pile-2L model configs - Use GPT-2 tokenizer vocab (50259 = 50257 + [UNK] + [EOS]) - Same architecture as SimpleStories versions (4L/1L/2L, 4 heads, 128 embd) - Add train_config_pile.yaml for training on monology/pile-uncopyrighted - Uses NeelNanda/pile-10k for validation (smaller, faster) - Streaming enabled for large Pile dataset 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Use underscore notation for num_iterations (100_000) - Consolidate model config comment to single line 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

When streaming datasets are loaded (e.g., Pile), the IterableDataset.features attribute is None, causing a TypeError when trying to iterate over it. This fix skips the column filtering for streaming datasets without features, since the map() function with remove_columns parameter will handle column management properly during tokenization. Also updated type hints to accept Dataset | IterableDataset. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

The tokenize_and_concatenate function was not receiving the column_name parameter from the dataset config, causing it to use the default value 'story' instead of the actual column name (e.g., 'text' for Pile dataset). This caused _keep_single_column to remove all columns since it was looking for 'story' but the actual column was 'text'. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

The Pile dataset uses zstd compression, which requires the zstandard Python package. Without it, dataset loading fails with: ValueError: Compression type zstd not supported 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Add scripts/slurm_train.py for generating and submitting SLURM jobs - Support single-node (1-8 GPUs) and multi-node (>8 GPUs) DDP training - Add SLURM documentation to README - Add slurm_scripts/ and slurm_logs/ to .gitignore 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Use hyphen (gpt2_simple-pile) not underscore (gpt2_simple_pile) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

The validation loss was different between 1-GPU and 8-GPU runs because the validation dataset was being split across ranks, but val_loss was only computed on rank 0 without all-reduce. This meant 8-GPU runs were evaluating on 1/8th of the validation data (different samples than 1-GPU). Now validation data is not split, so all ranks see the same validation samples, making val_loss comparable across different GPU configurations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

# Conflicts: # simple_stories_train/models/model_configs.py

Restructure and handle pile training

chandanms and others added 30 commits July 27, 2025 11:48

Added pruning tokenizer function; Modified the old scripts to use HF …

197781f

…dataset

Fixed pyright issues

cdd96ea

Added a test and a toy model to demo the wordpiece tokenizer problem …

3b14778

…and what pruning does

fixed formatting with ruff

a13ef35

Made the clean_dataset a iterable. Removed default arguments. Added c…

33d3804

…omments for train and prune tokenizer

Renamed the training config 35M_config.yaml to a generalized name tra…

d65c301

…in_config.yaml; deleted extra config.

Merge pull request simple-stories#36 from chandanms/tokenizer_fix

bb68b31

Add pruning functionality to tokenizer creation to remove zero frequency tokens

Reset train dataloader when depleted

29ebe0f

Fix pyright errors

1a5440a

Cast instead of isinstance

d0fa3fd

Update pinned torch version

7407ff3

Factor out gpt2 and make general train.py

5b70703

Prefix wandb run name with model_id

4042ab4

Merge branch 'dev' into refactor/modular

c295c0d

Create gpt2 hf converters

b1c6b50

Create push_to_hf

ed62db8

Upload tokenizer to hf too

77ecfe7

Refactor gpt conversions

334fa07

Fixed linter issues

49cd09a

Made the tests more strict for verifying existance of EOS and UNK tok…

04b6558

…ens; Made the data to tokenizer training iterable.

Updated the readme file

6cb28de

Merge pull request simple-stories#41 from chandanms/feature/tokenizer…

da7997a

…_test_cases Tokenizer test cases and reformatting of tokenizer training file

Accidently added the read me to wrong folder. Fixed now

bca1c21

Merge pull request simple-stories#44 from chandanms/fix/readme

642ada9

Fix for accidently adding the updated read me to wrong folder

Update README.md

57386c4

Added a note that mentions deviation from the original paper to implementation

Fix gpt2 module duplication

ec2e4aa

Support simple GPT2 and finetune away LN std (#1)

cad9607

* Play around with no ln * Fix weights_only=True * Add ability to finetune without ln * Revert to old llama from_pretrained * Minor fixes for ln ablation finetune * Fix pyright error

Save configs to out_dir rather than checkpoints_dir

9c665c8

leesharkey and others added 28 commits December 11, 2025 18:40

Fix formatting issues from code review

26287ec

- Use underscore notation for num_iterations (100_000) - Consolidate model config comment to single line 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Fix default SLURM partition to h200-reserved-default

e04b281

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Update README with correct SLURM partition names

f5c190f

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Fix model_id typo in train_config_pile.yaml

3d04463

Use hyphen (gpt2_simple-pile) not underscore (gpt2_simple_pile) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Change checkpoint frequency from every 1k to every 50k steps

5e091bb

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Set replace=False when downloading a run

4150ae7

Set exist_ok=True for downloads

16f574d

Added wide 2L model training config

6ee9631

Merge branch 'dev' into pile-training

52a0317

# Conflicts: # simple_stories_train/models/model_configs.py

Temp playing with pile-training

4a950e6

Update config structure

53b3d5a

Create settings.py for default data paths

42b548b

Use Pile validation and remove grad accum steps

6199bfc

Make batch_size be the total batch size

62ed376

Update config paths

c7c9239

Move REPO_ROOT to settings.py and fix SLURM_LOG_DIR

a769cd6

Add uv.lock

9676fdc

Fix pyright issues

2070e38

Fix tests

c8ee243

Update deploy script path

456c6bc

Fix ci

a9e2124

Merge pull request #4 from goodfire-ai/dan-pile-training

647337d

Restructure and handle pile training

danbraunai-goodfire marked this pull request as ready for review February 3, 2026 14:27

danbraunai-goodfire merged commit 83bf97c into main Feb 3, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc changes since paper#2

Misc changes since paper#2
danbraunai-goodfire merged 69 commits intomainfrom
dev

danbraunai commented Sep 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

danbraunai commented Sep 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants