Skip to content

Misc changes since paper#2

Merged
danbraunai-goodfire merged 69 commits intomainfrom
dev
Feb 3, 2026
Merged

Misc changes since paper#2
danbraunai-goodfire merged 69 commits intomainfrom
dev

Conversation

@danbraunai
Copy link

No description provided.

chandanms and others added 30 commits July 27, 2025 11:48
Add pruning functionality to tokenizer creation to remove zero frequency tokens
* Add rms_norm_eps argument to config with default 1e-6

* Fix type issues

* Add convert_to_hf.py script and tests

* Re-enable tests in CI

* Use torch>=2.6 in pyproject.toml

* Improve naming

* Fix type errors

* Added conversion scripts and corresponding tests

* Fixed pyright issues

* Marked a test as slow since it downloads all models from HF

* Revert "Marked a test as slow since it downloads all models from HF"

This reverts commit 2c9aedb.

Wrong commit with pytest!

* Marked a test as slow since it downloads all models from HF

* corrected the docstring of a test case. Made it more verbose to mention the backward compatibility

---------

Co-authored-by: chandanms <[email protected]>
* Reset train dataloader when depleted

* Fix pyright errors

* Cast instead of isinstance

* Update pinned torch version

* Factor out gpt2 and make general train.py

* Prefix wandb run name with model_id

* Create gpt2 hf converters

* Create push_to_hf

* Upload tokenizer to hf too

* Refactor gpt conversions
…ens; Made the data to tokenizer training iterable.
…_test_cases

Tokenizer test cases and reformatting of tokenizer training file
Fix for accidently adding the updated read me to wrong folder
Added a note that mentions deviation from the original paper to implementation
* Play around with no ln

* Fix weights_only=True

* Add ability to finetune without ln

* Revert to old llama from_pretrained

* Minor fixes for ln ablation finetune

* Fix pyright error
leesharkey and others added 28 commits December 11, 2025 18:40
- Add gpt2_simple-pile, gpt2_simple-pile-1L, gpt2_simple-pile-2L model configs
- Use GPT-2 tokenizer vocab (50259 = 50257 + [UNK] + [EOS])
- Same architecture as SimpleStories versions (4L/1L/2L, 4 heads, 128 embd)
- Add train_config_pile.yaml for training on monology/pile-uncopyrighted
- Uses NeelNanda/pile-10k for validation (smaller, faster)
- Streaming enabled for large Pile dataset

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Use underscore notation for num_iterations (100_000)
- Consolidate model config comment to single line

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
When streaming datasets are loaded (e.g., Pile), the IterableDataset.features
attribute is None, causing a TypeError when trying to iterate over it.

This fix skips the column filtering for streaming datasets without features,
since the map() function with remove_columns parameter will handle column
management properly during tokenization.

Also updated type hints to accept Dataset | IterableDataset.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The tokenize_and_concatenate function was not receiving the column_name
parameter from the dataset config, causing it to use the default value
'story' instead of the actual column name (e.g., 'text' for Pile dataset).

This caused _keep_single_column to remove all columns since it was looking
for 'story' but the actual column was 'text'.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The Pile dataset uses zstd compression, which requires the zstandard
Python package. Without it, dataset loading fails with:
ValueError: Compression type zstd not supported

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add scripts/slurm_train.py for generating and submitting SLURM jobs
- Support single-node (1-8 GPUs) and multi-node (>8 GPUs) DDP training
- Add SLURM documentation to README
- Add slurm_scripts/ and slurm_logs/ to .gitignore

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Use hyphen (gpt2_simple-pile) not underscore (gpt2_simple_pile)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The validation loss was different between 1-GPU and 8-GPU runs because
the validation dataset was being split across ranks, but val_loss was
only computed on rank 0 without all-reduce. This meant 8-GPU runs were
evaluating on 1/8th of the validation data (different samples than 1-GPU).

Now validation data is not split, so all ranks see the same validation
samples, making val_loss comparable across different GPU configurations.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
# Conflicts:
#	simple_stories_train/models/model_configs.py
Restructure and handle pile training
@danbraunai-goodfire danbraunai-goodfire marked this pull request as ready for review February 3, 2026 14:27
@danbraunai-goodfire danbraunai-goodfire merged commit 83bf97c into main Feb 3, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants