Move from manual sharding to HF dataset builder.#391
Draft
tom-pollak wants to merge 2 commits intodecoderesearch:mainfrom
Draft
Move from manual sharding to HF dataset builder.#391tom-pollak wants to merge 2 commits intodecoderesearch:mainfrom
tom-pollak wants to merge 2 commits intodecoderesearch:mainfrom
Conversation
Before `CacheActivationConfig` had a inconsistent config file for some
interopability with `LanguageModelSAERunnerConfig`. It was kind of
unclear which parameters were necessary vs redundant, and just was
fairly unclear.
Simplified to the required arguments:
- `hf_dataset_path`: Tokenized or untokenized dataset
- `total_training_tokens`
- `model_name`
- `model_batch_size`
- `hook_name`
- `final_hook_layer`
- `d_in`
I think this scheme captures everything you need when attempting to
cache activations and makes it a lot easier to reason about.
Optional:
```
activation_save_path # defaults to "activations/{dataset}/{model}/{hook_name}
shuffle=True
prepend_bos=True
streaming=True
seqpos_slice
buffer_size_gb=2 # Size of each buffer. Affects memory usage and saving freq
device="cuda" or "cpu"
dtype="float32"
autocast_lm=False
compile_llm=True
hf_repo_id # Push to hf
model_kwargs # `run_with_cache`
model_from_pretrained_kwargs
```
Contributor
Author
|
Oops previous commit wasn't merged yet, this is only for Move from manual sharding to HF dataset builder. |
fc9a460 to
a1da04c
Compare
Contributor
Author
|
Old New Not a huge amount of difference in benchmarks, slightly faster. |
Depends on decoderesearch#389. Inspired by: https://opensourcemechanistic.slack.com/archives/C07EHMK3XC7/p1732413633220709 Instead of manually writing the single arrow shards, we can create a dataset builder that can do this more efficiently. This speeds up saving quite a lot, old method spent a some time calculating the fingerprint of the shard, which was unecessary and would require a hack to get around. > Along with this change, I also switched to a 1D activation scheme. - Previously the dataset was stored as a `(seq_len d_in)` array. - Now stored as a flat `d_in` Primary reason for this change is shuffling activations. I found that by using activations sequence, the activations are not properly shuffled. This is a problem with `ActivationCache` too but there's not a great solution for it there. You can observe this in the loss of the SAE by using small buffer sizes with either using cache or `ActivationStore`.
a1da04c to
28ee687
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Depends on #389.
Inspired by:
https://opensourcemechanistic.slack.com/archives/C07EHMK3XC7/p1732413633220709
Instead of manually writing the single arrow shards, we can create a
dataset builder that can do this more efficiently. This speeds up saving
quite a lot, old method spent a some time calculating the fingerprint of
the shard, which was unecessary and would require a hack to get around.
(seq_len d_in)array.d_inPrimary reason for this change is shuffling activations. I found that by
using activations sequence, the activations are not properly shuffled.
This is a problem with
ActivationCachetoo but there's not a greatsolution for it there.
You can observe this in the loss of the SAE by using small buffer sizes
with either using cache or
ActivationStore.Fixes # (issue)
Type of change
Please delete options that are not relevant.
(d_in)activations vs(seq_len, d_in)Checklist:
You have tested formatting, typing and unit tests (acceptance tests not currently in use)
make check-cito check format and linting. (you can runmake formatto format code if needed.)