Conversation
|
@claude no full review pls but what do you think of the 513-length pattern? my intuition is that it's more typical to just serve 512 batches and use ignore indices for the offset problem. pls reference the torch CE loss functionality too (how it handles offsets) |
|
Claude finished @ocg-goodfire's task —— View job Analyzing 513-length pattern vs typical approaches
AnalysisYou're absolutely right that the typical pattern is to use 512-length sequences with ignore indices. Here's my analysis of the trade-offs: Current 513-length approach:
Typical 512 + ignore_index approach:
Technical considerations:
Recommendation: The 513-length pattern is actually quite reasonable here because:
The typical 512+ignore approach would make sense if you were using standard datasets, but since you're tokenizing specifically for this use case, the 513 approach is cleaner. |
|
@claude ok great. just are you sure this'll work with torch CE? I thought the torch CE loss had some weird gross thing where it implicitly handles inputs being non-correctly offset. Basically I thought it worked when x = [logits(a), logits(b), logits(c)] and y = [a, b, c], which would make this case fail, right? |
|
Claude finished @ocg-goodfire's task —— View job Analyzing PyTorch CrossEntropyLoss behavior with 513-length pattern
You're absolutely correct to question this! Let me clarify the PyTorch CrossEntropyLoss behavior: The key insight: PyTorch CE loss does NOT have any built-in autoregressive offset handling. It simply compares Current 513-length implementation: # Training (train.py:452-456)
x = bat[:, :-1] # [batch, 512] - tokens 0-511 (input)
y = bat[:, 1:] # [batch, 512] - tokens 1-512 (targets)
logits, loss = model(x, y)
# Model forward (llama_simple_mlp.py)
logits = self.lm_head(x) # [batch, 512, vocab] - predictions for each position
loss = F.cross_entropy(logits.view(-1, vocab), targets.view(-1))The alignment IS correct:
Your concern about
So the 513-length approach works perfectly - it's the clean way to get proper autoregressive alignment without needing ignore_index masking. |
|
Note that even if we did the 512 length with ignore_index for ce losses, we have to manually handle this for regular SPD things when we load in a 512 context and e.g. calculate the L0. We don't want to calculate on the final token because that token wasn't trained on a label (or at least, we'd get slightly weirded results there). Same for whatever analysis we want to do on the full 512 context. |
|
@claude gotcha, I think I'm getting confused with hf transformers trainers, which I believe do this. Anyway, thanks. @danbraunai-goodfire I'm sceptical that the models would generalise to 1-token-too-long sequences, but definitely not willing to bet on that intuition so fair point that this is safer |
|
Claude finished @ocg-goodfire's task —— View job Analysis CompleteYou're exactly right about the HF transformers confusion! HuggingFace transformers do indeed handle the autoregressive offset automatically in their training loops, which can make the manual slicing approach here seem unusual at first glance. Key clarifications from this discussion:
The implementation looks solid - clean autoregressive alignment without the complexity of ignore masking, and explicit handling of sequence boundaries for both training and analysis phases. |
| if dataset_config.n_ctx < tokenized_len: | ||
| col = dataset_config.column_name | ||
| n_ctx = dataset_config.n_ctx | ||
| torch_dataset = dataset.map(lambda x: {col: x[col][:n_ctx]}).with_format("torch") |
There was a problem hiding this comment.
we're just throwing out a bunch of data in this case, right? seems like instead maybe we should just assert dataset_config.n_ctx == tokenized_len. Do we imagine ever not doing that? and if we do then cropping examples doesn't seem the best way, or at least a potential footgun
There was a problem hiding this comment.
Yeah in SPD we use max_seq_len=512, so we just throw out one token. We could show that it's within 1 of the tokenized_len. But it also might be useful for random sampling of the tokenized dataset.
The changes here don't rely on the model generalising. Maybe you mean "that the models wouldn't generalise"? In which case, yeah I feel like they probably would, but I don't want to take a risk. |
Description
Created tokenized pile. See here (and here). NOTE: I want to update these scripts at some point to not import spd.
Use these tokenized datasets in pretraining and in the spd pile config.
Process:
Related Issue
N/A
Motivation and Context
If using the regular monology/pile-uncopyrighted, it's difficult to split into train/val in a way that's reproducible and that allows for streaming=True (which non-cluster users are going to need).
How Has This Been Tested?
No new tests.
Does this PR introduce a breaking change?
No