generated from ApolloResearch/sample
-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Labels
bugSomething isn't workingSomething isn't working
Description
We currently create sequences of length n_ctx. But then when we train, we do x = batch[:-1] and y=batch[1:]. This means that our model is actually only used to training on contexts of length n_ctx-1 rather than n_ctx.
For relative positional embeddings like llama uses, it may generalise just fine. For absolute positional embeddings, like those used in gpt2, it could be a problem because that final position embedding won't get trained. So when people are trying to do downstream tasks and training on the "same" context length that the model was trained on, it might produce garbage in that position.
h/t Lucius + Oli for spotting this potential issue in our SPD codebase, where we indeed try and use n_ctx for our decompositions.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working