Models are not trained on n_ctx-length sequences

We currently create sequences of length n_ctx. But then when we train, we do x = batch[:-1] and y=batch[1:]. This means that our model is actually only used to training on contexts of length n_ctx-1 rather than n_ctx.

For relative positional embeddings like llama uses, it may generalise just fine. For absolute positional embeddings, like those used in gpt2, it could be a problem because that final position embedding won't get trained. So when people are trying to do downstream tasks and training on the "same" context length that the model was trained on, it might produce garbage in that position.

h/t Lucius + Oli for spotting this potential issue in our SPD codebase, where we indeed try and use n_ctx for our decompositions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Models are not trained on n_ctx-length sequences #46

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Models are not trained on n_ctx-length sequences #46

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions