Skip to content

Models are not trained on n_ctx-length sequences #46

@danbraunai

Description

@danbraunai

We currently create sequences of length n_ctx. But then when we train, we do x = batch[:-1] and y=batch[1:]. This means that our model is actually only used to training on contexts of length n_ctx-1 rather than n_ctx.

For relative positional embeddings like llama uses, it may generalise just fine. For absolute positional embeddings, like those used in gpt2, it could be a problem because that final position embedding won't get trained. So when people are trying to do downstream tasks and training on the "same" context length that the model was trained on, it might produce garbage in that position.

h/t Lucius + Oli for spotting this potential issue in our SPD codebase, where we indeed try and use n_ctx for our decompositions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions