Self-Repair in Language Models

This repository contains the code and experimental framework for "Explorations of Self-Repair in Language Models", demonstrating how transformer models compensate for ablated attention heads through distributed repair mechanisms.

Code Organization: All the files in FINAL_figure_files hold the most important code, while other miscellaneous files have been stored in extra. By default, the code wasn't written to be read and is messy: for any questions regarding the code, contact Cody Rushing at [email protected]

Abstract

Prior interpretability research studying narrow distributions has preliminarily identified self-repair: a phenomenon where if components in large language models are ablated, later components will change their behavior to compensate.

Our work builds off this past literature, demonstrating that self-repair exists on a variety of model families and sizes when ablating individual attention heads on the full training distribution. We further show that on the full training distribution self-repair is:

Imperfect: the original direct effect of the head is not fully restored
Noisy: the degree of self-repair varies significantly across different prompts (sometimes overcorrecting beyond the original effect)

We highlight two different mechanisms that contribute to self-repair:

Changes in the final LayerNorm scaling factor (which can repair up to 30% of the direct effect)
Sparse sets of neurons implementing Anti-Erasure

We additionally discuss the implications of these results for interpretability practitioners and close with a more speculative discussion on the mystery of why self-repair occurs in these models at all, highlighting evidence for the Iterative Inference hypothesis in language models, a framework that predicts self-repair.

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
FINAL_figure_files		FINAL_figure_files
extra		extra
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Self-Repair in Language Models

Abstract

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Self-Repair in Language Models

Abstract

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages