Goodfire/Llama-3.1-8B-Instruct-SAE-l19

Model Information

The Goodfire SAE (Sparse Autoencoder) for meta-llama/Llama-3.1-8B-Instruct is an interpreter model designed to analyze and understand the model's internal representations. This SAE model is trained specifically on layer 19 of Llama 3.1 8B and achieves an L0 count of 91, enabling the decomposition of complex neural activations into interpretable features. The model is optimized for interpretability tasks and model steering applications, allowing researchers and developers to gain insights into the model's internal processing and behavior patterns. As an open-source tool, it serves as a foundation for advancing interpretability research and enhancing control over large language model operations.

Model Creator: Goodfire, built to work with Meta's Llama models

By using Goodfire/Llama-3.1-8B-Instruct-SAE-l19 you agree to the LLAMA 3.1 COMMUNITY LICENSE AGREEMENT

Intended Use

By open-sourcing SAEs for leading open models, especially large-scale models like Llama 3.1 8B, we aim to accelerate progress in interpretability research.

Our initial work with these SAEs has revealed promising applications in model steering, enhancing jailbreaking safeguards, and interpretable classification methods. We look forward to seeing how the research community builds upon these foundations and uncovers new applications.

Feature labels

To explore the feature labels check out the Goodfire Ember SDK, the first hosted mechanistic interpretability API. The SDK provides an intuitive interface for interacting with these features, allowing you to investigate how Llama processes information and even steer its behavior. You can explore the SDK documentation at docs.goodfire.ai.

How to use

View the notebook guide below to get started.

Training

We trained our SAE on activations harvested from Llama-3.1-8B-Instruct on the LMSYS-Chat-1M dataset.

Responsibility & Safety

Safety is at the core of everything we do at Goodfire. As a public benefit corporation, we’re dedicated to understanding AI models to enable safer, more reliable generative AI. You can read more about our comprehensive approach to safety and responsible development in our detailed safety overview.

Toxic features were removed prior to the release of this SAE. If you are a safety researcher that would like access to the features we’ve removed, you can reach out at [email protected] for access.