Scribe Experimenter Agent - Interp Task Suite

A suite of interpretability tasks to evaluate agents using Scribe for notebook tool usage.

Running with Scribe (agents like Claude Code, Codex, and Gemini CLI with Jupyter notebook access)

Environment Setup

Setup virtual environment: uv venv
Activate environment: source .venv/bin/activate
Install dependencies: uv sync
Install scribe repo and uv pip install -e /path/to/scribe/repo
If running on MacOS you may need to run: brew install coreutils

Running Benchmarks
The task suite uses Scribe's MCP (Model Context Protocol) server for automated session management. Each benchmark runs isolated Jupyter environments with automatic coordination between concurrent sessions.

Quick Start - Multi-Provider Script
Use the main benchmark script to run tests across multiple agent providers:

# Run all benchmark types with all providers (claude, gemini, codex)
./benchmark_all_providers.sh all

# Run specific benchmark with specific providers
./benchmark_all_providers.sh circuits --providers claude,codex
./benchmark_all_providers.sh neurons --providers claude --concurrent 1
./benchmark_all_providers.sh probes --providers gemini,codex

Available benchmark types: circuits, neurons, probes, all

Individual Runner Scripts
For granular control, use individual scripts in scripts/runners/:

run_circuits_copilot.sh - Circuit discovery and analysis tasks
run_neurons_copilot.sh - Universal neuron identification tasks
run_probes_copilot.sh - Linear probe training and evaluation

All scripts support multi-provider execution:

# Single provider
./scripts/runners/run_circuits_copilot.sh --providers claude

# Multiple providers with concurrency control
./scripts/runners/run_probes_copilot.sh --providers claude,gemini --concurrent 2

Benchmark Tasks

(See References for our sources)

Circuit Discovery
Identify important model components for specific behaviors:

circuit_1: IOI (Indirect Object Identification) task
circuit_2: Antonym task
circuit_4: Simple attention pattern
circuit_5: MLP-only task

Metrics: IoU (Intersection over Union), Precision, Recall

Universal Neurons
Find neurons with specific functional roles:

Position neurons: Detect token positions
Alphabet neurons: Respond to alphabetic characters
Syntax neurons: Activate on syntactic patterns
Unigram neurons: Fire for specific tokens

Metrics: Success rate (binary pass/fail)

Linear Probes Train classifiers on model representations:

Grammar probes: Detect grammatical structures
Comparison probes: Greater-than, smaller-than relationships
Geographic probes: City location knowledge
PCA analysis: Dimensionality reduction tasks

Metrics: Classification accuracy (0-1 scale)

Latest Results (Sep 23, 2025)

Task performance for Claude Codex and Codex as of 9/23/2025:

Circuits Tasks

Task	Claude IoU	Codex IoU	Claude Precision	Codex Precision	Claude Recall	Codex Recall
circuit_1	0.053	0.042	0.118	0.5	0.087	0.043
circuit_2	0.062	0.067	0.2	0.25	0.083	0.083
circuit_4	0.235	0.467	0.4	0.636	0.364	0.636
circuit_5	0.083	0.8	0.083	0.8	1	1

Neurons Tasks

Task	Claude Success Rate	Codex Success Rate
position	0	0
alphabet	1	1
position_1	1	1
syntax	1	1
unigram	0	1

Probes Tasks

Task	Claude Accuracy	Codex Accuracy
greater_than	0.732	0.964
smaller_than	0.728	0.582
cities	0.506	0.508
regularization	0.033	0.033
smaller_than_pca	1	1
greater_than_pca	0	0

Output Organization

Results Structure

runs/
├── {task}_{provider}_{timestamp}/
│   ├── notebooks/           # Generated Jupyter notebooks
│   ├── results/            # Evaluation results (JSON)
│   └── outputs/            # Task-specific outputs

Evaluation Results

Circuits: circuit_evaluation_results.json with IoU, precision, recall per task
Neurons: neuron_evaluation_results.json with success rates per neuron type
Probes: probe_evaluation_results.json with accuracy scores per probe

Comparison Analysis
Generate comparison tables between providers:

python comparison_table_generator.py

File Structure

├── benchmark_all_providers.sh          # Main multi-provider script
├── scripts/
│   ├── runners/                        # Individual benchmark runners
│   └── evaluators/                     # Result evaluation scripts
├── prompts/                            # Task prompts by category
├── runs/                               # Output directory (auto-created)
├── evaluation_data/                    # Ground truth data
└── comparison_table_generator.py       # Results comparison tool

References

Our tasks are sourced from:

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques — R. Gupta et al. (2024)
MIB: A Mechanistic Interpretability Benchmark — Aaron Mueller et al. (2025)
Universal Neurons in GPT2 Language Models — Wes Gurnee et al. (2024)
Steering Llama 2 via Contrastive Activation Addition — Li et al. (2023)
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control — Aleksandar Makelov et al. (2024)
Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders — Dong Shu et al. (2025)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.claude		.claude
.gemini		.gemini
data		data
evaluation_data		evaluation_data
ground_truth_large_circuits		ground_truth_large_circuits
notebooks		notebooks
prompts		prompts
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
benchmark_all_providers.sh		benchmark_all_providers.sh
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scribe Experimenter Agent - Interp Task Suite

Running with Scribe (agents like Claude Code, Codex, and Gemini CLI with Jupyter notebook access)

Benchmark Tasks

Latest Results (Sep 23, 2025)

Output Organization

File Structure

References

About

Uh oh!

Releases

Packages

Languages

goodfire-ai/scribe-task-suite

Folders and files

Latest commit

History

Repository files navigation

Scribe Experimenter Agent - Interp Task Suite

Running with Scribe (agents like Claude Code, Codex, and Gemini CLI with Jupyter notebook access)

Benchmark Tasks

Latest Results (Sep 23, 2025)

Output Organization

File Structure

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages