A suite of interpretability tasks to evaluate agents using Scribe for notebook tool usage.
Environment Setup
- Setup virtual environment:
uv venv - Activate environment:
source .venv/bin/activate - Install dependencies:
uv sync - Install
scriberepo anduv pip install -e /path/to/scribe/repo - If running on MacOS you may need to run:
brew install coreutils
Running Benchmarks
The task suite uses Scribe's MCP (Model Context Protocol) server for automated session management. Each benchmark runs isolated Jupyter environments with automatic coordination between concurrent sessions.
Quick Start - Multi-Provider Script
Use the main benchmark script to run tests across multiple agent providers:
# Run all benchmark types with all providers (claude, gemini, codex)
./benchmark_all_providers.sh all
# Run specific benchmark with specific providers
./benchmark_all_providers.sh circuits --providers claude,codex
./benchmark_all_providers.sh neurons --providers claude --concurrent 1
./benchmark_all_providers.sh probes --providers gemini,codexAvailable benchmark types: circuits, neurons, probes, all
Individual Runner Scripts
For granular control, use individual scripts in scripts/runners/:
run_circuits_copilot.sh- Circuit discovery and analysis tasksrun_neurons_copilot.sh- Universal neuron identification tasksrun_probes_copilot.sh- Linear probe training and evaluation
All scripts support multi-provider execution:
# Single provider
./scripts/runners/run_circuits_copilot.sh --providers claude
# Multiple providers with concurrency control
./scripts/runners/run_probes_copilot.sh --providers claude,gemini --concurrent 2(See References for our sources)
Circuit Discovery
Identify important model components for specific behaviors:
- circuit_1: IOI (Indirect Object Identification) task
- circuit_2: Antonym task
- circuit_4: Simple attention pattern
- circuit_5: MLP-only task
Metrics: IoU (Intersection over Union), Precision, Recall
Universal Neurons
Find neurons with specific functional roles:
- Position neurons: Detect token positions
- Alphabet neurons: Respond to alphabetic characters
- Syntax neurons: Activate on syntactic patterns
- Unigram neurons: Fire for specific tokens
Metrics: Success rate (binary pass/fail)
Linear Probes Train classifiers on model representations:
- Grammar probes: Detect grammatical structures
- Comparison probes: Greater-than, smaller-than relationships
- Geographic probes: City location knowledge
- PCA analysis: Dimensionality reduction tasks
Metrics: Classification accuracy (0-1 scale)
Task performance for Claude Codex and Codex as of 9/23/2025:
Circuits Tasks
| Task | Claude IoU | Codex IoU | Claude Precision | Codex Precision | Claude Recall | Codex Recall |
|---|---|---|---|---|---|---|
| circuit_1 | 0.053 | 0.042 | 0.118 | 0.5 | 0.087 | 0.043 |
| circuit_2 | 0.062 | 0.067 | 0.2 | 0.25 | 0.083 | 0.083 |
| circuit_4 | 0.235 | 0.467 | 0.4 | 0.636 | 0.364 | 0.636 |
| circuit_5 | 0.083 | 0.8 | 0.083 | 0.8 | 1 | 1 |
Neurons Tasks
| Task | Claude Success Rate | Codex Success Rate |
|---|---|---|
| position | 0 | 0 |
| alphabet | 1 | 1 |
| position_1 | 1 | 1 |
| syntax | 1 | 1 |
| unigram | 0 | 1 |
Probes Tasks
| Task | Claude Accuracy | Codex Accuracy |
|---|---|---|
| greater_than | 0.732 | 0.964 |
| smaller_than | 0.728 | 0.582 |
| cities | 0.506 | 0.508 |
| regularization | 0.033 | 0.033 |
| smaller_than_pca | 1 | 1 |
| greater_than_pca | 0 | 0 |
Results Structure
runs/
├── {task}_{provider}_{timestamp}/
│ ├── notebooks/ # Generated Jupyter notebooks
│ ├── results/ # Evaluation results (JSON)
│ └── outputs/ # Task-specific outputs
Evaluation Results
- Circuits:
circuit_evaluation_results.jsonwith IoU, precision, recall per task - Neurons:
neuron_evaluation_results.jsonwith success rates per neuron type - Probes:
probe_evaluation_results.jsonwith accuracy scores per probe
Comparison Analysis
Generate comparison tables between providers:
python comparison_table_generator.py├── benchmark_all_providers.sh # Main multi-provider script
├── scripts/
│ ├── runners/ # Individual benchmark runners
│ └── evaluators/ # Result evaluation scripts
├── prompts/ # Task prompts by category
├── runs/ # Output directory (auto-created)
├── evaluation_data/ # Ground truth data
└── comparison_table_generator.py # Results comparison tool
- InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques — R. Gupta et al. (2024)
- MIB: A Mechanistic Interpretability Benchmark — Aaron Mueller et al. (2025)
- Universal Neurons in GPT2 Language Models — Wes Gurnee et al. (2024)
- Steering Llama 2 via Contrastive Activation Addition — Li et al. (2023)
- Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control — Aleksandar Makelov et al. (2024)
- Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders — Dong Shu et al. (2025)