Skip to content

A suite of interpretability tasks to evaluate agents using Scribe for notebook access

Notifications You must be signed in to change notification settings

goodfire-ai/scribe-task-suite

Repository files navigation

Scribe Experimenter Agent - Interp Task Suite

A suite of interpretability tasks to evaluate agents using Scribe for notebook tool usage.

Running with Scribe (agents like Claude Code, Codex, and Gemini CLI with Jupyter notebook access)

Environment Setup

  1. Setup virtual environment: uv venv
  2. Activate environment: source .venv/bin/activate
  3. Install dependencies: uv sync
  4. Install scribe repo and uv pip install -e /path/to/scribe/repo
  5. If running on MacOS you may need to run: brew install coreutils

Running Benchmarks
The task suite uses Scribe's MCP (Model Context Protocol) server for automated session management. Each benchmark runs isolated Jupyter environments with automatic coordination between concurrent sessions.

Quick Start - Multi-Provider Script
Use the main benchmark script to run tests across multiple agent providers:

# Run all benchmark types with all providers (claude, gemini, codex)
./benchmark_all_providers.sh all

# Run specific benchmark with specific providers
./benchmark_all_providers.sh circuits --providers claude,codex
./benchmark_all_providers.sh neurons --providers claude --concurrent 1
./benchmark_all_providers.sh probes --providers gemini,codex

Available benchmark types: circuits, neurons, probes, all

Individual Runner Scripts
For granular control, use individual scripts in scripts/runners/:

  • run_circuits_copilot.sh - Circuit discovery and analysis tasks
  • run_neurons_copilot.sh - Universal neuron identification tasks
  • run_probes_copilot.sh - Linear probe training and evaluation

All scripts support multi-provider execution:

# Single provider
./scripts/runners/run_circuits_copilot.sh --providers claude

# Multiple providers with concurrency control
./scripts/runners/run_probes_copilot.sh --providers claude,gemini --concurrent 2

Benchmark Tasks

(See References for our sources)

Circuit Discovery
Identify important model components for specific behaviors:

  • circuit_1: IOI (Indirect Object Identification) task
  • circuit_2: Antonym task
  • circuit_4: Simple attention pattern
  • circuit_5: MLP-only task

Metrics: IoU (Intersection over Union), Precision, Recall

Universal Neurons
Find neurons with specific functional roles:

  • Position neurons: Detect token positions
  • Alphabet neurons: Respond to alphabetic characters
  • Syntax neurons: Activate on syntactic patterns
  • Unigram neurons: Fire for specific tokens

Metrics: Success rate (binary pass/fail)

Linear Probes Train classifiers on model representations:

  • Grammar probes: Detect grammatical structures
  • Comparison probes: Greater-than, smaller-than relationships
  • Geographic probes: City location knowledge
  • PCA analysis: Dimensionality reduction tasks

Metrics: Classification accuracy (0-1 scale)

Latest Results (Sep 23, 2025)

Task performance for Claude Codex and Codex as of 9/23/2025:

Circuits Tasks

Task Claude IoU Codex IoU Claude Precision Codex Precision Claude Recall Codex Recall
circuit_1 0.053 0.042 0.118 0.5 0.087 0.043
circuit_2 0.062 0.067 0.2 0.25 0.083 0.083
circuit_4 0.235 0.467 0.4 0.636 0.364 0.636
circuit_5 0.083 0.8 0.083 0.8 1 1

Neurons Tasks

Task Claude Success Rate Codex Success Rate
position 0 0
alphabet 1 1
position_1 1 1
syntax 1 1
unigram 0 1

Probes Tasks

Task Claude Accuracy Codex Accuracy
greater_than 0.732 0.964
smaller_than 0.728 0.582
cities 0.506 0.508
regularization 0.033 0.033
smaller_than_pca 1 1
greater_than_pca 0 0

Output Organization

Results Structure

runs/
├── {task}_{provider}_{timestamp}/
│   ├── notebooks/           # Generated Jupyter notebooks
│   ├── results/            # Evaluation results (JSON)
│   └── outputs/            # Task-specific outputs

Evaluation Results

  • Circuits: circuit_evaluation_results.json with IoU, precision, recall per task
  • Neurons: neuron_evaluation_results.json with success rates per neuron type
  • Probes: probe_evaluation_results.json with accuracy scores per probe

Comparison Analysis
Generate comparison tables between providers:

python comparison_table_generator.py

File Structure

├── benchmark_all_providers.sh          # Main multi-provider script
├── scripts/
│   ├── runners/                        # Individual benchmark runners
│   └── evaluators/                     # Result evaluation scripts
├── prompts/                            # Task prompts by category
├── runs/                               # Output directory (auto-created)
├── evaluation_data/                    # Ground truth data
└── comparison_table_generator.py       # Results comparison tool

References

Our tasks are sourced from:

  • InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques — R. Gupta et al. (2024)
  • MIB: A Mechanistic Interpretability Benchmark — Aaron Mueller et al. (2025)
  • Universal Neurons in GPT2 Language Models — Wes Gurnee et al. (2024)
  • Steering Llama 2 via Contrastive Activation Addition — Li et al. (2023)
  • Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control — Aleksandar Makelov et al. (2024)
  • Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders — Dong Shu et al. (2025)

About

A suite of interpretability tasks to evaluate agents using Scribe for notebook access

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published