Add attention head characterization scripts and ablation experiments by lee-goodfire · Pull Request #403 · goodfire-ai/spd

lee-goodfire · 2026-02-19T12:15:47Z

Description

Add a suite of attention head detection scripts, component-level induction analysis, new QK attention contribution plot types, and a position-specific ablation experiment with previous-token redundancy testing.

New detection scripts (each in spd/scripts/detect_*/):

Previous-token heads (+ random-token control variant)
Induction heads (synthetic repeated sequences)
Duplicate-token heads
Positional heads (offset profiles + BOS attention)
Delimiter heads
Successor heads (ordinal sequences with random-word controls)
S-inhibition heads (IOI prompts, OV copy scores)

Component-level analysis (spd/scripts/characterize_induction_components/):

Weight concentration, ablation, cross-head interaction, and "why not perfect" analysis for L2H4

Attention ablation experiment (spd/scripts/attention_ablation_experiment/):

Position-specific ablation: ablate heads or SPD components at a single randomly chosen position per sample, measure effect on attention outputs via normalized inner product (NIP) and cosine similarity
Head ablation: zero a head's attention output at one position
Component ablation: zero q/k component masks at position-specific locations (q at t, k at t-1)
Previous-token redundancy test (--prev_token_test): tests whether ablating a head/component makes value ablation at t-1 redundant, quantifying how much of the prev-token information channel the ablation captures
- Six forward passes per sample: baseline, A, B(all), B(specific), A+B(all), A+B(specific)
- Seven pairwise comparisons including B-alone controls for interaction analysis
- Key finding: SPD components (q:279, k:177) capture ~83% of prev-token value flow vs ~35% for full head ablation, providing evidence that SPD finds cross-head structure

Plot enhancements (plot_qk_c_attention_contributions):

Per-head heatmaps across offsets
Head-vs-sum scatter plots
Pair contribution line plots (summed and per-head)

Shared utility (spd/scripts/collect_attention_patterns.py):

Extracted duplicated _collect_attention_patterns from all 7 detect scripts into a shared module

Bug fix:

Fixed GQA bug in detect_s_inhibition_heads: OV copy scores indexed W_V with Q-head indices instead of KV-head indices

Harvest schema migration:

Updated all 10 plotting scripts for new harvest schema

Motivation and Context

Characterize attention head behavior in decomposed LlamaSimpleMLP models to understand component-level circuits. The ablation experiments provide quantitative evidence that SPD components correspond to specific attention mechanisms (previous-token behavior) in a way that cuts across individual attention heads.

How Has This Been Tested?

All scripts run against wandb:goodfire/spd/runs/s-275c8f21
Ablation experiments verified with 1024 samples for both head (L1H1) and component (q:279, k:177) modes
Previous-token redundancy test verified with B-alone controls and interaction analysis
All pass basedpyright and ruff checks (make check)

Does this PR introduce a breaking change?

No. All changes are additive (new scripts and plot functions). The harvest schema update aligns with the migration already merged on dev.

🤖 Generated with Claude Code

Scatter plots of mean CI values per component, arranged in a grid by module. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Per-head Frobenius norm heatmaps for each attention projection (q/k/v/o), showing how each component's weight is distributed across heads. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Bar charts of head-spread entropy per component for each attention projection, showing whether components are concentrated on one head or spread across many. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Target vs reconstructed weight matrices per attention projection, with individual component weight visualizations in paginated 4x4 grids. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Per-head activation magnitude heatmaps combining U-norm head structure with actual activation magnitudes from harvest data. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Weight-only q·k attention contribution heatmaps between q and k subcomponents. Single grid per layer with summed (all heads) and per-head breakdowns. Uses V-norm-scaled U dot products to account for unnormalized magnitude split. Co-Authored-By: Claude Opus 4.6 <[email protected]>

K/V component co-activation heatmaps using pre-computed harvest co-occurrence data. Three metrics per layer: CI co-occurrence counts, phi coefficient (binary correlation), and Jaccard similarity of firing sets. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Co-Authored-By: Claude Opus 4.6 <[email protected]>

Generates per-layer PDF reports and companion markdown files tracing attention component interactions: Q->K weight-only attention contributions and K->V CI co-occurrence associations, with autointerp labels/reasoning. Also excludes detect_* scripts from basedpyright (pre-existing type errors). Co-Authored-By: Claude Opus 4.6 <[email protected]>

Co-Authored-By: Claude Opus 4.6 <[email protected]>

The loader was hardcoding is_tokenized=False, causing failures on pre-tokenized datasets like danbraunai/pile-uncopyrighted-tok. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Non-streaming was impractical for large pre-tokenized datasets like pile-uncopyrighted-tok (491M rows shuffled/mapped just for 512 samples). Co-Authored-By: Claude Opus 4.6 <[email protected]>

The ~40k component model OOMs at batch_size=512 on 140GB H200. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Co-Authored-By: Claude Opus 4.6 <[email protected]>

Add four new plot types: per-head heatmaps across offsets, head-vs-sum scatter, and pair contribution line plots (summed and per-head). Also add top_n_pairs parameter and trim default offsets. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Compute mean attention to position i-1 for each head across real text data. Includes a random-tokens control variant to distinguish positional from content-driven attention patterns. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Use synthetic repeated token sequences [A B C | A B C] to measure induction attention (from second occurrence to token following first occurrence) for each head. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Measure mean attention weight landing on prior positions holding the same token as the current query position, on real text data. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Compute positional offset profiles (attention vs relative offset) and BOS attention scores for each head on real text data. Produces three plots: max-offset heatmap, BOS heatmap, and per-head profile lines. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Measure fraction of each head's attention landing on delimiter tokens (periods, commas, etc.) on real text, compared to baseline delimiter frequency. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Use ordinal sequences (digits, letters, days, months) with random-word controls to isolate successor-specific attention patterns for each head. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Use IOI prompts to measure S2-position attention and OV copy scores, identifying heads that attend to repeated subject names and inhibit copying via negative OV contributions. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Deep-dive into L2H4 at the SPD component level: weight concentration analysis, per-component ablation effects on induction score, cross-head interactions, and analysis of what prevents perfect induction behavior. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Summarize findings from all 7 detection scripts and the component-level analysis: multi-functional early heads, L1H1->L2H4 induction circuit, layer-2 BOS sink pattern, and cross-cutting observations. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Document the mean_ci -> firing_density + mean_activations schema change, the workaround for incompatible harvest sub-runs, and the list of files affected by the migration. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…feature/attn_plots

Subtracts a component's contribution from specific head rows only, leaving other heads unaffected. This controls for whether the component's effect is cross-head or concentrated in one head. - Add ComponentHeadAblation dataclass and extraction helper - Subtract component contribution from specific head q/k rows in patched forward (after q_proj/k_proj, before reshape/RoPE) - New --restrict_to_heads CLI param for prev_token_test - Results confirm per-head effect sits between head and full component ablation, validating cross-head interpretation Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Sweeps value ablation across offsets 1..N from the ablated position, producing a profile showing which offsets' values are captured by the ablation. For a prev-token head, the curve peaks at offset 1. Head ablation shows clear offset-1 specificity. Component ablation is flat across offsets (all value info already captured). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Compares per-head attention distributions at query position t across four conditions: target baseline, SPD baseline, full component ablation, and per-head component ablation. Produces raw distribution and difference plots, averaged over n samples with consistent y-axes. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Integer ticks at every offset, clearer label "Offset from query position". Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Previously mask_infos=None was passed on non-ablated generation steps, causing the SPD model to fall through to the target model instead of using component reconstruction. Now builds all-ones baseline masks and passes them consistently. Also adds value_pos_ablations and value_head_pos_ablations params to _generate_greedy. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Replaces ad-hoc inline scripts with a proper CLI tool that generates HTML comparison tables for multiple ablation conditions: - Head ablation, value ablation (t-1, t-1+t-2, all L1, all layers) - Multiple component sets with full and per-head variants - Multi-layer component support via JSON comp_sets arg - Crafted prompt examples built-in - SPD model uses baseline masks on non-ablated steps - Persistent ablation mode for value conditions Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Fire auto-parses JSON strings into dicts before passing to the function. Handle both str and dict types for comp_sets. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- Replace _run_all_conditions dict + _make_condition_order with _build_conditions that returns ordered (name, tokens) pairs - Value ablation layer derived from parsed_heads, not hardcoded to 1 - Condition names include actual head labels (e.g. "L1H1") not literals - _generate_greedy uses keyword-only args after prompt_ids/gen_len - Separate sections: generation, condition definitions, HTML rendering - Remove fragile **kwargs pattern Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- Add gen() helper in _build_conditions to reduce repetition - Add assert t >= 2 at top of _build_conditions - Extract run_sample() in main loop to deduplicate dataset/crafted paths - Add docstring to _build_conditions listing all condition groups - Document why value ablation conditions depend on parsed_heads (layer is inferred from the head spec) - Remove t >= 1, t >= 2 conditionals since t >= 2 is now asserted Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Replace _generate_greedy's ablate_first_only boolean with two explicit functions: - _generate_greedy_one_shot: ablation on first step only, then clean generation (for position-specific interventions) - _generate_greedy_persistent: ablation re-applied every step, since no KV cache means the model recomputes from scratch each step (for "model modification" conditions like zeroing all values) Both share _forward_once for the actual forward pass logic. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

All ablations now apply on the first generated token only. Deleted _generate_greedy_persistent since we never want ablations to persist across generation steps. Renamed "[persist]" conditions to "Vals @ALL prev" — these zero values at all prompt positions but only for one prediction, same as other value ablation conditions. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Replace _generate_greedy_one_shot loop with _predict_next_token that does exactly one forward pass and returns one token ID. Remove gen_len parameter entirely. HTML tables now have one column. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Each ablation condition now shows: - Top-k tokens whose logits increased most vs baseline - Top-k tokens whose logits decreased most vs baseline - Change in the baseline's predicted token's logit Baseline is target model for non-SPD conditions, SPD baseline for component conditions. Also: Prediction class replaces bare int return, ConditionResult tuple tracks which baseline each condition uses. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Token-per-line layout with flexbox alignment, color-coded values (green=increase, red=decrease), wider page, proper vertical alignment. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Inline token(value) format with nbsp separators instead of one-per-line. Each row is now one line tall. Values still color-coded. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

One cell per token instead of all crammed into one cell. Header shows inc 1..5 and dec 1..5 columns. Each cell has token + colored value. Compact single-line rows. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Also relax t >= 2 assertion to t >= 1, skip t-1,t-2 condition when t < 2. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Short prompts are better for studying previous-token effects since the ablated position is close to the prediction. The t >= 1 assertion and t >= 2 guard on the t-1,t-2 condition handle 2-token prompts correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Each logit cell shows the token on the first line and the colored change value on the second. Increased default top_k from 5 to 20. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Expanded crafted prompts to 40 with focus on prev-token behavior: bigrams, fixed expressions, sequences, code syntax, repetition. Change values in logit cells now 9px for less visual noise. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Replace separate prompt div and token info line with inline prompt text in a highlighted code box directly in the h2 heading. Before: "Crafted: HTML | ablation at t=4" + separate prompt div After: "Crafted: HTML | <html><body> | t=4" Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Tokens separated by | delimiters. Blue = ablated position (t), yellow = previous position (t-1), grey = other tokens. Makes tokenization boundaries and ablation target immediately visible. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

claude-spd1 and others added 30 commits February 19, 2026 13:57

Add plot_mean_ci script

f3dd309

Scatter plots of mean CI values per component, arranged in a grid by module. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Add plot_component_head_norms script

b630906

Per-head Frobenius norm heatmaps for each attention projection (q/k/v/o), showing how each component's weight is distributed across heads. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Add plot_head_spread script

2dd8a6c

Bar charts of head-spread entropy per component for each attention projection, showing whether components are concentrated on one head or spread across many. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Add plot_attention_weights script

ed5178f

Target vs reconstructed weight matrices per attention projection, with individual component weight visualizations in paginated 4x4 grids. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Add plot_per_head_component_activations script

dd07888

Per-head activation magnitude heatmaps combining U-norm head structure with actual activation magnitudes from harvest data. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Add plot_kv_vt_similarity script

d61bb8a

Co-Authored-By: Claude Opus 4.6 <[email protected]>

Rename plot_attention_contributions to plot_qk_c_attention_contributions

b05e55c

Co-Authored-By: Claude Opus 4.6 <[email protected]>

Add dotenv loading to app backend server

70e0819

Co-Authored-By: Claude Opus 4.6 <[email protected]>

Remove readonly from InterpDB in InterpRepo.open

4ae5dae

Co-Authored-By: Claude Opus 4.6 <[email protected]>

Add RoPE-aware QK attention computation with multi-offset support

523cd92

Co-Authored-By: Claude Opus 4.6 <[email protected]>

Add clustering configs for s-275c8f21

4acac7b

Co-Authored-By: Claude Opus 4.6 <[email protected]>

Fix clustering dataset loader to respect is_tokenized from task config

c9b54a1

The loader was hardcoding is_tokenized=False, causing failures on pre-tokenized datasets like danbraunai/pile-uncopyrighted-tok. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Read streaming setting from task config in clustering dataset loader

a6b506d

Non-streaming was impractical for large pre-tokenized datasets like pile-uncopyrighted-tok (491M rows shuffled/mapped just for 512 samples). Co-Authored-By: Claude Opus 4.6 <[email protected]>

Reduce clustering batch_size to 32 for s-275c8f21

96b33c9

The ~40k component model OOMs at batch_size=512 on 140GB H200. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Increase clustering iters to 5000 for s-275c8f21

e7e2497

Co-Authored-By: Claude Opus 4.6 <[email protected]>

Add 10k and 25k iter clustering configs for s-275c8f21

78d8658

Co-Authored-By: Claude Opus 4.6 <[email protected]>

Add previous-token head detection script

587aff2

Compute mean attention to position i-1 for each head across real text data. Includes a random-tokens control variant to distinguish positional from content-driven attention patterns. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Add induction head detection script

8397950

Use synthetic repeated token sequences [A B C | A B C] to measure induction attention (from second occurrence to token following first occurrence) for each head. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Add duplicate-token head detection script

14bc7da

Measure mean attention weight landing on prior positions holding the same token as the current query position, on real text data. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Add delimiter head detection script

2f23d34

Measure fraction of each head's attention landing on delimiter tokens (periods, commas, etc.) on real text, compared to baseline delimiter frequency. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Add successor head detection script

33e7061

Use ordinal sequences (digits, letters, days, months) with random-word controls to isolate successor-specific attention patterns for each head. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Add S-inhibition head detection script

013791a

Use IOI prompts to measure S2-position attention and OV copy scores, identifying heads that attend to repeated subject names and inhibit copying via negative OV contributions. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Add harvest data schema migration notes

b08d71b

Document the mean_ci -> firing_density + mean_activations schema change, the workaround for incompatible harvest sub-runs, and the list of files affected by the migration. Co-Authored-By: Claude Opus 4.6 <[email protected]>

claude-spd1 and others added 21 commits February 23, 2026 00:31

Add clustering configs for s-275c8f21 alpha=30 and alpha=100

bc2565f

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Add clustering configs for s-275c8f21 alpha=0.5 and alpha=0.1

5bf0713

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Add clustering configs for s-275c8f21 alpha=1

e397eb8

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Add clustering configs for s-275c8f21 alpha=2,3,5,8

a5bacfc

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Merge branch 'feature/attn_plots' of github.com:goodfire-ai/spd into …

103c5b3

…feature/attn_plots

Fix x-axis ticks and label in attention pattern diff plots

115f80c

Integer ticks at every offset, clearer label "Offset from query position". Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Fix Fire dict parsing for comp_sets argument

54375cd

Fire auto-parses JSON strings into dicts before passing to the function. Handle both str and dict types for comp_sets. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Improve logit analysis HTML formatting

158734a

Token-per-line layout with flexbox alignment, color-coded values (green=increase, red=decrease), wider page, proper vertical alignment. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Make logit analysis cells horizontal and compact

c17f3c1

Inline token(value) format with nbsp separators instead of one-per-line. Each row is now one line tall. Values still color-coded. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Use separate table cells for each top-k logit change

2e6f687

One cell per token instead of all crammed into one cell. Header shows inc 1..5 and dec 1..5 columns. Each cell has token + colored value. Compact single-line rows. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

claude-spd1 force-pushed the feature/attn_plots branch from 6bd5736 to 2e6f687 Compare February 25, 2026 16:38

claude-spd1 and others added 8 commits February 25, 2026 16:57

Replace crafted prompts with 30 short (4-8 token) examples

5c062f3

Also relax t >= 2 assertion to t >= 1, skip t-1,t-2 condition when t < 2. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Two-line logit cells (token/change) and top_k=20

d8e8851

Each logit cell shows the token on the first line and the colored change value on the second. Increased default top_k from 5 to 20. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Left-align logit cells with minimal width

0af9a8d

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Skip non-ASCII dataset samples (non-English text)

14e005a

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add attention head characterization scripts and ablation experiments#403

Add attention head characterization scripts and ablation experiments#403
lee-goodfire wants to merge 81 commits intodevfrom
feature/attn_plots

lee-goodfire commented Feb 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lee-goodfire commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How Has This Been Tested?

Does this PR introduce a breaking change?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lee-goodfire commented Feb 19, 2026 •

edited

Loading