Skip to content

Add attention head characterization scripts and ablation experiments#403

Open
lee-goodfire wants to merge 81 commits intodevfrom
feature/attn_plots
Open

Add attention head characterization scripts and ablation experiments#403
lee-goodfire wants to merge 81 commits intodevfrom
feature/attn_plots

Conversation

@lee-goodfire
Copy link
Contributor

@lee-goodfire lee-goodfire commented Feb 19, 2026

Description

Add a suite of attention head detection scripts, component-level induction analysis, new QK attention contribution plot types, and a position-specific ablation experiment with previous-token redundancy testing.

New detection scripts (each in spd/scripts/detect_*/):

  • Previous-token heads (+ random-token control variant)
  • Induction heads (synthetic repeated sequences)
  • Duplicate-token heads
  • Positional heads (offset profiles + BOS attention)
  • Delimiter heads
  • Successor heads (ordinal sequences with random-word controls)
  • S-inhibition heads (IOI prompts, OV copy scores)

Component-level analysis (spd/scripts/characterize_induction_components/):

  • Weight concentration, ablation, cross-head interaction, and "why not perfect" analysis for L2H4

Attention ablation experiment (spd/scripts/attention_ablation_experiment/):

  • Position-specific ablation: ablate heads or SPD components at a single randomly chosen position per sample, measure effect on attention outputs via normalized inner product (NIP) and cosine similarity
  • Head ablation: zero a head's attention output at one position
  • Component ablation: zero q/k component masks at position-specific locations (q at t, k at t-1)
  • Previous-token redundancy test (--prev_token_test): tests whether ablating a head/component makes value ablation at t-1 redundant, quantifying how much of the prev-token information channel the ablation captures
    • Six forward passes per sample: baseline, A, B(all), B(specific), A+B(all), A+B(specific)
    • Seven pairwise comparisons including B-alone controls for interaction analysis
    • Key finding: SPD components (q:279, k:177) capture ~83% of prev-token value flow vs ~35% for full head ablation, providing evidence that SPD finds cross-head structure

Plot enhancements (plot_qk_c_attention_contributions):

  • Per-head heatmaps across offsets
  • Head-vs-sum scatter plots
  • Pair contribution line plots (summed and per-head)

Shared utility (spd/scripts/collect_attention_patterns.py):

  • Extracted duplicated _collect_attention_patterns from all 7 detect scripts into a shared module

Bug fix:

  • Fixed GQA bug in detect_s_inhibition_heads: OV copy scores indexed W_V with Q-head indices instead of KV-head indices

Harvest schema migration:

  • Updated all 10 plotting scripts for new harvest schema

Motivation and Context

Characterize attention head behavior in decomposed LlamaSimpleMLP models to understand component-level circuits. The ablation experiments provide quantitative evidence that SPD components correspond to specific attention mechanisms (previous-token behavior) in a way that cuts across individual attention heads.

How Has This Been Tested?

  • All scripts run against wandb:goodfire/spd/runs/s-275c8f21
  • Ablation experiments verified with 1024 samples for both head (L1H1) and component (q:279, k:177) modes
  • Previous-token redundancy test verified with B-alone controls and interaction analysis
  • All pass basedpyright and ruff checks (make check)

Does this PR introduce a breaking change?

No. All changes are additive (new scripts and plot functions). The harvest schema update aligns with the migration already merged on dev.

🤖 Generated with Claude Code

claude-spd1 and others added 30 commits February 19, 2026 13:57
Scatter plots of mean CI values per component, arranged in a grid by module.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Per-head Frobenius norm heatmaps for each attention projection (q/k/v/o),
showing how each component's weight is distributed across heads.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Bar charts of head-spread entropy per component for each attention projection,
showing whether components are concentrated on one head or spread across many.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Target vs reconstructed weight matrices per attention projection, with
individual component weight visualizations in paginated 4x4 grids.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Per-head activation magnitude heatmaps combining U-norm head structure
with actual activation magnitudes from harvest data.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Weight-only q·k attention contribution heatmaps between q and k subcomponents.
Single grid per layer with summed (all heads) and per-head breakdowns. Uses
V-norm-scaled U dot products to account for unnormalized magnitude split.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
K/V component co-activation heatmaps using pre-computed harvest co-occurrence
data. Three metrics per layer: CI co-occurrence counts, phi coefficient
(binary correlation), and Jaccard similarity of firing sets.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Generates per-layer PDF reports and companion markdown files tracing
attention component interactions: Q->K weight-only attention contributions
and K->V CI co-occurrence associations, with autointerp labels/reasoning.

Also excludes detect_* scripts from basedpyright (pre-existing type errors).

Co-Authored-By: Claude Opus 4.6 <[email protected]>
The loader was hardcoding is_tokenized=False, causing failures on
pre-tokenized datasets like danbraunai/pile-uncopyrighted-tok.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Non-streaming was impractical for large pre-tokenized datasets like
pile-uncopyrighted-tok (491M rows shuffled/mapped just for 512 samples).

Co-Authored-By: Claude Opus 4.6 <[email protected]>
The ~40k component model OOMs at batch_size=512 on 140GB H200.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Add four new plot types: per-head heatmaps across offsets, head-vs-sum
scatter, and pair contribution line plots (summed and per-head). Also
add top_n_pairs parameter and trim default offsets.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Compute mean attention to position i-1 for each head across real text
data. Includes a random-tokens control variant to distinguish positional
from content-driven attention patterns.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Use synthetic repeated token sequences [A B C | A B C] to measure
induction attention (from second occurrence to token following first
occurrence) for each head.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Measure mean attention weight landing on prior positions holding the
same token as the current query position, on real text data.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Compute positional offset profiles (attention vs relative offset) and
BOS attention scores for each head on real text data. Produces three
plots: max-offset heatmap, BOS heatmap, and per-head profile lines.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Measure fraction of each head's attention landing on delimiter tokens
(periods, commas, etc.) on real text, compared to baseline delimiter
frequency.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Use ordinal sequences (digits, letters, days, months) with random-word
controls to isolate successor-specific attention patterns for each head.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Use IOI prompts to measure S2-position attention and OV copy scores,
identifying heads that attend to repeated subject names and inhibit
copying via negative OV contributions.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Deep-dive into L2H4 at the SPD component level: weight concentration
analysis, per-component ablation effects on induction score, cross-head
interactions, and analysis of what prevents perfect induction behavior.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Summarize findings from all 7 detection scripts and the component-level
analysis: multi-functional early heads, L1H1->L2H4 induction circuit,
layer-2 BOS sink pattern, and cross-cutting observations.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Document the mean_ci -> firing_density + mean_activations schema change,
the workaround for incompatible harvest sub-runs, and the list of files
affected by the migration.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
claude-spd1 and others added 21 commits February 23, 2026 00:31
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Subtracts a component's contribution from specific head rows only,
leaving other heads unaffected. This controls for whether the
component's effect is cross-head or concentrated in one head.

- Add ComponentHeadAblation dataclass and extraction helper
- Subtract component contribution from specific head q/k rows in
  patched forward (after q_proj/k_proj, before reshape/RoPE)
- New --restrict_to_heads CLI param for prev_token_test
- Results confirm per-head effect sits between head and full
  component ablation, validating cross-head interpretation

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Sweeps value ablation across offsets 1..N from the ablated position,
producing a profile showing which offsets' values are captured by the
ablation. For a prev-token head, the curve peaks at offset 1.

Head ablation shows clear offset-1 specificity. Component ablation
is flat across offsets (all value info already captured).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Compares per-head attention distributions at query position t across
four conditions: target baseline, SPD baseline, full component ablation,
and per-head component ablation. Produces raw distribution and
difference plots, averaged over n samples with consistent y-axes.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Integer ticks at every offset, clearer label "Offset from query position".

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Previously mask_infos=None was passed on non-ablated generation steps,
causing the SPD model to fall through to the target model instead of
using component reconstruction. Now builds all-ones baseline masks
and passes them consistently.

Also adds value_pos_ablations and value_head_pos_ablations params to
_generate_greedy.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Replaces ad-hoc inline scripts with a proper CLI tool that generates
HTML comparison tables for multiple ablation conditions:
- Head ablation, value ablation (t-1, t-1+t-2, all L1, all layers)
- Multiple component sets with full and per-head variants
- Multi-layer component support via JSON comp_sets arg
- Crafted prompt examples built-in
- SPD model uses baseline masks on non-ablated steps
- Persistent ablation mode for value conditions

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Fire auto-parses JSON strings into dicts before passing to the function.
Handle both str and dict types for comp_sets.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Replace _run_all_conditions dict + _make_condition_order with
  _build_conditions that returns ordered (name, tokens) pairs
- Value ablation layer derived from parsed_heads, not hardcoded to 1
- Condition names include actual head labels (e.g. "L1H1") not literals
- _generate_greedy uses keyword-only args after prompt_ids/gen_len
- Separate sections: generation, condition definitions, HTML rendering
- Remove fragile **kwargs pattern

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Add gen() helper in _build_conditions to reduce repetition
- Add assert t >= 2 at top of _build_conditions
- Extract run_sample() in main loop to deduplicate dataset/crafted paths
- Add docstring to _build_conditions listing all condition groups
- Document why value ablation conditions depend on parsed_heads
  (layer is inferred from the head spec)
- Remove t >= 1, t >= 2 conditionals since t >= 2 is now asserted

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Replace _generate_greedy's ablate_first_only boolean with two explicit
functions:
- _generate_greedy_one_shot: ablation on first step only, then clean
  generation (for position-specific interventions)
- _generate_greedy_persistent: ablation re-applied every step, since
  no KV cache means the model recomputes from scratch each step
  (for "model modification" conditions like zeroing all values)

Both share _forward_once for the actual forward pass logic.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
All ablations now apply on the first generated token only. Deleted
_generate_greedy_persistent since we never want ablations to persist
across generation steps.

Renamed "[persist]" conditions to "Vals @ALL prev" — these zero values
at all prompt positions but only for one prediction, same as other
value ablation conditions.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Replace _generate_greedy_one_shot loop with _predict_next_token that
does exactly one forward pass and returns one token ID. Remove gen_len
parameter entirely. HTML tables now have one column.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Each ablation condition now shows:
- Top-k tokens whose logits increased most vs baseline
- Top-k tokens whose logits decreased most vs baseline
- Change in the baseline's predicted token's logit

Baseline is target model for non-SPD conditions, SPD baseline for
component conditions.

Also: Prediction class replaces bare int return, ConditionResult
tuple tracks which baseline each condition uses.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Token-per-line layout with flexbox alignment, color-coded values
(green=increase, red=decrease), wider page, proper vertical alignment.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Inline token(value) format with nbsp separators instead of one-per-line.
Each row is now one line tall. Values still color-coded.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
One cell per token instead of all crammed into one cell. Header shows
inc 1..5 and dec 1..5 columns. Each cell has token + colored value.
Compact single-line rows.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
claude-spd1 and others added 8 commits February 25, 2026 16:57
Also relax t >= 2 assertion to t >= 1, skip t-1,t-2 condition when t < 2.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Short prompts are better for studying previous-token effects since the
ablated position is close to the prediction. The t >= 1 assertion and
t >= 2 guard on the t-1,t-2 condition handle 2-token prompts correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Each logit cell shows the token on the first line and the colored
change value on the second. Increased default top_k from 5 to 20.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Expanded crafted prompts to 40 with focus on prev-token behavior:
bigrams, fixed expressions, sequences, code syntax, repetition.
Change values in logit cells now 9px for less visual noise.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Replace separate prompt div and token info line with inline prompt
text in a highlighted code box directly in the h2 heading.
Before: "Crafted: HTML | ablation at t=4" + separate prompt div
After:  "Crafted: HTML | <html><body> | t=4"

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Tokens separated by | delimiters. Blue = ablated position (t),
yellow = previous position (t-1), grey = other tokens. Makes
tokenization boundaries and ablation target immediately visible.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants