Add attention head characterization scripts and ablation experiments#403
Open
lee-goodfire wants to merge 81 commits intodevfrom
Open
Add attention head characterization scripts and ablation experiments#403lee-goodfire wants to merge 81 commits intodevfrom
lee-goodfire wants to merge 81 commits intodevfrom
Conversation
Scatter plots of mean CI values per component, arranged in a grid by module. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Per-head Frobenius norm heatmaps for each attention projection (q/k/v/o), showing how each component's weight is distributed across heads. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Bar charts of head-spread entropy per component for each attention projection, showing whether components are concentrated on one head or spread across many. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Target vs reconstructed weight matrices per attention projection, with individual component weight visualizations in paginated 4x4 grids. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Per-head activation magnitude heatmaps combining U-norm head structure with actual activation magnitudes from harvest data. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Weight-only q·k attention contribution heatmaps between q and k subcomponents. Single grid per layer with summed (all heads) and per-head breakdowns. Uses V-norm-scaled U dot products to account for unnormalized magnitude split. Co-Authored-By: Claude Opus 4.6 <[email protected]>
K/V component co-activation heatmaps using pre-computed harvest co-occurrence data. Three metrics per layer: CI co-occurrence counts, phi coefficient (binary correlation), and Jaccard similarity of firing sets. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Generates per-layer PDF reports and companion markdown files tracing attention component interactions: Q->K weight-only attention contributions and K->V CI co-occurrence associations, with autointerp labels/reasoning. Also excludes detect_* scripts from basedpyright (pre-existing type errors). Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
The loader was hardcoding is_tokenized=False, causing failures on pre-tokenized datasets like danbraunai/pile-uncopyrighted-tok. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Non-streaming was impractical for large pre-tokenized datasets like pile-uncopyrighted-tok (491M rows shuffled/mapped just for 512 samples). Co-Authored-By: Claude Opus 4.6 <[email protected]>
The ~40k component model OOMs at batch_size=512 on 140GB H200. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Add four new plot types: per-head heatmaps across offsets, head-vs-sum scatter, and pair contribution line plots (summed and per-head). Also add top_n_pairs parameter and trim default offsets. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Compute mean attention to position i-1 for each head across real text data. Includes a random-tokens control variant to distinguish positional from content-driven attention patterns. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Use synthetic repeated token sequences [A B C | A B C] to measure induction attention (from second occurrence to token following first occurrence) for each head. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Measure mean attention weight landing on prior positions holding the same token as the current query position, on real text data. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Compute positional offset profiles (attention vs relative offset) and BOS attention scores for each head on real text data. Produces three plots: max-offset heatmap, BOS heatmap, and per-head profile lines. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Measure fraction of each head's attention landing on delimiter tokens (periods, commas, etc.) on real text, compared to baseline delimiter frequency. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Use ordinal sequences (digits, letters, days, months) with random-word controls to isolate successor-specific attention patterns for each head. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Use IOI prompts to measure S2-position attention and OV copy scores, identifying heads that attend to repeated subject names and inhibit copying via negative OV contributions. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Deep-dive into L2H4 at the SPD component level: weight concentration analysis, per-component ablation effects on induction score, cross-head interactions, and analysis of what prevents perfect induction behavior. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Summarize findings from all 7 detection scripts and the component-level analysis: multi-functional early heads, L1H1->L2H4 induction circuit, layer-2 BOS sink pattern, and cross-cutting observations. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Document the mean_ci -> firing_density + mean_activations schema change, the workaround for incompatible harvest sub-runs, and the list of files affected by the migration. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…feature/attn_plots
Subtracts a component's contribution from specific head rows only, leaving other heads unaffected. This controls for whether the component's effect is cross-head or concentrated in one head. - Add ComponentHeadAblation dataclass and extraction helper - Subtract component contribution from specific head q/k rows in patched forward (after q_proj/k_proj, before reshape/RoPE) - New --restrict_to_heads CLI param for prev_token_test - Results confirm per-head effect sits between head and full component ablation, validating cross-head interpretation Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Sweeps value ablation across offsets 1..N from the ablated position, producing a profile showing which offsets' values are captured by the ablation. For a prev-token head, the curve peaks at offset 1. Head ablation shows clear offset-1 specificity. Component ablation is flat across offsets (all value info already captured). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Compares per-head attention distributions at query position t across four conditions: target baseline, SPD baseline, full component ablation, and per-head component ablation. Produces raw distribution and difference plots, averaged over n samples with consistent y-axes. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Integer ticks at every offset, clearer label "Offset from query position". Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Previously mask_infos=None was passed on non-ablated generation steps, causing the SPD model to fall through to the target model instead of using component reconstruction. Now builds all-ones baseline masks and passes them consistently. Also adds value_pos_ablations and value_head_pos_ablations params to _generate_greedy. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Replaces ad-hoc inline scripts with a proper CLI tool that generates HTML comparison tables for multiple ablation conditions: - Head ablation, value ablation (t-1, t-1+t-2, all L1, all layers) - Multiple component sets with full and per-head variants - Multi-layer component support via JSON comp_sets arg - Crafted prompt examples built-in - SPD model uses baseline masks on non-ablated steps - Persistent ablation mode for value conditions Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Fire auto-parses JSON strings into dicts before passing to the function. Handle both str and dict types for comp_sets. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Replace _run_all_conditions dict + _make_condition_order with _build_conditions that returns ordered (name, tokens) pairs - Value ablation layer derived from parsed_heads, not hardcoded to 1 - Condition names include actual head labels (e.g. "L1H1") not literals - _generate_greedy uses keyword-only args after prompt_ids/gen_len - Separate sections: generation, condition definitions, HTML rendering - Remove fragile **kwargs pattern Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Add gen() helper in _build_conditions to reduce repetition - Add assert t >= 2 at top of _build_conditions - Extract run_sample() in main loop to deduplicate dataset/crafted paths - Add docstring to _build_conditions listing all condition groups - Document why value ablation conditions depend on parsed_heads (layer is inferred from the head spec) - Remove t >= 1, t >= 2 conditionals since t >= 2 is now asserted Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Replace _generate_greedy's ablate_first_only boolean with two explicit functions: - _generate_greedy_one_shot: ablation on first step only, then clean generation (for position-specific interventions) - _generate_greedy_persistent: ablation re-applied every step, since no KV cache means the model recomputes from scratch each step (for "model modification" conditions like zeroing all values) Both share _forward_once for the actual forward pass logic. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
All ablations now apply on the first generated token only. Deleted _generate_greedy_persistent since we never want ablations to persist across generation steps. Renamed "[persist]" conditions to "Vals @ALL prev" — these zero values at all prompt positions but only for one prediction, same as other value ablation conditions. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Replace _generate_greedy_one_shot loop with _predict_next_token that does exactly one forward pass and returns one token ID. Remove gen_len parameter entirely. HTML tables now have one column. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Each ablation condition now shows: - Top-k tokens whose logits increased most vs baseline - Top-k tokens whose logits decreased most vs baseline - Change in the baseline's predicted token's logit Baseline is target model for non-SPD conditions, SPD baseline for component conditions. Also: Prediction class replaces bare int return, ConditionResult tuple tracks which baseline each condition uses. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Token-per-line layout with flexbox alignment, color-coded values (green=increase, red=decrease), wider page, proper vertical alignment. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Inline token(value) format with nbsp separators instead of one-per-line. Each row is now one line tall. Values still color-coded. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
One cell per token instead of all crammed into one cell. Header shows inc 1..5 and dec 1..5 columns. Each cell has token + colored value. Compact single-line rows. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
6bd5736 to
2e6f687
Compare
Also relax t >= 2 assertion to t >= 1, skip t-1,t-2 condition when t < 2. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Short prompts are better for studying previous-token effects since the ablated position is close to the prediction. The t >= 1 assertion and t >= 2 guard on the t-1,t-2 condition handle 2-token prompts correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Each logit cell shows the token on the first line and the colored change value on the second. Increased default top_k from 5 to 20. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Expanded crafted prompts to 40 with focus on prev-token behavior: bigrams, fixed expressions, sequences, code syntax, repetition. Change values in logit cells now 9px for less visual noise. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Replace separate prompt div and token info line with inline prompt text in a highlighted code box directly in the h2 heading. Before: "Crafted: HTML | ablation at t=4" + separate prompt div After: "Crafted: HTML | <html><body> | t=4" Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Tokens separated by | delimiters. Blue = ablated position (t), yellow = previous position (t-1), grey = other tokens. Makes tokenization boundaries and ablation target immediately visible. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Add a suite of attention head detection scripts, component-level induction analysis, new QK attention contribution plot types, and a position-specific ablation experiment with previous-token redundancy testing.
New detection scripts (each in
spd/scripts/detect_*/):Component-level analysis (
spd/scripts/characterize_induction_components/):Attention ablation experiment (
spd/scripts/attention_ablation_experiment/):--prev_token_test): tests whether ablating a head/component makes value ablation at t-1 redundant, quantifying how much of the prev-token information channel the ablation capturesPlot enhancements (
plot_qk_c_attention_contributions):Shared utility (
spd/scripts/collect_attention_patterns.py):_collect_attention_patternsfrom all 7 detect scripts into a shared moduleBug fix:
detect_s_inhibition_heads: OV copy scores indexedW_Vwith Q-head indices instead of KV-head indicesHarvest schema migration:
Motivation and Context
Characterize attention head behavior in decomposed LlamaSimpleMLP models to understand component-level circuits. The ablation experiments provide quantitative evidence that SPD components correspond to specific attention mechanisms (previous-token behavior) in a way that cuts across individual attention heads.
How Has This Been Tested?
wandb:goodfire/spd/runs/s-275c8f21make check)Does this PR introduce a breaking change?
No. All changes are additive (new scripts and plot functions). The harvest schema update aligns with the migration already merged on
dev.🤖 Generated with Claude Code