Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
f3dd309
Add plot_mean_ci script
claude-spd1 Feb 16, 2026
b630906
Add plot_component_head_norms script
claude-spd1 Feb 16, 2026
2dd8a6c
Add plot_head_spread script
claude-spd1 Feb 16, 2026
ed5178f
Add plot_attention_weights script
claude-spd1 Feb 16, 2026
dd07888
Add plot_per_head_component_activations script
claude-spd1 Feb 16, 2026
547b884
Add plot_attention_contributions script
claude-spd1 Feb 16, 2026
9cc2c63
Add plot_kv_coactivation script
claude-spd1 Feb 16, 2026
d61bb8a
Add plot_kv_vt_similarity script
claude-spd1 Feb 16, 2026
90cd0bd
Add attention_stories script for Q->K->V interaction chain reports
claude-spd1 Feb 17, 2026
b05e55c
Rename plot_attention_contributions to plot_qk_c_attention_contributions
claude-spd1 Feb 17, 2026
70e0819
Add dotenv loading to app backend server
claude-spd1 Feb 17, 2026
4ae5dae
Remove readonly from InterpDB in InterpRepo.open
claude-spd1 Feb 17, 2026
523cd92
Add RoPE-aware QK attention computation with multi-offset support
claude-spd1 Feb 17, 2026
4acac7b
Add clustering configs for s-275c8f21
claude-spd1 Feb 18, 2026
c9b54a1
Fix clustering dataset loader to respect is_tokenized from task config
claude-spd1 Feb 19, 2026
a6b506d
Read streaming setting from task config in clustering dataset loader
claude-spd1 Feb 19, 2026
96b33c9
Reduce clustering batch_size to 32 for s-275c8f21
claude-spd1 Feb 19, 2026
e7e2497
Increase clustering iters to 5000 for s-275c8f21
claude-spd1 Feb 19, 2026
78d8658
Add 10k and 25k iter clustering configs for s-275c8f21
claude-spd1 Feb 19, 2026
f8daef0
Add per-head and line plot visualizations to QK attention contributions
claude-spd1 Feb 19, 2026
587aff2
Add previous-token head detection script
claude-spd1 Feb 19, 2026
8397950
Add induction head detection script
claude-spd1 Feb 19, 2026
14bc7da
Add duplicate-token head detection script
claude-spd1 Feb 19, 2026
f591b07
Add positional head detection script
claude-spd1 Feb 19, 2026
2f23d34
Add delimiter head detection script
claude-spd1 Feb 19, 2026
33e7061
Add successor head detection script
claude-spd1 Feb 19, 2026
013791a
Add S-inhibition head detection script
claude-spd1 Feb 19, 2026
b1c39f1
Add SPD component-level induction head characterization script
claude-spd1 Feb 19, 2026
9400a6a
Add attention head characterization report for s-275c8f21
claude-spd1 Feb 19, 2026
b08d71b
Add harvest data schema migration notes
claude-spd1 Feb 19, 2026
dd3ba3d
Extract shared collect_attention_patterns utility and fix GQA bug
claude-spd1 Feb 19, 2026
eccad9f
Update clustering config: batch_size=1024, iters=1000
claude-spd1 Feb 19, 2026
89ea32d
Update clustering batch_size to 512 (1024 OOM'd)
claude-spd1 Feb 19, 2026
afc80f1
Update clustering batch_size to 256 (512 also OOM'd)
claude-spd1 Feb 19, 2026
04f10d0
Update plotting scripts for new harvest schema
claude-spd1 Feb 19, 2026
9dad2b8
Update clustering iters to 20000
claude-spd1 Feb 19, 2026
265a9e5
Add per-head grid plot, caching, and selective plot generation for qk…
claude-spd1 Feb 20, 2026
b299f6e
Add P(V|K) and P(K|V) conditional probability plots to KV coactivatio…
claude-spd1 Feb 20, 2026
872bc6b
Add total attention contribution line to per-head qk pair plots
claude-spd1 Feb 20, 2026
1c1914c
Add single-datapoint attention pattern plots to prev-token head detec…
claude-spd1 Feb 20, 2026
25aea37
Add faint gray background lines to the summed-across-heads lines plot
claude-spd1 Feb 20, 2026
6dd313e
Plot all interactions as gray background lines and adjust alpha to 0.45
claude-spd1 Feb 20, 2026
aba0054
Add v_proj parameter zeroing for head ablation and value vector heatmaps
claude-spd1 Feb 21, 2026
118d218
Add output inner product and cosine similarity analysis to attention …
claude-spd1 Feb 21, 2026
2a01268
Add position-specific ablation and bar chart summaries
claude-spd1 Feb 22, 2026
948d35d
Shorten plot filenames and fix cosine sim metrics
claude-spd1 Feb 22, 2026
913bb44
Use normalized inner product (dot/||baseline||²) for ablation metrics
claude-spd1 Feb 22, 2026
7262a97
Add max_plot_samples to limit per-sample plots while using all for stats
claude-spd1 Feb 22, 2026
14300a5
Add prev-token head redundancy test (--prev_token_test)
claude-spd1 Feb 22, 2026
33ec9b5
Add B-alone controls and Baseline-vs-AB comparisons to prev-token test
claude-spd1 Feb 22, 2026
b953c36
Seed all RNGs for deterministic ablation experiments
claude-spd1 Feb 22, 2026
bbdd505
Merge dev into feature/attn_plots
claude-spd1 Feb 22, 2026
bc2565f
Add clustering configs for s-275c8f21 alpha=30 and alpha=100
claude-spd1 Feb 23, 2026
5bf0713
Add clustering configs for s-275c8f21 alpha=0.5 and alpha=0.1
claude-spd1 Feb 23, 2026
e397eb8
Add clustering configs for s-275c8f21 alpha=1
claude-spd1 Feb 23, 2026
a5bacfc
Add clustering configs for s-275c8f21 alpha=2,3,5,8
claude-spd1 Feb 23, 2026
103c5b3
Merge branch 'feature/attn_plots' of github.com:goodfire-ai/spd into …
claude-spd1 Feb 24, 2026
a374706
Add per-head component ablation (--restrict_to_heads)
claude-spd1 Feb 24, 2026
a0f726e
Add offset sweep for value ablation profile (--offset_sweep N)
claude-spd1 Feb 24, 2026
a69fc40
Add attention pattern diff plotting script
claude-spd1 Feb 24, 2026
115f80c
Fix x-axis ticks and label in attention pattern diff plots
claude-spd1 Feb 24, 2026
3035712
Fix SPD model generation: use baseline masks on non-ablated steps
claude-spd1 Feb 24, 2026
5817a40
Consolidate generation comparison into proper script
claude-spd1 Feb 25, 2026
54375cd
Fix Fire dict parsing for comp_sets argument
claude-spd1 Feb 25, 2026
cedb5bc
Clean up generation script: clearer structure and no hardcoded values
claude-spd1 Feb 25, 2026
106771f
Improve generation script clarity and reduce bug surface
claude-spd1 Feb 25, 2026
5872798
Split generation into one_shot and persistent functions
claude-spd1 Feb 25, 2026
f713294
Remove persistent generation mode, add all-prev value ablation
claude-spd1 Feb 25, 2026
7527b20
Simplify to single-token prediction, remove gen_len
claude-spd1 Feb 25, 2026
b93a879
Add logit analysis to generation comparison HTML
claude-spd1 Feb 25, 2026
158734a
Improve logit analysis HTML formatting
claude-spd1 Feb 25, 2026
c17f3c1
Make logit analysis cells horizontal and compact
claude-spd1 Feb 25, 2026
2e6f687
Use separate table cells for each top-k logit change
claude-spd1 Feb 25, 2026
5c062f3
Replace crafted prompts with 30 short (4-8 token) examples
claude-spd1 Feb 25, 2026
b4a2be1
Restore short crafted prompts (2-8 tokens)
claude-spd1 Feb 25, 2026
d8e8851
Two-line logit cells (token/change) and top_k=20
claude-spd1 Feb 25, 2026
0af9a8d
Left-align logit cells with minimal width
claude-spd1 Feb 25, 2026
edbf826
Add more prev-token crafted prompts (40 total), smaller change text
claude-spd1 Feb 25, 2026
14e005a
Skip non-ASCII dataset samples (non-English text)
claude-spd1 Feb 25, 2026
d572774
Show prompt text inline in sample heading for readability
claude-spd1 Feb 26, 2026
51a4caf
Show tokenized prompt with color-coded positions
claude-spd1 Feb 26, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
141 changes: 141 additions & 0 deletions HARVEST_DATA_SCHEMA_MIGRATION_NOTES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# Harvest DB Schema Migration: `mean_ci` to `firing_density` + `mean_activations`

**Date investigated**: 2026-02-18
**Current branch**: `feature/attn_plots`
**Status**: Not yet migrated. Workaround in place for plotting scripts.

## What happened

A colleague is generalizing the harvest pipeline to work across decomposition methods (SPD, MOLT, CLT, SAE) on branches `feature/harvest-generic` and `feature/autointerp-generic`. As part of this, the harvest DB schema changed. The key commit is `70eceb8f` ("Generalize harvest pipeline over decomposition methods") by Claude SPD1, dated 2026-02-16.

A new harvest sub-run for `s-275c8f21` was created on 2026-02-18 using code from `feature/harvest-generic`, producing data with the new schema. This data is incompatible with the code on `dev`, `main`, and `feature/attn_plots`, which still expect the old schema.

## Schema change details

### Old schema (on `dev`, `main`, `feature/attn_plots`)

```sql
CREATE TABLE components (
component_key TEXT PRIMARY KEY,
layer TEXT NOT NULL,
component_idx INTEGER NOT NULL,
mean_ci REAL NOT NULL, -- mean causal importance across all tokens
activation_examples TEXT NOT NULL,
input_token_pmi TEXT NOT NULL,
output_token_pmi TEXT NOT NULL
);
```

`ActivationExample` dataclass fields: `token_ids: list[int]`, `ci_values: list[float]`, `component_acts: list[float]`

### New schema (on `feature/harvest-generic`, commit `70eceb8f`)

```sql
CREATE TABLE components (
component_key TEXT PRIMARY KEY,
layer TEXT NOT NULL,
component_idx INTEGER NOT NULL,
firing_density REAL NOT NULL, -- proportion of tokens where component fired (0-1)
mean_activations TEXT NOT NULL, -- JSON dict, e.g. {"causal_importance": 0.007}
activation_examples TEXT NOT NULL,
input_token_pmi TEXT NOT NULL,
output_token_pmi TEXT NOT NULL
);
```

`ActivationExample` dataclass fields: `token_ids: list[int]`, `firings: list[bool]`, `activations: dict[str, list[float]]`

### Field mapping

| Old field | New field | Notes |
|---|---|---|
| `mean_ci` (float) | `mean_activations["causal_importance"]` (float, inside JSON dict) | Same semantic meaning for SPD runs |
| *(not present)* | `firing_density` (float) | New metric: proportion of tokens where component fired |
| `ci_values` (on ActivationExample) | `activations["causal_importance"]` (on ActivationExample) | Per-token CI values, now keyed by activation type |
| `component_acts` (on ActivationExample) | `activations["component_activation"]` (on ActivationExample) | Per-token component activations |
| *(not present)* | `firings` (on ActivationExample) | Boolean per-token firing indicators |

### Example new data row

From `s-275c8f21` sub-run `h-20260218_000000`:
```
firing_density: 0.011455078125
mean_activations: {"causal_importance": 0.007389060687273741}
```

## Branches involved

| Branch | Schema version | Status |
|---|---|---|
| `main` | Old (`mean_ci`) | Production |
| `dev` | Old (`mean_ci`) | Development |
| `feature/attn_plots` | Old (`mean_ci`) | Current work branch |
| `feature/harvest-generic` | New (`firing_density` + `mean_activations`) | Colleague's WIP |
| `feature/autointerp-generic` | New | Colleague's WIP |

The schema change commits (`70eceb8f`, `5e66fd49`, `ad68187d`) exist **only** on `feature/harvest-generic` and `feature/autointerp-generic`. They are NOT on `dev` or `main`.

## Current workaround

The broken sub-run was renamed so the `HarvestRepo` skips it:

```
_h-20260218_000000.bak <-- new schema, renamed with _ prefix
h-20260212_150336 <-- old schema, now picked as "latest" by HarvestRepo
```

`HarvestRepo.open()` picks the latest `h-*` directory by lexicographic sort. The `_` prefix prevents the glob match on `d.name.startswith("h-")` (see `spd/harvest/repo.py:46`).

## Files that need updating when migrating

### Core harvest module (update schema definitions)

1. **`spd/harvest/schemas.py`** — `ComponentSummary` and `ComponentData` dataclasses: replace `mean_ci: float` with `firing_density: float` + `mean_activations: dict[str, float]`. Also update `ActivationExample`.
2. **`spd/harvest/db.py`** — SQL schema, `_serialize_component()`, `_deserialize_component()`, `get_summary()`, `get_all_components()`.
3. **`spd/harvest/harvester.py`** — `build_results()` yields `ComponentData`; needs to compute `firing_density` and `mean_activations` dict.

Reference implementation: `git show 70eceb8f:spd/harvest/schemas.py` and `git show 70eceb8f:spd/harvest/db.py`.

### App backend (API schemas + endpoints)

4. **`spd/app/backend/schemas.py`** — `SubcomponentMetadata.mean_ci` and `SubcomponentActivationContexts.mean_ci`.
5. **`spd/app/backend/routers/activation_contexts.py`** — Extracts and sorts by `mean_ci` in 3 endpoints.

### Autointerp module

6. **`spd/autointerp/interpret.py`** — Sorts components by `c.mean_ci` (line 116).
7. **`spd/autointerp/strategies/compact_skeptical.py`** — Uses `mean_ci * 100` and `1 / component.mean_ci` for LLM prompt formatting.

### Dataset attributions

8. **`spd/dataset_attributions/harvest.py`** — Filters alive components with `summary[key].mean_ci > ci_threshold` (line 90).

### Plotting scripts (9 files, all follow same pattern)

All have `MIN_MEAN_CI` constant and `_get_alive_indices()` that filters on `s.mean_ci > threshold`:

9. `spd/scripts/plot_qk_c_attention_contributions/plot_qk_c_attention_contributions.py`
10. `spd/scripts/attention_stories/attention_stories.py`
11. `spd/scripts/characterize_induction_components/characterize_induction_components.py`
12. `spd/scripts/plot_kv_vt_similarity/plot_kv_vt_similarity.py`
13. `spd/scripts/plot_attention_weights/plot_attention_weights.py`
14. `spd/scripts/plot_kv_coactivation/plot_kv_coactivation.py`
15. `spd/scripts/plot_per_head_component_activations/plot_per_head_component_activations.py`
16. `spd/scripts/plot_head_spread/plot_head_spread.py`
17. `spd/scripts/plot_component_head_norms/plot_component_head_norms.py`

### Dedicated CI visualization

18. **`spd/scripts/plot_mean_ci/plot_mean_ci.py`** — Entire script dedicated to visualizing `mean_ci` distributions. May need renaming or deprecation.

## Recommended migration approach

Once `feature/harvest-generic` is stable and merged to `dev`:

1. **Port core schema changes** from `feature/harvest-generic` (items 1-3 above). The reference implementation at commit `70eceb8f` has the complete updated `db.py` and `schemas.py`.

2. **Update consumers** (items 4-18). For filtering "alive" components, the equivalent of `mean_ci > threshold` is `mean_activations["causal_importance"] > threshold`. Consider whether `firing_density` would be a better filter (it's a cleaner concept: "does this component fire often enough?").

3. **No legacy fallback needed** — per repo conventions (CLAUDE.md: "Don't add legacy fallbacks or migration code"). Old harvest data should be re-harvested with new code if needed.

4. **Consider extracting `_get_alive_indices`** into a shared utility — it's duplicated across 9 plotting scripts with identical logic.
172 changes: 172 additions & 0 deletions attention_head_report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
# Attention Head Characterization: s-275c8f21

Model: 4-layer, 6-head LlamaSimpleMLP (d_model=768, head_dim=128), pretrained model t-32d1bb3b.

All scripts live in `spd/scripts/detect_*/` and output to `spd/scripts/detect_*/out/s-275c8f21/`.

## Analyses

### Previous-Token Heads

**Method**: On real text (eval split, 100 batches of 32), extract the offset-1 diagonal of each head's attention matrix — i.e., `attn[i, i-1]` — and average across positions and batches.

**Results**:
| Head | Score |
|------|-------|
| L1H1 | 0.604 |
| L0H5 | 0.308 |

All other heads score below 0.1. L1H1 is a clear previous-token head, spending over 60% of its attention on the immediately preceding token. L0H5 is weaker but still notable.

### Induction Heads

**Method**: Synthetic data — repeated random token sequences `[A B C ... | A B C ...]`. Measures the "offset diagonal" of attention in the second half: at position `L+k`, how much attention goes to position `k+1` (the token that followed the current token's earlier occurrence). This is the textbook induction pattern. 100 batches of 32, half-sequence length 256.

**Results**:
| Head | Score |
|------|-------|
| L2H4 | 0.629 |

No other head scores above 0.1. L2H4 is a strong, clean induction head.

The L1H1 → L2H4 pairing forms the classic two-layer induction circuit: L1H1 shifts information one position back (previous-token), composing with L2H4's key-query matching to attend to "what came after this token last time."

### Duplicate-Token Heads

**Method**: On real text, build a boolean mask of positions where a prior token has the same ID, then measure mean attention to those same-token positions. Only positions with at least one prior duplicate contribute to the score, and batches are weighted by the number of valid positions.

**Results**:
| Head | Score |
|------|-------|
| L0H4 | 0.323 |
| L0H2 | 0.202 |

All other heads below 0.05. Both heads are in layer 0, suggesting duplicate-token detection happens early.

### Successor Heads

**Method**: Constructs ordinal sequences (digits, letters, number words, days, months) as comma-separated lists and measures attention from each element to its ordinal predecessor (2 positions back, since commas intervene). A control condition uses random words in place of ordinals, with the same positional structure. The "signal" is ordinal score minus control score, isolating semantic successor attention from positional artifacts.

**Results** (signal > 0.05):
| Head | Ordinal | Control | Signal |
|------|---------|---------|--------|
| L0H2 | 0.379 | 0.073 | +0.307 |
| L0H4 | 0.155 | 0.001 | +0.154 |
| L1H0 | 0.098 | 0.041 | +0.058 |
| L1H1 | 0.174 | 0.108 | +0.067 |
| L3H0 | 0.192 | 0.121 | +0.070 |
| L1H2 | 0.497 | 0.443 | +0.054 |

L0H2 is the standout successor head. L0H4 is secondary. Several other heads show modest signals.

Note that L1H2 has high ordinal attention (0.497) but nearly as high control attention (0.443), suggesting it attends strongly to position-2-back regardless of content. The control subtraction properly removes this.

### S-Inhibition Heads

**Method**: Two-pronged analysis using IOI (Indirect Object Identification) prompts of the form "When Alice and Bob went to the store, Bob gave a drink to" → Alice.

1. **Data-driven**: Measures attention from the final position to the second occurrence of the subject name (S2). High S2 attention means the head is "looking at" the repeated name.
2. **Weight-based**: Computes the OV copy score `W_U[t] @ W_O_h @ W_V_h @ W_E[t]` averaged over name tokens. Negative values indicate the head suppresses (rather than promotes) the attended token's logit.

An S-inhibition head should have high S2 attention *and* a negative copy score.

**Results** (candidates: attn > 0.1 and copy < 0):
| Head | Attn to S2 | OV Copy | Assessment |
|------|-----------|---------|------------|
| L3H2 | 0.377 | -0.029 | Strongest candidate |
| L2H1 | 0.151 | -0.001 | Weak candidate |

L3H2 is the clearest S-inhibition candidate: it strongly attends to the repeated subject and has a negative copy score (suppression). L2H1 attends to S2 moderately but its copy score is only marginally negative.

Several other heads have high S2 attention but positive copy scores (e.g., L3H0 at attn=0.156, copy=+0.007), suggesting they *copy* the subject rather than inhibit it — a different role in the IOI circuit.

### Delimiter Heads

**Method**: On real text, identifies delimiter token IDs (`.` `,` `;` `:` `!` `?` `\n` and multi-char variants) via the tokenizer, then measures the mean fraction of each head's attention landing on delimiter tokens. Compares to the baseline delimiter frequency in the data (~10.7%). Reports the ratio over baseline.

**Results**: No head exceeds 2.0x baseline. Highest ratios:
| Head | Raw Attn | Ratio |
|------|----------|-------|
| L0H5 | 0.187 | 1.74x |
| L1H0 | 0.184 | 1.72x |
| L1H4 | 0.178 | 1.66x |

This model does not appear to have dedicated delimiter heads. Most heads sit in the 1.0-1.7x range — modestly above baseline but not specialized. This could reflect the model's small size, or it could mean delimiter attention is distributed across heads rather than concentrated.

### Positional Heads

**Method**: On real text, builds a mean attention profile by relative offset for each head (offset = query_pos - key_pos). Also measures attention to absolute position 0 (BOS). A "positional head" has high attention concentrated at a specific offset; a "BOS head" attends heavily to position 0.

**Results — offset-based**:
| Head | Max Offset Score | Peak Offset |
|------|-----------------|-------------|
| L1H1 | 0.604 | 1 |
| L0H5 | 0.308 | 1 |
| L1H0 | 0.265 | 1 |
| L1H3 | 0.226 | 2 |

L1H1 and L0H5 are the same heads already identified as previous-token heads, confirming the result from a different angle. L1H3 peaks at offset 2 — it preferentially attends two positions back.

**Results — BOS attention**:
| Head | BOS Score |
|------|-----------|
| L2H4 | 0.489 |
| L2H5 | 0.355 |
| L2H3 | 0.337 |
| L2H1 | 0.318 |
| L2H0 | 0.232 |
| L2H2 | 0.217 |
| L3H3 | 0.248 |
| L3H2 | 0.208 |
| L3H4 | 0.206 |

All six heads in layer 2 have substantial BOS attention (0.22–0.49). Layer 3 also shows moderate BOS attention in several heads. Layers 0–1 show negligible BOS attention (< 0.01).

## Cross-Cutting Observations

### Multi-functional early heads

L0H2 and L0H4 both serve as duplicate-token *and* successor heads. These are layer-0 heads operating directly on token embeddings, suggesting the behaviors may share a mechanism: attending to tokens that are "similar" to the current one (exact match for duplicate-token, ordinal neighbor for successor). Whether this reflects a single underlying computation or two coincidentally co-located behaviors is unclear from these analyses alone.

**Hypothesis**: L0H2 and L0H4 may implement a general "embedding similarity" attention pattern that manifests as duplicate-token detection on repeated tokens and successor detection on ordinal sequences. Testing this would require measuring the correlation between these heads' attention weights and embedding cosine similarity.

### The induction circuit

The L1H1 (previous-token) → L2H4 (induction) circuit is clean and well-separated. L1H1 scores 0.604 on previous-token and L2H4 scores 0.629 on induction, with no other head approaching either score. This is the textbook two-layer induction circuit.

### Layer 2 as a BOS sink

The uniform high BOS attention across all of layer 2 is striking. L2H4 — the induction head — has the highest BOS score (0.489) despite also being the strongest induction head. This might seem contradictory, but BOS attention and induction attention operate on different token positions: BOS attention is measured as an average across *all* query positions, while induction attention is measured specifically at positions following repeated sequences. L2H4 likely defaults to BOS when there's no induction pattern to match, using position 0 as an attention sink.

**Hypothesis**: Layer 2's BOS attention may serve as a "no-op" or default state. When a head doesn't have a strong content-based signal, it parks attention on BOS rather than distributing it noisily. This is a known phenomenon in transformer models (sometimes called "attention sinking"), and BOS is a natural sink since it's always available and semantically neutral in context.

### S-inhibition is late and sparse

Only L3H2 shows a convincing S-inhibition signal (layer 3, near the output). This makes architectural sense: S-inhibition requires first identifying the repeated subject (which depends on earlier duplicate-token and induction mechanisms) before suppressing it. The fact that it appears in the final layer is consistent with it being a downstream consumer of earlier head outputs.

### No dedicated delimiter heads

The absence of strong delimiter heads is a genuine null result, not a limitation of the method. The method would have detected them if present (the baseline-ratio approach has no inherent ceiling). This model apparently handles structural boundaries through other means, or distributes delimiter attention diffusely.

### Caveats

- All data-driven scores are averages. A head with a moderate average score might be strongly specialized on a subset of inputs and inactive on others. Per-example distributions would be more informative but are not captured here.
- The IOI template is a single fixed pattern. S-inhibition scores might differ with varied sentence structures.
- The successor head control condition (random words) controls for positional patterns but not for all confounds — e.g., if the tokenizer assigns similar embeddings to ordinal tokens, heads might use embedding similarity rather than "knowing" ordinal structure.
- OV copy scores (used in S-inhibition) are a linear approximation. They measure the direct path through one head and don't account for nonlinear interactions or composition with other heads/layers.

## Summary Table

| Head | Primary Role(s) | Evidence Strength |
|------|----------------|-------------------|
| L0H2 | Successor, duplicate-token | Strong (signal=0.307, dup=0.202) |
| L0H4 | Duplicate-token, successor | Strong (dup=0.323, signal=0.154) |
| L0H5 | Previous-token | Moderate (0.308) |
| L1H1 | Previous-token | Strong (0.604) |
| L1H3 | Offset-2 positional | Moderate (0.226 at offset 2) |
| L2H1 | Weak S-inhibition candidate | Weak (attn=0.151, copy=-0.001) |
| L2H4 | Induction, BOS sink | Strong induction (0.629), strong BOS (0.489) |
| L2H* | BOS sink (all layer 2) | Strong (0.22–0.49 across all heads) |
| L3H2 | S-inhibition | Moderate (attn=0.377, copy=-0.029) |

Heads not listed individually (L0H1, L0H3, L1H0, L1H2, L1H4, L1H5, L3H0–L3H1, L3H3–L3H5) did not show strong specialization in any analysis, though several had modest signals across multiple categories. Layer-2 heads without individual roles (L2H0, L2H2, L2H3, L2H5) are covered by the L2H* BOS sink row.
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ known-third-party = ["wandb"]

[tool.pyright]
include = ["spd", "tests"]
exclude = ["**/wandb/**", "spd/utils/linear_sum_assignment.py", "spd/app/frontend"]
exclude = ["**/wandb/**", "spd/utils/linear_sum_assignment.py", "spd/app/frontend", "spd/scripts/detect_*"]
stubPath = "typings" # Having type stubs for transformers shaves 10 seconds off basedpyright calls

strictListInference = true
Expand Down
2 changes: 2 additions & 0 deletions spd/app/backend/server.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
import fire
import torch
import uvicorn
from dotenv import load_dotenv
from fastapi import FastAPI, Request
from fastapi.exceptions import RequestValidationError
from fastapi.middleware.cors import CORSMiddleware
Expand All @@ -42,6 +43,7 @@
from spd.log import logger
from spd.utils.distributed_utils import get_device

load_dotenv()
DEVICE = get_device()


Expand Down
2 changes: 1 addition & 1 deletion spd/autointerp/repo.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ def open(cls, run_id: str) -> "InterpRepo | None":
if not db_path.exists():
return None
return cls(
db=InterpDB(db_path, readonly=True),
db=InterpDB(db_path),
subrun_dir=subrun_dir,
run_id=run_id,
)
Expand Down
20 changes: 20 additions & 0 deletions spd/clustering/configs/crc/s-275c8f21-10k.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"merge_config": {
"activation_threshold": 0.1,
"alpha": 1,
"iters": 10000,
"merge_pair_sampling_method": "range",
"merge_pair_sampling_kwargs": {"threshold": 0.001},
"filter_dead_threshold": 0.1,
"module_name_filter": null
},
"model_path": "wandb:goodfire/spd/runs/s-275c8f21",
"batch_size": 32,
"wandb_project": null,
"logging_intervals": {
"stat": 10,
"tensor": 200,
"plot": 2000,
"artifact": 2000
}
}
Loading