Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
85 commits
Select commit Hold shift + click to select a range
9143920
Add agent swarm for parallel behavior investigation
claude Jan 30, 2026
498d459
Stream Claude Code output to file in real-time
claude Jan 30, 2026
efe5928
Use stream-json output format and add max_turns limit
claude Jan 30, 2026
ef5b0fd
Fix stream-json output requiring --verbose flag
claude Jan 30, 2026
f40f02e
Add GPU lock to prevent concurrent GPU operations
claude-spd1 Jan 30, 2026
567fb19
Add research_log.md for human-readable agent progress
claude-spd1 Jan 30, 2026
4c4a843
Add full timestamps to research log examples
claude-spd1 Jan 30, 2026
dcb28f4
Merge remote-tracking branch 'origin/dev' into claude/slurm-agent-swa…
claude-spd1 Jan 31, 2026
cb6e6f0
wip: Integrate agent swarm with MCP for Claude Code tool access
claude-spd1 Jan 31, 2026
06cf2e8
Fix MCP JSON-RPC response format violating spec
claude-spd1 Jan 31, 2026
39b5acb
wip: Refactor agent swarm MCP configuration to require all swarm sett…
claude-spd1 Jan 31, 2026
ae88d53
Fix agent swarm hanging at ~80% optimization
claude-spd1 Feb 1, 2026
58129b0
Simplify agent swarm env vars: 4 → 2
claude-spd1 Feb 1, 2026
b47733f
wip: Add graph artifacts to investigation research logs
claude-spd1 Feb 2, 2026
6f957a0
Merge branch 'dev' into claude/slurm-agent-swarm-lIpTu
claude-spd1 Feb 13, 2026
13cf49a
Refactor agent_swarm → investigate: single-agent, researcher-directed
claude-spd1 Feb 13, 2026
1b30e81
Fix investigation wandb_path matching
claude-spd1 Feb 13, 2026
22f9971
UI improvements: run picker arch labels, artifact graph layout, inves…
claude-spd1 Feb 13, 2026
2625f34
Fix MCP canonical/concrete key translation
claude-spd1 Feb 13, 2026
474e2f3
Move app DB from repo-local .data/ to SPD_OUT_DIR/app/
claude-spd1 Feb 13, 2026
2e6dff1
Sandbox investigation agent to MCP-only, revert DB path move
claude-spd1 Feb 13, 2026
26ff2a7
Isolate investigation agent from global Claude Code config
claude-spd1 Feb 13, 2026
edc7c58
Add topological interpretation module
ocg-goodfire Feb 19, 2026
a41cfef
Remove output-label dependency from cofiring neighbors
ocg-goodfire Feb 19, 2026
54d5a7b
Rename neighbor → related component terminology
ocg-goodfire Feb 19, 2026
c7991a9
Skip components missing labels in unification pass
ocg-goodfire Feb 20, 2026
cb18c86
Use in-memory accumulator for scan state, DB is write-only
ocg-goodfire Feb 20, 2026
eb480c4
Editing and autointerp
ocg-goodfire Feb 19, 2026
9ab56d9
Add kaleido dep and __main__ to run_interpret
ocg-goodfire Feb 19, 2026
7f0cb44
Replace editing.py with editing/ package (adds optimize_circuit, prin…
ocg-goodfire Feb 22, 2026
4a3a076
Add token_divergence_edits.yaml config for editing experiments
ocg-goodfire Feb 22, 2026
5c9f344
Add example-heavy README for spd/editing module
ocg-goodfire Feb 23, 2026
b5d3a02
Request 1 GPU for autointerp/eval/intruder SLURM jobs
ocg-goodfire Feb 19, 2026
09c8bd8
Fix YAML configs to use current schema and fix misleading error messa…
danbraunai-goodfire Feb 19, 2026
b95f6bf
Fix SQLite issues on NFS: remove WAL, separate read/write connections
ocg-goodfire Feb 20, 2026
64556ba
Fix attributions SLURM passing full config instead of inner config
ocg-goodfire Feb 20, 2026
9e7d3ef
add worktrees to ignore
ocg-goodfire Feb 23, 2026
e8cd454
Rewrite dataset attribution storage: dict-of-dicts, canonical names, …
ocg-goodfire Feb 23, 2026
a116ddd
Fix alive_targets iteration: use torch.where for indices, not bool to…
ocg-goodfire Feb 23, 2026
5f98d81
Fix KeyError for embed source: CI dict doesn't include embedding layer
ocg-goodfire Feb 23, 2026
01633c5
Fix scatter_add OOB: use embedding num_embeddings instead of tokenize…
ocg-goodfire Feb 23, 2026
9118c1e
Split run.py into run_worker.py and run_merge.py
ocg-goodfire Feb 23, 2026
d0166d0
Correct attr_abs via backprop through |target|, reorganise method sig…
ocg-goodfire Feb 23, 2026
fd42030
Add merge_mem config (default 200G) to prevent merge OOM
ocg-goodfire Feb 23, 2026
223afd4
Add 3-metric selection to dataset attributions in app
ocg-goodfire Feb 23, 2026
4fc7cf1
Allow bare s-prefixed run IDs everywhere (e.g. "s-17805b61")
ocg-goodfire Feb 23, 2026
6a5d0f6
Fix AttributionRepo.open skipping valid subruns due to old-format dirs
ocg-goodfire Feb 23, 2026
08b17c9
Fix 3s lag on attribution metric toggle: O(V) linear scan per pill
ocg-goodfire Feb 23, 2026
b0df7c0
Ship token strings from backend instead of resolving vocab IDs in fro…
ocg-goodfire Feb 23, 2026
2385a82
Hide negative attribution column for non-signed metrics
ocg-goodfire Feb 23, 2026
747991a
Narrow frontend types: SignedAttributions vs UnsignedAttributions
ocg-goodfire Feb 23, 2026
e36e187
Update dataset_attributions CLAUDE.md for new storage format and 3 me…
ocg-goodfire Feb 23, 2026
206dcf0
Integrate new dataset attributions storage, lazy harvest loading, emb…
ocg-goodfire Feb 23, 2026
42fca11
Separate output/input context in prompts, reduce examples, remove err…
ocg-goodfire Feb 23, 2026
c397a2c
Add activation examples to unification prompt
ocg-goodfire Feb 23, 2026
f12aedf
Clean up prompts: human-readable keys, normalized attributions, filte…
ocg-goodfire Feb 23, 2026
98a65ae
Tweak component display, tighten error threshold to 5%
ocg-goodfire Feb 23, 2026
17a25ba
wip.
ocg-goodfire Feb 24, 2026
4781853
wip: Refactor dataset attribution harvester to track abs attributions
ocg-goodfire Feb 24, 2026
a557576
Rewrite dataset attribution storage with explicit edge types
ocg-goodfire Feb 24, 2026
48d318c
Fix embed path not removed from unembed sources in harvester
ocg-goodfire Feb 24, 2026
b44115a
Rename topological_interp → graph_interp and integrate into SPD app
ocg-goodfire Feb 24, 2026
ef4ec4b
Store raw attribution sums, normalize at query time
ocg-goodfire Feb 24, 2026
7298cd7
Fix n_batches removal, detach tensors on save, handle output source q…
ocg-goodfire Feb 24, 2026
3ccc301
Add graph interp badge to components tab, prune model graph to 500 nodes
ocg-goodfire Feb 25, 2026
98dcc55
Merge remote-tracking branch 'origin/dev' into feature/model-editing
ocg-goodfire Feb 25, 2026
07aa9af
Merge remote-tracking branch 'origin/feature/topological-interp' into…
ocg-goodfire Feb 25, 2026
63c544e
Expand graph interp badge with detail, edges, token strings, and auto…
ocg-goodfire Feb 25, 2026
889a89e
Move graph interp detail fetch into useComponentData hooks
ocg-goodfire Feb 25, 2026
e87b274
Merge remote-tracking branch 'origin/feature/topological-interp' into…
ocg-goodfire Feb 25, 2026
e183401
tiny tidy
ocg-goodfire Feb 25, 2026
af7e28a
Add editing experiment docs, circuit export tools, and worktree note
ocg-goodfire Feb 25, 2026
5beaa66
wip: Add embed token count normalization for dataset attributions
ocg-goodfire Feb 26, 2026
52c275e
fold in the investigator work
ocg-goodfire Feb 26, 2026
5a3856b
Merge branch 'feature/topological-interp' into feature/model-editing
ocg-goodfire Feb 26, 2026
0f759d5
wip: Add CI optimization visualization during graph computation
ocg-goodfire Feb 26, 2026
b65f7ae
Add --dependency flag to spd-postprocess and integrate graph-interp
ocg-goodfire Feb 26, 2026
0d9e466
Merge remote-tracking branch 'origin/dev' into feature/model-editing
ocg-goodfire Feb 27, 2026
76963ac
Merge branch 'dev' into feature/model-editing
danbraunai-goodfire Feb 27, 2026
e194ffa
Merge branch 'dev' into feature/model-editing
danbraunai-goodfire Feb 27, 2026
0820460
add new canon run to registry
ocg-goodfire Feb 27, 2026
de787b5
Fix app UI issues: prompt input, interpretation 404s, interventions loop
ocg-goodfire Feb 27, 2026
2eae377
wip.
ocg-goodfire Feb 27, 2026
199f277
Add clustering tab
danbraunai-goodfire Feb 27, 2026
db74e8d
Merge branch 'dev' into feature/model-editing
danbraunai-goodfire Feb 27, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions .claude/skills/gpudash.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
name: gpudash
description: Check GPU availability across the SLURM cluster
user_invocable: true
---

# gpudash

Run the `gpudash` command to show GPU availability across the cluster.

## Steps
1. Run `gpudash` and show the output to the user.
1 change: 1 addition & 0 deletions .claude/worktrees/bold-elm-8kpb
Submodule bold-elm-8kpb added at 356f8c
1 change: 1 addition & 0 deletions .claude/worktrees/bright-fox-a4i0
Submodule bright-fox-a4i0 added at 356f8c
1 change: 1 addition & 0 deletions .claude/worktrees/calm-owl-v4pj
Submodule calm-owl-v4pj added at dbe066
1 change: 1 addition & 0 deletions .claude/worktrees/cozy-frolicking-stream
Submodule cozy-frolicking-stream added at 356f8c
1 change: 1 addition & 0 deletions .claude/worktrees/stateless-dancing-blanket
Submodule stateless-dancing-blanket added at 356f8c
1 change: 1 addition & 0 deletions .claude/worktrees/swift-owl-yep9
Submodule swift-owl-yep9 added at 356f8c
1 change: 1 addition & 0 deletions .claude/worktrees/swift-ray-amfs
Submodule swift-ray-amfs added at 356f8c
1 change: 1 addition & 0 deletions .claude/worktrees/vectorized-wiggling-whisper
Submodule vectorized-wiggling-whisper added at cb18c8
1 change: 1 addition & 0 deletions .claude/worktrees/xenodochial-germain
Submodule xenodochial-germain added at 5c9f34
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -177,4 +177,6 @@ cython_debug/
#.idea/

**/*.db
**/*.db*
**/*.db*

.claude/worktrees
7 changes: 1 addition & 6 deletions .mcp.json
Original file line number Diff line number Diff line change
@@ -1,8 +1,3 @@
{
"mcpServers": {
"svelte-llm": {
"type": "http",
"url": "https://svelte-llm.stanislav.garden/mcp/mcp"
}
}
"mcpServers": {}
}
86 changes: 70 additions & 16 deletions CLAUDE.md

Large diffs are not rendered by default.

Empty file added bio.py
Empty file.
6 changes: 6 additions & 0 deletions configs/autointerp_dual_view.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
config:
model: google/gemini-3-flash-preview
reasoning_effort: low
template_strategy:
type: dual_view
evals: null
12 changes: 12 additions & 0 deletions configs/token_divergence_edits.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
Male pronouns:
- h.1.mlp.down_proj:798
- h.1.mlp.c_fc:144
- h.1.attn.o_proj:82

Question marks:
- h.1.mlp.down_proj:534
- h.1.mlp.c_fc:891
- h.1.attn.o_proj:6

Memorized template:
- h.1.mlp.down_proj:136
209 changes: 209 additions & 0 deletions docs/editing_conceptual_notes.md

Large diffs are not rendered by default.

104 changes: 104 additions & 0 deletions docs/editing_process_learnings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# VPD Model Editing: Process Learnings

Notes on what works, what doesn't, and what to watch out for when doing component-level model editing with VPD decompositions. Written after a day of intensive experimentation on `s-892f140b` (2-layer Llama, SimpleStories, 7104 components).

## Component Selection

### Use output PMI, not labels, for finding ablation targets

The single biggest methodological lesson. When you want to suppress token X, search for components that *predict* X (high output PMI), not components that *respond to* X (high input PMI) or components whose label mentions X.

Ablating input-side components destroys the model's ability to *process* X, causing massive collateral damage. Ablating output-side components specifically suppresses *production* of X.

Concrete example — quote suppression:
- Old labels (input/output confused): -86% P("), +57% dialogue PPL
- Output PMI search: -89% P("), +0.5% non-dialogue PPL

That's 5.6x less collateral damage with better suppression.

### Labels are lossy — always inspect before committing

Autointerp labels compress a rich prompt (token correlations, activation examples) into ~5 words. They routinely miss the most important information. Always call `inspect_component()` or look at the full prompt before ablating.

The dual-view autointerp strategy (`dual_view`) separates input and output function in labels, which eliminates the worst failure mode. But labels still lose nuance — components with similar labels can have very different causal roles (see attn_o:82 vs attn_o:208 below).

### Start with 1 component, then add more

Dose-response is non-linear. One component often captures most of the effect. Adding more has diminishing returns with growing collateral. For male pronouns: 1 component gets -84%, 3 gets -92%, 6 gets -96% but with 3x the PPL cost.

### Syntactic > semantic for editability

Syntactic/functional features (pronouns, punctuation, quotes) decompose into dedicated components and edit cleanly. Semantic topics (nature, food) are distributed across many components and resist clean ablation. This likely reflects VPD's masking objective rewarding sharp on/off firing patterns.

## Circuit Analysis

### Component polarity is arbitrary

A rank-1 matrix `U ⊗ V^T` is the same as `(-U) ⊗ (-V)^T`. The decomposition has no privileged sign convention. Empirically, 49% of components are "aligned" with their predicted tokens (positive activation → positive logit contribution) and 51% are "anti-aligned" (negative activation × negative cosine → positive logit contribution). Both work identically.

Don't interpret the sign of a cosine or activation in isolation. What matters is the *product* of activation × write-direction alignment, and how that product changes across contexts. A component with negative cosine to "he" that has negative activations in male contexts is *boosting* "he", not suppressing it.

### Activation sign carries the computation

While the overall polarity convention is arbitrary, the *context-dependent variation* in activation sign is meaningful. Components like `attn_o:82` use activation sign as a conditional switch: negative in male context → boosts male pronouns, positive in female context → boosts female pronouns. One rank-1 matrix implementing two-way conditional behavior via signed activations.

### Three circuit architectures exist

Not all circuits are the same:

1. **Geometric/residual**: Components aligned in weight space, communicate through the residual stream within a single forward pass. The pronoun circuit (attn→MLP chain) is this type. Identified by high cosine between one component's write direction and another's read direction.

2. **Parallel**: Independent components that each contribute to the same output via separate pathways. The quote circuit is this type. Low geometric coupling, each fires on punctuation independently.

3. **Token-mediated handoff**: Components communicate across positions through the discrete token stream. The ? → " circuit is this type: ? predictors produce ?, then " predictors fire on the ? token at the next position. Identified by one group having high output PMI for a token that the other group has high input PMI for.

### Similar labels ≠ similar causal roles

`attn_o:82` and `attn_o:208` are both labeled as "male pronoun" components. But `attn_o:208` is more critical for IOI coreference (6/15 accuracy when ablated) while `attn_o:82` is a general pronoun booster (9/15 when ablated). The label doesn't capture this distinction. Always test causally.

### Geometric alignment percentiles use the empirical population

When we say two components have "100th percentile alignment," that's computed over all ~1M actual component pairs in the same two layers, not against a random baseline. The population mean cosine is ~0.02-0.05 (close to random in high-dim space), so any cosine above ~0.15 is unusual and above ~0.3 is extreme.

## Tooling

### The `EditableModel` workflow

```python
em, tok = EditableModel.from_wandb("wandb:goodfire/spd/s-892f140b")

# Search
matches = search_interpretations(harvest, interp, r"male pronoun")
pmi_hits = search_by_token_pmi(harvest, [he_id], side="output")

# Inspect
inspect_component(harvest, interp, "h.1.mlp.down_proj:798", tok)

# Edit
edit_fn = em.make_edit_fn({"h.1.mlp.down_proj:798": 0.0})
generate(edit_fn, tokens, tok)

# Measure
measure_kl(em, edit_fn, token_seqs)
measure_token_probs(em, edit_fn, token_seqs, {"male": [he_id, him_id]})

# Circuit analysis
em.component_alignment("h.1.attn.o_proj:82", "h.1.mlp.c_fc:144")
em.unembed_alignment("h.1.mlp.down_proj:798", tok)
em.get_component_activations(tokens, "h.1.attn.o_proj:82")

# Permanent edit
edited = em.without_components(["h.1.mlp.down_proj:798"])
```

### Use AppTokenizer, not raw HuggingFace

HF's `tokenizer.encode()` silently appends EOS, making the model treat every prompt as a complete document. `AppTokenizer` uses `add_special_tokens=False` and exposes `eos_token_id` as a typed property.

### Unbatched convention

All `EditableModel` methods and free functions (`generate`, `measure_kl`, `measure_token_probs`) use unbatched tensors: `[seq]` not `[1, seq]`. The batch dimension is handled internally. This eliminates the `[0]` indexing and `[...]` wrapping noise throughout notebook code.

### Verifying the base model can do the task

Before trying to ablate a behavior, always verify the base model actually exhibits it. This 2-layer model can do IOI at 97% accuracy — surprisingly capable. But don't assume; test first with concrete prompts and measure P(correct) vs P(incorrect).
107 changes: 107 additions & 0 deletions docs/editing_session_2025-02-25.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Editing Session Notes — 2025-02-25

Run: s-17805b61 (4-layer Llama MLP+Attn, Pile, ~39K components)

## What we tried and what worked

### Token-level ablation (worked)
Searched graph-interp output labels for specific token types, ablated, measured. Semicolons (-88%), colons (-88%), question marks (-91%), open parens (-94%), exclamation (-49%), male pronouns (-47%), contrastive "but" (-49%). All <3.7% PPL. This is the bread-and-butter of SPD editing — reliable, measurable, low cost.

### Circuit optimization (mixed)
`optimize_circuit` on "The king summoned his most trusted knight. He told him that" → " he". Found o_proj:361 as causally necessary with stochastic P=0.988. Traced the attention circuit: v_proj:717 reads person tokens → k_proj selects masculine entities → o_proj:361 outputs gender signal.

But on other prompts (soap→water, dog→tail, cat→is) stochastic performance was terrible (<0.13) despite good CI-masked (>0.97). Only the pronoun circuit held up. Takeaway: always check stochastic, and most predictions don't compress into sparse circuits on this model.

### find_components_by_examples (disappointing)
Finds shared infrastructure (bias components, general machinery), not the differentiating features. Tried contrastively (he-examples minus she-examples) — got a tiny diff of generic components, none gender-specific. The critical masculine pronoun component (o_proj:361) didn't appear because it's only needed in male contexts, not uniformly across all "he" predictions.

### Higher-level semantic ablation (mostly didn't work)
- Negative emotion: 2 components, barely visible effect
- Modal verbs: 11 components, no visible generation change
- Narrative speech verbs: 9 components, messy results
- Second-person "you": 1 component, no effect (model doesn't use "you" much on Pile)
- Lists: 9 components, lists still appeared

Root cause: this is a 4-layer Pile model. It generates degenerate/repetitive text, making qualitative comparison of long generations impossible. High-level semantic features are distributed across too many components for small ablations to have visible effects.

### Evaluative adjectives (marginal)
24 components labeled "evaluative adjective", -62% sum P over 15 evaluative words after copular verbs. But only 1.3x selectivity (copular vs non-copular) — suppresses evaluative words everywhere, not concept-selectively. And when restricted to prompts where evaluative words were actually predicted (not noise-level), effect drops to -42% ± 26%. The story felt mushy.

### Directional adverbs (the best result)
Single component h.3.mlp.down_proj:649. Found via `unembed_alignment` — write direction points at {back, home, down, off, forward, south}. Ablating it: -38% directional adverbs after movement verbs, -1.5% in non-directional contexts. 36pp gap. +1.0% PPL. Concept-selective, semantic category, single component, clean mechanism.

### "that" disambiguation (good)
3 components with input function "verbs of assertion and reporting." Suppresses "that" after formal verbs (-35%) but boosts it after informal verbs (+28%). Same token, opposite direction. The selectivity comes from the components' narrow input functions — they fire specifically in formal attribution contexts.

### Factual knowledge (didn't work)
Tried suppressing "Romeo and Juliet" association. `find_components_by_examples` found 16 generic "proper noun suffix" components. Ablating them: -60 to -98% P(Juliet) but +64% PPL — catastrophic collateral damage. The Juliet knowledge is distributed across generic name-completion machinery, not stored in dedicated components.

## Key findings

### Token-level, not concept-level (mostly)
The "but" and "he" ablations suppress the token uniformly across all contexts — contrastive and non-contrastive "but", gendered and generic "he". Non-contrastive uses are actually suppressed MORE. This is because ablation removes a rank-1 contribution everywhere, regardless of linguistic function.

### Exception: concept-selectivity from narrow input functions
The "that" and directional adverb results show concept-selectivity IS possible when the ablated component has a narrow input function. The "that" components fire specifically on "verbs of assertion" (formal register). The directional component fires specifically after movement verbs (via the c_fc:2506 fan-out). Broad input function → token-level edit. Narrow input function → concept-selective edit.

### The MLP "fan-out" — corrected understanding
Initial finding: c_fc:2506 detects movement verbs, fans out to multiple down-projections (directional, manner, degree, temporal, prepositional). But ablation testing revealed this is WRONG as a causal story:

- Ablating c_fc:2506 alone: directional **+2%** (no effect!), manner **-63%**, degree **-40%**
- Ablating ALL 7 non-bias upstream c_fc components: directional **+10%** (still no effect!)
- Ablating just down_proj:649 alone: directional **-38%**

The directional signal doesn't flow through any identifiable c_fc component. It's distributed across the full c_fc layer (3072 components), so removing a few doesn't matter. The concept-selectivity of down_proj:649 comes entirely from its **write direction** — it reads a broad, distributed MLP hidden state and projects onto the directional adverb subspace in vocab space.

Corrected framing: down_proj:649 is a **readout direction**, not a narrow channel in a pipeline. The "fan-out" structure (graph-interp edges from c_fc:2506 to multiple down_proj) describes attribution flow but not causal necessity. Manner and degree branches DO depend on c_fc:2506, but the directional branch doesn't.

Lesson: graph-interp edges show attribution (correlation in gradient flow), not causal necessity. Always validate with ablation.

### Bias components
14/39K components have mean CI > 0.5 and fire on >60% of tokens. They're structural biases necessary for everything. Show up in every circuit. Filter them out when searching for editing targets.

### Measurement matters
- Single-position P(token) can overstate the effect. Colons: -88% at single position, -18% in generation.
- Generation-level counts are more honest. Pronouns: -47% single position but -73% in generation (compounds). Question marks: -100% in generation (zero produced in 600 tokens).
- PPL on 15 hand-picked texts vs 25K training tokens: similar but the latter is more defensible.
- Random N-component ablation achieves ~0% target suppression at similar PPL cost. The edits are targeted, not just small enough to be harmless. But this is the expected outcome (N/39K is tiny), not an impressive finding.
- Concentrated damage: targeted edits have 4-13x higher KL on target domain vs unrelated text. Random has ~1x. Targeted doesn't spare unrelated text — it concentrates extra damage in the target domain.

## Tool effectiveness

### Graph-interp label search
Primary discovery method for 5/7 token-level edits and the directional adverb result. Fast, broad coverage. Most effective when searching for specific output patterns. The separate input/output labels are crucial — output tells you what the component produces, input tells you when it fires.

### unembed_alignment
How we found the directional adverb component — the write direction formed a tight semantic cluster in vocab space. Also useful for understanding the MLP fan-out (applying it to sibling components). Underused tool — should be a standard part of the exploration workflow.

### Graph-interp edges
How we traced the MLP fan-out: upstream from down_proj:649 to c_fc:2506, then downstream from c_fc:2506 to siblings. Also used for the pronoun circuit analysis. These are dataset-level attributions (aggregated), not prompt-specific.

### optimize_circuit
Good for prompt-specific causal analysis when it works (pronoun circuit, stoch P=0.988). But most behaviors don't compress into sparse circuits on this model (stoch P < 0.13). Expensive (~15s per prompt). Use for validation/mechanistic understanding, not for search.

### find_components_by_examples
Disappointing for editing purposes. Finds shared infrastructure and bias components, not the differentiating features that matter for targeted editing. The contrastive approach (run on A, run on B, diff) produced noise. Might work better with more examples or on a better-decomposed model.

### inspect_component
Underused. Should have looked at activation examples more systematically earlier. The labels are lossy summaries — the actual examples show what the component really does.

### PMI search
Good for rare/specific tokens (pronouns, question marks). Bad for common tokens (periods, commas, "the") where PMI is noisy. Graph-interp labels are better for common tokens.

## What I'd do differently next time

1. Start with `unembed_alignment` on random components to find ones with coherent semantic write directions. This found our best result.
2. Use graph-interp edges to trace fan-out patterns in MLPs. The up→down decomposition is a systematic structure, not a one-off.
3. Don't waste time on generation-level evaluation for this model. It generates degenerate text. Stick to P(token) measurements and be honest about their limitations.
4. For concept-selectivity tests, the key is finding components with NARROW input functions (check the graph-interp input label). Broad input → token-level edit. Narrow input → concept-selective.
5. Skip `find_components_by_examples` for contrastive features. It finds the wrong thing.
6. Always measure on prompts where the target token is actually in the baseline top-5. Measuring suppression of noise-level probabilities is meaningless.

## Open questions

- Is the MLP fan-out pattern common? How many MLPs decompose cleanly into semantically distinct up→down pathways?
- Can we find concept-selective attention edits (not just MLP)? The pronoun component is in attention but we didn't test its concept-selectivity properly.
- Would a larger/better model show cleaner high-level edits? The 4-layer Pile model may just be too small for semantic editing.
- Can permanent weight editing (rank-1 subtraction) reproduce all these results? We only validated it for the pronoun case.
Loading
Loading