Skip to content

Add topological interpretation module#407

Open
ocg-goodfire wants to merge 65 commits intodevfrom
feature/topological-interp
Open

Add topological interpretation module#407
ocg-goodfire wants to merge 65 commits intodevfrom
feature/topological-interp

Conversation

@ocg-goodfire
Copy link
Collaborator

Description

New spd/topological_interp/ module for context-aware component labeling using network graph structure. Unlike standard autointerp (one-shot per component), this uses dataset attributions to provide neighbor context in each component's prompt.

Three-phase pipeline:

  1. Output pass (late→early): "What does this component DO?" — each prompt includes top-K downstream neighbors by attribution, with their labels
  2. Input pass (early→late): "What TRIGGERS this component?" — each prompt includes top-K upstream neighbors + co-firing components (Jaccard/PMI)
  3. Unification: synthesizes output + input labels into a single unified label

Output and input passes are fully independent (both layer-serial within, but no cross-dependency). Unification combines them.

Also includes:

  • Extraction of shared prompt helpers from dual_view.py into spd/autointerp/prompt_helpers.py (public API, used by both autointerp and topological interp)
  • Layer ordering via topology module's CanonicalWeight system (correct for all model architectures)
  • spd-topological-interp CLI entry point
  • SQLite DB with WAL mode storing output/input/unified labels + directed prompt edge graph
  • Resume support per-phase via completed key sets

Motivation and Context

Current autointerp labels each component independently. This misses the graph structure — components influence each other through the network, and knowing what upstream/downstream components do provides crucial context for interpretation. Inspired by Curt Tigges' topological labeling approach (.circuits-ref/).

How Has This Been Tested?

  • basedpyright: 0 errors across all new + modified files
  • ruff: lint + format clean
  • All 442 existing tests pass (0 regressions from dual_view.py refactor)
  • Ordering logic verified with unit-style assertions across attn/mlp/glu layer types
  • spd-topological-interp --help works
  • All module imports verified end-to-end

Does this PR introduce a breaking change?

No. dual_view.py functions are re-exported from the same module (now importing from prompt_helpers.py). All existing autointerp functionality is preserved.

🤖 Generated with Claude Code

claude and others added 30 commits January 30, 2026 20:40
Implements a SLURM-based system for launching parallel Claude Code agents
that investigate behaviors in SPD model decompositions.

Key components:
- spd-swarm CLI: Submits SLURM array job for N agents
- Each agent starts isolated app backend (unique port, separate database)
- Detailed system prompt guides agents through investigation methodology
- Findings written to append-only JSONL files (events.jsonl, explanations.jsonl)

New files:
- spd/agent_swarm/schemas.py: BehaviorExplanation, SwarmEvent schemas
- spd/agent_swarm/agent_prompt.py: Detailed API and methodology instructions
- spd/agent_swarm/scripts/run_slurm_cli.py: CLI entry point
- spd/agent_swarm/scripts/run_slurm.py: SLURM submission logic
- spd/agent_swarm/scripts/run_agent.py: Worker script for each job

Also adds SPD_APP_DB_PATH env var support for database isolation.

https://claude.ai/code/session_01UMpYFZ3A98vsPkqoq6zvT6
Previously used communicate() which buffers all output until process
completes. Now streams directly to claude_output.txt so you can monitor
agent activity with: tail -f <task_dir>/claude_output.txt

https://claude.ai/code/session_01UMpYFZ3A98vsPkqoq6zvT6
- Switch to --output-format stream-json for structured JSONL output
- Add --max-turns parameter (default 50) to prevent runaway agents
- Output file changed from claude_output.txt to claude_output.jsonl
- Updated monitoring commands in logs to use jq for parsing

Monitor with: tail -f task_*/claude_output.jsonl | jq -r '.result // empty'

https://claude.ai/code/session_01UMpYFZ3A98vsPkqoq6zvT6
Claude Code requires --verbose when using --output-format=stream-json
with --print mode.

https://claude.ai/code/session_01UMpYFZ3A98vsPkqoq6zvT6
When multiple GPU-intensive requests are made concurrently (graph
computation, optimization, intervention), the backend would hang.
This adds a lock that returns HTTP 503 immediately if a GPU operation
is already in progress, allowing clients to retry later.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Agents now create and update a research_log.md file with readable
progress updates. This makes it easy to follow what the agent is
doing and discovering without parsing JSONL files.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Show YYYY-MM-DD HH:MM:SS format and provide tip for getting timestamps.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The JSON-RPC 2.0 spec requires that the "error" field must NOT be present
when there is no error. Our MCPResponse was serializing "error": null in
all success responses, causing Claude Code to reject the MCP connection
with "Failed to connect" status.

Added exclude_none=True to all model_dump() calls so null fields are
omitted from the serialized response.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The backend subprocess had stdout=subprocess.PIPE but the pipe was
never drained. When the pipe buffer filled (~64KB), tqdm.write() in
the optimization loop would block forever.

Fix: Write backend logs to task_dir/backend.log instead of piping.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- SPD_SWARM_TASK_DIR: backend derives db_path, events_path from this
- SPD_SWARM_SUGGESTIONS_PATH: global suggestions file

Removed:
- SPD_APP_DB_PATH, SPD_MCP_EVENTS_PATH, SPD_MCP_TASK_DIR (consolidated)
- Unused AgentOutput schema

Co-Authored-By: Claude Opus 4.5 <[email protected]>
# Conflicts:
#	CLAUDE.md
#	pyproject.toml
#	spd/app/CLAUDE.md
#	spd/app/backend/routers/__init__.py
#	spd/app/backend/routers/intervention.py
#	spd/app/backend/server.py
#	spd/app/backend/state.py
#	spd/app/frontend/src/components/RunView.svelte
#	spd/app/frontend/src/lib/api/index.ts
Reshapes the swarm module into a focused investigation tool where a
researcher poses a specific question and a single agent investigates it.

Key changes:
- Rename spd/agent_swarm/ → spd/investigate/, CLI spd-swarm → spd-investigate
- Single SLURM job instead of array, flat output dir structure
- Agent prompt accepts researcher's question + injects model architecture info
- 5 new MCP tools: probe_component, get_component_activation_examples,
  get_component_attributions, get_model_info, get_attribution_strength
- MCP dispatch refactored from if/elif chain to lookup tables
- Investigations scoped to loaded run via DepLoadedRun
- Frontend: refresh button, @file prompt input, launch-from-UI flow
- Graph artifacts expand to natural size, research log flows with page

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Normalize wandb_path to canonical form (entity/project/run_id) when
storing investigation metadata and when filtering. Handles old
investigations that stored the "runs/" form.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…tigations UX

- Run picker: replace hardcoded modelName with fetched arch info
  (e.g. "SS LlamaSimple 4L d512"), add dataset_short to pretrain_info
- Artifact graphs: use shared graphLayout.ts for canonical layer names,
  fixing topological grouping (q/k/v rows, gate/up rows)
- Investigations: add launch-from-UI, @file prompt support, refresh button,
  remove research log scroll trap, scope to loaded run
- Remove layerAliasing.ts — backend now handles concrete→canonical translation
- Drop modelName from registry entries

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- get_component_info: translate canonical → concrete for harvest/interp
  lookups, canonicalize correlated component keys in response
- save_graph_artifact: use 'embed' not 'wte' for pseudo-nodes
- get_component_activation_examples: return canonical keys
- Tool descriptions: update examples to canonical format
- ArtifactGraph: prefetch component data on mount for tooltip cards
- Filter both 'wte' and 'embed' as non-interventable nodes
- Remove unused CSS selector in StagedNodesPanel

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Persists alongside other artifacts instead of being tied to a repo
checkout. Keyed by run, so multiple runs share the DB safely.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Add --permission-mode dontAsk and --allowedTools mcp__spd__* to
  Claude Code launch, preventing use of Bash/Read/Write/Edit and
  blocking inheritance from ~/.claude/settings.json
- Revert DB path back to .data/app/prompt_attr.db

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Add --setting-sources "" to skip all user/project settings
  (no plugins, no inherited model, no alwaysThinkingEnabled)
- Add --model opus explicitly since global settings are skipped

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Three-phase context-aware component labeling using network graph structure:
1. Output pass (late→early): labels what each component does, with downstream neighbor context
2. Input pass (early→late): labels what triggers each component, with upstream + co-firing context
3. Unification: synthesizes output + input labels into unified label

Output and input passes are independent (both layer-serial, but no cross-dependency).

Also extracts shared prompt helpers from dual_view.py into autointerp/prompt_helpers.py,
and uses the topology module's CanonicalWeight system for correct layer ordering.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
get_cofiring_neighbors no longer reads from the DB — it returns
pure co-firing stats (Jaccard/PMI) with no labels. This ensures
the input and output passes have zero logical coupling.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- neighbors.py → graph_context.py
- NeighborContext → RelatedComponent
- get_downstream_neighbors → get_downstream_components
- get_upstream_neighbors → get_upstream_components
- get_cofiring_neighbors → get_cofiring_components
- top_k_neighbors → top_k_attributed
- DB columns: neighbor_key → related_key, neighbor_label → related_label

"Neighbours" implied same-layer adjacency; "related components" better
conveys the attribution-graph and co-firing relationships.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
If a component failed its output or input pass (e.g. transient API error),
the unification pass now logs a warning and skips it instead of asserting
and silently deadlocking the async pipeline.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Each directional pass now maintains labels_so_far: dict[str, LabelResult]
as the scan accumulator. Related components look up labels from this dict
instead of querying the DB. The DB is seeded from on resume and written to
for durability, but never read mid-scan.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
ComponentModel loading exceeds 160GB CPU-only allocation. Requesting a
GPU gives 24 CPUs + 240GB RAM via DefCpuPerGPU/DefMemPerCPU defaults.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ge (#408)

- Update 7 YAML files: bare `lr:` → `lr_schedule:` block in PGD optimizer configs
- Update 5 YAML files: deprecated scope names (`batch_invariant` → `repeat_across_batch`,
  `unique_per_batch_per_token` → `per_batch_per_position`, `n_masks` → `n_sources`)
- Remove redundant `coeff: null` from eval_metric_configs across 16 YAML files
- Fix misleading error message in persistent_pgd.py (said "use fewer ranks" but
  fewer ranks makes the problem worse)

Co-authored-by: Claude Opus 4.6 <[email protected]>
- Remove PRAGMA journal_mode=WAL from all 3 DB classes (harvest, autointerp,
  topological_interp). WAL requires POSIX file locking which breaks on NFS.
- Scoring scripts (detection, fuzzing) now accept a separate writable InterpDB
  instead of writing through the readonly InterpRepo.
- Intruder eval opens harvest readonly + separate writable HarvestDB for scores.
- Fix try/except → try/finally in interpret.py for proper connection cleanup.
- Bump autointerp/eval/intruder jobs to 2 GPUs for memory headroom.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
ocg-goodfire and others added 30 commits February 24, 2026 10:22
…r vocab_size

tokenizer.vocab_size (50254) < len(tokenizer) (50277) due to added tokens.
Token IDs >= vocab_size cause scatter_add_ index out of bounds in the embed
accumulator. Use embedding_module.num_embeddings which matches the actual
token ID space.

Also add Path type annotation to test tmp_path params.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Merge doesn't need config_json, worker does. Separate entrypoints avoid the
issue where Fire requires config_json for both paths.

Cherry-picked from feature/faster-dataset-attributions.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…natures

attr_abs now computed by backpropping through target_acts.abs() instead of
flipping by source activation sign. Requires 2 backward passes per target
component but is mathematically correct for cross-position (attention) interactions.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
3 metrics × dict-of-dicts makes rank files ~15GB each. Merge loads all
in double precision, needs much more than the default 10GB.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Backend: implement storage query methods with AttrMetric parameter, bulk
endpoint returns all 3 metrics (attr, attr_abs, mean_squared_attr), other
endpoints accept optional ?metric= query param.

Frontend: 3-way radio toggle (Signed / Abs Target / RMS) in
DatasetAttributionsSection. All metrics fetched at once, selection is local
state that switches which ComponentAttributions to display.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
parse_wandb_run_path now accepts "s-xxxxxxxx" and expands to goodfire/spd.
Handled in backend so it works for CLI, app, and any other consumer.
Frontend placeholder updated.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Old subruns (da-timing-*, da-overnight-*) sort after da-YYYYMMDD_* and have
no dataset_attributions.pt. The old code only checked the last candidate
and returned None. Now iterates in reverse until finding one with the file.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
getTokenText did .find() over the full 50K vocab array for every
embed/output pill on each render. Build a Map<id, string> once via
$derived, making lookups O(1).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ntend

Backend resolves embed/output token strings via tokenizer.decode() and
includes token_str in DatasetAttributionEntry. Frontend uses it directly
instead of scanning a 50K vocab array per pill.

Removes tokens/outputProbs passthrough from EdgeAttributionGrid/List —
token strings now flow through EdgeAttribution.tokenStr from the source.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
attr_abs and mean_squared_attr are non-negative by definition, so the
negative top-k is meaningless. Only show negative column for signed attr.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
mean_squared_attr only has positive_sources/positive_targets.
Hardcoded three paths in DatasetAttributionsSection matching each type.

Slow benchmark result: per-element loops >14x slower than scatter_add_
(>60 min vs 4.3 min per batch on s-17805b61, timed out at 1hr).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…trics

Rewrite docs to reflect dict-of-dicts storage, canonical naming, split
entrypoints, 3 metrics, and updated query method signatures.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ed/output attributions

- Add attr_metric config (default: attr_abs) for selecting attribution metric
- Concrete↔canonical key translation between harvest and attribution storage
- Include embed/output token attributions in prompts (w_unembed support)
- Lazy component loading: use get_summary() upfront, get_component() per-prompt
  (eliminates 54GB harvest DB bulk load at startup)
- Add granular logging for each loading step
- Add export_html.py script for static site data export
- Unify get_downstream/get_upstream into single get_related_components with
  injected GetAttributed callable (from stash recovery)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…oneous metadata

Output pass only sees: output tokens, says examples, downstream components.
Input pass only sees: input tokens, fires-on examples, upstream components.
Remove dataset description and model class from prompts (erroneous).
Reduce max_examples default from 30 to 10.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
The unification step now sees fires-on and says examples alongside the
two-perspective labels, giving it grounding to make better unified labels.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…r unlabeled

- Replace raw component keys (h.3.mlp.c_fc:42) with human-readable descriptions
  (layer 3 MLP up-projection, component 42) using human_layer_desc
- Normalize attributions: strongest = 1.0, rest relative (+0.85, -0.42, etc.)
- Filter unlabeled components from related table (API failures), keep token entries
- Remove dead _FORBIDDEN constant and inconsistent "lowercase only" instruction
- max_examples 10 → 20 (user edit)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Storage now has four structurally distinct edge types instead of a
uniform dict-of-dicts: regular (component→component), embed
(embed→component), unembed (component→unembed in residual space),
and embed_unembed (embed→unembed). w_unembed is stored alongside
attribution data so consumers never need to provide the projection
matrix. Dropped mean_squared_attr metric and has_source/has_target
methods — query methods return [] for nonexistent components.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Rename spd/topological_interp/ → spd/graph_interp/ with all classes,
  functions, CLI entry points, and references updated
- Remove redundant `direction` field from PromptEdge (determined by pass_name)
- Save prompt edges during interpretation (was defined but never called)
- Add get_all_prompt_edges() to DB and repo
- New backend router /api/graph_interp/ with labels, detail, and graph endpoints
- Add graph_interp to RunState, LoadedRun, and DataSourcesResponse
- Frontend: GraphInterpBadge component (side-by-side with autointerp in component card)
- Frontend: Model Graph tab with SVG DAG visualization, filtering, zoom/pan
- Frontend: Graph interp section in Data Sources tab

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Storage now holds raw (unnormalized) accumulator sums plus normalization
metadata (CI sums, component activation RMS, logit RMS per token).
Normalization happens at query time in get_top_sources/get_top_targets.

This fixes: exact merge (element-wise addition instead of approximate
weighted average), proper output-target normalization via logit RMS,
no NaN from dead components (clamp at query time), and shallow-copy
bug where embed was removed from sources_by_target.

All attribution fields are now private — query methods are the only
public interface.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ueries

- Remove n_batches_processed from all consumers (storage, routers, frontend,
  graph-interp, tests) after field was dropped from storage
- Add .detach().cpu() in save to prevent requires_grad leaking to disk
- Return empty list for get_top_targets("output:*") since output can't be a source
- Fix shallow copy bug in harvester: deep copy source lists to prevent
  embed being removed from sources_by_target
- Fix embed accumulator shape: use num_embeddings not d_model
- Restore fire.Fire in run_worker.py (was accidentally replaced with hardcoded call)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…-load env var

- Add /api/graph_interp/detail endpoint returning labels + prompt edges per component
- GraphInterpBadge: lazy-fetch detail on expand, two-column layout (left=input, right=output)
- Resolve embed/output token strings server-side via AppTokenizer
- Extract shared isTokenNode/formatComponentKey utility (componentKeys.ts)
- Fix _concrete_to_canonical_key for embed/output pseudo-layers
- Add SPD_APP_DEFAULT_RUN env var to auto-load a run on app startup

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
GraphInterpBadge was fetching its own data with a plain `fetched` flag
that didn't reset when props changed. This caused stale data when the
component was reused (e.g. clicking through components in Activation
Contexts tab). Now the fetch lives in useComponentData/ExpectCached
alongside all other component data, and GraphInterpBadge is a pure
display component.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants