Feature/model editing by ocg-goodfire · Pull Request #417 · goodfire-ai/spd

ocg-goodfire · 2026-02-25T20:56:44Z

Description

Related Issue

Motivation and Context

How Has This Been Tested?

Does this PR introduce a breaking change?

Implements a SLURM-based system for launching parallel Claude Code agents that investigate behaviors in SPD model decompositions. Key components: - spd-swarm CLI: Submits SLURM array job for N agents - Each agent starts isolated app backend (unique port, separate database) - Detailed system prompt guides agents through investigation methodology - Findings written to append-only JSONL files (events.jsonl, explanations.jsonl) New files: - spd/agent_swarm/schemas.py: BehaviorExplanation, SwarmEvent schemas - spd/agent_swarm/agent_prompt.py: Detailed API and methodology instructions - spd/agent_swarm/scripts/run_slurm_cli.py: CLI entry point - spd/agent_swarm/scripts/run_slurm.py: SLURM submission logic - spd/agent_swarm/scripts/run_agent.py: Worker script for each job Also adds SPD_APP_DB_PATH env var support for database isolation. https://claude.ai/code/session_01UMpYFZ3A98vsPkqoq6zvT6

Previously used communicate() which buffers all output until process completes. Now streams directly to claude_output.txt so you can monitor agent activity with: tail -f <task_dir>/claude_output.txt https://claude.ai/code/session_01UMpYFZ3A98vsPkqoq6zvT6

- Switch to --output-format stream-json for structured JSONL output - Add --max-turns parameter (default 50) to prevent runaway agents - Output file changed from claude_output.txt to claude_output.jsonl - Updated monitoring commands in logs to use jq for parsing Monitor with: tail -f task_*/claude_output.jsonl | jq -r '.result // empty' https://claude.ai/code/session_01UMpYFZ3A98vsPkqoq6zvT6

Claude Code requires --verbose when using --output-format=stream-json with --print mode. https://claude.ai/code/session_01UMpYFZ3A98vsPkqoq6zvT6

When multiple GPU-intensive requests are made concurrently (graph computation, optimization, intervention), the backend would hang. This adds a lock that returns HTTP 503 immediately if a GPU operation is already in progress, allowing clients to retry later. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Agents now create and update a research_log.md file with readable progress updates. This makes it easy to follow what the agent is doing and discovering without parsing JSONL files. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Show YYYY-MM-DD HH:MM:SS format and provide tip for getting timestamps. Co-Authored-By: Claude Opus 4.5 <[email protected]>

…rm-lIpTu

The JSON-RPC 2.0 spec requires that the "error" field must NOT be present when there is no error. Our MCPResponse was serializing "error": null in all success responses, causing Claude Code to reject the MCP connection with "Failed to connect" status. Added exclude_none=True to all model_dump() calls so null fields are omitted from the serialized response. Co-Authored-By: Claude Opus 4.5 <[email protected]>

…ings together

The backend subprocess had stdout=subprocess.PIPE but the pipe was never drained. When the pipe buffer filled (~64KB), tqdm.write() in the optimization loop would block forever. Fix: Write backend logs to task_dir/backend.log instead of piping. Co-Authored-By: Claude Opus 4.5 <[email protected]>

- SPD_SWARM_TASK_DIR: backend derives db_path, events_path from this - SPD_SWARM_SUGGESTIONS_PATH: global suggestions file Removed: - SPD_APP_DB_PATH, SPD_MCP_EVENTS_PATH, SPD_MCP_TASK_DIR (consolidated) - Unused AgentOutput schema Co-Authored-By: Claude Opus 4.5 <[email protected]>

# Conflicts: # CLAUDE.md # pyproject.toml # spd/app/CLAUDE.md # spd/app/backend/routers/__init__.py # spd/app/backend/routers/intervention.py # spd/app/backend/server.py # spd/app/backend/state.py # spd/app/frontend/src/components/RunView.svelte # spd/app/frontend/src/lib/api/index.ts

Reshapes the swarm module into a focused investigation tool where a researcher poses a specific question and a single agent investigates it. Key changes: - Rename spd/agent_swarm/ → spd/investigate/, CLI spd-swarm → spd-investigate - Single SLURM job instead of array, flat output dir structure - Agent prompt accepts researcher's question + injects model architecture info - 5 new MCP tools: probe_component, get_component_activation_examples, get_component_attributions, get_model_info, get_attribution_strength - MCP dispatch refactored from if/elif chain to lookup tables - Investigations scoped to loaded run via DepLoadedRun - Frontend: refresh button, @file prompt input, launch-from-UI flow - Graph artifacts expand to natural size, research log flows with page Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Normalize wandb_path to canonical form (entity/project/run_id) when storing investigation metadata and when filtering. Handles old investigations that stored the "runs/" form. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…tigations UX - Run picker: replace hardcoded modelName with fetched arch info (e.g. "SS LlamaSimple 4L d512"), add dataset_short to pretrain_info - Artifact graphs: use shared graphLayout.ts for canonical layer names, fixing topological grouping (q/k/v rows, gate/up rows) - Investigations: add launch-from-UI, @file prompt support, refresh button, remove research log scroll trap, scope to loaded run - Remove layerAliasing.ts — backend now handles concrete→canonical translation - Drop modelName from registry entries Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- get_component_info: translate canonical → concrete for harvest/interp lookups, canonicalize correlated component keys in response - save_graph_artifact: use 'embed' not 'wte' for pseudo-nodes - get_component_activation_examples: return canonical keys - Tool descriptions: update examples to canonical format - ArtifactGraph: prefetch component data on mount for tooltip cards - Filter both 'wte' and 'embed' as non-interventable nodes - Remove unused CSS selector in StagedNodesPanel Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Persists alongside other artifacts instead of being tied to a repo checkout. Keyed by run, so multiple runs share the DB safely. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- Add --permission-mode dontAsk and --allowedTools mcp__spd__* to Claude Code launch, preventing use of Bash/Read/Write/Edit and blocking inheritance from ~/.claude/settings.json - Revert DB path back to .data/app/prompt_attr.db Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- Add --setting-sources "" to skip all user/project settings (no plugins, no inherited model, no alwaysThinkingEnabled) - Add --model opus explicitly since global settings are skipped Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Three-phase context-aware component labeling using network graph structure: 1. Output pass (late→early): labels what each component does, with downstream neighbor context 2. Input pass (early→late): labels what triggers each component, with upstream + co-firing context 3. Unification: synthesizes output + input labels into unified label Output and input passes are independent (both layer-serial, but no cross-dependency). Also extracts shared prompt helpers from dual_view.py into autointerp/prompt_helpers.py, and uses the topology module's CanonicalWeight system for correct layer ordering. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

get_cofiring_neighbors no longer reads from the DB — it returns pure co-firing stats (Jaccard/PMI) with no labels. This ensures the input and output passes have zero logical coupling. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- neighbors.py → graph_context.py - NeighborContext → RelatedComponent - get_downstream_neighbors → get_downstream_components - get_upstream_neighbors → get_upstream_components - get_cofiring_neighbors → get_cofiring_components - top_k_neighbors → top_k_attributed - DB columns: neighbor_key → related_key, neighbor_label → related_label "Neighbours" implied same-layer adjacency; "related components" better conveys the attribution-graph and co-firing relationships. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

If a component failed its output or input pass (e.g. transient API error), the unification pass now logs a warning and skips it instead of asserting and silently deadlocking the async pipeline. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Each directional pass now maintains labels_so_far: dict[str, LabelResult] as the scan accumulator. Related components look up labels from this dict instead of querying the DB. The DB is seeded from on resume and written to for durability, but never read mid-scan. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…t_circuit, get_ci, find_components_by_examples) Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

…r unlabeled - Replace raw component keys (h.3.mlp.c_fc:42) with human-readable descriptions (layer 3 MLP up-projection, component 42) using human_layer_desc - Normalize attributions: strongest = 1.0, rest relative (+0.85, -0.42, etc.) - Filter unlabeled components from related table (API failures), keep token entries - Remove dead _FORBIDDEN constant and inconsistent "lowercase only" instruction - max_examples 10 → 20 (user edit) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Storage now has four structurally distinct edge types instead of a uniform dict-of-dicts: regular (component→component), embed (embed→component), unembed (component→unembed in residual space), and embed_unembed (embed→unembed). w_unembed is stored alongside attribution data so consumers never need to provide the projection matrix. Dropped mean_squared_attr metric and has_source/has_target methods — query methods return [] for nonexistent components. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- Rename spd/topological_interp/ → spd/graph_interp/ with all classes, functions, CLI entry points, and references updated - Remove redundant `direction` field from PromptEdge (determined by pass_name) - Save prompt edges during interpretation (was defined but never called) - Add get_all_prompt_edges() to DB and repo - New backend router /api/graph_interp/ with labels, detail, and graph endpoints - Add graph_interp to RunState, LoadedRun, and DataSourcesResponse - Frontend: GraphInterpBadge component (side-by-side with autointerp in component card) - Frontend: Model Graph tab with SVG DAG visualization, filtering, zoom/pan - Frontend: Graph interp section in Data Sources tab Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Storage now holds raw (unnormalized) accumulator sums plus normalization metadata (CI sums, component activation RMS, logit RMS per token). Normalization happens at query time in get_top_sources/get_top_targets. This fixes: exact merge (element-wise addition instead of approximate weighted average), proper output-target normalization via logit RMS, no NaN from dead components (clamp at query time), and shallow-copy bug where embed was removed from sources_by_target. All attribution fields are now private — query methods are the only public interface. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…ueries - Remove n_batches_processed from all consumers (storage, routers, frontend, graph-interp, tests) after field was dropped from storage - Add .detach().cpu() in save to prevent requires_grad leaking to disk - Return empty list for get_top_targets("output:*") since output can't be a source - Fix shallow copy bug in harvester: deep copy source lists to prevent embed being removed from sources_by_target - Fix embed accumulator shape: use num_embeddings not d_model - Restore fire.Fire in run_worker.py (was accidentally replaced with hardcoded call) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

… feature/model-editing # Conflicts: # spd/autointerp/strategies/dual_view.py # spd/experiments/lm/pile_llama_simple_mlp-4L.yaml

…-load env var - Add /api/graph_interp/detail endpoint returning labels + prompt edges per component - GraphInterpBadge: lazy-fetch detail on expand, two-column layout (left=input, right=output) - Resolve embed/output token strings server-side via AppTokenizer - Extract shared isTokenNode/formatComponentKey utility (componentKeys.ts) - Fix _concrete_to_canonical_key for embed/output pseudo-layers - Add SPD_APP_DEFAULT_RUN env var to auto-load a run on app startup Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

GraphInterpBadge was fetching its own data with a plain `fetched` flag that didn't reset when props changed. This caused stale data when the component was reused (e.g. clicking through components in Activation Contexts tab). Now the fetch lives in useComponentData/ExpectCached alongside all other component data, and GraphInterpBadge is a pure display component. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

… feature/model-editing

- docs/editing_conceptual_notes.md: conceptual findings from model editing experiments on s-17805b61 (sign convention, concept-selectivity, MLP fan-out, bias components, measurement methodology) - docs/editing_session_2025-02-25.md: detailed session log with results, tool effectiveness analysis, and open questions - scripts/export_circuit_json.py: export OptimizedPromptAttributionResult to JSON for standalone circuit graph renderer - scripts/render_circuit_html.py: embed circuit JSON into self-contained HTML visualization - CLAUDE.md: add worktree .venv guidance Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- Thread dependency_job_id through submit_harvest → postprocess → CLI so postprocessing can be chained after a training job - Add graph_interp field to PostprocessConfig with validation (requires attributions) - Switch CLI from fire to argparse (fire silently parses 311644_1 as int 3116441 due to Python numeric literal underscores) - Add s-55ea3f9b postprocess config Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

# Conflicts: # spd/experiments/lm/pile_llama_simple_mlp-4L.yaml

- Change prompt input to textarea so Enter inserts newlines (Cmd+Enter to submit) - Skip interpretation detail fetch when component has no interpretation headline - Handle 404 gracefully in getInterpretationDetail (return null) - Fix infinite $effect loop in InterventionsView (knownRunIds doesn't need $state) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

claude and others added 30 commits January 30, 2026 20:40

Fix stream-json output requiring --verbose flag

ef5b0fd

Claude Code requires --verbose when using --output-format=stream-json with --print mode. https://claude.ai/code/session_01UMpYFZ3A98vsPkqoq6zvT6

Add full timestamps to research log examples

4c4a843

Show YYYY-MM-DD HH:MM:SS format and provide tip for getting timestamps. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Merge remote-tracking branch 'origin/dev' into claude/slurm-agent-swa…

dcb28f4

…rm-lIpTu

wip: Integrate agent swarm with MCP for Claude Code tool access

cb6e6f0

wip: Refactor agent swarm MCP configuration to require all swarm sett…

39b5acb

…ings together

wip: Add graph artifacts to investigation research logs

b47733f

Fix investigation wandb_path matching

1b30e81

Normalize wandb_path to canonical form (entity/project/run_id) when storing investigation metadata and when filtering. Handles old investigations that stored the "runs/" form. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Move app DB from repo-local .data/ to SPD_OUT_DIR/app/

474e2f3

Persists alongside other artifacts instead of being tied to a repo checkout. Keyed by run, so multiple runs share the DB safely. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Editing and autointerp

eb480c4

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Add kaleido dep and __main__ to run_interpret

9ab56d9

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Replace editing.py with editing/ package (adds optimize_circuit, prin…

7f0cb44

…t_circuit, get_ci, find_components_by_examples) Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

ocg-goodfire and others added 30 commits February 24, 2026 10:23

Tweak component display, tighten error threshold to 5%

98a65ae

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

wip.

17a25ba

wip: Refactor dataset attribution harvester to track abs attributions

4781853

Fix embed path not removed from unembed sources in harvester

48d318c

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Add graph interp badge to components tab, prune model graph to 500 nodes

3ccc301

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Merge remote-tracking branch 'origin/dev' into feature/model-editing

98dcc55

Merge remote-tracking branch 'origin/feature/topological-interp' into…

07aa9af

… feature/model-editing # Conflicts: # spd/autointerp/strategies/dual_view.py # spd/experiments/lm/pile_llama_simple_mlp-4L.yaml

Merge remote-tracking branch 'origin/feature/topological-interp' into…

e87b274

… feature/model-editing

tiny tidy

e183401

wip: Add embed token count normalization for dataset attributions

5beaa66

fold in the investigator work

52c275e

Merge branch 'feature/topological-interp' into feature/model-editing

5a3856b

wip: Add CI optimization visualization during graph computation

0f759d5

Merge remote-tracking branch 'origin/dev' into feature/model-editing

0d9e466

# Conflicts: # spd/experiments/lm/pile_llama_simple_mlp-4L.yaml

Merge branch 'dev' into feature/model-editing

76963ac

Merge branch 'dev' into feature/model-editing

e194ffa

add new canon run to registry

0820460

wip.

2eae377

Add clustering tab

199f277

Merge branch 'dev' into feature/model-editing

db74e8d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/model editing#417

Feature/model editing#417
ocg-goodfire wants to merge 85 commits intodevfrom
feature/model-editing

ocg-goodfire commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ocg-goodfire commented Feb 25, 2026

Description

Related Issue

Motivation and Context

How Has This Been Tested?

Does this PR introduce a breaking change?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants