Compare text generations of simplestories models by leesharkey · Pull Request #254 · goodfire-ai/spd

leesharkey · 2025-11-14T16:02:59Z

Description

This PR adds a new script spd/scripts/compare_generations/compare_generations.py that enables qualitative comparison of text generations from different model configurations. The script generates and displays side-by-side comparisons of:

Target model: Original pretrained model generations
Unmasked SPD model: SPD model with all component masks fixed to 1
Masked SPD model: SPD model using causal importance values directly as masks
Stochastic masked SPD model: SPD model using stochastically sampled masks from causal importance (matching training behavior)

The script loads models from wandb run IDs (or local paths), generates text autoregressively, and displays results both in the terminal and saves them to JSON for further analysis.

Key features:

Configurable generation parameters (temperature, top-k, max tokens, prompt length)
Automatic prompt truncation to control output visibility
Sequences start at story boundaries (beginning of stories) rather than arbitrary positions
Simple, readable implementation prioritizing clarity over efficiency
For masked SPD generation, causal importances are recomputed for each new token

Related Issue

N/A - New feature addition

Motivation and Context

This script addresses the need for qualitative evaluation and comparison of different SPD model configurations. It allows researchers to:

Compare model behaviors: See how target models, unmasked SPD, and masked SPD models differ in their text generation
Validate SPD training: Compare stochastic masked SPD (training behavior) vs deterministic masked SPD (inference behavior)
Debug and analyze: Identify qualitative differences between model configurations that might not be captured by quantitative metrics
Reproducible comparisons: Save results to JSON for later analysis and sharing

The script is designed to be simple and maintainable, making it easy to understand and modify for different use cases.

How Has This Been Tested?

The script has been tested with:

Wandb run ID: wandb:goodfire/spd/runs/9gf5ud48 and `5cdomlyn'
Language model task (SimpleStories dataset)
Various temperature settings (1.0, 100.0) to verify sampling works correctly
Different prompt lengths (10, 50, 100 tokens) to verify truncation
All four generation types (target, unmasked SPD, masked SPD, stochastic masked SPD)

Test results:

✅ Successfully loads models from wandb
✅ Generates text with all four model types
✅ Displays results in readable format
✅ Saves results to JSON
✅ Prompts start at story boundaries (not mid-story)
✅ Temperature sampling works correctly
✅ Prompt truncation works as expected

Does this PR introduce a breaking change?

No, this is a new script addition and does not modify any existing functionality. It only adds new files:

spd/scripts/compare_generations/compare_generations.py
spd/scripts/compare_generations/compare_generations_config.yaml
spd/scripts/compare_generations/README.md
.vscode/launch.json (debugger configuration)

No existing code paths are affected.

…tested

… than eval

leesharkey and others added 30 commits September 16, 2025 18:07

Geometric similarity comparison made consistent with other evals and …

b93b9d6

…tested

Replaced mean max cosine sim with mean max ABS cosine sim

cd5fda2

Configs for geom comparison runs

61d3408

Merge remote-tracking branch 'origin/main' into feature/geom_sim_compar

63c85f0

Minor modifications to make PR-ready

770a5c5

Merge remote-tracking branch 'origin/main' into feature/geom_sim_compar

49ba925

Update seed to be consistent with other configs again

364198e

Cleaned up some comments and other bits

57c2c76

Major update of PR following review: Now implemented as script rather…

2e7752d

… than eval

Merge remote-tracking branch 'origin/main' into feature/geom_sim_compar

4fbf807

Updated registry to delete old obselete experiments

98a6620

Merge branch 'main' into feature/geom_sim_compar

bede346

Merge branch 'main' into feature/geom_sim_compar

acc04f1

Reorganized compare_models into subdirectory and cleaned up config code

62bd77e

Merging

b84814a

Updated README.md

5173a6a

Added some example models to the config

181cac8

Getting rid of newline

8db7559

Minor changes to make the PR mergeable

0d05f0a

Merge branch 'main' of https://github.com/goodfire-ai/spd

8767194

Merge branch 'main' of https://github.com/goodfire-ai/spd

019eb2d

Merge branch 'main' of https://github.com/goodfire-ai/spd

b935b4c

Merge branch 'main' of github.com:goodfire-ai/spd

3d1edeb

Merge branch 'main' of github.com:goodfire-ai/spd

1dd738d

Merge branch 'main' of github.com:goodfire-ai/spd

956f3d4

Merge branch 'main' of github.com:goodfire-ai/spd

f7ad411

Merge branch 'main' of github.com:goodfire-ai/spd

ade1377

Merge branch 'main' of github.com:goodfire-ai/spd

08875a9

Merge branch 'main' of github.com:goodfire-ai/spd

7ca7037

Merge branch 'main' of github.com:goodfire-ai/spd

cbbdb61

leesharkey added 3 commits October 28, 2025 14:00

Merge branch 'main' of github.com:goodfire-ai/spd

267deb6

Merge branch 'main' of github.com:goodfire-ai/spd

f49e9e0

Merge branch 'main' of github.com:goodfire-ai/spd

22f7cfc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compare text generations of simplestories models#254

Compare text generations of simplestories models#254
leesharkey wants to merge 33 commits intomainfrom
feature/compare_lm_generations

leesharkey commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leesharkey commented Nov 14, 2025

Description

Related Issue

Motivation and Context

How Has This Been Tested?

Does this PR introduce a breaking change?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant