Skip to content

Compare text generations of simplestories models#254

Open
leesharkey wants to merge 33 commits intomainfrom
feature/compare_lm_generations
Open

Compare text generations of simplestories models#254
leesharkey wants to merge 33 commits intomainfrom
feature/compare_lm_generations

Conversation

@leesharkey
Copy link
Contributor

Description

This PR adds a new script spd/scripts/compare_generations/compare_generations.py that enables qualitative comparison of text generations from different model configurations. The script generates and displays side-by-side comparisons of:

  • Target model: Original pretrained model generations
  • Unmasked SPD model: SPD model with all component masks fixed to 1
  • Masked SPD model: SPD model using causal importance values directly as masks
  • Stochastic masked SPD model: SPD model using stochastically sampled masks from causal importance (matching training behavior)

The script loads models from wandb run IDs (or local paths), generates text autoregressively, and displays results both in the terminal and saves them to JSON for further analysis.

Key features:

  • Configurable generation parameters (temperature, top-k, max tokens, prompt length)
  • Automatic prompt truncation to control output visibility
  • Sequences start at story boundaries (beginning of stories) rather than arbitrary positions
  • Simple, readable implementation prioritizing clarity over efficiency
  • For masked SPD generation, causal importances are recomputed for each new token

Related Issue

N/A - New feature addition

Motivation and Context

This script addresses the need for qualitative evaluation and comparison of different SPD model configurations. It allows researchers to:

  1. Compare model behaviors: See how target models, unmasked SPD, and masked SPD models differ in their text generation
  2. Validate SPD training: Compare stochastic masked SPD (training behavior) vs deterministic masked SPD (inference behavior)
  3. Debug and analyze: Identify qualitative differences between model configurations that might not be captured by quantitative metrics
  4. Reproducible comparisons: Save results to JSON for later analysis and sharing

The script is designed to be simple and maintainable, making it easy to understand and modify for different use cases.

How Has This Been Tested?

The script has been tested with:

  • Wandb run ID: wandb:goodfire/spd/runs/9gf5ud48 and `5cdomlyn'
  • Language model task (SimpleStories dataset)
  • Various temperature settings (1.0, 100.0) to verify sampling works correctly
  • Different prompt lengths (10, 50, 100 tokens) to verify truncation
  • All four generation types (target, unmasked SPD, masked SPD, stochastic masked SPD)

Test results:

  • ✅ Successfully loads models from wandb
  • ✅ Generates text with all four model types
  • ✅ Displays results in readable format
  • ✅ Saves results to JSON
  • ✅ Prompts start at story boundaries (not mid-story)
  • ✅ Temperature sampling works correctly
  • ✅ Prompt truncation works as expected

Does this PR introduce a breaking change?

No, this is a new script addition and does not modify any existing functionality. It only adds new files:

  • spd/scripts/compare_generations/compare_generations.py
  • spd/scripts/compare_generations/compare_generations_config.yaml
  • spd/scripts/compare_generations/README.md
  • .vscode/launch.json (debugger configuration)

No existing code paths are affected.

leesharkey and others added 30 commits September 16, 2025 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant