LunarCommand · chris-colinsky · Apr 16, 2026
diff --git a/.gitignore b/.gitignore
@@ -1,5 +1,6 @@
 assets/
 .claude/
+_reqs/
 
 # Python
 .venv/

diff --git a/openspec/changes/universal-alignment-strategy/.openspec.yaml b/openspec/changes/universal-alignment-strategy/.openspec.yaml
@@ -0,0 +1,2 @@
+schema: spec-driven
+created: 2026-04-16
diff --git a/openspec/changes/universal-alignment-strategy/design.md b/openspec/changes/universal-alignment-strategy/design.md
diff --git a/openspec/changes/universal-alignment-strategy/proposal.md b/openspec/changes/universal-alignment-strategy/proposal.md
@@ -0,0 +1,38 @@
+## Why
+
+Audio Refinery is coupled to WhisperX, which bundles transcription and Wav2Vec2 alignment behind a single API. This prevents us from swapping in better ASR backends (Cohere Transcribe, cloud APIs, newer open-source models) without losing the millisecond-level word timestamps that downstream consumers (RAG databases, semantic video editing, diarization merging) depend on. Decoupling the transcription engine from the word-level aligner turns Audio Refinery into a model-agnostic audio processing factory where ASR backends are hot-swappable while alignment quality stays constant.
+
+## What Changes
+
+- Add a new **forced alignment** stage built on `torchaudio.pipelines.MMS_FA` (Wav2Vec2 acoustic model) that accepts a sanitized text string plus audio and emits precise word-level start/end times.
+- Add a new **text normalization** step that sanitizes ASR output before alignment: expand numbers and currency ("$100" → "one hundred dollars"), strip punctuation and symbols that crash phoneme models, and lowercase.
+- Refactor the **transcription** stage into an ASR-only, model-agnostic interface. The stage produces rough segments `[{start, end, text}]` and a detected language code; it no longer emits word-level timestamps directly.
+- Introduce a **chunked alignment** strategy. Because trellis pathfinding memory scales quadratically with audio length, the aligner must chunk audio to ≤30 s windows based on ASR segment boundaries (or VAD) before running MMS-FA.
+- Introduce a **hybrid model lifecycle** with two VRAM strategies: **co-resident** (all in-process models loaded simultaneously — fastest, requires sufficient VRAM) and **sequential** (load/unload each stage one at a time — fits any GPU). Each stage exposes `load_model()` / `unload_model()` with `gc.collect()` + `torch.cuda.empty_cache()`. Default strategy is co-resident; the system automatically falls back to sequential if the projected VRAM budget exceeds available capacity.
+- Add a **VRAM budget preflight** that runs before the pipeline starts: queries available GPU memory, looks up projected VRAM per model from a new `model_budgets.toml` registry, and validates that the selected strategy fits. Blocks the run with an actionable error if it doesn't, suggesting alternatives (smaller model, different strategy, free GPU memory).
+- Add an **interactive run planner** via a new `audio-refinery plan` CLI command. Discovers files, queries GPUs, renders the pipeline configuration with a VRAM budget bar chart, and lets the user adjust stages/models/strategy before committing. Optionally saves validated configurations as reusable **run profiles** (TOML) for non-interactive replay via `pipeline --plan profile.toml`.
+- Plumb the ASR-detected `language_code` through to the aligner so future language-specific acoustic models can be swapped in dynamically.
+- **BREAKING**: `TranscriptionResult` and the transcription stage contract change shape — word-level timestamps move out of transcription output into a new `AlignmentResult` produced by the forced-alignment stage. Downstream consumers (diarization merge, sentiment, CLI reports) read from the alignment output instead.
+- **BREAKING**: WhisperX is removed from the critical path. The `transcriber` module either wraps faster-whisper directly (ASR-only) or exposes a pluggable backend interface; the `whisperx.align` call is deleted.
+
+## Capabilities
+
+### New Capabilities
+
+- `forced-alignment`: Chunked word-level forced alignment of arbitrary text against audio using torchaudio MMS-FA (Wav2Vec2). Accepts a language code, produces per-word start/end/score arrays, and owns its own model load/unload lifecycle.
+- `text-normalization`: Deterministic text sanitization that prepares ASR output for phoneme-based alignment (number/currency expansion, symbol stripping, casing).
+- `transcription`: Model-agnostic ASR stage that emits rough segments and a detected language code. Establishes the contract that lets transcription backends be swapped without touching alignment.
+- `model-lifecycle`: Hybrid VRAM strategy (co-resident vs. sequential) with per-stage `load_model`/`unload_model` hooks, a `model_budgets.toml` VRAM registry, an upfront preflight validation that blocks runs projected to OOM, and an interactive `audio-refinery plan` command for configuring and validating pipeline runs before execution.
+
+### Modified Capabilities
+
+<!-- No existing specs in openspec/specs/ — the four capabilities above are all net-new. -->
+
+## Impact
+
+- **Code**: `src/transcriber.py` (refactored into ASR-only wrapper), `src/pipeline.py` (new stage order + hybrid VRAM lifecycle), `src/cli.py` (new `align` and `plan` subcommands, updated `transcribe` and `pipeline` commands), `src/models/transcription.py` (shape change), new `src/aligner.py`, `src/text_normalizer.py`, `src/vram_preflight.py` modules, new `src/models/alignment.py` Pydantic model, new `src/model_budgets.toml` registry.
+- **APIs**: `TranscriptionResult.segments[].words` removed; new `AlignmentResult` model with `aligned_words: list[AlignedWord]` becomes the source of truth for word timing. Sentiment and diarization merging read from the alignment output. New `VramBudget` dataclass and `validate_vram_budget()` function for preflight checks.
+- **Dependencies**: `torchaudio>=2.1.2` (already pinned via torch 2.1.2) gains the MMS-FA bundle; HuggingFace model weights for `torchaudio.pipelines.MMS_FA` download on first use. WhisperX drops from the critical path but stays available as an optional legacy backend behind a feature flag for one release cycle to ease migration. `faster-whisper` becomes a direct dependency (previously pulled transitively by whisperx).
+- **Tests**: New unit tests for `text_normalizer`, `aligner` (with mocked torchaudio bundle), VRAM preflight validation, and pipeline reordering; new integration test that runs the full decoupled pipeline end-to-end on a short clip with GPU marker.
+- **Performance**: Co-resident mode preserves current batch throughput (models loaded once). Sequential mode trades load/unload overhead for reduced VRAM footprint. Preflight validation prevents OOM failures before work begins.
+- **Docs**: `README.md`, `CLAUDE.md`, and `docs/DEVELOPMENT.md` updated to describe the new stage order, model-agnostic transcription contract, VRAM strategies, and the `plan` command.
diff --git a/openspec/changes/universal-alignment-strategy/specs/forced-alignment/spec.md b/openspec/changes/universal-alignment-strategy/specs/forced-alignment/spec.md
@@ -0,0 +1,61 @@
+## ADDED Requirements
+
+### Requirement: Forced alignment stage produces word-level timestamps from text and audio
+
+The system SHALL provide a forced-alignment stage that accepts a sanitized text string, an audio waveform, and a language code, and produces a list of aligned words with start time, end time, and a confidence score. The stage SHALL use `torchaudio.pipelines.MMS_FA` as its default acoustic model.
+
+#### Scenario: Aligning a short English utterance
+
+- **WHEN** the stage is called with a ~5 second audio clip, the sanitized text "hello world", and language code "en"
+- **THEN** it returns two aligned words whose start/end times fall within the clip bounds, whose ordering matches the text, and whose scores are floats in `[0.0, 1.0]`
+
+#### Scenario: Aligning a non-English utterance
+
+- **WHEN** the stage is called with audio, sanitized text, and a non-English language code that MMS-FA supports
+- **THEN** it loads the MMS-FA bundle once, produces aligned words for the input, and does not raise
+
+#### Scenario: Aligner receives an unsupported language code
+
+- **WHEN** the stage is called with a language code MMS-FA does not support
+- **THEN** it raises a clear error identifying the unsupported language, and it does not silently fall back to a different language model
+
+### Requirement: Forced alignment chunks audio before running the trellis
+
+The system SHALL chunk input audio into windows of at most 30 seconds before running the forced-alignment trellis, using upstream ASR segment boundaries as the primary chunk delimiter. The system SHALL merge per-chunk results into a single global timeline using each chunk's start offset.
+
+#### Scenario: Ten-minute input with multiple ASR segments
+
+- **WHEN** the stage is called with a 10-minute audio file and 40 ASR segments
+- **THEN** it processes the file as multiple chunks each ≤30 seconds long, each chunk's word timings are offset back into the global timeline, and the final output covers the full duration with monotonically increasing start times
+
+#### Scenario: Single ASR segment exceeds the chunk cap
+
+- **WHEN** an ASR segment's audio window is longer than 30 seconds
+- **THEN** the stage splits the segment on a silence-based fallback (or reports a clear `fallback_reason` if no silence is found) and continues processing the rest of the file
+
+### Requirement: Forced alignment stage owns its model load and unload lifecycle
+
+The forced-alignment stage SHALL expose explicit `load_model` and `unload_model` entry points. After `unload_model` returns, the MMS-FA acoustic model SHALL no longer hold GPU memory, as observed by `torch.cuda.memory_allocated` returning to its pre-load baseline on the target device.
+
+#### Scenario: Unloading releases VRAM
+
+- **WHEN** the stage loads the MMS-FA model, runs one alignment, and calls `unload_model`
+- **THEN** `torch.cuda.memory_allocated(device)` after `unload_model` is within 50 MB of its value before `load_model` was called
+
+### Requirement: Forced alignment output includes provenance
+
+The stage SHALL emit an `AlignmentResult` containing the input audio path, a pointer to the transcription JSON that sourced the text, the language code used, the acoustic model identifier, the device, the number of chunks processed, any `fallback_reason`, processing time, and timestamps.
+
+#### Scenario: Serializing an AlignmentResult
+
+- **WHEN** the stage completes alignment and serializes its result to JSON
+- **THEN** the JSON contains `input_file`, `transcription_file`, `language`, `acoustic_model`, `device`, `chunks_processed`, `processing_time_seconds`, `started_at`, `completed_at`, and `aligned_words`
+
+### Requirement: Aligned words link back to transcript segments
+
+Each aligned word SHALL carry an index identifying which transcript segment it belongs to, so downstream consumers can reconstruct segment-to-word membership without relying on timestamp overlap heuristics.
+
+#### Scenario: Reconstructing segment membership
+
+- **WHEN** a consumer reads an `AlignmentResult` and a `TranscriptionResult` from the same pipeline run
+- **THEN** every aligned word's `segment_index` value is a valid index into `TranscriptionResult.segments`, and iterating aligned words grouped by `segment_index` reproduces the word order within each segment