Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
assets/
.claude/
_reqs/

# Python
.venv/
Expand Down
2 changes: 2 additions & 0 deletions openspec/changes/universal-alignment-strategy/.openspec.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-04-16
258 changes: 258 additions & 0 deletions openspec/changes/universal-alignment-strategy/design.md

Large diffs are not rendered by default.

38 changes: 38 additions & 0 deletions openspec/changes/universal-alignment-strategy/proposal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
## Why

Audio Refinery is coupled to WhisperX, which bundles transcription and Wav2Vec2 alignment behind a single API. This prevents us from swapping in better ASR backends (Cohere Transcribe, cloud APIs, newer open-source models) without losing the millisecond-level word timestamps that downstream consumers (RAG databases, semantic video editing, diarization merging) depend on. Decoupling the transcription engine from the word-level aligner turns Audio Refinery into a model-agnostic audio processing factory where ASR backends are hot-swappable while alignment quality stays constant.

## What Changes

- Add a new **forced alignment** stage built on `torchaudio.pipelines.MMS_FA` (Wav2Vec2 acoustic model) that accepts a sanitized text string plus audio and emits precise word-level start/end times.
- Add a new **text normalization** step that sanitizes ASR output before alignment: expand numbers and currency ("$100" → "one hundred dollars"), strip punctuation and symbols that crash phoneme models, and lowercase.
- Refactor the **transcription** stage into an ASR-only, model-agnostic interface. The stage produces rough segments `[{start, end, text}]` and a detected language code; it no longer emits word-level timestamps directly.
- Introduce a **chunked alignment** strategy. Because trellis pathfinding memory scales quadratically with audio length, the aligner must chunk audio to ≤30 s windows based on ASR segment boundaries (or VAD) before running MMS-FA.
- Introduce a **hybrid model lifecycle** with two VRAM strategies: **co-resident** (all in-process models loaded simultaneously — fastest, requires sufficient VRAM) and **sequential** (load/unload each stage one at a time — fits any GPU). Each stage exposes `load_model()` / `unload_model()` with `gc.collect()` + `torch.cuda.empty_cache()`. Default strategy is co-resident; the system automatically falls back to sequential if the projected VRAM budget exceeds available capacity.
- Add a **VRAM budget preflight** that runs before the pipeline starts: queries available GPU memory, looks up projected VRAM per model from a new `model_budgets.toml` registry, and validates that the selected strategy fits. Blocks the run with an actionable error if it doesn't, suggesting alternatives (smaller model, different strategy, free GPU memory).
- Add an **interactive run planner** via a new `audio-refinery plan` CLI command. Discovers files, queries GPUs, renders the pipeline configuration with a VRAM budget bar chart, and lets the user adjust stages/models/strategy before committing. Optionally saves validated configurations as reusable **run profiles** (TOML) for non-interactive replay via `pipeline --plan profile.toml`.
- Plumb the ASR-detected `language_code` through to the aligner so future language-specific acoustic models can be swapped in dynamically.
- **BREAKING**: `TranscriptionResult` and the transcription stage contract change shape — word-level timestamps move out of transcription output into a new `AlignmentResult` produced by the forced-alignment stage. Downstream consumers (diarization merge, sentiment, CLI reports) read from the alignment output instead.
- **BREAKING**: WhisperX is removed from the critical path. The `transcriber` module either wraps faster-whisper directly (ASR-only) or exposes a pluggable backend interface; the `whisperx.align` call is deleted.

## Capabilities

### New Capabilities

- `forced-alignment`: Chunked word-level forced alignment of arbitrary text against audio using torchaudio MMS-FA (Wav2Vec2). Accepts a language code, produces per-word start/end/score arrays, and owns its own model load/unload lifecycle.
- `text-normalization`: Deterministic text sanitization that prepares ASR output for phoneme-based alignment (number/currency expansion, symbol stripping, casing).
- `transcription`: Model-agnostic ASR stage that emits rough segments and a detected language code. Establishes the contract that lets transcription backends be swapped without touching alignment.
- `model-lifecycle`: Hybrid VRAM strategy (co-resident vs. sequential) with per-stage `load_model`/`unload_model` hooks, a `model_budgets.toml` VRAM registry, an upfront preflight validation that blocks runs projected to OOM, and an interactive `audio-refinery plan` command for configuring and validating pipeline runs before execution.

### Modified Capabilities

<!-- No existing specs in openspec/specs/ — the four capabilities above are all net-new. -->

## Impact

- **Code**: `src/transcriber.py` (refactored into ASR-only wrapper), `src/pipeline.py` (new stage order + hybrid VRAM lifecycle), `src/cli.py` (new `align` and `plan` subcommands, updated `transcribe` and `pipeline` commands), `src/models/transcription.py` (shape change), new `src/aligner.py`, `src/text_normalizer.py`, `src/vram_preflight.py` modules, new `src/models/alignment.py` Pydantic model, new `src/model_budgets.toml` registry.
- **APIs**: `TranscriptionResult.segments[].words` removed; new `AlignmentResult` model with `aligned_words: list[AlignedWord]` becomes the source of truth for word timing. Sentiment and diarization merging read from the alignment output. New `VramBudget` dataclass and `validate_vram_budget()` function for preflight checks.
- **Dependencies**: `torchaudio>=2.1.2` (already pinned via torch 2.1.2) gains the MMS-FA bundle; HuggingFace model weights for `torchaudio.pipelines.MMS_FA` download on first use. WhisperX drops from the critical path but stays available as an optional legacy backend behind a feature flag for one release cycle to ease migration. `faster-whisper` becomes a direct dependency (previously pulled transitively by whisperx).
- **Tests**: New unit tests for `text_normalizer`, `aligner` (with mocked torchaudio bundle), VRAM preflight validation, and pipeline reordering; new integration test that runs the full decoupled pipeline end-to-end on a short clip with GPU marker.
- **Performance**: Co-resident mode preserves current batch throughput (models loaded once). Sequential mode trades load/unload overhead for reduced VRAM footprint. Preflight validation prevents OOM failures before work begins.
- **Docs**: `README.md`, `CLAUDE.md`, and `docs/DEVELOPMENT.md` updated to describe the new stage order, model-agnostic transcription contract, VRAM strategies, and the `plan` command.
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
## ADDED Requirements

### Requirement: Forced alignment stage produces word-level timestamps from text and audio

The system SHALL provide a forced-alignment stage that accepts a sanitized text string, an audio waveform, and a language code, and produces a list of aligned words with start time, end time, and a confidence score. The stage SHALL use `torchaudio.pipelines.MMS_FA` as its default acoustic model.

#### Scenario: Aligning a short English utterance

- **WHEN** the stage is called with a ~5 second audio clip, the sanitized text "hello world", and language code "en"
- **THEN** it returns two aligned words whose start/end times fall within the clip bounds, whose ordering matches the text, and whose scores are floats in `[0.0, 1.0]`

#### Scenario: Aligning a non-English utterance

- **WHEN** the stage is called with audio, sanitized text, and a non-English language code that MMS-FA supports
- **THEN** it loads the MMS-FA bundle once, produces aligned words for the input, and does not raise

#### Scenario: Aligner receives an unsupported language code

- **WHEN** the stage is called with a language code MMS-FA does not support
- **THEN** it raises a clear error identifying the unsupported language, and it does not silently fall back to a different language model

### Requirement: Forced alignment chunks audio before running the trellis

The system SHALL chunk input audio into windows of at most 30 seconds before running the forced-alignment trellis, using upstream ASR segment boundaries as the primary chunk delimiter. The system SHALL merge per-chunk results into a single global timeline using each chunk's start offset.

#### Scenario: Ten-minute input with multiple ASR segments

- **WHEN** the stage is called with a 10-minute audio file and 40 ASR segments
- **THEN** it processes the file as multiple chunks each ≤30 seconds long, each chunk's word timings are offset back into the global timeline, and the final output covers the full duration with monotonically increasing start times

#### Scenario: Single ASR segment exceeds the chunk cap

- **WHEN** an ASR segment's audio window is longer than 30 seconds
- **THEN** the stage splits the segment on a silence-based fallback (or reports a clear `fallback_reason` if no silence is found) and continues processing the rest of the file

### Requirement: Forced alignment stage owns its model load and unload lifecycle

The forced-alignment stage SHALL expose explicit `load_model` and `unload_model` entry points. After `unload_model` returns, the MMS-FA acoustic model SHALL no longer hold GPU memory, as observed by `torch.cuda.memory_allocated` returning to its pre-load baseline on the target device.

#### Scenario: Unloading releases VRAM

- **WHEN** the stage loads the MMS-FA model, runs one alignment, and calls `unload_model`
- **THEN** `torch.cuda.memory_allocated(device)` after `unload_model` is within 50 MB of its value before `load_model` was called

### Requirement: Forced alignment output includes provenance

The stage SHALL emit an `AlignmentResult` containing the input audio path, a pointer to the transcription JSON that sourced the text, the language code used, the acoustic model identifier, the device, the number of chunks processed, any `fallback_reason`, processing time, and timestamps.

#### Scenario: Serializing an AlignmentResult

- **WHEN** the stage completes alignment and serializes its result to JSON
- **THEN** the JSON contains `input_file`, `transcription_file`, `language`, `acoustic_model`, `device`, `chunks_processed`, `processing_time_seconds`, `started_at`, `completed_at`, and `aligned_words`

### Requirement: Aligned words link back to transcript segments

Each aligned word SHALL carry an index identifying which transcript segment it belongs to, so downstream consumers can reconstruct segment-to-word membership without relying on timestamp overlap heuristics.

#### Scenario: Reconstructing segment membership

- **WHEN** a consumer reads an `AlignmentResult` and a `TranscriptionResult` from the same pipeline run
- **THEN** every aligned word's `segment_index` value is a valid index into `TranscriptionResult.segments`, and iterating aligned words grouped by `segment_index` reproduces the word order within each segment
Loading
Loading