diff --git a/.gitignore b/.gitignore index c64f816..5f970cd 100644 --- a/.gitignore +++ b/.gitignore @@ -1,5 +1,6 @@ assets/ .claude/ +_reqs/ # Python .venv/ diff --git a/openspec/changes/universal-alignment-strategy/.openspec.yaml b/openspec/changes/universal-alignment-strategy/.openspec.yaml new file mode 100644 index 0000000..3a54a17 --- /dev/null +++ b/openspec/changes/universal-alignment-strategy/.openspec.yaml @@ -0,0 +1,2 @@ +schema: spec-driven +created: 2026-04-16 diff --git a/openspec/changes/universal-alignment-strategy/design.md b/openspec/changes/universal-alignment-strategy/design.md new file mode 100644 index 0000000..15178c3 --- /dev/null +++ b/openspec/changes/universal-alignment-strategy/design.md @@ -0,0 +1,258 @@ +## Context + +Audio Refinery's transcription stage is implemented in `src/transcriber.py` as a thin wrapper around WhisperX. A single `transcribe()` call does three things at once: (1) runs Whisper large-v3 via faster-whisper/ctranslate2 to get rough segments, (2) loads a per-language Wav2Vec2 model via `whisperx.load_align_model()` and runs `whisperx.align()` to add word-level timestamps, and (3) optionally calls `whisperx.assign_word_speakers()` to merge pyannote diarization labels onto words. The pipeline pre-loads the WhisperX model once per run in `pipeline.py::_load_whisperx_model()` and passes it through. + +This design is constrained by WhisperX's release cadence and API. The project is pinned to a specific commit (`741ab9a2a8a1076c171e785363b23c55a91ceff1`) because upstream tags either lacked `device_index` or broke against our pinned `torch==2.1.2+cu121`. Upgrading ASR backends (e.g., to Cohere Transcribe or a newer Whisper variant) means giving up Wav2Vec2 alignment, because alignment is bolted onto WhisperX's internal data shapes. The result: the project cannot benefit from ASR improvements without losing downstream word-timestamp precision that RAG indexing, video editing, and word-level diarization merge depend on. + +Separately, VRAM pressure is already real on 24 GB consumer GPUs. `pipeline.py` currently keeps WhisperX + align model resident for the whole loop, and pyannote is loaded per file. Adding another acoustic model (MMS-FA) on top without a strict lifecycle would OOM. + +**Current state snapshot** (as of 2026-04-15): +- `src/transcriber.py::transcribe()` — does ASR + align + speaker-merge in one function. +- `src/pipeline.py::_load_whisperx_model()` — pre-loads WhisperX model for the batch. +- `src/models/transcription.py::TranscriptSegment.words` — where word-level timestamps live today. +- `src/models/transcription.py::TranscriptionResult.alignment_fallback: bool` — tracks when Wav2Vec2 align fails and raw Whisper timestamps are used as a fallback. +- No `openspec/specs/` entries yet; this is a greenfield spec effort. + +## Goals / Non-Goals + +**Goals:** +- Decouple ASR from word-level alignment so transcription backends become hot-swappable. +- Produce word-level timestamps of equal-or-better precision than the current WhisperX path using `torchaudio.pipelines.MMS_FA`. +- Keep runs OOM-free across all GPU sizes by implementing a hybrid VRAM strategy (co-resident or sequential) with upfront preflight validation. +- Preserve the existing downstream contract: diarization-aware transcripts with per-word speakers and per-segment sentiment still come out of the pipeline. +- Make the aligner language-aware via an explicit `language_code` parameter, so future per-language acoustic models are a config swap, not a code change. + +**Non-Goals:** +- Shipping a Cohere Transcribe backend in this change. This change lands the decoupling and the torchaudio alignment; Cohere is a follow-up that plugs into the new transcription interface. +- Supporting ASR backends that don't emit rough segment timestamps. If a backend only emits a monolithic block of text, the pipeline falls back to VAD-based chunking — that fallback ships later. +- Changing the Demucs or pyannote stages. Their interfaces stay the same. +- Rewriting sentiment analysis. It already consumes `TranscriptSegment` and will continue to; the segment shape changes minimally (see Decisions). +- Removing WhisperX from the repo in this change. It stays installable as a legacy backend flag for one release cycle. + +## Decisions + +### Decision 1: Split `transcriber.py` into an ASR-only wrapper; add a new `aligner.py` stage + +The existing `transcribe()` function becomes ASR-only. It runs faster-whisper directly (WhisperX's underlying engine) via a thin wrapper and returns rough segments plus a detected language code — no `words`, no align, no speaker merge. Alignment moves to a new `src/aligner.py` module that owns `torchaudio.pipelines.MMS_FA`. Speaker merge moves into either `aligner.py` or a small new `src/merger.py` (see Open Questions). + +**Alternatives considered:** +- *Keep WhisperX as the only backend and wrap MMS-FA inside it* — rejected. The whole point is to decouple. Wrapping MMS-FA inside WhisperX reproduces the coupling we're removing. +- *Call faster-whisper via whisperx's `load_model()` and just skip the align step* — rejected. This leaves us married to WhisperX's model-loading lifecycle and its specific transformers/torch pin. Direct faster-whisper gives us a cleaner break. +- *Make the transcription stage pluggable via a `TranscriptionBackend` abstract class from day one* — deferred. The first backend is faster-whisper. The second backend (Cohere or cloud) is what justifies the abstraction; building it before we know the second backend's shape invites wrong abstractions. This change lays the seam; the interface hardens in the Cohere follow-up. + +### Decision 2: Use `torchaudio.pipelines.MMS_FA` for forced alignment + +MMS-FA ships with torchaudio ≥ 2.1, is multilingual out of the box, and its acoustic model weights are small enough to load/unload cheaply. It exposes the trellis directly, which is what we need for chunked alignment. + +**Alternatives considered:** +- *Keep `whisperx.align()` but call it standalone* — possible, but it pulls in WhisperX's dependency chain (pytorch-lightning, lightning, etc.) and still couples us to their language-model mapping table. Wins nothing architecturally. +- *Use `ctc-forced-aligner` from HuggingFace* — viable alternative, but not yet in torchaudio and requires an extra HF download path. Revisit if MMS-FA quality disappoints on non-English content. +- *Skip forced alignment entirely and trust ASR-reported word timings* — rejected. faster-whisper's `word_timestamps=True` path is noisier than Wav2Vec2 alignment, especially near segment boundaries, and that's exactly what downstream consumers care about. + +### Decision 3: Chunked alignment driven by ASR segment boundaries, capped at 30 s + +Trellis memory is O(audio_frames × tokens). For a 10-minute file, that's a multi-GB matrix; for a 1-hour podcast, it's impossible. The aligner must chunk before running. + +**Chunking strategy:** +- Primary: use the ASR stage's rough segment boundaries. For each segment (or contiguous group of segments), pack until the audio window reaches ~25 s, then emit a chunk. Cap hard at 30 s. +- Fallback (when a single ASR segment exceeds 30 s — rare, happens on music or silence runs): split on the nearest silence using a lightweight energy-based VAD, or fail the chunk with a clear error and mark it in `AlignmentResult.fallback_reason`. +- Each chunk is aligned independently; results are offset back into the global timeline using the chunk's start time. + +**Alternatives considered:** +- *Fixed 30 s stride, ignoring ASR boundaries* — simpler, but splits mid-word and produces worse alignment at chunk edges. +- *VAD-driven chunking independent of ASR* — adds a second model to the pipeline (silero-vad or similar) just for chunking. Unnecessary when ASR already gives us rough boundaries for free. +- *Stream alignment with a sliding window* — premature complexity. Chunked alignment is good enough for batch processing, which is this tool's core use case. + +### Decision 4: Hybrid VRAM strategy — co-resident by default, sequential by flag or auto-fallback + +Each stage module exposes: + +``` +load_model(device, ...) -> ModelHandle +unload_model(handle) -> None # calls del, gc.collect(), torch.cuda.empty_cache() +``` + +Two VRAM strategies are available via `--vram-strategy {co-resident, sequential}`: + +**Co-resident (default):** All in-process GPU models (pyannote, faster-whisper, MMS-FA) are loaded once before the batch loop and stay resident for the entire run, matching the current pipeline's pre-load pattern. This is the fastest option because model load time is paid once. Demucs already runs as a subprocess (separate process, separate VRAM), and sentiment runs on CPU, so neither competes for in-process GPU memory. + +Typical co-resident VRAM budget: +``` + pyannote diarization ~1,500 MiB + faster-whisper large-v3 ~5,800 MiB + MMS-FA alignment ~400 MiB + ──────────────────────────────────────── + Total co-resident ~7,700 MiB (fits comfortably on 24 GB) +``` + +**Sequential:** Each stage loads its model, processes all files (or the current file), unloads the model, then the next stage loads. Peak VRAM equals the single largest stage. Required when co-resident doesn't fit — e.g., when a future heavyweight stage (emotion/SER, CLAP, or a large cloud-class ASR model) is enabled alongside existing stages. + +Per-file sequential overhead: ~10.5 s of model loading per file (pyannote ~3 s + faster-whisper ~6 s + MMS-FA ~1.5 s). For a 10-file batch, this adds ~105 s vs. ~11 s for co-resident. **In sequential mode, default to stage-batched execution** (all files through one stage before the next) to amortize load cost across files. Per-file interleaved mode stays available behind a flag for small runs. + +**Auto-fallback:** If the user doesn't specify a strategy, the pipeline runs the VRAM preflight check (see Decision 7). If co-resident fits, it's used. If it doesn't, the pipeline automatically falls back to sequential and emits a warning explaining why. + +**Alternatives considered:** +- *Strict sequential only (original proposal)* — rejected after analysis. MMS-FA is only ~400 MiB; co-resident pyannote + faster-whisper + MMS-FA totals ~7.7 GB, well within 24 GB consumer cards. Forcing sequential load/unload on every run adds ~10 s of dead time per file for no VRAM benefit on most hardware. The strict approach was designed for a hypothetical future where all models are large; the hybrid defers that cost until it's actually needed. +- *Co-resident only (no sequential option)* — rejected. Users on smaller GPUs (12 GB), users adding future heavyweight stages, and cloud deployments with variable instance types all need the escape hatch. Different transcription models also vary widely in VRAM (small ~1.2 GB vs. large-v3 ~5.8 GB), making a one-size-fits-all assumption untenable. +- *Swap models via CPU offload (`accelerate`-style)* — adds a heavy dependency and doesn't actually free VRAM for the next model; it just moves weights to CPU RAM while the previous model's allocator fragments persist. + +### Decision 5: New `AlignmentResult` model; `TranscriptionResult` loses `words` + +```python +# src/models/transcription.py (changed) +class TranscriptSegment(BaseModel): + text: str + start: float + end: float + # words: REMOVED — now lives on AlignmentResult + speaker: str | None = None + sentiment: SegmentSentiment | None = None + +# src/models/alignment.py (new) +class AlignedWord(BaseModel): + word: str + start: float + end: float + score: float + speaker: str | None = None # populated by the merge step + +class AlignmentResult(BaseModel): + input_file: Path + transcription_file: Path # provenance pointer back to the ASR output + language: str + aligned_words: list[AlignedWord] + segment_index: list[int] # aligned_words[i] belongs to TranscriptSegment[segment_index[i]] + acoustic_model: str # e.g. "torchaudio.pipelines.MMS_FA" + device: str + chunks_processed: int + fallback_reason: str | None = None # filled when a chunk fell back or failed + processing_time_seconds: float + started_at: datetime + completed_at: datetime +``` + +`segment_index` is the explicit back-pointer that lets downstream consumers reconstruct "which words belong to which segment" without relying on timestamp overlap math. Sentiment analysis and the RAG merge step both need this. + +**Alternatives considered:** +- *Inline aligned words back into `TranscriptSegment.words` after alignment* — easier migration (downstream code doesn't change), but hides the decoupling in the data model and re-couples segments to word-level data. Rejected because it defeats the point. +- *Ditch segments entirely and make aligned words the only output* — too aggressive. Segments are useful for RAG chunking and captioning; keeping them as the coarse unit and treating aligned words as the fine unit matches how consumers think. + +### Decision 6: Text normalization is a small, deterministic module — no ML + +`src/text_normalizer.py` exposes one function: `normalize_for_alignment(text: str, language: str) -> str`. It expands numbers (`100` → `"one hundred"`), currency (`$100` → `"one hundred dollars"`), lowercases, strips punctuation and symbols except intra-word apostrophes, and collapses whitespace. Use the `num2words` library (pure Python, multilingual) for number expansion. No tokenizer, no model. + +**Alternatives considered:** +- *Nemo/ESPnet text normalizers* — massive overkill for the symbol set that actually breaks MMS-FA. +- *Let the aligner's tokenizer handle everything* — it doesn't. MMS-FA crashes or produces garbage on `$`, `%`, raw digits, emoji, etc. Cleaning upstream is the only reliable path. +- *Run normalization inside the aligner module* — rejected for testability. A pure text-in/text-out module is trivially unit-testable; bundling it inside the aligner forces the aligner's fixtures to load a torchaudio bundle just to test string transforms. + +### Decision 7: VRAM budget registry (`model_budgets.toml`) and preflight validation + +A new `src/model_budgets.toml` file maps model identifiers to their expected VRAM footprint (weights + typical peak activations, in MiB). This is a separate file from `gpu_tflops.toml` — one maps GPU names to performance, the other maps model names to VRAM cost. + +```toml +# src/model_budgets.toml +[vram_mib] +"pyannote/speaker-diarization-3.1" = 1500 +"faster-whisper/large-v3" = 5800 +"faster-whisper/distil-large-v3" = 3500 +"faster-whisper/medium" = 2900 +"faster-whisper/small" = 1200 +"torchaudio/MMS-FA" = 400 +``` + +A new `src/vram_preflight.py` module exposes `validate_vram_budget()`: + +1. Queries GPU(s) via existing `gpu_utils.query_gpu_info()` to get total VRAM. +2. Queries active processes via `gpu_utils.query_compute_processes()` to determine already-used VRAM. +3. Subtracts a configurable headroom (default 512 MiB) for CUDA context and fragmentation. +4. Looks up the VRAM cost of each enabled stage's model from `model_budgets.toml`. +5. For co-resident mode: sums all enabled stages. For sequential mode: takes the max. +6. Compares projected budget against available VRAM. +7. Returns a structured `VramBudget` result with per-model breakdowns, the verdict (fits / doesn't fit), and actionable suggestions if it fails (switch strategy, use smaller model, free GPU memory, switch GPU). + +For multi-GPU parallel runs, the preflight checks the **smallest** GPU across all workers, since workers run identical configurations. This keeps the validation simple and prevents the weakest link from OOM-ing mid-batch. + +**Alternatives considered:** +- *Extend `gpu_tflops.toml` with VRAM data* — rejected. Separate concerns: TFLOPS drives GPU ranking, VRAM budget drives pipeline planning. Mixing them makes both files harder to maintain. +- *Query VRAM at runtime via `torch.cuda.mem_get_info()`* — used as a supplemental check, but not the primary source. The TOML registry gives budget estimates *before* loading anything, which is the whole point of preflight. Runtime queries confirm the estimate after the first stage loads. +- *Skip preflight and just catch OOM at runtime* — rejected. An OOM 20 minutes into a 50-file batch is a terrible user experience. Catching it before any work begins is the correct UX. + +### Decision 8: Interactive run planner (`audio-refinery plan`) + +A new `plan` command provides an interactive CLI interface for configuring, validating, and optionally saving pipeline runs. This addresses the growing complexity of flag combinations as stages are added (separation, diarization, transcription, alignment, sentiment, and future emotion/events stages). + +**The planner workflow:** +1. Discovers source files and queries available GPUs. +2. Renders a Rich panel showing: file count and total audio duration, GPU specs, enabled pipeline stages with model choices, a VRAM budget bar chart comparing projected usage against available capacity, and an estimated run time based on historical RTF data. +3. Lets the user toggle stages, switch models, and change VRAM strategy. Each change immediately re-renders the budget and estimate. +4. On confirmation, either launches the pipeline directly or saves the configuration as a **run profile** — a small TOML file that `pipeline --plan profile.toml` replays non-interactively. + +**Run profile format:** +```toml +# saved by `audio-refinery plan --save my-batch.toml` +[source] +dir = "/audio/test/extracted" + +[pipeline] +stages = ["separate", "diarize", "transcribe", "align", "sentiment"] +vram_strategy = "co-resident" +mode = "stage-batched" + +[transcription] +model = "large-v3" +compute_type = "float16" +batch_size = 16 +language = "en" + +[alignment] +backend = "mms-fa" + +[devices] +primary = "cuda:0" +``` + +**Why an interactive planner instead of more flags:** +- Flag combinatorics grow multiplicatively. Today with 4 configurable stages × 3 models × 2 VRAM strategies = ~24 meaningful configs. With emotion + CLAP: 6 stages × 3 models × 2 strategies = ~384. No one reasons about this from `--help`. +- The planner shows *interactions* between choices. Picking `--model large-v3` + `--emotion` might OOM on a 12 GB card; the planner visualizes this immediately and suggests alternatives. +- Run profiles make scheduled/CI runs reproducible without memorizing flags. + +**Implementation notes:** +- Built on Rich (already in the stack) — panels, tables, bars. No new dependency for basic interaction (`rich.prompt.Prompt` for simple choices). +- The `plan` command is a standalone Click subcommand alongside `pipeline`, `transcribe`, etc. +- `pipeline --plan ` is syntactic sugar that reads the profile TOML and maps it to the existing pipeline kwargs. No separate execution path. + +**Alternatives considered:** +- *YAML/JSON config file only (no interactive mode)* — misses the core value: immediate feedback on VRAM budget as you adjust choices. A config file without a planner is just another thing to edit blindly. +- *TUI framework (textual, urwid)* — too heavy for what's needed. Rich prompts + panels + re-renders are sufficient. Revisit if the planner grows into a full dashboard (unlikely for a batch processing tool). +- *Web-based config UI* — wrong modality for a CLI tool that runs on GPU servers often accessed via SSH. + +## Risks / Trade-offs + +- **[MMS-FA quality regression on English vs. whisperx's English Wav2Vec2]** → Mitigation: ship an integration test that runs both backends on a known-good clip and compares word boundary deltas. If regression is material, keep WhisperX behind a `--aligner whisperx` flag for one release cycle. +- **[Sequential mode load overhead slows batch runs]** → Mitigation: co-resident is the default. Sequential mode defaults to stage-batched execution to amortize load cost. The preflight check warns users about the time impact before the run starts. Measure wall-clock on a representative batch in both modes before merging and document the trade-off in `docs/DEVELOPMENT.md`. +- **[Chunk boundaries cut words when an ASR segment's audio exceeds 30 s and no silence is present]** → Mitigation: prefer silence-based fallback; if no silence exists in a 30 s window (e.g., continuous music under speech), emit a clear error in `fallback_reason`, use raw ASR word timings for that chunk, and log a warning. Rare in the target domain (podcasts, interviews, dialog). +- **[num2words produces odd output for contextual currencies]** (e.g., "$1.5M" → unclear expansion) → Mitigation: pre-regex common cases (`\$(\d+(?:\.\d+)?)([kKmMbB])?`) into explicit text before calling num2words, and fall back to stripping the symbol if expansion fails. Log failures for corpus analysis. +- **[Breaking change to `TranscriptionResult.segments[].words`]** → Mitigation: bump the minor version (0.2.0), add a migration note to `CHANGELOG.md`, and provide a compat helper `load_transcription_with_words(path) -> list[TranscriptSegment]` that reads both the transcription and alignment JSON and re-stitches words onto segments for callers that want the old shape. +- **[Legacy whisperx backend flag adds a maintenance surface]** → Mitigation: scope it as "one release cycle only." Delete in the next change after this one lands and downstream callers confirm migration. +- **[MMS-FA weights download on first use may fail behind corporate proxies]** → Mitigation: document the HuggingFace cache path in `docs/DEVELOPMENT.md` and support `HF_HOME` / `TRANSFORMERS_CACHE` env vars (already honored by torchaudio). +- **[VRAM budget estimates are approximate]** → Mitigation: `model_budgets.toml` values include typical peak activation overhead (not just weight size), measured empirically at default batch sizes. The 512 MiB headroom margin absorbs minor variance. For edge cases, the planner shows both the projected budget and actual available VRAM, so users can judge the margin themselves. Runtime `torch.cuda.mem_get_info()` provides a secondary safety net. +- **[Interactive planner assumes a TTY]** → Mitigation: `audio-refinery plan` detects non-interactive mode (`not sys.stdin.isatty()`) and exits with a message pointing to `--plan profile.toml` for headless use. The preflight validation also runs automatically when `pipeline` starts, so the planner is not required for safety — it's an ergonomic layer on top. + +## Migration Plan + +1. **Land the new modules and models in parallel with existing code** — `aligner.py`, `text_normalizer.py`, `models/alignment.py`, and the refactored `transcriber.py` ship alongside the old code paths. The old `transcriber.transcribe()` continues to work (marked deprecated). +2. **Add new CLI subcommands** — `audio-refinery align` runs the aligner standalone against a transcription JSON. `audio-refinery plan` provides interactive run configuration and validation. `audio-refinery pipeline` gains `--aligner {mms-fa,whisperx,none}`, `--vram-strategy {co-resident,sequential}`, and `--plan ` flags; default aligner stays `whisperx` for one release to avoid surprising users mid-upgrade. +3. **Flip the default** — in the follow-up 0.3.0 release, default `--aligner` becomes `mms-fa` and `whisperx` prints a deprecation warning. +4. **Remove WhisperX** — in the release after that, delete the `whisperx` branch from the pipeline and drop the optional-dependency entry from `pyproject.toml`. + +**Rollback strategy:** Each intermediate release keeps the old path reachable via flag. A user who hits a regression can pin the previous audio-refinery version *or* pass `--aligner whisperx` until the issue is triaged. + +## Open Questions + +1. Should speaker merge live in `aligner.py` or in a new `merger.py` module? Leaning toward `merger.py` because the merge is independent of the alignment acoustic model — it's just "given aligned words and diarization segments, assign speakers." Keeping them separate matches the one-responsibility-per-module convention. +2. ~~Does the stage-batched pipeline mode reuse the existing `pipeline.py` structure or warrant a new `pipeline_staged.py`?~~ **Resolved:** extend `pipeline.py` with a mode flag. Stage-batched is the default for sequential VRAM strategy; per-file interleaved is the default for co-resident. Both paths live in the same module. +3. What does the CLI report for alignment metrics? Current transcription report shows words/RTF/VRAM. Alignment will show chunks processed, mean chunk duration, fallback count. Need to confirm the exact fields with the CLI report's existing Rich table layout. +4. Can we ship `faster-whisper` as a direct dependency in `pyproject.toml` under the main `dependencies` (not under the `conflicting` optional), or does it still hit the same torch version conflict that forced whisperx into optional deps? Needs a clean `uv sync` test. +5. Should the `plan` command support editing all pipeline flags (batch_size, compute_type, segment) or only the high-impact ones (stages, model, VRAM strategy, device)? Leaning toward high-impact only to keep the TUI simple; power users use flags directly. +6. How should `model_budgets.toml` handle compute_type variations? `float16` vs. `int8` halves the weight VRAM. Options: (a) separate entries per compute type, (b) a single entry with a multiplier formula, (c) always budget for float16 and treat int8 as extra headroom. Leaning toward (a) for accuracy. diff --git a/openspec/changes/universal-alignment-strategy/proposal.md b/openspec/changes/universal-alignment-strategy/proposal.md new file mode 100644 index 0000000..956f0c7 --- /dev/null +++ b/openspec/changes/universal-alignment-strategy/proposal.md @@ -0,0 +1,38 @@ +## Why + +Audio Refinery is coupled to WhisperX, which bundles transcription and Wav2Vec2 alignment behind a single API. This prevents us from swapping in better ASR backends (Cohere Transcribe, cloud APIs, newer open-source models) without losing the millisecond-level word timestamps that downstream consumers (RAG databases, semantic video editing, diarization merging) depend on. Decoupling the transcription engine from the word-level aligner turns Audio Refinery into a model-agnostic audio processing factory where ASR backends are hot-swappable while alignment quality stays constant. + +## What Changes + +- Add a new **forced alignment** stage built on `torchaudio.pipelines.MMS_FA` (Wav2Vec2 acoustic model) that accepts a sanitized text string plus audio and emits precise word-level start/end times. +- Add a new **text normalization** step that sanitizes ASR output before alignment: expand numbers and currency ("$100" → "one hundred dollars"), strip punctuation and symbols that crash phoneme models, and lowercase. +- Refactor the **transcription** stage into an ASR-only, model-agnostic interface. The stage produces rough segments `[{start, end, text}]` and a detected language code; it no longer emits word-level timestamps directly. +- Introduce a **chunked alignment** strategy. Because trellis pathfinding memory scales quadratically with audio length, the aligner must chunk audio to ≤30 s windows based on ASR segment boundaries (or VAD) before running MMS-FA. +- Introduce a **hybrid model lifecycle** with two VRAM strategies: **co-resident** (all in-process models loaded simultaneously — fastest, requires sufficient VRAM) and **sequential** (load/unload each stage one at a time — fits any GPU). Each stage exposes `load_model()` / `unload_model()` with `gc.collect()` + `torch.cuda.empty_cache()`. Default strategy is co-resident; the system automatically falls back to sequential if the projected VRAM budget exceeds available capacity. +- Add a **VRAM budget preflight** that runs before the pipeline starts: queries available GPU memory, looks up projected VRAM per model from a new `model_budgets.toml` registry, and validates that the selected strategy fits. Blocks the run with an actionable error if it doesn't, suggesting alternatives (smaller model, different strategy, free GPU memory). +- Add an **interactive run planner** via a new `audio-refinery plan` CLI command. Discovers files, queries GPUs, renders the pipeline configuration with a VRAM budget bar chart, and lets the user adjust stages/models/strategy before committing. Optionally saves validated configurations as reusable **run profiles** (TOML) for non-interactive replay via `pipeline --plan profile.toml`. +- Plumb the ASR-detected `language_code` through to the aligner so future language-specific acoustic models can be swapped in dynamically. +- **BREAKING**: `TranscriptionResult` and the transcription stage contract change shape — word-level timestamps move out of transcription output into a new `AlignmentResult` produced by the forced-alignment stage. Downstream consumers (diarization merge, sentiment, CLI reports) read from the alignment output instead. +- **BREAKING**: WhisperX is removed from the critical path. The `transcriber` module either wraps faster-whisper directly (ASR-only) or exposes a pluggable backend interface; the `whisperx.align` call is deleted. + +## Capabilities + +### New Capabilities + +- `forced-alignment`: Chunked word-level forced alignment of arbitrary text against audio using torchaudio MMS-FA (Wav2Vec2). Accepts a language code, produces per-word start/end/score arrays, and owns its own model load/unload lifecycle. +- `text-normalization`: Deterministic text sanitization that prepares ASR output for phoneme-based alignment (number/currency expansion, symbol stripping, casing). +- `transcription`: Model-agnostic ASR stage that emits rough segments and a detected language code. Establishes the contract that lets transcription backends be swapped without touching alignment. +- `model-lifecycle`: Hybrid VRAM strategy (co-resident vs. sequential) with per-stage `load_model`/`unload_model` hooks, a `model_budgets.toml` VRAM registry, an upfront preflight validation that blocks runs projected to OOM, and an interactive `audio-refinery plan` command for configuring and validating pipeline runs before execution. + +### Modified Capabilities + + + +## Impact + +- **Code**: `src/transcriber.py` (refactored into ASR-only wrapper), `src/pipeline.py` (new stage order + hybrid VRAM lifecycle), `src/cli.py` (new `align` and `plan` subcommands, updated `transcribe` and `pipeline` commands), `src/models/transcription.py` (shape change), new `src/aligner.py`, `src/text_normalizer.py`, `src/vram_preflight.py` modules, new `src/models/alignment.py` Pydantic model, new `src/model_budgets.toml` registry. +- **APIs**: `TranscriptionResult.segments[].words` removed; new `AlignmentResult` model with `aligned_words: list[AlignedWord]` becomes the source of truth for word timing. Sentiment and diarization merging read from the alignment output. New `VramBudget` dataclass and `validate_vram_budget()` function for preflight checks. +- **Dependencies**: `torchaudio>=2.1.2` (already pinned via torch 2.1.2) gains the MMS-FA bundle; HuggingFace model weights for `torchaudio.pipelines.MMS_FA` download on first use. WhisperX drops from the critical path but stays available as an optional legacy backend behind a feature flag for one release cycle to ease migration. `faster-whisper` becomes a direct dependency (previously pulled transitively by whisperx). +- **Tests**: New unit tests for `text_normalizer`, `aligner` (with mocked torchaudio bundle), VRAM preflight validation, and pipeline reordering; new integration test that runs the full decoupled pipeline end-to-end on a short clip with GPU marker. +- **Performance**: Co-resident mode preserves current batch throughput (models loaded once). Sequential mode trades load/unload overhead for reduced VRAM footprint. Preflight validation prevents OOM failures before work begins. +- **Docs**: `README.md`, `CLAUDE.md`, and `docs/DEVELOPMENT.md` updated to describe the new stage order, model-agnostic transcription contract, VRAM strategies, and the `plan` command. diff --git a/openspec/changes/universal-alignment-strategy/specs/forced-alignment/spec.md b/openspec/changes/universal-alignment-strategy/specs/forced-alignment/spec.md new file mode 100644 index 0000000..6e4e71b --- /dev/null +++ b/openspec/changes/universal-alignment-strategy/specs/forced-alignment/spec.md @@ -0,0 +1,61 @@ +## ADDED Requirements + +### Requirement: Forced alignment stage produces word-level timestamps from text and audio + +The system SHALL provide a forced-alignment stage that accepts a sanitized text string, an audio waveform, and a language code, and produces a list of aligned words with start time, end time, and a confidence score. The stage SHALL use `torchaudio.pipelines.MMS_FA` as its default acoustic model. + +#### Scenario: Aligning a short English utterance + +- **WHEN** the stage is called with a ~5 second audio clip, the sanitized text "hello world", and language code "en" +- **THEN** it returns two aligned words whose start/end times fall within the clip bounds, whose ordering matches the text, and whose scores are floats in `[0.0, 1.0]` + +#### Scenario: Aligning a non-English utterance + +- **WHEN** the stage is called with audio, sanitized text, and a non-English language code that MMS-FA supports +- **THEN** it loads the MMS-FA bundle once, produces aligned words for the input, and does not raise + +#### Scenario: Aligner receives an unsupported language code + +- **WHEN** the stage is called with a language code MMS-FA does not support +- **THEN** it raises a clear error identifying the unsupported language, and it does not silently fall back to a different language model + +### Requirement: Forced alignment chunks audio before running the trellis + +The system SHALL chunk input audio into windows of at most 30 seconds before running the forced-alignment trellis, using upstream ASR segment boundaries as the primary chunk delimiter. The system SHALL merge per-chunk results into a single global timeline using each chunk's start offset. + +#### Scenario: Ten-minute input with multiple ASR segments + +- **WHEN** the stage is called with a 10-minute audio file and 40 ASR segments +- **THEN** it processes the file as multiple chunks each ≤30 seconds long, each chunk's word timings are offset back into the global timeline, and the final output covers the full duration with monotonically increasing start times + +#### Scenario: Single ASR segment exceeds the chunk cap + +- **WHEN** an ASR segment's audio window is longer than 30 seconds +- **THEN** the stage splits the segment on a silence-based fallback (or reports a clear `fallback_reason` if no silence is found) and continues processing the rest of the file + +### Requirement: Forced alignment stage owns its model load and unload lifecycle + +The forced-alignment stage SHALL expose explicit `load_model` and `unload_model` entry points. After `unload_model` returns, the MMS-FA acoustic model SHALL no longer hold GPU memory, as observed by `torch.cuda.memory_allocated` returning to its pre-load baseline on the target device. + +#### Scenario: Unloading releases VRAM + +- **WHEN** the stage loads the MMS-FA model, runs one alignment, and calls `unload_model` +- **THEN** `torch.cuda.memory_allocated(device)` after `unload_model` is within 50 MB of its value before `load_model` was called + +### Requirement: Forced alignment output includes provenance + +The stage SHALL emit an `AlignmentResult` containing the input audio path, a pointer to the transcription JSON that sourced the text, the language code used, the acoustic model identifier, the device, the number of chunks processed, any `fallback_reason`, processing time, and timestamps. + +#### Scenario: Serializing an AlignmentResult + +- **WHEN** the stage completes alignment and serializes its result to JSON +- **THEN** the JSON contains `input_file`, `transcription_file`, `language`, `acoustic_model`, `device`, `chunks_processed`, `processing_time_seconds`, `started_at`, `completed_at`, and `aligned_words` + +### Requirement: Aligned words link back to transcript segments + +Each aligned word SHALL carry an index identifying which transcript segment it belongs to, so downstream consumers can reconstruct segment-to-word membership without relying on timestamp overlap heuristics. + +#### Scenario: Reconstructing segment membership + +- **WHEN** a consumer reads an `AlignmentResult` and a `TranscriptionResult` from the same pipeline run +- **THEN** every aligned word's `segment_index` value is a valid index into `TranscriptionResult.segments`, and iterating aligned words grouped by `segment_index` reproduces the word order within each segment diff --git a/openspec/changes/universal-alignment-strategy/specs/model-lifecycle/spec.md b/openspec/changes/universal-alignment-strategy/specs/model-lifecycle/spec.md new file mode 100644 index 0000000..5ae3159 --- /dev/null +++ b/openspec/changes/universal-alignment-strategy/specs/model-lifecycle/spec.md @@ -0,0 +1,140 @@ +## ADDED Requirements + +### Requirement: Every GPU-resident stage exposes load and unload entry points + +Each pipeline stage that loads a model onto a GPU (separator, diarizer, transcriber, aligner, sentiment analyzer) SHALL expose explicit `load_model` and `unload_model` entry points. The stage module SHALL NOT rely on Python garbage collection, process exit, or implicit cleanup to free GPU memory. + +#### Scenario: Each stage module exports the lifecycle functions + +- **WHEN** inspecting each stage module's public API +- **THEN** every stage that touches a GPU model has `load_model` and `unload_model` callable entry points, and their signatures are documented in the module docstring + +### Requirement: Unload releases VRAM deterministically + +The system SHALL ensure that calling `unload_model` on any stage releases the stage's GPU memory before returning. Implementations SHALL invoke `gc.collect()` and `torch.cuda.empty_cache()` as part of unload. After unload, `torch.cuda.memory_allocated(device)` SHALL be within 50 MB of the pre-load baseline on the stage's target device. + +#### Scenario: Unload releases VRAM for each stage + +- **WHEN** each GPU-resident stage is loaded, runs one unit of work, and is unloaded in turn +- **THEN** `torch.cuda.memory_allocated(device)` after each stage's `unload_model` is within 50 MB of the pre-load baseline + +### Requirement: Pipeline supports co-resident and sequential VRAM strategies + +The pipeline SHALL support two VRAM strategies selectable via `--vram-strategy`: + +- **co-resident**: All in-process GPU models (pyannote, ASR, aligner) are loaded before the file loop and remain resident for the entire batch. This is the fastest mode because model load time is paid once. +- **sequential**: Each stage loads its model, processes its workload, and unloads before the next stage loads. Peak VRAM equals the single largest stage. + +The system SHALL default to co-resident when the VRAM preflight check projects that all models fit. The system SHALL automatically fall back to sequential and emit a warning when co-resident is projected to exceed available VRAM. + +#### Scenario: Co-resident mode on a GPU with sufficient VRAM + +- **WHEN** the pipeline runs with `--vram-strategy co-resident` on a GPU where the projected co-resident budget fits within available VRAM +- **THEN** all in-process GPU models are loaded once before the file loop begins, no `unload_model` calls occur between stages during the loop, and total wall-clock model loading time equals the sum of each model's single load time + +#### Scenario: Sequential mode on a constrained GPU + +- **WHEN** the pipeline runs with `--vram-strategy sequential` on any GPU +- **THEN** at most one heavyweight GPU model is resident at any time, and `torch.cuda.max_memory_allocated` during any stage reflects only that stage's model plus working memory + +#### Scenario: Automatic fallback from co-resident to sequential + +- **WHEN** the user does not specify `--vram-strategy` and the VRAM preflight projects that co-resident mode exceeds available VRAM +- **THEN** the pipeline falls back to sequential mode and emits a warning explaining the projected budget, available VRAM, and the fallback decision + +### Requirement: VRAM budget registry provides per-model cost estimates + +The system SHALL maintain a `model_budgets.toml` file in `src/` that maps model identifiers to their projected VRAM cost in MiB (weights plus typical peak activation overhead). The registry SHALL be user-editable and SHALL be the single source of truth for VRAM preflight calculations. + +#### Scenario: Looking up a known model + +- **WHEN** the preflight system looks up `"faster-whisper/large-v3"` in `model_budgets.toml` +- **THEN** it returns a VRAM estimate in MiB that reflects the model's weight size plus typical peak activation overhead at default batch size + +#### Scenario: Looking up an unknown model + +- **WHEN** a model identifier is not found in `model_budgets.toml` +- **THEN** the preflight system logs a warning identifying the unknown model, uses a conservative fallback estimate, and does not crash + +### Requirement: VRAM preflight validates the pipeline configuration before processing + +The pipeline SHALL run a VRAM budget preflight check before any models are loaded or files are processed. The preflight SHALL: + +1. Query each target GPU's total VRAM and currently used VRAM via `gpu_utils`. +2. Subtract a configurable headroom margin (default 512 MiB) for CUDA context and fragmentation. +3. Look up the projected VRAM cost of each enabled stage's model from `model_budgets.toml`. +4. Compute the projected budget: sum of all models for co-resident, max of any single model for sequential. +5. Compare projected budget against available VRAM. + +If the budget exceeds available VRAM, the preflight SHALL block the pipeline with an actionable error message that includes: per-model VRAM breakdown, total projected budget, available VRAM on the target GPU(s), and specific suggestions (switch VRAM strategy, use a smaller transcription model, free GPU memory, or switch to a different GPU). + +#### Scenario: Preflight passes for co-resident mode + +- **WHEN** the pipeline starts with co-resident mode and the sum of all model budgets plus headroom is less than available GPU VRAM +- **THEN** the preflight passes, logs the projected budget and available VRAM, and the pipeline proceeds to load models + +#### Scenario: Preflight blocks an over-budget co-resident run + +- **WHEN** the pipeline starts with co-resident mode and the sum of all model budgets plus headroom exceeds available GPU VRAM +- **THEN** the pipeline does not load any models, does not process any files, and prints an error that includes: each model's VRAM cost, the total projected budget, the available VRAM, and at least two actionable suggestions + +#### Scenario: Preflight checks the smallest GPU in multi-GPU parallel runs + +- **WHEN** the pipeline runs in parallel mode across multiple GPUs +- **THEN** the preflight validates the budget against the GPU with the smallest available VRAM, since all workers run identical configurations + +### Requirement: Pipeline reports peak VRAM per stage + +The pipeline's per-stage result SHALL record the peak VRAM observed during that stage's execution in bytes, measured via `torch.cuda.max_memory_allocated` after resetting the peak counter at the start of the stage. Reports and notifications SHALL surface this value. + +#### Scenario: Recording and reporting peak VRAM + +- **WHEN** the pipeline completes a stage on a CUDA device +- **THEN** the stage's result entry contains a non-null `peak_vram_bytes` field, and the CLI/Slack report shows a human-readable value (e.g., "6.3 GB") for that stage + +### Requirement: Interactive run planner validates and configures pipeline runs + +The system SHALL provide an `audio-refinery plan` CLI command that interactively presents the pipeline configuration, validates it against available hardware, and allows the user to adjust settings before committing to a run. + +The planner SHALL display: discovered file count and total audio duration, GPU specs and available VRAM, enabled pipeline stages with their model choices, a VRAM budget visualization comparing projected usage against available capacity, and an estimated run time. + +The planner SHALL allow the user to toggle stages on/off, switch transcription models, change VRAM strategy, and select target device(s). Each change SHALL immediately re-render the budget and time estimate. + +#### Scenario: Planning a run on a single GPU + +- **WHEN** the user runs `audio-refinery plan --source-dir /audio/test` +- **THEN** the planner displays the file count, GPU info, enabled stages with models, a VRAM budget visualization, and run time estimate, and waits for user input before proceeding + +#### Scenario: Adjusting the transcription model in the planner + +- **WHEN** the user selects a different transcription model (e.g., switching from large-v3 to medium) during the planning session +- **THEN** the VRAM budget visualization and run time estimate update immediately to reflect the new model's resource requirements + +#### Scenario: Planner detects a VRAM budget problem + +- **WHEN** the user's selected configuration exceeds available VRAM in co-resident mode +- **THEN** the planner highlights the budget overflow, suggests specific alternatives (switch to sequential, use a smaller model, disable a stage), and does not allow the run to proceed until the configuration fits or the user explicitly switches to sequential mode + +#### Scenario: Non-interactive environment + +- **WHEN** `audio-refinery plan` is invoked without a TTY (e.g., in a CI pipeline) +- **THEN** the command exits with a clear message directing the user to use `pipeline --plan profile.toml` for headless execution + +### Requirement: Run profiles save and replay validated pipeline configurations + +The system SHALL support saving a validated pipeline configuration from the planner as a **run profile** (TOML file). The `pipeline` command SHALL accept a `--plan ` flag that loads and replays the saved configuration without requiring interactive input. + +#### Scenario: Saving a run profile + +- **WHEN** the user confirms a configuration in the planner and chooses to save it +- **THEN** the system writes a TOML file containing all pipeline settings (source dir, stages, models, VRAM strategy, devices, language, compute type, batch size) and the file is valid input for `pipeline --plan` + +#### Scenario: Replaying a run profile + +- **WHEN** the user runs `audio-refinery pipeline --plan my-batch.toml` +- **THEN** the pipeline loads all settings from the TOML, runs the VRAM preflight against current GPU state, and proceeds if the budget fits + +#### Scenario: Replaying a profile on a different GPU + +- **WHEN** a run profile saved on a machine with a 4090 is replayed on a machine with a 3060 +- **THEN** the VRAM preflight re-validates against the current GPU's available VRAM and blocks or auto-falls-back to sequential if the saved co-resident configuration no longer fits diff --git a/openspec/changes/universal-alignment-strategy/specs/text-normalization/spec.md b/openspec/changes/universal-alignment-strategy/specs/text-normalization/spec.md new file mode 100644 index 0000000..d8fea52 --- /dev/null +++ b/openspec/changes/universal-alignment-strategy/specs/text-normalization/spec.md @@ -0,0 +1,57 @@ +## ADDED Requirements + +### Requirement: Text normalization prepares ASR output for phoneme-based alignment + +The system SHALL provide a pure-function text normalizer that transforms an ASR text string into a form safe for MMS-FA alignment. The normalizer SHALL not depend on any ML model and SHALL be deterministic for a given `(text, language)` input. + +#### Scenario: Normalizing a simple English sentence + +- **WHEN** the normalizer is called with `"Hello, world!"` and language `"en"` +- **THEN** it returns `"hello world"` (lowercased, punctuation stripped, whitespace collapsed) + +#### Scenario: Determinism across calls + +- **WHEN** the normalizer is called twice with the same input +- **THEN** both calls return byte-identical output + +### Requirement: Numbers and currency are expanded to spoken form + +The normalizer SHALL expand bare numbers, decimals, and currency amounts into their spoken-word equivalents before stripping symbols. Expansion SHALL use `num2words` or an equivalent multilingual library and SHALL honor the supplied language code. + +#### Scenario: Expanding currency + +- **WHEN** the normalizer is called with `"$100"` and language `"en"` +- **THEN** it returns a string containing `"one hundred dollars"` + +#### Scenario: Expanding bare numbers + +- **WHEN** the normalizer is called with `"I have 3 apples"` and language `"en"` +- **THEN** it returns `"i have three apples"` + +#### Scenario: Expanding a number in a different language + +- **WHEN** the normalizer is called with `"Tengo 3 manzanas"` and language `"es"` +- **THEN** it returns `"tengo tres manzanas"` + +### Requirement: Symbols that crash phoneme models are removed + +The normalizer SHALL strip characters that are known to crash or degrade phoneme-based aligners, including `$`, `%`, `#`, `@`, emoji, and other non-letter, non-space, non-apostrophe characters. Intra-word apostrophes (e.g., `"don't"`) SHALL be preserved. + +#### Scenario: Stripping an emoji + +- **WHEN** the normalizer is called with `"great work 🎉"` and language `"en"` +- **THEN** the output contains only alphabetic characters, spaces, and apostrophes, and does not contain the emoji + +#### Scenario: Preserving intra-word apostrophes + +- **WHEN** the normalizer is called with `"don't stop"` and language `"en"` +- **THEN** it returns `"don't stop"` with the apostrophe intact + +### Requirement: Normalizer failures degrade gracefully + +If number or currency expansion fails for a specific token (e.g., an unusual format `num2words` cannot parse), the normalizer SHALL strip the offending symbols and continue rather than raising, and SHALL log a warning identifying the token. + +#### Scenario: Unparseable currency expression + +- **WHEN** the normalizer is called with a currency token that `num2words` cannot expand +- **THEN** the token's symbols are stripped, the remaining text is returned, and a warning is logged identifying the token that could not be expanded diff --git a/openspec/changes/universal-alignment-strategy/specs/transcription/spec.md b/openspec/changes/universal-alignment-strategy/specs/transcription/spec.md new file mode 100644 index 0000000..abdb5a2 --- /dev/null +++ b/openspec/changes/universal-alignment-strategy/specs/transcription/spec.md @@ -0,0 +1,47 @@ +## ADDED Requirements + +### Requirement: Transcription stage is ASR-only and backend-agnostic + +The transcription stage SHALL accept an audio file and return a `TranscriptionResult` containing rough segments (`text`, `start`, `end`), a detected language code, and processing provenance. The stage SHALL NOT perform forced alignment or word-level timestamp refinement. The stage's public interface SHALL be independent of any particular ASR library so that backends can be swapped without changing the stage's contract. + +#### Scenario: Transcribing a short clip with the default backend + +- **WHEN** the stage is called with a 10-second WAV file on `cuda:0` +- **THEN** it returns a `TranscriptionResult` with one or more segments, a non-empty `language` field, and a non-zero `processing_time_seconds` + +#### Scenario: Transcription result has no per-word timestamps + +- **WHEN** a caller serializes a `TranscriptionResult` to JSON +- **THEN** `TranscriptSegment` entries do not contain a `words` list (that data lives in `AlignmentResult`) + +### Requirement: Transcription result includes detected language code + +The stage SHALL emit a `language` field on `TranscriptionResult` containing either the language code passed in by the caller or the code detected by the ASR backend when the caller requested auto-detection. The language code SHALL be in a form the forced-alignment stage can consume directly (e.g., BCP-47 or the MMS-FA supported set). + +#### Scenario: Auto-detecting language + +- **WHEN** the stage is called with `language="auto"` on an English clip +- **THEN** the resulting `TranscriptionResult.language` is `"en"` + +#### Scenario: Passing through an explicit language + +- **WHEN** the stage is called with `language="es"` +- **THEN** the resulting `TranscriptionResult.language` is `"es"` + +### Requirement: Transcription stage owns its model load and unload lifecycle + +The transcription stage SHALL expose `load_model` and `unload_model` entry points. After `unload_model`, the ASR model SHALL no longer hold GPU memory, matching the VRAM baseline observed before `load_model`. + +#### Scenario: Unload releases VRAM + +- **WHEN** the stage loads the ASR model, runs one transcription, and unloads the model +- **THEN** `torch.cuda.memory_allocated(device)` returns within 50 MB of the pre-load baseline + +### Requirement: Transcription stage is pluggable via a documented backend seam + +The transcription module SHALL expose a single function-level seam (initially wrapping faster-whisper) that future ASR backends can replace. The seam's inputs and outputs SHALL match the stage contract (audio path in, rough segments + language out) so that swapping backends does not require touching the pipeline, aligner, or CLI. + +#### Scenario: Replacing the default backend + +- **WHEN** a second backend is introduced in a follow-up change +- **THEN** the change consists of a new backend module implementing the seam and a configuration flag; no edits to `pipeline.py`, `aligner.py`, or the public `TranscriptionResult` shape are required diff --git a/openspec/changes/universal-alignment-strategy/tasks.md b/openspec/changes/universal-alignment-strategy/tasks.md new file mode 100644 index 0000000..fbb9651 --- /dev/null +++ b/openspec/changes/universal-alignment-strategy/tasks.md @@ -0,0 +1,128 @@ +## 1. Scaffolding & Dependencies + +- [ ] 1.1 Add `num2words` to `pyproject.toml` main dependencies and re-sync with `uv` +- [ ] 1.2 Add `faster-whisper` to `pyproject.toml` main dependencies; verify `uv sync` succeeds without reintroducing the WhisperX/torch 2.1.2 conflict (if conflict persists, keep under `[project.optional-dependencies].conflicting` and update `make install-whisperx` accordingly) +- [ ] 1.3 Confirm `torchaudio>=2.1.2` exposes `torchaudio.pipelines.MMS_FA` in the CUDA 12.1 wheel; run a one-liner smoke test from `make install-torch-cuda` +- [ ] 1.4 Create empty module skeletons: `src/text_normalizer.py`, `src/aligner.py`, `src/merger.py`, `src/vram_preflight.py`, `src/models/alignment.py` +- [ ] 1.5 Update `CLAUDE.md` and `docs/DEVELOPMENT.md` project-structure sections to list the new modules + +## 2. Data Models + +- [ ] 2.1 Add `AlignedWord` and `AlignmentResult` Pydantic models in `src/models/alignment.py` with fields from design §Decision 5 +- [ ] 2.2 Remove `words: list[WordSegment]` from `TranscriptSegment` in `src/models/transcription.py` +- [ ] 2.3 Remove `alignment_fallback: bool` from `TranscriptionResult` (fallback now lives on `AlignmentResult.fallback_reason`) +- [ ] 2.4 Export `AlignedWord`, `AlignmentResult` from `src/models/__init__.py` +- [ ] 2.5 Add a compat helper `load_transcription_with_words(transcription_path, alignment_path) -> list[TranscriptSegment]` that reads both JSONs and returns segments with word lists stitched back on (for callers that want the old shape) +- [ ] 2.6 Add unit tests under `tests/models/` for `AlignmentResult` round-trip JSON serialization and `load_transcription_with_words` stitching + +## 3. Text Normalization Stage + +- [ ] 3.1 Implement `normalize_for_alignment(text: str, language: str) -> str` in `src/text_normalizer.py` covering: lowercase, whitespace collapse, punctuation strip (preserving intra-word apostrophes), number expansion via `num2words`, currency pre-regex for `$`/`€`/`£` amounts including `k/M/B` suffixes +- [ ] 3.2 Implement graceful failure: on `num2words` exception for a token, strip symbols, log a warning with the offending token, and continue +- [ ] 3.3 Add `tests/test_text_normalizer.py` covering every scenario in `specs/text-normalization/spec.md` (English simple, currency, bare numbers, Spanish numbers, emoji strip, apostrophe preservation, unparseable currency fallback, determinism) + +## 4. Transcription Stage Refactor + +- [ ] 4.1 Rewrite `src/transcriber.py::transcribe()` to call `faster-whisper` directly (no `whisperx.align`, no `whisperx.assign_word_speakers`). Output rough segments only (`text`, `start`, `end`) plus detected language +- [ ] 4.2 Add `load_model(device, compute_type, batch_size, language, model)` and `unload_model(handle)` entry points. `unload_model` SHALL `del handle`, `gc.collect()`, and `torch.cuda.empty_cache()` +- [ ] 4.3 Replace `_parse_whisperx_device` with a faster-whisper-appropriate device parser (faster-whisper also splits `cuda:N` into `device="cuda", device_index=N`) +- [ ] 4.4 Update `TranscriptionResult` construction in `transcriber.py` to drop `words`, `alignment_fallback` +- [ ] 4.5 Remove the speaker-merge branch (`whisperx.assign_word_speakers`) from `transcriber.py`; speaker assignment moves to `merger.py` +- [ ] 4.6 Define a minimal backend seam: a module-level `Backend` protocol/type-alias or single `_run_asr(audio, model_handle, ...) -> (segments, language)` function that future backends can replace without touching the stage's public interface +- [ ] 4.7 Rewrite `tests/test_transcriber.py` to mock faster-whisper, assert no `words` on output, and cover both auto-detect and explicit-language paths +- [ ] 4.8 Verify transcription VRAM release: add a test that loads, transcribes, unloads, and asserts `torch.cuda.memory_allocated` is within 50 MB of baseline (skip under `not has_cuda`) + +## 5. Forced Alignment Stage + +- [ ] 5.1 Implement `load_model(device, language_code) -> AlignerHandle` in `src/aligner.py` that fetches `torchaudio.pipelines.MMS_FA` and prepares its tokenizer for the given language +- [ ] 5.2 Implement `unload_model(handle)` mirroring the transcription unload pattern +- [ ] 5.3 Implement `align(audio_path, transcription_result, normalizer, handle) -> AlignmentResult`: iterate ASR segments, pack into ≤30 s chunks, run the trellis per chunk, offset word timings back into the global timeline, populate `segment_index` +- [ ] 5.4 Implement the silence-based fallback for oversize segments: on an ASR segment exceeding 30 s, split on the lowest-energy window within the segment (simple numpy RMS, no new dependency); if no viable split, populate `fallback_reason` with `"no_silence_in_segment"` and use the ASR's raw start/end for that span +- [ ] 5.5 Raise a clear `AlignerError` for unsupported language codes (pre-check against `MMS_FA.get_labels()` or equivalent) — do NOT silently fall back to a different language +- [ ] 5.6 Add `tests/test_aligner.py` covering: mocked MMS-FA bundle returning synthetic trellis output, chunk boundary math (ensure offsets applied), `segment_index` correctness, language-not-supported error, oversize-segment fallback +- [ ] 5.7 Add an integration test under `tests/test_integration.py` marked `@pytest.mark.integration` that runs the real MMS-FA pipeline on a short recorded clip and asserts monotonic word start times across the file + +## 6. Speaker Merge Module + +- [ ] 6.1 Implement `merge_speakers(alignment_result, diarization_result) -> AlignmentResult` in `src/merger.py` — iterate aligned words, find the diarization segment whose `[start, end]` contains each word's midpoint, assign `speaker` field; do not require a model +- [ ] 6.2 Handle edge cases: word falls in a gap between diarization segments (nearest neighbor), word spans two speakers (pick the speaker covering more of the word's duration) +- [ ] 6.3 Add `tests/test_merger.py` covering the midpoint-inside case, the gap case, and the boundary-spanning case with deterministic fixture data + +## 7. VRAM Budget Registry & Preflight Validation + +- [ ] 7.1 Create `src/model_budgets.toml` with VRAM estimates (MiB) for: `pyannote/speaker-diarization-3.1`, `faster-whisper/large-v3`, `faster-whisper/distil-large-v3`, `faster-whisper/medium`, `faster-whisper/small`, `torchaudio/MMS-FA`. Values include weight size + typical peak activation overhead at default batch sizes +- [ ] 7.2 Implement `src/vram_preflight.py` with: `load_model_budgets()` (reads the TOML), `VramBudget` dataclass (per-model breakdown, total projected, available, verdict, suggestions), and `validate_vram_budget(stages, models, device, strategy, headroom_mib=512) -> VramBudget` +- [ ] 7.3 Preflight logic for co-resident: sum all enabled stage budgets. For sequential: take max of any single stage. Compare against `gpu_utils.query_gpu_info(device).vram_mib - used_mib - headroom` +- [ ] 7.4 Generate actionable suggestions on failure: switch to sequential, use a smaller model (list alternatives with their budgets), free GPU memory, switch GPU (if multi-GPU detected) +- [ ] 7.5 Handle unknown models gracefully: log a warning, use a conservative fallback estimate (e.g., 4000 MiB), do not crash +- [ ] 7.6 For multi-GPU parallel runs, validate against the smallest GPU's available VRAM across all target devices +- [ ] 7.7 Add `tests/test_vram_preflight.py` covering: budget passes co-resident, budget blocks co-resident, auto-fallback to sequential, unknown model warning, multi-GPU smallest-GPU logic, headroom subtraction + +## 8. Pipeline Integration + +- [ ] 8.1 Update `src/pipeline.py` stage order: separate → diarize → transcribe (ASR only) → normalize → align → merge speakers → sentiment. Unload behavior depends on VRAM strategy +- [ ] 8.2 Add `--vram-strategy {co-resident, sequential}` parameter to `run_pipeline()`. Co-resident: load all in-process GPU models (pyannote, ASR, aligner) before the file loop. Sequential: load/unload each stage's model around its processing pass +- [ ] 8.3 Integrate VRAM preflight: call `validate_vram_budget()` before loading any models. On failure with explicit co-resident, block with error. On failure with no explicit strategy, auto-fallback to sequential with a warning +- [ ] 8.4 Replace `_load_whisperx_model` with new `transcriber.load_model()` call; wire pipeline to use the new lifecycle entry points for all stages +- [ ] 8.5 Add a `mode` flag to the pipeline: `per_file` (interleaved) and `stage_batched`. Default: per-file for co-resident, stage-batched for sequential +- [ ] 8.6 Add `AlignmentResult` to `PipelineResult` as a new `alignment: StageResult` field; update `FileOutcome` with alignment-specific metrics (`chunks_processed`, `fallback_count`) +- [ ] 8.7 Update Slack notifier (`src/notifier.py`) and CLI Rich reports to render the new alignment stage stats and VRAM strategy used +- [ ] 8.8 Update `tests/test_pipeline.py` and `tests/test_pipeline_parallel.py` to reflect the new stage order, both VRAM strategies, and both execution modes +- [ ] 8.9 Add a pipeline test for co-resident mode that asserts all models are loaded before the file loop +- [ ] 8.10 Add a pipeline test for sequential mode that asserts peak VRAM per stage reflects only one model at a time + +## 9. CLI + +- [ ] 9.1 Add `audio-refinery align` subcommand: inputs are `--audio`, `--transcription`, `--output`; options include `--device`, `--language` (override auto) +- [ ] 9.2 Update `audio-refinery transcribe` to drop the `--no-align` flag (moot now) and to stop mentioning alignment in help text +- [ ] 9.3 Update `audio-refinery pipeline` and `pipeline-parallel` to surface `--aligner {mms-fa,whisperx,none}`, `--vram-strategy {co-resident,sequential}`, `--mode {per-file,stage-batched}`, and `--plan ` flags. Default aligner stays `whisperx` for one release cycle +- [ ] 9.4 Implement `--plan ` flag: read TOML, map to pipeline kwargs, run VRAM preflight against current GPU state before proceeding +- [ ] 9.5 Update `tests/test_cli.py` for the new subcommands and flags + +## 10. Interactive Run Planner + +- [ ] 10.1 Add `audio-refinery plan` Click subcommand with `--source-dir`, `--device`, and `--save ` options +- [ ] 10.2 Implement the discovery panel: file count, total audio duration (via `probe_audio_file` on each), GPU name/VRAM from `query_gpu_info()` +- [ ] 10.3 Implement the stage configuration panel: list all pipeline stages with current enable/disable state and model choices; use Rich prompts for toggling stages and selecting models +- [ ] 10.4 Implement the VRAM budget visualization: Rich bar chart showing per-model projected VRAM vs. available, with color-coded pass/fail. Re-renders on every configuration change +- [ ] 10.5 Implement run time estimation: use model-specific RTF constants (bootstrapped from known benchmarks, refined from historical runs) × total audio duration to project wall-clock. Show separately for co-resident vs. sequential +- [ ] 10.6 Implement the run profile save: serialize confirmed configuration to TOML at the `--save` path (or a default like `.audio-refinery-plan.toml`) +- [ ] 10.7 Implement non-TTY detection: `not sys.stdin.isatty()` → exit with message directing to `pipeline --plan` +- [ ] 10.8 Add `tests/test_plan.py` covering: discovery panel output, VRAM budget calculation matches preflight, profile round-trip (save then load), non-TTY exit behavior + +## 11. Legacy WhisperX Shim + +- [ ] 11.1 Extract the current `transcriber.py` logic into `src/backends/whisperx_legacy.py` as a single callable that reproduces the old end-to-end behavior (ASR + Wav2Vec2 align + speaker merge) and emits an `AlignmentResult` in the new shape +- [ ] 11.2 Wire the `--aligner whisperx` pipeline branch to the legacy shim +- [ ] 11.3 Add a deprecation warning in the legacy branch's startup log; document the removal schedule in `CHANGELOG.md` + +## 12. Documentation + +- [ ] 12.1 Update `README.md`: pipeline-flow diagram, stage descriptions for the new 6-stage order, VRAM strategy overview, `plan` command usage, updated CLI examples, and new dependency notes (`faster-whisper`, `num2words`, `model_budgets.toml`) +- [ ] 12.2 Update `CLAUDE.md`: "Project Structure" (new modules: `aligner.py`, `text_normalizer.py`, `merger.py`, `vram_preflight.py`, `model_budgets.toml`, `models/alignment.py`), "Pipeline Stages" (new stage order + alignment stage), "CLI" (new `align` and `plan` commands, new flags), "Critical Dependency Notes" (`faster-whisper` as direct dep, WhisperX deprecation), "Architecture" (hybrid VRAM lifecycle) +- [ ] 12.3 Update `docs/DEVELOPMENT.md`: VRAM strategy trade-offs (co-resident vs. sequential), per-file vs. stage-batched execution modes, `model_budgets.toml` maintenance guide (how to measure and add entries), HF cache path for MMS-FA weights (`HF_HOME`/`TRANSFORMERS_CACHE`), `audio-refinery plan` usage and run profile format, updated `make` targets if any change +- [ ] 12.4 Update `docs/ARCHITECTURE.md`: new pipeline stage flow diagram (6 stages), forced alignment stage description, text normalization stage description, speaker merge module, VRAM preflight system, hybrid model lifecycle, `model_budgets.toml` registry role +- [ ] 12.5 Update `docs/USE_CASES.md`: add alignment-related use cases (RAG-ready word-level timestamps from any ASR backend, model-agnostic transcription), VRAM-constrained GPU use case (sequential mode on 12 GB cards), run planning use case +- [ ] 12.6 Update `docs/PERFORMANCE.md`: document expected VRAM budgets per model/strategy, co-resident vs. sequential throughput comparison, MMS-FA alignment RTF benchmarks, chunked alignment overhead +- [ ] 12.7 Update `docs/DEPLOYMENT.md`: document VRAM requirements per strategy and model combination, `model_budgets.toml` customization for cloud GPU instances, run profile usage for scheduled/CI pipelines, MMS-FA model weight pre-download for air-gapped deployments +- [ ] 12.8 Update `CONTRIBUTING.md` if needed: note that new pipeline stages must expose `load_model`/`unload_model` and add an entry to `model_budgets.toml` + +## 13. Release v0.2.0 + +- [ ] 13.1 Bump version in `pyproject.toml` from `0.1.1` to `0.2.0` +- [ ] 13.2 Write `CHANGELOG.md` entry for v0.2.0: breaking changes (`TranscriptSegment.words` removed, new `AlignmentResult`), new features (forced alignment via MMS-FA, text normalization, model-agnostic transcription, hybrid VRAM strategy, VRAM preflight validation, interactive `plan` command, run profiles), deprecations (WhisperX backend — removal scheduled for v0.3.0), new dependencies (`faster-whisper`, `num2words`) +- [ ] 13.3 Run `make all-checks` and `make test-integration` on a GPU host; capture wall-clock and peak-VRAM deltas vs. the v0.1.1 baseline for both VRAM strategies +- [ ] 13.4 Create `release/v0.2.0` branch, open PR to `main`, merge +- [ ] 13.5 Create annotated tag `v0.2.0` on merged commit and push — triggers `.github/workflows/release.yml` to build and create the GitHub release +- [ ] 13.6 Verify GitHub release artifact is published and release notes match CHANGELOG + +## 14. Validation Gates + +- [ ] 14.1 Every spec requirement in `specs/forced-alignment/spec.md`, `specs/text-normalization/spec.md`, `specs/transcription/spec.md`, `specs/model-lifecycle/spec.md` maps to at least one test (unit or integration) that can fail +- [ ] 14.2 `make test` passes (unit only, no GPU) +- [ ] 14.3 `make test-integration` passes on a GPU host for both `--aligner mms-fa` and `--aligner whisperx`, and for both `--vram-strategy co-resident` and `--vram-strategy sequential` +- [ ] 14.4 A 10-minute English podcast clip aligned with MMS-FA has word boundary deltas within ±150 ms of the WhisperX baseline on a hand-labeled subset +- [ ] 14.5 Co-resident mode: total model load time for a 10-file batch is within 15% of the current pipeline's load time (models loaded once, not per-file) +- [ ] 14.6 Sequential mode: peak VRAM on a single-GPU run of a representative file does not exceed the largest single stage's working-set memory +- [ ] 14.7 VRAM preflight correctly blocks an over-budget co-resident run on a real GPU and the error message includes all required fields (per-model breakdown, projected total, available, suggestions) +- [ ] 14.8 `audio-refinery plan` renders the VRAM budget visualization correctly and a saved profile round-trips through `pipeline --plan` diff --git a/openspec/config.yaml b/openspec/config.yaml new file mode 100644 index 0000000..392946c --- /dev/null +++ b/openspec/config.yaml @@ -0,0 +1,20 @@ +schema: spec-driven + +# Project context (optional) +# This is shown to AI when creating artifacts. +# Add your tech stack, conventions, style guides, domain knowledge, etc. +# Example: +# context: | +# Tech stack: TypeScript, React, Node.js +# We use conventional commits +# Domain: e-commerce platform + +# Per-artifact rules (optional) +# Add custom rules for specific artifacts. +# Example: +# rules: +# proposal: +# - Keep proposals under 500 words +# - Always include a "Non-goals" section +# tasks: +# - Break tasks into chunks of max 2 hours