test(quality): add Qwen3-ASR WER coverage by pasrom · Pull Request #230 · pasrom/meeting-transcriber

pasrom · 2026-05-10T19:35:33Z

Summary

Closes the last gap in the WER quality matrix. Qwen3-ASR (FluidAudio Qwen3 0.6B) was the only production engine without regression coverage. A FluidAudio bump or a Qwen3 model swap could shift the baseline silently.

What's new

Qwen3AsrEngineQualityTests — 2 tests against two_speakers_de and three_speakers_de, feeding through the shared runWERAgainstFixture helper (same one Parakeet + WhisperKit use). Engine pins language = "de".
1-line append to the --filter alternation in quality-and-safety.yml.
runWERAgainstFixture gains an optional audioPathOverride: URL? parameter; defaults to nil so WhisperKit + Parakeet behaviour is unchanged.

The Mini-bug round-trip workaround

Initial Mini live-run returned garbage tokens — single Cyrillic б on the two-speaker fixture, two punctuation chars on the three-speaker — even though local M3 Max gave reasonable German (WER 0.214 / 0.509) against the same code, same fixture, same language hint.

Qwen3E2ETests.testTranscribeSegmentsProducesGermanContent runs green on the same Mini. The only material difference between the working E2E call and the failing quality call was that the E2E test round-trips the fixture through AudioMixer.loadAudioFileAsFloat32 → AudioMixer.saveWAV to a temp WAV before handing the path to the engine. Both files report identical ffprobe metadata (16 kHz mono PCM s16le); the round-trip rewrites bytes to Float32 PCM but reports the same shape — yet Qwen3's loader cares about something the round-trip normalises.

The Qwen3 quality test now mirrors that round-trip inline, then passes the temp path through the new audioPathOverride. Inline (vs sharing the resample helper) so the Mini-bug workaround is visible at the point of use; if a future investigation roots out the loader issue and removes the need for the round-trip, the delete is local to one file.

Local baselines (M3 Max, 2026-05-10)

Engine	Fixture	WER
`qwen3`	two_speakers_de	0.214
`qwen3`	three_speakers_de	0.509

Soft threshold 0.6 sits ~10 % above the higher baseline. Wide enough to absorb run-to-run variance, tight enough to flag catastrophic breakage (model corrupted, language hint ignored, audio not loaded).

For comparison after this PR lands, quality-results.json carries:

Engine	two_speakers_de	three_speakers_de
`whisperKit` (large-v3-turbo, language=de)	0.286	0.226
`parakeet` (auto-detect)	0.464	0.453
`qwen3` (language=de)	0.214	0.509

Qwen3 actually beats WhisperKit on the two-speaker fixture (0.21 vs 0.29) but trails on the three-speaker one (0.51 vs 0.23). Proper-noun + tech-jargon substitutions in the longer audio drive the gap.

macOS gate

Class-level @available(macOS 15, *) mirrors Qwen3AsrEngine's gate — CoreML stateful models need macOS 15+. The annotation is runtime-only; the file compiles fine against the package's macOS 14 deployment target, XCTest just skips the methods at discovery time on older hosts.

Test plan

RUN_QUALITY_TESTS=1 swift test --filter "WhisperKitQualityTests|ParakeetQualityTests|Qwen3AsrEngineQualityTests|FluidDiarizerQualityTests" → all pass locally
Lint clean
Verified all four engine rows land in quality-results.json
Mini quality job green with the round-trip workaround (initial run without it failed; resume run pending)

Qwen3-ASR (FluidAudio Qwen3 0.6B) was the last engine without WER regression coverage. A FluidAudio bump or a Qwen3 model swap could shift the baseline silently — same blind spot Parakeet had until #229. Two tests against the same German fixtures (`two_speakers_de`, `three_speakers_de`) feed through the shared `runWERAgainstFixture` helper. Engine pins `language = "de"` (Qwen3 supports an explicit language hint, unlike Parakeet's auto-detect-only contract). Local baselines (M3 Max, 2026-05-10): qwen3 two_speakers_de WER ≈ 0.214 qwen3 three_speakers_de WER ≈ 0.509 Soft threshold 0.6 sits ~10 % above the higher baseline. Qwen3 actually beats WhisperKit on the two-speaker fixture (0.21 vs 0.29) but trails on the three-speaker one (0.51 vs 0.23) — proper-noun + tech-jargon substitutions in the longer audio drive the gap. Documented inline so readers don't read the row pair as a uniform improvement vs Whisper. Class-level `@available(macOS 15, *)` mirrors `Qwen3AsrEngine`'s gate. The annotation is runtime-only — file compiles fine against the package's macOS 14 deployment target; XCTest skips the methods at discovery time on macOS 14 hosts.

Append the new Qwen3 quality class to the alternation filter so the WER baselines land in `quality-results.json` alongside the WhisperKit, Parakeet, and diarizer rows. Adds ~10 s wall when warm. Cold-cache adds ~2-3 min for the ~1.75 GB CoreML model download — first run after this lands will be slow, but subsequent runs hit the persistent FluidAudio model cache on the Mini runner.

pasrom · 2026-05-10T19:50:00Z

Closing — Mini live-run surfaced a real Qwen3-quality bug that we can't paper over. See follow-up plan in repo (not on the PR).

…transcribing First Mini live-run of `Qwen3AsrEngineQualityTests` returned garbage tokens — single Cyrillic char "б" on two_speakers_de, two punctuation chars on three_speakers_de — even though local M3 Max runs produced reasonable German (WER 0.214 / 0.509) against the same code, same fixture, same `language="de"` hint. `Qwen3E2ETests.testTranscribeSegmentsProducesGermanContent` runs green on the same Mini. The only material difference: it round-trips the fixture through `AudioMixer.loadAudioFileAsFloat32` + `saveWAV` to a temp WAV before handing the path to the engine, while the quality test passed `truth.audioURL` directly. Both files report identical ffprobe metadata (16 kHz mono PCM s16le); the round-trip rewrites bytes to Float32 PCM but reports the same shape, so something Qwen3's loader cares about is normalised by the round-trip. This commit: - Adds an optional `audioPathOverride: URL?` parameter to `runWERAgainstFixture` (defaults to nil → use truth's audioURL, so WhisperKit + Parakeet behaviour is unchanged). - Mirrors the working E2E test's `resampleFixtureToTemp` pattern inline in the Qwen3 quality test, then passes the temp path through the override. Inline (vs sharing the resample helper) so the Mini-bug workaround is visible at the point of use; if a future investigation roots out the underlying loader issue and removes the need for the round-trip, the delete is local to one file. Local M3 Max: WER unchanged (0.214 / 0.509). Mini retest pending.

Bisect step 2 from the Qwen3-on-Mini garbage-output investigation plan. If running Qwen3AsrEngineQualityTests alone produces clean German transcription on Mini, the bug is process-state-leak from co-loaded WhisperKit / FluidDiarizer / Parakeet models. DO NOT MERGE THIS COMMIT. Revert before the PR ships either way.

pasrom added 2 commits May 10, 2026 21:35

github-actions Bot added the chore Maintenance or non-functional changes label May 10, 2026

pasrom closed this May 10, 2026

pasrom deleted the feat/qwen3-quality-tests branch May 10, 2026 19:50

pasrom reopened this May 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(quality): add Qwen3-ASR WER coverage#230

test(quality): add Qwen3-ASR WER coverage#230
pasrom wants to merge 4 commits into
mainfrom
feat/qwen3-quality-tests

pasrom commented May 10, 2026 •

edited

Loading

Uh oh!

pasrom commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pasrom commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's new

The Mini-bug round-trip workaround

Local baselines (M3 Max, 2026-05-10)

macOS gate

Test plan

Uh oh!

pasrom commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pasrom commented May 10, 2026 •

edited

Loading