test(quality): add Qwen3-ASR WER coverage#230
Open
pasrom wants to merge 4 commits into
Open
Conversation
Qwen3-ASR (FluidAudio Qwen3 0.6B) was the last engine without WER regression coverage. A FluidAudio bump or a Qwen3 model swap could shift the baseline silently — same blind spot Parakeet had until #229. Two tests against the same German fixtures (`two_speakers_de`, `three_speakers_de`) feed through the shared `runWERAgainstFixture` helper. Engine pins `language = "de"` (Qwen3 supports an explicit language hint, unlike Parakeet's auto-detect-only contract). Local baselines (M3 Max, 2026-05-10): qwen3 two_speakers_de WER ≈ 0.214 qwen3 three_speakers_de WER ≈ 0.509 Soft threshold 0.6 sits ~10 % above the higher baseline. Qwen3 actually beats WhisperKit on the two-speaker fixture (0.21 vs 0.29) but trails on the three-speaker one (0.51 vs 0.23) — proper-noun + tech-jargon substitutions in the longer audio drive the gap. Documented inline so readers don't read the row pair as a uniform improvement vs Whisper. Class-level `@available(macOS 15, *)` mirrors `Qwen3AsrEngine`'s gate. The annotation is runtime-only — file compiles fine against the package's macOS 14 deployment target; XCTest skips the methods at discovery time on macOS 14 hosts.
Append the new Qwen3 quality class to the alternation filter so the WER baselines land in `quality-results.json` alongside the WhisperKit, Parakeet, and diarizer rows. Adds ~10 s wall when warm. Cold-cache adds ~2-3 min for the ~1.75 GB CoreML model download — first run after this lands will be slow, but subsequent runs hit the persistent FluidAudio model cache on the Mini runner.
Owner
Author
|
Closing — Mini live-run surfaced a real Qwen3-quality bug that we can't paper over. See follow-up plan in repo (not on the PR). |
…transcribing First Mini live-run of `Qwen3AsrEngineQualityTests` returned garbage tokens — single Cyrillic char "б" on two_speakers_de, two punctuation chars on three_speakers_de — even though local M3 Max runs produced reasonable German (WER 0.214 / 0.509) against the same code, same fixture, same `language="de"` hint. `Qwen3E2ETests.testTranscribeSegmentsProducesGermanContent` runs green on the same Mini. The only material difference: it round-trips the fixture through `AudioMixer.loadAudioFileAsFloat32` + `saveWAV` to a temp WAV before handing the path to the engine, while the quality test passed `truth.audioURL` directly. Both files report identical ffprobe metadata (16 kHz mono PCM s16le); the round-trip rewrites bytes to Float32 PCM but reports the same shape, so something Qwen3's loader cares about is normalised by the round-trip. This commit: - Adds an optional `audioPathOverride: URL?` parameter to `runWERAgainstFixture` (defaults to nil → use truth's audioURL, so WhisperKit + Parakeet behaviour is unchanged). - Mirrors the working E2E test's `resampleFixtureToTemp` pattern inline in the Qwen3 quality test, then passes the temp path through the override. Inline (vs sharing the resample helper) so the Mini-bug workaround is visible at the point of use; if a future investigation roots out the underlying loader issue and removes the need for the round-trip, the delete is local to one file. Local M3 Max: WER unchanged (0.214 / 0.509). Mini retest pending.
Bisect step 2 from the Qwen3-on-Mini garbage-output investigation plan. If running Qwen3AsrEngineQualityTests alone produces clean German transcription on Mini, the bug is process-state-leak from co-loaded WhisperKit / FluidDiarizer / Parakeet models. DO NOT MERGE THIS COMMIT. Revert before the PR ships either way.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the last gap in the WER quality matrix. Qwen3-ASR (FluidAudio Qwen3 0.6B) was the only production engine without regression coverage. A FluidAudio bump or a Qwen3 model swap could shift the baseline silently.
What's new
Qwen3AsrEngineQualityTests— 2 tests againsttwo_speakers_deandthree_speakers_de, feeding through the sharedrunWERAgainstFixturehelper (same one Parakeet + WhisperKit use). Engine pinslanguage = "de".--filteralternation inquality-and-safety.yml.runWERAgainstFixturegains an optionalaudioPathOverride: URL?parameter; defaults to nil so WhisperKit + Parakeet behaviour is unchanged.The Mini-bug round-trip workaround
Initial Mini live-run returned garbage tokens — single Cyrillic
бon the two-speaker fixture, two punctuation chars on the three-speaker — even though local M3 Max gave reasonable German (WER 0.214 / 0.509) against the same code, same fixture, same language hint.Qwen3E2ETests.testTranscribeSegmentsProducesGermanContentruns green on the same Mini. The only material difference between the working E2E call and the failing quality call was that the E2E test round-trips the fixture throughAudioMixer.loadAudioFileAsFloat32→AudioMixer.saveWAVto a temp WAV before handing the path to the engine. Both files report identical ffprobe metadata (16 kHz mono PCM s16le); the round-trip rewrites bytes to Float32 PCM but reports the same shape — yet Qwen3's loader cares about something the round-trip normalises.The Qwen3 quality test now mirrors that round-trip inline, then passes the temp path through the new
audioPathOverride. Inline (vs sharing the resample helper) so the Mini-bug workaround is visible at the point of use; if a future investigation roots out the loader issue and removes the need for the round-trip, the delete is local to one file.Local baselines (M3 Max, 2026-05-10)
qwen3qwen3Soft threshold 0.6 sits ~10 % above the higher baseline. Wide enough to absorb run-to-run variance, tight enough to flag catastrophic breakage (model corrupted, language hint ignored, audio not loaded).
For comparison after this PR lands,
quality-results.jsoncarries:whisperKit(large-v3-turbo, language=de)parakeet(auto-detect)qwen3(language=de)Qwen3 actually beats WhisperKit on the two-speaker fixture (0.21 vs 0.29) but trails on the three-speaker one (0.51 vs 0.23). Proper-noun + tech-jargon substitutions in the longer audio drive the gap.
macOS gate
Class-level
@available(macOS 15, *)mirrorsQwen3AsrEngine's gate — CoreML stateful models need macOS 15+. The annotation is runtime-only; the file compiles fine against the package's macOS 14 deployment target, XCTest just skips the methods at discovery time on older hosts.Test plan
RUN_QUALITY_TESTS=1 swift test --filter "WhisperKitQualityTests|ParakeetQualityTests|Qwen3AsrEngineQualityTests|FluidDiarizerQualityTests"→ all pass locallyquality-results.json