Skip to content

test(quality): add Qwen3-ASR WER coverage#230

Open
pasrom wants to merge 4 commits into
mainfrom
feat/qwen3-quality-tests
Open

test(quality): add Qwen3-ASR WER coverage#230
pasrom wants to merge 4 commits into
mainfrom
feat/qwen3-quality-tests

Conversation

@pasrom
Copy link
Copy Markdown
Owner

@pasrom pasrom commented May 10, 2026

Summary

Closes the last gap in the WER quality matrix. Qwen3-ASR (FluidAudio Qwen3 0.6B) was the only production engine without regression coverage. A FluidAudio bump or a Qwen3 model swap could shift the baseline silently.

What's new

  • Qwen3AsrEngineQualityTests — 2 tests against two_speakers_de and three_speakers_de, feeding through the shared runWERAgainstFixture helper (same one Parakeet + WhisperKit use). Engine pins language = "de".
  • 1-line append to the --filter alternation in quality-and-safety.yml.
  • runWERAgainstFixture gains an optional audioPathOverride: URL? parameter; defaults to nil so WhisperKit + Parakeet behaviour is unchanged.

The Mini-bug round-trip workaround

Initial Mini live-run returned garbage tokens — single Cyrillic б on the two-speaker fixture, two punctuation chars on the three-speaker — even though local M3 Max gave reasonable German (WER 0.214 / 0.509) against the same code, same fixture, same language hint.

Qwen3E2ETests.testTranscribeSegmentsProducesGermanContent runs green on the same Mini. The only material difference between the working E2E call and the failing quality call was that the E2E test round-trips the fixture through AudioMixer.loadAudioFileAsFloat32AudioMixer.saveWAV to a temp WAV before handing the path to the engine. Both files report identical ffprobe metadata (16 kHz mono PCM s16le); the round-trip rewrites bytes to Float32 PCM but reports the same shape — yet Qwen3's loader cares about something the round-trip normalises.

The Qwen3 quality test now mirrors that round-trip inline, then passes the temp path through the new audioPathOverride. Inline (vs sharing the resample helper) so the Mini-bug workaround is visible at the point of use; if a future investigation roots out the loader issue and removes the need for the round-trip, the delete is local to one file.

Local baselines (M3 Max, 2026-05-10)

Engine Fixture WER
qwen3 two_speakers_de 0.214
qwen3 three_speakers_de 0.509

Soft threshold 0.6 sits ~10 % above the higher baseline. Wide enough to absorb run-to-run variance, tight enough to flag catastrophic breakage (model corrupted, language hint ignored, audio not loaded).

For comparison after this PR lands, quality-results.json carries:

Engine two_speakers_de three_speakers_de
whisperKit (large-v3-turbo, language=de) 0.286 0.226
parakeet (auto-detect) 0.464 0.453
qwen3 (language=de) 0.214 0.509

Qwen3 actually beats WhisperKit on the two-speaker fixture (0.21 vs 0.29) but trails on the three-speaker one (0.51 vs 0.23). Proper-noun + tech-jargon substitutions in the longer audio drive the gap.

macOS gate

Class-level @available(macOS 15, *) mirrors Qwen3AsrEngine's gate — CoreML stateful models need macOS 15+. The annotation is runtime-only; the file compiles fine against the package's macOS 14 deployment target, XCTest just skips the methods at discovery time on older hosts.

Test plan

  • RUN_QUALITY_TESTS=1 swift test --filter "WhisperKitQualityTests|ParakeetQualityTests|Qwen3AsrEngineQualityTests|FluidDiarizerQualityTests" → all pass locally
  • Lint clean
  • Verified all four engine rows land in quality-results.json
  • Mini quality job green with the round-trip workaround (initial run without it failed; resume run pending)

pasrom added 2 commits May 10, 2026 21:35
Qwen3-ASR (FluidAudio Qwen3 0.6B) was the last engine without WER
regression coverage. A FluidAudio bump or a Qwen3 model swap could
shift the baseline silently — same blind spot Parakeet had until #229.

Two tests against the same German fixtures (`two_speakers_de`,
`three_speakers_de`) feed through the shared `runWERAgainstFixture`
helper. Engine pins `language = "de"` (Qwen3 supports an explicit
language hint, unlike Parakeet's auto-detect-only contract).

Local baselines (M3 Max, 2026-05-10):

  qwen3  two_speakers_de    WER ≈ 0.214
  qwen3  three_speakers_de  WER ≈ 0.509

Soft threshold 0.6 sits ~10 % above the higher baseline. Qwen3 actually
beats WhisperKit on the two-speaker fixture (0.21 vs 0.29) but trails on
the three-speaker one (0.51 vs 0.23) — proper-noun + tech-jargon
substitutions in the longer audio drive the gap. Documented inline so
readers don't read the row pair as a uniform improvement vs Whisper.

Class-level `@available(macOS 15, *)` mirrors `Qwen3AsrEngine`'s gate.
The annotation is runtime-only — file compiles fine against the
package's macOS 14 deployment target; XCTest skips the methods at
discovery time on macOS 14 hosts.
Append the new Qwen3 quality class to the alternation filter so the WER
baselines land in `quality-results.json` alongside the WhisperKit,
Parakeet, and diarizer rows.

Adds ~10 s wall when warm. Cold-cache adds ~2-3 min for the ~1.75 GB
CoreML model download — first run after this lands will be slow, but
subsequent runs hit the persistent FluidAudio model cache on the Mini
runner.
@github-actions github-actions Bot added the chore Maintenance or non-functional changes label May 10, 2026
@pasrom
Copy link
Copy Markdown
Owner Author

pasrom commented May 10, 2026

Closing — Mini live-run surfaced a real Qwen3-quality bug that we can't paper over. See follow-up plan in repo (not on the PR).

@pasrom pasrom closed this May 10, 2026
@pasrom pasrom deleted the feat/qwen3-quality-tests branch May 10, 2026 19:50
…transcribing

First Mini live-run of `Qwen3AsrEngineQualityTests` returned garbage
tokens — single Cyrillic char "б" on two_speakers_de, two punctuation
chars on three_speakers_de — even though local M3 Max runs produced
reasonable German (WER 0.214 / 0.509) against the same code, same
fixture, same `language="de"` hint.

`Qwen3E2ETests.testTranscribeSegmentsProducesGermanContent` runs green
on the same Mini. The only material difference: it round-trips the
fixture through `AudioMixer.loadAudioFileAsFloat32` + `saveWAV` to a
temp WAV before handing the path to the engine, while the quality test
passed `truth.audioURL` directly. Both files report identical ffprobe
metadata (16 kHz mono PCM s16le); the round-trip rewrites bytes to
Float32 PCM but reports the same shape, so something Qwen3's loader
cares about is normalised by the round-trip.

This commit:
- Adds an optional `audioPathOverride: URL?` parameter to
  `runWERAgainstFixture` (defaults to nil → use truth's audioURL, so
  WhisperKit + Parakeet behaviour is unchanged).
- Mirrors the working E2E test's `resampleFixtureToTemp` pattern
  inline in the Qwen3 quality test, then passes the temp path through
  the override.

Inline (vs sharing the resample helper) so the Mini-bug workaround is
visible at the point of use; if a future investigation roots out the
underlying loader issue and removes the need for the round-trip, the
delete is local to one file.

Local M3 Max: WER unchanged (0.214 / 0.509). Mini retest pending.
@pasrom pasrom reopened this May 10, 2026
Bisect step 2 from the Qwen3-on-Mini garbage-output investigation
plan. If running Qwen3AsrEngineQualityTests alone produces clean
German transcription on Mini, the bug is process-state-leak from
co-loaded WhisperKit / FluidDiarizer / Parakeet models.

DO NOT MERGE THIS COMMIT. Revert before the PR ships either way.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chore Maintenance or non-functional changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant