Skip to content

feat(transcription): add on-device local transcription via FluidAudio#3

Open
tenequm wants to merge 1 commit intomainfrom
feat/local-transcription
Open

feat(transcription): add on-device local transcription via FluidAudio#3
tenequm wants to merge 1 commit intomainfrom
feat/local-transcription

Conversation

@tenequm
Copy link
Copy Markdown
Owner

@tenequm tenequm commented Mar 18, 2026

Summary

Add on-device speech recognition and speaker diarization using FluidAudio (Parakeet TDT v3 ASR + offline diarization). Users get speaker-attributed transcripts without any API key, account, or internet connection. Soniox cloud transcription remains available as an alternative.

What changed

New file

  • LocalTranscriptionService.swift - FluidAudio integration: dual-track extraction via AVAssetReader, ASR on both tracks, diarization on both tracks, temporal overlap matching for speaker assignment, merge with sequential integer speaker IDs

Modified files

  • Package.swift - Added FluidAudio v0.12.4 dependency
  • TranscriptionService.swift - TranscriptionProvider enum (.local, .soniox), per-provider sidecarURL/load/save, speakers field on TranscriptDocument, legacy transcript.json migration with fallback
  • MainWindowView.swift - Segmented tab switcher between Local/Soniox, per-provider transcript and status state, split transcription methods, speaker name resolution (transcript.speakers -> metadata.speakers -> default)
  • SettingsView.swift - Default provider picker, model download button with progress
  • AudioMonitor.swift - Auto-transcription hook after recording stops (fires when local provider is default and models are downloaded)

Architecture

Recording finishes
  -> auto-transcribe (if enabled + models ready)
  -> LocalTranscriptionService:
      1. Extract Track 0 (system) + Track 1 (mic) from M4A
         (prefers audio-processed.m4a when AEC ran)
      2. AVAssetReader per track -> 16kHz mono Float32
      3. ASR each track with AsrManager (Parakeet TDT v3)
      4. Diarize each track with OfflineDiarizerManager
      5. Assign speakers to ASR segments via temporal overlap
      6. Merge: mic speakers first ("You"), then remote ("Speaker N")
      7. Save as transcript-local.json
  -> Single-track fallback if only 1 track exists

What does NOT change

  • Soniox API client (zero modifications)
  • Recording pipeline (AudioRecorder, SCStream, AVAudioEngine)
  • AEC post-processing
  • Playback, waveform, track switcher
  • Transcript renderer (provider-agnostic, takes any TranscriptDocument)
  • Transcript export
  • Onboarding, HUD

Per-provider transcript storage

Each provider writes its own sidecar file:

  • transcript-local.json - local on-device transcription
  • transcript-soniox.json - Soniox cloud transcription
  • Legacy transcript.json is migrated to transcript-soniox.json on first load

Speaker names are stored per-provider in TranscriptDocument.speakers, independent of metadata. Renaming a speaker on one tab does not affect the other.

Smoke test checklist

  • Open app with existing recordings that have transcript.json - verify they appear under Soniox tab
  • Local tab: click Transcribe on a recording - verify model download + diarized transcript
  • Soniox tab: enter API key, Transcribe - verify works exactly as before
  • Switch between Local/Soniox tabs - verify each shows its own transcript
  • Rename a speaker on Local tab - verify Soniox tab speakers are unaffected
  • Settings: change default provider, download models
  • Record a call with local as default + models ready - verify auto-transcription
  • Single-track recording - verify transcription still works
  • Cancel mid-transcription - verify clean state

Add on-device speech recognition and speaker diarization using FluidAudio
(Parakeet TDT v3 ASR + offline diarization). Users get speaker-attributed
transcripts without any API key, account, or internet connection.

Architecture:
- Dual-track independent processing: system audio (Track 0) and mic (Track 1)
  are transcribed and diarized separately, then merged with speaker attribution
  via temporal overlap matching
- Tab switcher UI between Local and Soniox providers (like track picker)
- Per-provider transcript files (transcript-local.json, transcript-soniox.json)
- Per-provider speaker names stored in TranscriptDocument.speakers field
- Auto-transcription after recording stops (when models downloaded + local default)
- Legacy transcript.json migrated to transcript-soniox.json on first load

New file:
- LocalTranscriptionService.swift: FluidAudio integration, track extraction,
  ASR, diarization, temporal overlap matching, speaker merge

Modified files:
- Package.swift: FluidAudio dependency
- TranscriptionService.swift: TranscriptionProvider enum, per-provider storage,
  speakers field, legacy migration with fallback
- MainWindowView.swift: tab switcher, per-provider state, split transcription
  methods, speaker name resolution from transcript then metadata
- SettingsView.swift: default provider picker, model download management
- AudioMonitor.swift: auto-transcription hook after recording stops

Existing Soniox transcription is completely untouched (zero changes to API client).
Recording, playback, AEC, and export pipelines are unmodified.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant