Real-time local audio transcription CLI. Captures live audio from microphone and/or system sources, detects speech via VAD, optionally denoises, and transcribes using Whisper or Parakeet — all locally, no cloud APIs.
- Dual engines — OpenAI Whisper (tiny through large-v3) and NVIDIA Parakeet (parakeet-tdt-0.6b)
- Multi-source capture — microphone, system audio (loopback), or both simultaneously
- Voice Activity Detection — Silero VAD v5 for accurate speech segmentation
- Audio enhancement — high-pass filtering, peak normalization, RNNoise denoising
- Hallucination filtering — detects and drops common Whisper artifacts
- Multiple output formats — console, TXT, SRT subtitles, JSON
- WebSocket relay — stream results to a remote server (optional
relayfeature) - Auto model download — fetches models from Hugging Face on first use
cargo build --releaseThe binary is built as transcriber. On macOS, the build script automatically compiles the Swift helper needed for system audio capture.
# Transcribe from microphone (default)
transcriber transcribe
# Use a specific model
transcriber transcribe --model large-v3
# Transcribe system audio
transcriber transcribe --mode system
# Transcribe both mic and system audio simultaneously
transcriber transcribe --mode both
# Use Parakeet engine
transcriber transcribe --engine parakeet
# Enable noise reduction
transcriber transcribe --noise-reduce
# Save to file
transcriber transcribe -o output.srt -f srt
# List audio devices
transcriber devices
# List available models
transcriber models| Option | Default | Description |
|---|---|---|
--mode |
mic |
Audio source: mic, system, both |
--engine |
whisper |
Transcription engine: whisper, parakeet |
--model |
turbo |
Model name (e.g. tiny, base, small, turbo, large-v3, parakeet-tdt-0.6b) |
--language |
auto | Language code (e.g. en) |
--device |
system default | Audio device index or name substring |
--compute-device |
auto |
Backend: auto, cpu, cuda |
--compute-type |
int8 |
Precision: int8, float16, float32 |
-o, --output |
console | Output file path |
-f, --format |
txt |
Output format: txt, srt, json |
--vad-threshold |
0.5 |
Speech detection threshold (0.0–1.0) |
--noise-reduce |
off | Enable RNNoise denoising |
--max-segment |
3.0 |
Max speech duration in seconds before force-emit |
--relay |
— | WebSocket relay URL (requires --session) |
--session |
— | Session code for relay |
Models are cached in ~/.cache/transcriber/models/ and downloaded automatically on first use.
Whisper models:
| Name | Size | Notes |
|---|---|---|
tiny |
75 MB | Fastest, lowest accuracy |
base |
142 MB | |
small |
466 MB | |
turbo |
809 MB | Default — good speed/accuracy tradeoff |
medium |
1.5 GB | |
distil-large-v3 |
756 MB | Distilled, English-optimized |
large-v3 |
3.1 GB | Best accuracy |
Parakeet models:
| Name | Size | Notes |
|---|---|---|
parakeet-tdt-0.6b |
~600 MB | English-only, 6.05% WER |
Audio Source (mic/system)
→ Resampling to 16kHz mono
→ High-pass filter (80Hz) + normalization
→ Silero VAD (speech detection)
→ [Optional] RNNoise denoising
→ Whisper/Parakeet transcription
→ Hallucination filter + dedup/merge
→ Output sinks (console/file/relay)
In both mode, mic and system audio run as independent pipelines in separate threads, with results multiplexed to shared output sinks via crossbeam channels.
This CLI depends on the transcribe-rs library (include locally at ../transcribe-rs/) which provides the Whisper and Parakeet transcription engines. The CLI handles audio capture, VAD, the processing pipeline, and output — transcribe-rs handles model loading and inference.