Skip to content

e-ndrus/transcriber-cli

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

transcriber-cli

Real-time local audio transcription CLI. Captures live audio from microphone and/or system sources, detects speech via VAD, optionally denoises, and transcribes using Whisper or Parakeet — all locally, no cloud APIs.

Features

  • Dual engines — OpenAI Whisper (tiny through large-v3) and NVIDIA Parakeet (parakeet-tdt-0.6b)
  • Multi-source capture — microphone, system audio (loopback), or both simultaneously
  • Voice Activity Detection — Silero VAD v5 for accurate speech segmentation
  • Audio enhancement — high-pass filtering, peak normalization, RNNoise denoising
  • Hallucination filtering — detects and drops common Whisper artifacts
  • Multiple output formats — console, TXT, SRT subtitles, JSON
  • WebSocket relay — stream results to a remote server (optional relay feature)
  • Auto model download — fetches models from Hugging Face on first use

Installation

cargo build --release

The binary is built as transcriber. On macOS, the build script automatically compiles the Swift helper needed for system audio capture.

Usage

# Transcribe from microphone (default)
transcriber transcribe

# Use a specific model
transcriber transcribe --model large-v3

# Transcribe system audio
transcriber transcribe --mode system

# Transcribe both mic and system audio simultaneously
transcriber transcribe --mode both

# Use Parakeet engine
transcriber transcribe --engine parakeet

# Enable noise reduction
transcriber transcribe --noise-reduce

# Save to file
transcriber transcribe -o output.srt -f srt

# List audio devices
transcriber devices

# List available models
transcriber models

Options

Option Default Description
--mode mic Audio source: mic, system, both
--engine whisper Transcription engine: whisper, parakeet
--model turbo Model name (e.g. tiny, base, small, turbo, large-v3, parakeet-tdt-0.6b)
--language auto Language code (e.g. en)
--device system default Audio device index or name substring
--compute-device auto Backend: auto, cpu, cuda
--compute-type int8 Precision: int8, float16, float32
-o, --output console Output file path
-f, --format txt Output format: txt, srt, json
--vad-threshold 0.5 Speech detection threshold (0.0–1.0)
--noise-reduce off Enable RNNoise denoising
--max-segment 3.0 Max speech duration in seconds before force-emit
--relay WebSocket relay URL (requires --session)
--session Session code for relay

Models

Models are cached in ~/.cache/transcriber/models/ and downloaded automatically on first use.

Whisper models:

Name Size Notes
tiny 75 MB Fastest, lowest accuracy
base 142 MB
small 466 MB
turbo 809 MB Default — good speed/accuracy tradeoff
medium 1.5 GB
distil-large-v3 756 MB Distilled, English-optimized
large-v3 3.1 GB Best accuracy

Parakeet models:

Name Size Notes
parakeet-tdt-0.6b ~600 MB English-only, 6.05% WER

Architecture

Audio Source (mic/system)
  → Resampling to 16kHz mono
  → High-pass filter (80Hz) + normalization
  → Silero VAD (speech detection)
  → [Optional] RNNoise denoising
  → Whisper/Parakeet transcription
  → Hallucination filter + dedup/merge
  → Output sinks (console/file/relay)

In both mode, mic and system audio run as independent pipelines in separate threads, with results multiplexed to shared output sinks via crossbeam channels.

Relationship to transcribe-rs

This CLI depends on the transcribe-rs library (include locally at ../transcribe-rs/) which provides the Whisper and Parakeet transcription engines. The CLI handles audio capture, VAD, the processing pipeline, and output — transcribe-rs handles model loading and inference.

About

A rust project to transcribe audio input and output locally using AI models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages