Real-time speech recognition with speaker diarization, combining the speed of Moonshine ASR with the accuracy of diart speaker diarization.
- Fast ASR: Moonshine ONNX models process audio 5x-15x faster than Whisper
- Real-time Diarization: Identifies and separates different speakers as they talk
- Low Latency: Optimized for live transcription with minimal delay
- Lightweight: Runs efficiently on edge devices without GPU
- Voice Activity Detection: Silero VAD ensures accurate speech detection
cd live-speechpython -m venv speech
source speech/bin/activate # On Windows: speech\Scripts\activate
pip install -r requirements.txtNote: PyTorch 2.2.2 is used, but the code supports PyTorch versions 2.2.2 to 2.6.0.
Apply a required patch to fix compatibility with newer huggingface-hub versions:
sed -i 's/use_auth_token=use_auth_token/token=use_auth_token/g' "$(python -c 'import pyannote.audio.core.model; print(pyannote.audio.core.model.__file__)')"Diart uses PyAnnote models which require authentication:
# Get your token from https://huggingface.co/settings/tokens
# Accept the user agreements for:
# - https://huggingface.co/pyannote/segmentation
# - https://huggingface.co/pyannote/embedding
export HF_TOKEN=your_token_here
# Or create a .env file with: HF_TOKEN=your_token_herepython live-speech.pypython live-speech.py --no-diarization# Use tiny model (faster, less accurate)
python live-speech.py --model moonshine/tiny
# Use base model (slower, more accurate) - default
python live-speech.py --model moonshine/base# List available devices
python -c "import sounddevice; print(sounddevice.query_devices())"
# Use specific device
python live-speech.py --device 1[2.34s-5.67s] Speaker 1: Hello, how are you doing today?
[6.12s-8.95s] Speaker 2: I'm doing great, thanks for asking!
[9.23s-12.45s] Speaker 1: That's wonderful to hear.
- Audio Capture: Captures live microphone input at 16kHz
- Voice Activity Detection: Silero VAD detects speech segments
- Speech Recognition: Moonshine transcribes detected speech to text
- Speaker Diarization: Diart identifies which speaker is talking
- Alignment: Matches transcriptions to speakers using timestamp overlap
If you see hf_hub_download() got an unexpected keyword argument 'use_auth_token', apply the required patch (see Installation step 3):
sed -i 's/use_auth_token=use_auth_token/token=use_auth_token/g' "$(python -c 'import pyannote.audio.core.model; print(pyannote.audio.core.model.__file__)')"If you encounter undefined symbol errors with torchaudio, there's a version mismatch. Reinstall both to ensure compatibility:
# Check your PyTorch version
python -c "import torch; print(torch.__version__)"
# Reinstall matching versions
pip install torch==2.2.2 torchaudio==2.2.2 --force-reinstallIf transcriptions appear without speaker labels:
- Diarization needs ~2-3 seconds to identify speakers
- Early transcriptions may timeout (6 second default)
- Ensure your HF_TOKEN is set correctly
# Test your microphone
python -c "import sounddevice as sd; import numpy as np; print(sd.rec(16000, samplerate=16000, channels=1, dtype=np.float32))"
# List all audio devices
python -c "import sounddevice; print(sounddevice.query_devices())"https://github.com/moonshine-ai/moonshine
English models:
@misc{jeffries2024moonshinespeechrecognitionlive,
title={Moonshine: Speech Recognition for Live Transcription and Voice Commands},
author={Nat Jeffries and Evan King and Manjunath Kudlur and Guy Nicholson and James Wang and Pete Warden},
year={2024},
eprint={2410.15608},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2410.15608},
}Non-English variants:
@misc{king2025flavorsmoonshinetinyspecialized,
title={Flavors of Moonshine: Tiny Specialized ASR Models for Edge Devices},
author={Evan King and Adam Sabra and Manjunath Kudlur and James Wang and Pete Warden},
year={2025},
eprint={2509.02523},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.02523},
}https://github.com/juanmc2005/diart
@inproceedings{diart,
author={Coria, Juan M. and Bredin, Hervé and Ghannay, Sahar and Rosset, Sophie},
booktitle={2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
title={Overlap-Aware Low-Latency Online Speaker Diarization Based on End-to-End Local Segmentation},
year={2021},
pages={1139-1146},
doi={10.1109/ASRU51503.2021.9688044}
}