Skip to content

dwu006/live-speech

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

🎙️ live-speech

Real-time speech recognition with speaker diarization, combining the speed of Moonshine ASR with the accuracy of diart speaker diarization.

🌟 Features

  • Fast ASR: Moonshine ONNX models process audio 5x-15x faster than Whisper
  • Real-time Diarization: Identifies and separates different speakers as they talk
  • Low Latency: Optimized for live transcription with minimal delay
  • Lightweight: Runs efficiently on edge devices without GPU
  • Voice Activity Detection: Silero VAD ensures accurate speech detection

🚀 Installation

1. Clone or download this repository

cd live-speech

2. Install dependencies

python -m venv speech
source speech/bin/activate  # On Windows: speech\Scripts\activate
pip install -r requirements.txt

Note: PyTorch 2.2.2 is used, but the code supports PyTorch versions 2.2.2 to 2.6.0.

3. Fix Hugging Face compatibility

Apply a required patch to fix compatibility with newer huggingface-hub versions:

sed -i 's/use_auth_token=use_auth_token/token=use_auth_token/g' "$(python -c 'import pyannote.audio.core.model; print(pyannote.audio.core.model.__file__)')"

4. Set up Hugging Face token (for speaker diarization)

Diart uses PyAnnote models which require authentication:

# Get your token from https://huggingface.co/settings/tokens
# Accept the user agreements for:
# - https://huggingface.co/pyannote/segmentation
# - https://huggingface.co/pyannote/embedding

export HF_TOKEN=your_token_here
# Or create a .env file with: HF_TOKEN=your_token_here

🎯 Usage

Basic Usage (ASR + Diarization)

python live-speech.py

ASR Only (No Diarization)

python live-speech.py --no-diarization

Choose Moonshine Model

# Use tiny model (faster, less accurate)
python live-speech.py --model moonshine/tiny

# Use base model (slower, more accurate) - default
python live-speech.py --model moonshine/base

Select Audio Device

# List available devices
python -c "import sounddevice; print(sounddevice.query_devices())"

# Use specific device
python live-speech.py --device 1

🎬 Example Output

[2.34s-5.67s] Speaker 1: Hello, how are you doing today?
[6.12s-8.95s] Speaker 2: I'm doing great, thanks for asking!
[9.23s-12.45s] Speaker 1: That's wonderful to hear.

🔧 How It Works

  1. Audio Capture: Captures live microphone input at 16kHz
  2. Voice Activity Detection: Silero VAD detects speech segments
  3. Speech Recognition: Moonshine transcribes detected speech to text
  4. Speaker Diarization: Diart identifies which speaker is talking
  5. Alignment: Matches transcriptions to speakers using timestamp overlap

🐛 Troubleshooting

Hugging Face Token Error

If you see hf_hub_download() got an unexpected keyword argument 'use_auth_token', apply the required patch (see Installation step 3):

sed -i 's/use_auth_token=use_auth_token/token=use_auth_token/g' "$(python -c 'import pyannote.audio.core.model; print(pyannote.audio.core.model.__file__)')"

PyTorch/torchaudio Version Mismatch

If you encounter undefined symbol errors with torchaudio, there's a version mismatch. Reinstall both to ensure compatibility:

# Check your PyTorch version
python -c "import torch; print(torch.__version__)"

# Reinstall matching versions
pip install torch==2.2.2 torchaudio==2.2.2 --force-reinstall

No Speaker Labels

If transcriptions appear without speaker labels:

  • Diarization needs ~2-3 seconds to identify speakers
  • Early transcriptions may timeout (6 second default)
  • Ensure your HF_TOKEN is set correctly

Audio Issues

# Test your microphone
python -c "import sounddevice as sd; import numpy as np; print(sd.rec(16000, samplerate=16000, channels=1, dtype=np.float32))"

# List all audio devices
python -c "import sounddevice; print(sounddevice.query_devices())"

📚 Citations

Moonshine ASR

https://github.com/moonshine-ai/moonshine

English models:

@misc{jeffries2024moonshinespeechrecognitionlive,
      title={Moonshine: Speech Recognition for Live Transcription and Voice Commands}, 
      author={Nat Jeffries and Evan King and Manjunath Kudlur and Guy Nicholson and James Wang and Pete Warden},
      year={2024},
      eprint={2410.15608},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2410.15608}, 
}

Non-English variants:

@misc{king2025flavorsmoonshinetinyspecialized,
      title={Flavors of Moonshine: Tiny Specialized ASR Models for Edge Devices}, 
      author={Evan King and Adam Sabra and Manjunath Kudlur and James Wang and Pete Warden},
      year={2025},
      eprint={2509.02523},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.02523}, 
}

Diart Speaker Diarization

https://github.com/juanmc2005/diart

@inproceedings{diart,
  author={Coria, Juan M. and Bredin, Hervé and Ghannay, Sahar and Rosset, Sophie},
  booktitle={2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
  title={Overlap-Aware Low-Latency Online Speaker Diarization Based on End-to-End Local Segmentation},
  year={2021},
  pages={1139-1146},
  doi={10.1109/ASRU51503.2021.9688044}
}

About

fast real time asr and diarization

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages