Releases: FoxNoseTech/diarize
Releases · FoxNoseTech/diarize
v0.1.2
What's Changed
diarize 0.1.2 focuses on diarization quality, reproducible benchmarks, and clearer accuracy documentation.
Improvements
- Reduced short speaker label switching with temporal smoothing during diarization assembly.
- Improved automatic speaker-count selection with silhouette refinement plus a small larger-k prior.
- Added
scripts/benchmark_rttm.pyfor reproducible audio+RTTM benchmark runs across VoxConverse, AMI, and similar datasets.
Benchmarks and Docs
- Updated VoxConverse dev benchmark numbers:
- Weighted DER: ~4.8%
- Speaker count: 125/216 exact, 178/216 within ±1
- Added preliminary AMI Mix-Headset test validation:
- Weighted DER: 14.96%
- Speaker count: 4/16 exact, 8/16 within ±1
- Documented known limitations around speaker-count errors and speaker label fragmentation.
- Added a Changelog page to the documentation.
Package
- Synced package metadata and runtime
diarize.__version__to0.1.2.
v0.1.1
This patch release fixes dependency compatibility for audio loading.
Fixed
- Pinned
torchandtorchaudioto a compatible range:torch>=1.13,<2.9torchaudio>=0.13,<2.9
- Prevents failures where newer
torchaudiorequirestorchcodec.
Docs
- Clarified that diarize now installs a compatible torch/torchaudio range automatically.
No API changes.
v0.1.0 — Initial Release
diarize v0.1.0
Speaker diarization for Python — answers "who spoke when?" in any audio file. CPU-only, no GPU, no API keys, no account signup.
Highlights
- ~10.8% DER on VoxConverse dev set — lower than pyannote's free models (community-1 and 3.1 legacy, both ~11.2%)
- ~8x faster than real-time on CPU (RTF 0.12 vs pyannote community-1's 0.86)
- Automatic speaker count detection via GMM BIC with silhouette refinement (1–7 speakers)
- Zero setup friction —
pip install diarizeand you're done, no HuggingFace token or account needed
Pipeline
Silero VAD → WeSpeaker ResNet34-LM (ONNX) → GMM BIC → Spectral Clustering
All four stages run on CPU. All components are open-source with permissive licenses.
Usage
from diarize import diarize
result = diarize("meeting.wav")
for seg in result.segments:
print(f" [{seg.start:.1f}s - {seg.end:.1f}s] {seg.speaker}")Known Limitations
- Benchmarked on a single dataset (VoxConverse). Cross-dataset validation is planned.
- Speaker count estimation degrades for 8+ speakers — pass num_speakers explicitly when known.
- Overlapping speech is not modeled — each segment is assigned to one speaker.