Skip to content

Releases: FoxNoseTech/diarize

v0.1.2

06 May 10:03
4f25d27

Choose a tag to compare

What's Changed

diarize 0.1.2 focuses on diarization quality, reproducible benchmarks, and clearer accuracy documentation.

Improvements

  • Reduced short speaker label switching with temporal smoothing during diarization assembly.
  • Improved automatic speaker-count selection with silhouette refinement plus a small larger-k prior.
  • Added scripts/benchmark_rttm.py for reproducible audio+RTTM benchmark runs across VoxConverse, AMI, and similar datasets.

Benchmarks and Docs

  • Updated VoxConverse dev benchmark numbers:
    • Weighted DER: ~4.8%
    • Speaker count: 125/216 exact, 178/216 within ±1
  • Added preliminary AMI Mix-Headset test validation:
    • Weighted DER: 14.96%
    • Speaker count: 4/16 exact, 8/16 within ±1
  • Documented known limitations around speaker-count errors and speaker label fragmentation.
  • Added a Changelog page to the documentation.

Package

  • Synced package metadata and runtime diarize.__version__ to 0.1.2.

v0.1.1

06 Mar 17:20

Choose a tag to compare

This patch release fixes dependency compatibility for audio loading.

Fixed

  • Pinned torch and torchaudio to a compatible range:
    • torch>=1.13,<2.9
    • torchaudio>=0.13,<2.9
  • Prevents failures where newer torchaudio requires torchcodec.

Docs

  • Clarified that diarize now installs a compatible torch/torchaudio range automatically.

No API changes.

v0.1.0 — Initial Release

01 Mar 11:30

Choose a tag to compare

diarize v0.1.0

Speaker diarization for Python — answers "who spoke when?" in any audio file. CPU-only, no GPU, no API keys, no account signup.

Highlights

  • ~10.8% DER on VoxConverse dev set — lower than pyannote's free models (community-1 and 3.1 legacy, both ~11.2%)
  • ~8x faster than real-time on CPU (RTF 0.12 vs pyannote community-1's 0.86)
  • Automatic speaker count detection via GMM BIC with silhouette refinement (1–7 speakers)
  • Zero setup frictionpip install diarize and you're done, no HuggingFace token or account needed

Pipeline

Silero VAD → WeSpeaker ResNet34-LM (ONNX) → GMM BIC → Spectral Clustering

All four stages run on CPU. All components are open-source with permissive licenses.

Usage

from diarize import diarize

result = diarize("meeting.wav")
for seg in result.segments:
    print(f"  [{seg.start:.1f}s - {seg.end:.1f}s] {seg.speaker}")

Known Limitations

  • Benchmarked on a single dataset (VoxConverse). Cross-dataset validation is planned.
  • Speaker count estimation degrades for 8+ speakers — pass num_speakers explicitly when known.
  • Overlapping speech is not modeled — each segment is assigned to one speaker.