This is the official repository of the IEEE SLT 2024 paper Self-Supervised Syllable Discovery Based on Speaker-Disentangled HuBERT.
conda create -y -n py310 python=3.10.14 pip=24.0
conda activate py310
pip install -r requirements/requirements.txt
sh scripts/setup.sh
import torchaudio
from src.s5hubert import S5HubertForSyllableDiscovery
wav_path = "/path/to/wav"
# download a pretrained model from hugging face hub
model = S5HubertForSyllableDiscovery.from_pretrained("ryota-komatsu/s5-hubert").cuda()
# load a waveform
waveform, sr = torchaudio.load(wav_path)
waveform = torchaudio.functional.resample(waveform, sr, 16000)
# encode a waveform into pseudo-syllabic units
outputs = model(waveform.cuda())
# pseudo-syllabic units
units = outputs["units"] # [3950, 67, ..., 503]
Google Colab demo is found here.
You can download a pretrained model from Hugging Face.
Other models can be downloaded from the old repository.
If you already have LibriSpeech, you can use it by editing a config file;
dataset:
root: "/path/to/LibriSpeech/root" # ${dataset.root}/LibriSpeech/train-clean-100, train-clean-360, ...
otherwise you can download the new one under dataset_root
.
dataset_root=data # be consistent with dataset.root in a config file
sh scripts/download_librispeech.sh ${dataset_root}
Check the directory structure
dataset.root in a config file
└── LibriSpeech/
├── train-clean-100/
├── train-clean-360/
├── train-other-500/
├── dev-clean/
├── dev-other/
├── test-clean/
├── test-other/
└── SPEAKERS.TXT
python main.py --config configs/default.yaml
@inproceedings{Komatsu_Self-Supervised_Syllable_Discovery_2024,
author = {Komatsu, Ryota and Shinozaki, Takahiro},
title = {Self-Supervised Syllable Discovery Based on Speaker-Disentangled HuBERT},
year = {2024},
month = {Dec.},
booktitle = {IEEE Spoken Language Technology Workshop},
pages = {1131--1136},
doi = {10.1109/SLT61566.2024.10832325},
}