The generation of high-quality, natural-sounding speech via text-to-speech (TTS) models is a major aim in the field of au- dio synthesis, which often suffer from tangled representations that blur distinctions between phoneme classes, making pre- cise control of phoneme pronunciation difficult. This paper presents a novel approach that combines contrastive learning with Variational Autoencoders (VAE), PhonemeCVAE, to produce disentangled latent embeddings for distinct phoneme classes. We demonstrate that by carefully integrating su- pervised contrastive learning into the VAE paradigm and training a phoneme-conditioned VAE with Gaussian priors per phoneme class, the latent space achieves much stronger phoneme separability and more compactness within-class clustering than training without the contrastive loss. We show that by adding the supervised contrastive loss into our training objective, we enable the Gaussian priors to learn disentangled phonetic representations that we can later use at inference stage to generate gradually interpolated phonemes.
git clone https://github.com/nina-goes/PhonemeCVAE.git
cd PhonemeCVAE
conda env create -f environment.ymlInterpolate audio samples between the centroids of two specifically selected phoneme classes, where the interpolation factor α controls the weighting between the source and the target class centroid, with α=0.0 sampling from the source class and α=1.0 sampling from the target classs.
| Sample | Input | α = -0.5 (150% /s/, 0% /ʃ/) | α = 1.0 (0% /s/, 100% /ʃ/) | α = 0.5 (50% /s/, 50% /ʃ/) |
|---|---|---|---|---|
| 1 |
| Sample | Input | α = -0.5 (150% /æ/, 0% /ɑ/) | α = 1.0 (0% /æ/, 100% /ɑ/) | α = 0.5 (50% /æ/, 50% /ɑ/) |
|---|---|---|---|---|
| 1 |
| Sample | Input | α = -0.5 (150% /s/, 0% /ʃ/) | α = 1.0 (0% /s/, 100% /ʃ/) | α = 0.5 (50% /s/, 50% /ʃ/) |
|---|---|---|---|---|
| 1 |
| Sample | Input | α = -0.5 (150% /ɒ/, 0% /i/) | α = 1.0 (0% /ɒ/, 100% /i/) | α = 0.5 (50% /ɒ/, 50% /i/) |
|---|---|---|---|---|
| 1 |
# Prepare training data
python scripts/build_manifests_from_mfa.py \
--librispeech_root LibriSpeech/ \
--mfa_align_root train-clean-100/ \
--out_root librispeech/ \
--n_mels 80 \
--split train-clean-100
# Train model
python train.py --config configs/default.yaml
# Perform phoneme editing
python phoneme_editing.py
--checkpoint #your checkpoint
--phones_json phones.json
--test_manifest test-clean_manifest.jsonl
--source_phoneme "s"
--target_phoneme "ʃ"
--edit_type interpolate
--alpha 1.0