Skip to content

ankilab/PhonemeCVAE

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PhonemeCVAE

SUPERVISED CONTRASTIVE VARIATIONAL AUTOENCODERS FOR PHONEME-DISENTANGLED SPEECH SYNTHESIS

Abstract

The generation of high-quality, natural-sounding speech via text-to-speech (TTS) models is a major aim in the field of au- dio synthesis, which often suffer from tangled representations that blur distinctions between phoneme classes, making pre- cise control of phoneme pronunciation difficult. This paper presents a novel approach that combines contrastive learning with Variational Autoencoders (VAE), PhonemeCVAE, to produce disentangled latent embeddings for distinct phoneme classes. We demonstrate that by carefully integrating su- pervised contrastive learning into the VAE paradigm and training a phoneme-conditioned VAE with Gaussian priors per phoneme class, the latent space achieves much stronger phoneme separability and more compactness within-class clustering than training without the contrastive loss. We show that by adding the supervised contrastive loss into our training objective, we enable the Gaussian priors to learn disentangled phonetic representations that we can later use at inference stage to generate gradually interpolated phonemes.

Architecture overview

Architecture

Installation

git clone https://github.com/nina-goes/PhonemeCVAE.git
cd PhonemeCVAE
conda env create -f environment.yml

Audio Samples

Interpolate audio samples between the centroids of two specifically selected phoneme classes, where the interpolation factor α controls the weighting between the source and the target class centroid, with α=0.0 sampling from the source class and α=1.0 sampling from the target classs.

LibriSpeech/test

Transforming /s/ to /ʃ/

Sample Input α = -0.5 (150% /s/, 0% /ʃ/) α = 1.0 (0% /s/, 100% /ʃ/) α = 0.5 (50% /s/, 50% /ʃ/)
1

Transforming /æ/ to /ɑ/

Sample Input α = -0.5 (150% /æ/, 0% /ɑ/) α = 1.0 (0% /æ/, 100% /ɑ/) α = 0.5 (50% /æ/, 50% /ɑ/)
1

Audio Samples LJSpeech

Transforming /s/ to /ʃ/

Sample Input α = -0.5 (150% /s/, 0% /ʃ/) α = 1.0 (0% /s/, 100% /ʃ/) α = 0.5 (50% /s/, 50% /ʃ/)
1

Transforming /ɒ/ to /i/

Sample Input α = -0.5 (150% /ɒ/, 0% /i/) α = 1.0 (0% /ɒ/, 100% /i/) α = 0.5 (50% /ɒ/, 50% /i/)
1

Training

# Prepare training data
python scripts/build_manifests_from_mfa.py \
  --librispeech_root LibriSpeech/ \
  --mfa_align_root  train-clean-100/ \
  --out_root        librispeech/ \
  --n_mels 80 \
  --split train-clean-100

# Train model
python train.py --config configs/default.yaml

# Perform phoneme editing 
 python phoneme_editing.py 
    --checkpoint #your checkpoint 
    --phones_json phones.json 
    --test_manifest test-clean_manifest.jsonl
    --source_phoneme "s" 
    --target_phoneme "ʃ" 
    --edit_type interpolate 
    --alpha 1.0

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.7%
  • Shell 0.3%