Skip to content

AzamRabiee/Continuous-Emotional-TTS

Repository files navigation

Continuous Emotional TTS

An implementation of continuous emotional speech synthesis in TensorFlow, folked from Keithito

Azam Rabiee, Tae-Ho Kim, Soo-Young Lee "Adjusting Pleasure-Arousal-Dominance for Continuous Emotional Text-to-speech Synthesizer," accepted paper in Interspeech2019, show and tell demonstration.

Background

In April 2017, Google published a paper, Tacotron: Towards End-to-End Speech Synthesis, where they present a neural text-to-speech model that learns to synthesize speech directly from (text, audio) pairs.

Here we add emotion to the TTS!

In the first project categorical emotion are added; for more info you may check "Emotional End-to-End Neural Speech synthesizer". In the second step, we made the continuous emotional TTS!

Emotion is not limited to discrete categories of happy, sad, angry, fear, disgust, surprise, and so on. Here, each emotion category is projected onto a set of independent dimensions named Pleasure-Arousal-Dominance (PAD). The value of each dimension varies from -1 to 1, such that the neutral emotion is in the center with all-zero values. However, you can generate speech with various emotions by either setting any arbitrary PAD or selecting an emotion category.

Demo

For demo, click here

Continuous Emotional samples

You can find synthesized waves with continuous emotions in this video

Quick Start

Installing dependencies

  1. Install Python 3.

  2. Install the latest version of TensorFlow for your platform. For better performance, install with GPU support if it's available. This code works with TensorFlow 1.12 and later.

  3. Install requirements:

    pip install -r requirements.txt
    

Using a pre-trained model

a pretrained model for Korean language is available here.

  1. Download the model

  2. Run the demo server:

    python3 demo_server_gpu.py --checkpoint <path-to-the-pretrained-model>/model.ckpt-385000
    
  3. Point your browser at localhost:9000

    • select the emotion category or any PAD
    • Type what you want to synthesize

Training

  1. Download an emotional speech dataset. We have used an internal Korean emotional speech dataset containing 6 emotional categories plus neutral speech. You can use other datasets if you convert them to the right format. See TRAINING_DATA.md for more info.

    Note: In our emotional dataset, the wave files are in wav_16k folder; filename contains the emotion label. For example, acriil_hap_m30_1981.wav means that a 30-year old man has uttered the sentence ID 1981 with happy emotion. In addition, all scripts are in emoTTS_script.txt as <sentence-ID> <text> in each line.

  2. Preprocess the data. The preprocessing prepares triples of the linear spectrogram, the Mel spectrogram, and text, as <spec-npy-filename|mel-npy-filename|text> for training; then saves them in a folder specified by --output. Please note that in the current implementation (in datafeeder.py), it is supposed that the filename contains the emotion label.

    Note 1: make sure to trim the silence in the beginning and ending of the wave files. You can run python3 ./datasets/trimmer.py on your in_dir path pointing to the wave folder. Edit the trimmer.py file for your parameters if needed. The trimmer.py utilizes voice activity detection (VAD) to trim the silence.

    Note 2: For Korean language, make sure to run python3 preprocess_kor_text.py to separate the characters of a syllabus, for example from 안녕 to ㅇㅏㄴㄴㅕᆼ. Modify the preprocess_kor_text.py file as needed.

    Run the following command to make mel-*.npy and spec-*.npy files.

    python3 preprocess.py --dataset emotionalDS
    

    Check the preprocess.py and emotionalDS.py for your dataset. Replace the dataset name with yours.

    After this step, the training folder contains mel- and spec- .npy files, as well as a metadata text file, named train.txt with <spec-filename|mel-filename|number-of-frames|text> format for each line. Here is an example:

    spec-neu_f30_0001.npy|mel-neu_f30_0001.npy|495|ㅈㅔㅇㅣㅍㅣㄴㅡㄴ ㅅㅏㅁㄱㅗㅇㄸㅐ ㅈㅔㅊㅓㄹㅅㅗㄹㅡㄹ ㅅㅣㅈㅏㄱㅎㅏㄹ ㄷㅏㅇㅅㅣ ㅊㅗㅇㄹㅣㄷㅗ ㅎㅏㄱㅗㅎㅐ ㅈㅏㅇㅕㄴㅅㅡㄹㅓㅂㄱㅔ ㅈㅏㅈㅜ ㅁㅏㄴㄴㅏㅆㅇㅡㄹ ㅃㅜㄴㅇㅣㅂㄴㅣㄷㅏ.
    

    Note 3: Don't forget to shuffle the train.txt with python3 shuffle_train.txt.py on your desired path. Also point to the shuffled text file in hparams.py as base_dir and input parameters.

  3. Train a model

    python3 train.py --name <your-desired-run-name>
    

    Tunable hyperparameters are found in hparams.py. You can adjust these at the command line using the --hparams flag, for example --hparams="batch_size=16,outputs_per_step=2". Hyperparameters should generally be set to the same values at both training and eval time. The default hyperparameters are recommended. See TRAINING_DATA.md for other languages.

  4. Monitor with Tensorboard (optional)

    tensorboard --logdir ./logs-<your-desired-run-name>
    

    The trainer dumps audio and alignments every 1000 steps. You can find these in ./logs-<your-desired-run-name>.

  5. Synthesize from a checkpoint

    python3 demo_server_gpu.py --checkpoint ./logs-<your-desired-run-name>/model.ckpt-185000
    

    Replace "185000" with the checkpoint number that you want to use, then open a browser to localhost:9000 and type what you want to speak. Alternately, you can run eval.py at the command line:

    python3 eval.py --checkpoint ./logs-<your-desired-run-name>/model.ckpt-185000
    

    If you set the --hparams flag when training, set the same value here.

About

Continuous Emotional TTS An implementation of continuous emotional speech synthesis in TensorFlow

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages