An implementation of continuous emotional speech synthesis in TensorFlow, folked from Keithito
Azam Rabiee, Tae-Ho Kim, Soo-Young Lee "Adjusting Pleasure-Arousal-Dominance for Continuous Emotional Text-to-speech Synthesizer," accepted paper in Interspeech2019, show and tell demonstration.
In April 2017, Google published a paper, Tacotron: Towards End-to-End Speech Synthesis, where they present a neural text-to-speech model that learns to synthesize speech directly from (text, audio) pairs.
Here we add emotion to the TTS!
In the first project categorical emotion are added; for more info you may check "Emotional End-to-End Neural Speech synthesizer". In the second step, we made the continuous emotional TTS!
Emotion is not limited to discrete categories of happy, sad, angry, fear, disgust, surprise, and so on. Here, each emotion category is projected onto a set of independent dimensions named Pleasure-Arousal-Dominance (PAD). The value of each dimension varies from -1 to 1, such that the neutral emotion is in the center with all-zero values. However, you can generate speech with various emotions by either setting any arbitrary PAD or selecting an emotion category.
For demo, click here
You can find synthesized waves with continuous emotions in this video
-
Install Python 3.
-
Install the latest version of TensorFlow for your platform. For better performance, install with GPU support if it's available. This code works with TensorFlow 1.12 and later.
-
Install requirements:
pip install -r requirements.txt
a pretrained model for Korean language is available here.
-
Download the model
-
Run the demo server:
python3 demo_server_gpu.py --checkpoint <path-to-the-pretrained-model>/model.ckpt-385000 -
Point your browser at localhost:9000
- select the emotion category or any PAD
- Type what you want to synthesize
-
Download an emotional speech dataset. We have used an internal Korean emotional speech dataset containing 6 emotional categories plus neutral speech. You can use other datasets if you convert them to the right format. See TRAINING_DATA.md for more info.
Note: In our emotional dataset, the wave files are in
wav_16kfolder; filename contains the emotion label. For example,acriil_hap_m30_1981.wavmeans that a 30-year old man has uttered the sentence ID 1981 with happy emotion. In addition, all scripts are inemoTTS_script.txtas<sentence-ID> <text>in each line. -
Preprocess the data. The preprocessing prepares triples of the linear spectrogram, the Mel spectrogram, and text, as <spec-npy-filename|mel-npy-filename|text> for training; then saves them in a folder specified by
--output. Please note that in the current implementation (in datafeeder.py), it is supposed that the filename contains the emotion label.Note 1: make sure to trim the silence in the beginning and ending of the wave files. You can run
python3 ./datasets/trimmer.pyon yourin_dirpath pointing to the wave folder. Edit thetrimmer.pyfile for your parameters if needed. The trimmer.py utilizes voice activity detection (VAD) to trim the silence.Note 2: For Korean language, make sure to run
python3 preprocess_kor_text.pyto separate the characters of a syllabus, for example from안녕toㅇㅏㄴㄴㅕᆼ. Modify the preprocess_kor_text.py file as needed.Run the following command to make
mel-*.npyandspec-*.npyfiles.python3 preprocess.py --dataset emotionalDSCheck the preprocess.py and emotionalDS.py for your dataset. Replace the dataset name with yours.
After this step, the training folder contains mel- and spec- .npy files, as well as a metadata text file, named
train.txtwith <spec-filename|mel-filename|number-of-frames|text> format for each line. Here is an example:spec-neu_f30_0001.npy|mel-neu_f30_0001.npy|495|ㅈㅔㅇㅣㅍㅣㄴㅡㄴ ㅅㅏㅁㄱㅗㅇㄸㅐ ㅈㅔㅊㅓㄹㅅㅗㄹㅡㄹ ㅅㅣㅈㅏㄱㅎㅏㄹ ㄷㅏㅇㅅㅣ ㅊㅗㅇㄹㅣㄷㅗ ㅎㅏㄱㅗㅎㅐ ㅈㅏㅇㅕㄴㅅㅡㄹㅓㅂㄱㅔ ㅈㅏㅈㅜ ㅁㅏㄴㄴㅏㅆㅇㅡㄹ ㅃㅜㄴㅇㅣㅂㄴㅣㄷㅏ.Note 3: Don't forget to shuffle the
train.txtwithpython3 shuffle_train.txt.pyon your desired path. Also point to the shuffled text file in hparams.py asbase_dirandinputparameters. -
Train a model
python3 train.py --name <your-desired-run-name>Tunable hyperparameters are found in hparams.py. You can adjust these at the command line using the
--hparamsflag, for example--hparams="batch_size=16,outputs_per_step=2". Hyperparameters should generally be set to the same values at both training and eval time. The default hyperparameters are recommended. See TRAINING_DATA.md for other languages. -
Monitor with Tensorboard (optional)
tensorboard --logdir ./logs-<your-desired-run-name>The trainer dumps audio and alignments every 1000 steps. You can find these in
./logs-<your-desired-run-name>. -
Synthesize from a checkpoint
python3 demo_server_gpu.py --checkpoint ./logs-<your-desired-run-name>/model.ckpt-185000Replace "185000" with the checkpoint number that you want to use, then open a browser to
localhost:9000and type what you want to speak. Alternately, you can run eval.py at the command line:python3 eval.py --checkpoint ./logs-<your-desired-run-name>/model.ckpt-185000If you set the
--hparamsflag when training, set the same value here.