This is the official repository of the WACV 2025 paper "EmoVOCA: Speech-Driven Emotional 3D Talking Heads" by Federico Nocentini, Claudio Ferrari, Stefano Berretti.
🔥🔥 [2025/01/25] Our code is now public available! Feel free to explore, use, and contribute!
The domain of 3D talking head generation has witnessed significant progress in recent years. A notable challenge in this field consists in blending speech-related motions with expression dynamics, which is primarily caused by the lack of comprehensive 3D datasets that combine diversity in spoken sentences with a variety of facial expressions. Whereas literature works attempted to exploit 2D video data and parametric 3D models as a workaround, these still show limitations when jointly modeling the two motions. In this work, we address this problem from a different perspective, and propose an innovative data-driven technique that we used for creating a synthetic dataset, called EmoVOCA, obtained by combining a collection of inexpressive 3D talking heads and a set of 3D expressive sequences. To demonstrate the advantages of this approach, and the quality of the dataset, we then designed and trained an emotional 3D talking head generator that accepts a 3D face, an audio file, an emotion label, and an intensity value as inputs, and learns to animate the audio-synchronized lip movements with expressive traits of the face. Comprehensive experiments, both quantitative and qualitative, using our data and generator evidence superior ability in synthesizing convincing animations, when compared with the best performing methods in the literature.
We introduce EmoVOCA, a novel approach for generating a synthetic 3D Emotional Talking Heads dataset which leverages speech tracks, intensity labels, emotion labels, and actor specifications. The proposed dataset can be used to surpass the lack of 3D datasets of expressive speech, and train more accurate emotional 3D talking head generators as compared to methods relying on 2D data as proxy.
Overview of our framework. Two distinct encoders process the talking and expressive 3D head displacements, separately, while a common decoder is trained to reconstruct them. At inference, talking and emotional heads are combined by concatenating the encoded latent vectors, and the decoder outputs a combination of their displacements.
@inproceedings{nocentini2024emovocaspeechdrivenemotional3d,
title={EmoVOCA: Speech-Driven Emotional 3D Talking Heads},
author={Federico Nocentini and Claudio Ferrari and Stefano Berretti},
booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
year = {2025},
}
This guide provides step-by-step instructions on how to set up the ScanTalk environment and install all necessary dependencies. The codebase has been tested on Ubuntu 20.04.2 LTS with Python 3.8.
It is recommended to use a Conda environment for this setup.
-
Create a Conda Environment
conda create -n emovoca python=3.8.18
-
Activate the Environment
conda activate emovoca
-
Clone the MPI-IS Repository
git clone https://github.com/MPI-IS/mesh.git
cd mesh
-
Modify line 7 of the Makefile to avoid error
@pip install --no-deps --config-settings="--boost-location=$$BOOST_INCLUDE_DIRS" --verbose --no-cache-dir .
-
Run the MakeFile
make all
Ensure you have the correct version of PyTorch and torchvision. If you need a different CUDA version, please refer to the official PyTorch website.
-
Install PyTorch, torchvision, and torchaudio
conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia
-
Install Requirements
pip install -r requirements.txt
For training and testing EmoVOCA DE-SD, we used two open-source datasets for 3D facial data: vocaset and Florence 4D Facial Expression Dataset. Please note that you must obtain authorization to use both datasets.
To generate meshes with EmoVOCA, follow these steps:
- Download the vocaset dataset and place it in the
Dataset
folder located in the main directory. - The meshes used for conditioning vocaset have already been added to the
EmoVOCA_generator/New_Conditions
folder. For additional data, download the Florence 4D dataset.
Pre-generated EmoVOCAv2 sequences are available here. To use them:
- Download the folder and place it inside the
Dataset
folder in the main directory. - Extract all the files to ensure proper access.
We are releasing three models:
emovoca_generator.tar
: The DE-SD framework used to generate EmoVOCA.es2l.tar
: The ES2L framework trained on EmoVOCA.es2d.tar
: The ES2D framework trained on EmoVOCA.
All models are available for download here. After downloading, place the saves
folder inside each model's directory to ensure proper setup.
Inside the model folders ES2L
and ES2D
, you will find both the model definitions and the training code for each component.
In the EmoVOCA_generator
folder, you will find the code required to generate any version of EmoVOCA.
Within the main directory, there is a file named demo.py
, which can be used to render outputs based on an emotion label, intensity value, an audio file, and a 3D face template. Additionally, example files for generation are provided in the example
folder located in the main directory.
All material is made available under Creative Commons BY-NC 4.0. You can use, redistribute, and adapt the material for non-commercial purposes, as long as you give appropriate credit by citing our paper and indicate any changes that you've made.