Recent speech-to-speech (S2S) models can generate intelligible speech but often lack natural expressiveness, largely due to the absence of a reliable evaluation metric. To address this, we present DeEAR (Decoding the Expressive Preference of eAR), a novel framework that converts human preferences for speech expressiveness into an objective score.
Grounded in phonetics and psychology, DeEAR evaluates speech across three core dimensions: Emotion, Prosody, and Spontaneity. It achieves strong alignment with human perception (Spearman's Rank Correlation Coefficient, SRCC = 0.86) using fewer than 500 annotated samples.
Beyond reliable scoring, DeEAR enables fair benchmarking and targeted data curation. We applied DeEAR to build ExpressiveSpeech, a high-quality dataset, and used it to fine-tune an S2S model, which improved its overall expressiveness score from 2.0 to 23.4 (on a 100-point scale).
- Multi-dimensional Objective Scoring: Decomposes speech expressiveness into quantifiable dimensions of Emotion, Prosody, and Spontaneity.
- Strong Alignment with Human Perception: Achieves a Spearman's Rank Correlation (SRCC) of 0.86 with human ratings for overall expressiveness.
- Data-Efficient and Scalable: Requires minimal annotated data, making it practical for deployment and scaling.
- Dual Applications:
- Automated Model Benchmarking: Ranks SOTA models with near-perfect correlation (SRCC = 0.96) to human rankings.
- Evaluation-Driven Data Curation: Efficiently filters and curates high-quality, expressive speech datasets.
- Release of ExpressiveSpeech Dataset: A new large-scale, bilingual (English-Chinese) dataset containing ~14,000 utterances of highly expressive speech.
The DeEAR framework follows a four-stage pipeline designed to decompose, model, and ultimately align the abstract concept of expressiveness with human preference.
Figure 1: The DeEAR Framework. (A) The training pipeline involves four stages: decomposition, sub-dimension modeling, learning a fusion function, and distillation. (B) Applications include data filtering and serving as a reward model.
Follow the steps below to get started with the DeEAR.
git clone https://github.com/FreedomIntelligence/ExpressiveSpeech.git
cd ExpressiveSpeechconda create -n DeEAR python=3.10
conda activate DeEAR
pip install -r requirements.txt
conda install -c conda-forge ffmpegDownload the DeEAR_Base model from FreedomIntelligence/DeEAR_Base and place it in the models/DeEAR_Base/ directory.
python inference.py \
--model_dir ./models \
--input_path /path/to/audio_folder \
--output_file /path/to/save/my_scores.jsonl \
--batch_size 64- Fine-tune an open-source S2S model using the high-expressiveness dataset curated by DeEAR to validate its impact on improving speech model expressiveness.
The Supplementary Material folder in this repository provides additional details for the experiments presented in our paper. This includes comprehensive information on the experimental setup and the data materials used.
If you use our work in your research, please cite the following paper:
@article{lin2025decoding,
title={Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment},
author={Lin, Zhiyu and Yang, Jingwen and Zhao, Jiale and Liu, Meng and Li, Sunzhu and Wang, Benyou},
journal={arXiv preprint arXiv:2510.20513},
year={2025}
}