Voice Cloning and Fake Audio Detection

Background

The client is a technology company working in the Cyber Security industry. They focus on building systems that help individuals and organizations to have safe and secure digital presence by providing cutting edge technologies to their customers. They create products and services that ensure their customers' security using data driven technologies to understand whether audio and video media is authentic or fake.

Our goal in this project is to build algorithms that can synthesize spoken audio by converting a speaker’s voice to another speaker’s voice with the end goal to detect if any spoken audio is pristine or fake.

Data Description

There are two datasets we can utilize in this project. Both datasets are publicly available sources.

TIMIT Dataset:

The TIMIT corpus of read speech is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. TIMIT contains a total of 6300 sentences, 10 sentences spoken by each of 630 speakers from 8 major dialect regions of the United States.

Dataset Link: https://github.com/philipperemy/timit

CommonVoice Dataset:

Common Voice is part of Mozilla's initiative to help teach machines how real people speak. Common Voice is a corpus of speech data read by users on the Common Voice website (https://commonvoice.mozilla.org/), and based upon text from a number of public domain sources like user submitted blog posts, old books, movies, and other public speech corpora. Its primary purpose is to enable the training and testing of automatic speech recognition (ASR) systems.

Dataset Link: https://commonvoice.mozilla.org/en/datasets

Goal(s)

Build a machine learning system to detect if a spoken audio is synthetically generated or not. In order to achieve this, first, build a voice cloning system given a speaker’s spoken audio that clones the source speaker’s voice to the target speaker’s voice. Next, build a machine learning system which detects if any spoken audio is a natural speech or synthetically generated by machine.

For the voice cloning system (VC), we can utilize the TIMIT dataset as it consists of aligned text-audio data with various speakers. For the fake audio detection system (FAD) we can utilize the CommonVoice dataset as it consists of thousands of naturally spoken audio which could be used as golden spoken audio by humans as positive examples and creating negative examples using the voice cloning system as automatic data/label generator. Since the CommonVoice English dataset is large, we can use a subset of it by sampling the dataset.

Success Metrics

Use Word Error Rate (WER) for automatic evaluation of the voice cloning (VC) system for the speech generation part and also report speaker classification accuracy to assess the performance of the generated audio’s target speaker. For the fake audio detection (FAD) system evaluate the performance of the models using F-score via positive labels coming from the groundtruth dataset and negative labels generated by the VC.

Results

Voice Cloning (VC) System

Algorithms

We tested 17 different algorithms and identified two with superior performance as follows:
- vits: An English voice conversion model from the TTS library.
- speech_generator: A speech generator from the Voice_Cloning package.
Model Performance
- Word Error Rate (WER) for transcription
  Evaluation of speech generation algorithms showed a WER of 0.12 for the 'vits' model and 0.34 for the 'speech_generator.' Despite the suboptimal WER for 'speech_generator,' it was still utilized due to its favorable impact on speaker classification accuracy. Several other evaluation measures were also adopted, and it's worth mentioning that the 'vits' model demonstrated a notable Character Error Rate (CER) of 0.04.
- Speaker classification accuracy
  A neural network model was constructed to assess speaker classification accuracy, achieving a score of 0.83. Considering the limited number of very short audio files available for each speaker, this accuracy is deemed reasonably good.

Fake Audio Detection (FAD) System

Model Performance
- F-score for binary classification (authentic vs. synthesized speeches)
  Another neural network model was developed to evaluate binary classification between authentic and synthesized speeches. The model achieved a perfect F-score on the test data, signifying its capability to distinguish fake audio with the highest precision and recall.

Future Work

The knowledge and models derived from this project exhibit significant potential for addressing diverse business challenges involving audio data. Future iterations will focus on enhancing system robustness and versatility by incorporating a broader range of source data for system development and model training.

Notebook and Installation

For more details, you may refer to this notebook directly.

To run VoiceCloningAndFakeAudioDetection.ipynb locally, please clone or fork this repo and install the required packages by running the following command:

pip install -r requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
TTS		TTS
saved		saved
utils		utils
.gitignore		.gitignore
README (Japanese).md		README (Japanese).md
README.md		README.md
VoiceCloningAndFakeAudioDetection.ipynb		VoiceCloningAndFakeAudioDetection.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Voice Cloning and Fake Audio Detection

Background

Data Description

Goal(s)

Success Metrics

Results

Notebook and Installation

* Associated with Apziva

About

Releases

Packages

Contributors 2

Languages

henryhyunwookim/Voice-Cloning-and-Fake-Audio-Detection

Folders and files

Latest commit

History

Repository files navigation

Voice Cloning and Fake Audio Detection

Background

Data Description

Goal(s)

Success Metrics

Results

Notebook and Installation

* Associated with Apziva

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages