Skip to content

C3Imaging/speech-augmentation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repository was originally created as a demo and instruction manual for the reproduction of the data augmentation pipeline discussed in the paper Augmentation Techniques for Adult-Speech to Generate Child-Like Speech Data Samples at Scale but has now been expanded with more and more functionality as research is ongoing in other domains, though centred on audio data generation and augmentation. This repo is actively being updated!

Original paper: Augmentation Techniques for Adult-Speech to Generate Child-Like Speech Data Samples at Scale.

Abstract

Technologies such as Text-To-Speech (TTS) synthesis and Automatic Speech Recognition (ASR) have become important in providing speech-based Artificial Intelligence (AI) solutions in today’s AI-centric technology sector. Most current research work and solutions focus largely on adult speech compared to child speech. The main reason for this disparity can be linked to the limited availability of children’s speech datasets that can be used in training modern speech AI systems. In this paper, we propose and validate a speech augmentation pipeline to transform existing adult speech datasets into synthetic child-like speech. We use a publicly available phase vocoder-based toolbox for manipulating sound files to tune the pitch and duration of the adult speech utterances making them sound child-like. Both objective and subjective evaluations are performed on the resulting synthetic child utterances. For the objective evaluation, the similarities of the selected top adults’ speaker embeddings are compared before and after the augmentation to a mean child speaker embedding. The average adult voice is shown to have a cosine similarity of approximately 0.87 (87%) relative to the mean child voice after augmentation, compared to a similarity of approximately 0.74 (74%) before augmentations. Mean Opinion Score (MOS) tests were also conducted for the subjective evaluation, with average MOS scores of 3.7 for how convincing the samples are as childspeech and 4.6 for how intelligible the speech is. Finally, ASR models fine-tuned with the augmented speech are tested against a baseline set of ASR experiments showing some modest improvements over the baseline model finetuned with only adult speech.

Open-source Code

This repository provides the open-source scripts used for multi-speaker adult audio dataset augmentation pipeline discussed in the original paper, along with scripts for many more experiments form other papers and projects.

Focusing on the original paper, our first experiments augment the Librispeech train-clean-100 dataset, and we use the CMU kids dataset for computing child speaker embeddings.

The main functionalities for the augmentation pipeline can be broken down into the following scripts:

  • audio_augmentation/Compute_librispeech_cmukids_similarities.py: computes the cosine similarity between adult speakers' embedding averaged over all utterances for that speaker to the average child speaker from the CMU kids multi-speaker audio dataset. The Resemblyzer library is used here.
  • asr/wav2vec2_forced_alignment_libri.py: runs forced alignment between Librispeech ground truth speaker transcripts and corresponding audio using wav2vec2.0 ASR model and the Viterbi + Trellis matrix backtracking traversal algorithm, to get the timestamps for each word in each transcript. NOTE: transcript files for the audio data are required. The torchaudio library is used here.
  • audio_augmentation/cleese_audio_augmentation.py: augments the pitch and time duration characteristics of original adult audio data from a multi-speaker dataset (LibriSpeech in this case). The CLEESE library is used here.

Installation Requirements (UNIX)

torch, torchaudio and resemblyzer can be installed via pip install.

CLEESE

  • To install CLEESE from https://forum.ircam.fr/projects/detail/cleese/ you will be prompted to register for a free account first.
  • Once the zip file has been downloaded, unzip it to a convenient /path/to/cleese/
  • Then run the following command:
ln -s /path/to/cleese/cleese-master /usr/local/lib/python3.8/dist-packages/cleese
  • Python usage:
from cleese.cleese import cleeseProcess
  • If receiving an import error, then:
  1. In /path/to/cleese/cleese-master/cleese/init.py, change import statement to:
from .cleeseProcess import *
  1. In /path/to/cleese/cleese-master/cleese/cleeseProcess.py, change the import statements of cleeseBPF and cleeseEngine to:
from .cleeseBPF import *
from .cleeseEngine import *

Steps for Reproducing Our Original Audio Augmentation Experiments

Step 1: Select Suitable Adult Candidate Speakers

Run the audio_augmentation/Compute_librispeech_cmukids_similarities.py script on Librispeech train-clean-100 dataset + CMU kids dataset. The output results in a new folder being created that contains copied Librispeech speaker folders with a cosine similarity score above a specified threshold, appended with their gender tag.
Example call:
python audio_augmentation/Compute_librispeech_cmukids_similarities.py --adults_dir /workspace/datasets/LibriSpeech-train-clean-100/LibriSpeech/train-clean-100 --kids_dir /workspace/datasets/cmu_kids --out_dir /workspace/datasets/LibriSpeech_preaugmentations_test --sim_thresh 0.65

NOTES:

  • If using an adult dataset different from the original LibriSpeech or LibriTTS datasets, manually create a SPEAKERS.txt file and place it in the adult speakers dataset directory. An example of the file is the following (lines prepended with ';' are comments and are ignored):
;<SPEAKER FOLDER NAME> | <GENDER TAG>
14   | F
16   | F
17   | M
19   | F
20   | F
22   | F
23   | F
25   | M
26   | M
27   | M
28   | F
  • In order for the script to work, the child speakers dataset directory must have the following structure:
    <CHILD_DIRECTORY_NAME>/ -> <SPEAKER_ID_1>/ -> signal/ -> <SPEAKER_ID_1_AUDIO_FILE_NAME_1>.wav
                                                                                                         -> <SPEAKER_ID_1_AUDIO_FILE_NAME_2>.wav
                                                                                                         -> etc.
                                                     -> <SPEAKER_ID_2>/ -> signal/ -> <SPEAKER_ID_2 - AUDIO_FILE_NAME_1>.wav
                                                                                                         -> <SPEAKER_ID_2_AUDIO_FILE_NAME_2>.wav
                                                                                                         -> etc.
                                                                                                         -> etc.
                                                     -> etc.

Step 2: Generate Timestamps for Suitable Adult Speakers' Transcripts

After selecting the suitable adult speakers according to Step 1, we run asr/wav2vec2_forced_alignment_libri.py on the new folder. This will create a new subfolder in the output folder from Step 1 that contains a JSON file with time alignments (timestamps for the words in each transcript) for each audio file.
Example call:
python asr/wav2vec2_forced_alignment_libri.py /workspace/datasets/LibriSpeech_preaugmentations_test --out_dir /workspace/datasets/LibriSpeech_preaugmentations_test/w2v2_torchaudio_forced_align

Step 3: Generate Augmented Dataset

Using the new folder created after Step 1 and the time alignments JSON file from Step 2, run audio_augmentation/cleese_audio_augmentation.py to produce a new, identically structured dataset of augmented speakers. The CLEESE configuration used for augmentation experiments can be found in the file audio_augmentation/cleeseConfig_all_lj.py Example call:
python audio_augmentation/cleese_audio_augmentation.py --in_dir /workspace/datasets/LibriSpeech_preaugmentations_test --alignments_json /workspace/datasets/LibriSpeech_preaugmentations_test/w2v2_torchaudio_forced_align/forced_alignments.json --out_dir /workspace/datasets/LibriSpeech_augmentations_test --female_pitch 50 --male_pitch 0

Other Important Scripts for your convenience (other projects)

ASR Inference

Current ASR models available:

  • wav2vec2 (fairseq framework): You can run asr/wav2vec2_infer_custom.py to generate hypothesis text transcripts from an unlabelled audio dataset.
    Run python asr/wav2vec2_infer_custom.py --help for a description of the usage.

    NOTE: You must manually specify the ASR pipeline you want to use in the code under the comment # create model + decoder pair. There are a number of possible combinations of ASR model + Decoder + [optional external Language Model] to choose from. Please see Utils/asr/decoding_utils/Wav2Vec2_Decoder_Factory class for a list of the implemented pipelines. These are also described in the Time Aligned Predictions and Forced Alignment section under the "Wav2Vec2 inference with time alignment" bulletpoint below.

    Wav2Vec2 ASR Pipeline Notes:

    1. a wav2vec2 checkpoint file can have its architecture defined in the "args" field or "cfg" field, depending on the version of the fairseq framework used to train the model. If you get an error using a "get_args_" function from the Wav2Vec2 factory class, the checkpoint is likely a "cfg" one, thus try using the "get_cfg_" equivalent function instead and vice versa).
    2. You must also manually edit the file Utils/asr/decoding_utils with paths local to you in the following places, if using those particular decoders:
    • BeamSearchKenLMDecoder_Fairseq -> __init__ -> decoder_args -> kenlm_model (if using a KenLM external language model)
    • BeamSearchKenLMDecoder_Fairseq -> __init__ -> decoder_args -> lexicon (if using a KenLM external language model)
    • TransformerDecoder -> __init__ -> transformerLM_root_folder (if using a TransformerLM external language model)
    • TransformerDecoder -> __init__ -> lexicon (if using a TransformerLM external language model)

    UPDATE: word-level time alignment output information, as well as multiple hypotheses output is now supported. Please read Time Aligned Predictions and Forced Alignment section for more details.

  • Whisper (whisper-timestamped framework): You can run asr/whisper_time_alignment.py to generate hypothesis text transcripts from an unlabelled audio dataset with optional word-level time alignment output information as well as multiple hypotheses output.
    Run python asr/whisper_time_alignment.py --help for a description of the usage.
    Please read Time Aligned Predictions and Forced Alignment section for more details.

  • Conformer (NeMo framework): Please see the NeMo ASR Experiments project for more details.

Time Aligned Predictions and Forced Alignment

Generating time alignment information for predictions is possible. Also, forced alignment between paired text and audio data (a.k.a aligning known text transcript with audio file) can be performed using ASR or TTS models.

Current available ASR-based approaches:

  • Wav2Vec2 inference with time alignment: You can create word-level time-aligned transcripts using wav2vec2 models both from the torchaudio library or from a custom trained/finetuned checkpoint using the fairseq framework.

    Run python asr/wav2vec2_infer_custom.py --help for a description of the usage.

    There are multiple ASR model + Decoder options available, as defined by class methods of Wav2Vec2_Decoder_Factory:
    • get_torchaudio_greedy: Torchaudio Wav2Vec2 model + Greedy Decoder from torchaudio (multiple hypotheses and word-level timestamps unavailable);
    • get_torchaudio_beamsearch: Torchaudio Wav2Vec2 model + lexicon-free Beam Search Decoder without external language model from torchaudio (multiple hypotheses and word-level timestamps available);
    • get_torchaudio_beamsearchkenlm: Torchaudio Wav2Vec2 model + lexicon-based Beam Search Decoder with KenLM external language model from torchaudio (multiple hypotheses and word-level timestamps available);
    • get_args_viterbi and get_cfg_viterbi: Wav2Vec2 model checkpoint trained using fairseq framework + Viterbi Decoder from fairseq (multiple hypotheses unavailable, word-level timestamps available);
    • get_args_greedy and get_cfg_greedy: Wav2Vec2 model checkpoint trained using fairseq framework + Greedy Decoder from torchaudio (multiple hypotheses and word-level timestamps unavailable);
    • get_args_beamsearch_torch and get_cfg_beamsearch_torch: Wav2Vec2 model checkpoint trained using fairseq framework + lexicon-free Beam Search Decoder without external language model from torchaudio (multiple hypotheses and word-level timestamps available);
    • get_args_beamsearchkenlm_torch and get_cfg_beamsearchkenlm_torch: Wav2Vec2 model checkpoint trained using fairseq framework + lexicon-based Beam Search Decoder with KenLM external language model from torchaudio (multiple hypotheses and word-level timestamps available);
    • get_args_beamsearch_fairseq and get_cfg_beamsearch_fairseq: Wav2Vec2 model checkpoint trained using fairseq framework + lexicon-free Beam Search Decoder without external language model from fairseq (multiple hypotheses and word-level timestamps available);
    • get_args_beamsearchkenlm_fairseq and get_cfg_beamsearchkenlm_fairseq: Wav2Vec2 model checkpoint trained using fairseq framework + lexicon-based Beam Search Decoder with KenLM external language model from fairseq (multiple hypotheses and word-level timestamps available);
    • get_args_beamsearchtransformerlm and get_cfg_beamsearchtransformerlm: Wav2Vec2 model checkpoint trained using fairseq framework + lexicon-based Beam Search Decoder with neural Transformer-based external language model from fairseq (multiple hypotheses and word-level timestamps available).

      NOTE: By changing the source code of fairseq/examples/speech_recognition/w2l_decoder.py to that in fairseq forked, word-level timestamps can be returned from all the decoders in this file (W2lViterbiDecoder, W2lKenLMDecoder, W2lFairseqLMDecoder).

  • Whisper inference with time alignment: You can generate transcripts with Whisper and time align the generated transcript with the speech file using Dynamic Time Warping.
    Run asr/python whisper_time_alignment.py --help for a description of the usage).

    UPDATE: By changing the source code of openai/whisper/decoding.py, openai/whisper/transcribe.py and linto-ai/whisper-timestamped/transcribe.py to those in the openai/whisper forked and linto-ai/whisper-timestamped forked repositories, multiple hypotheses can be returned from beam search decoding, instead of the default best hypothesis offered by the original projects.

  • Time alignment-enabled inference with NeMo models project: You can generate transcripts with NeMo-based ASR models, such as Conformer-CTC, Conformer-Transducer, Hybrid FastFormer etc. and generate char and word-level time alignment information for the generated transcripts. This requires installing the NeMo framework. Please see the NeMo ASR Experiments project for more details.

    UPDATE: By using abarcovschi/nemo_asr/transcribe_speech_custom.py, multiple hypotheses can also be returned from beam search decoding, instead of the default best hypothesis offered by the original project's NeMo/examples/asr/transcribe_speech.py.

  • Wav2Vec2 forced alignment: You can generate forced time alignments for paired <audio, ground truth text> datasets whose transcripts are saved in LibriSpeech OR LibriTTS format OR use the output transcripts/hypotheses (as ground truth) located in a JSON file outputted by an ASR inference scripts such as asr/wav2vec2_infer_custom.py, asr/whisper_time_alignment.py or abarcovschi/nemo_asr/transcribe_speech_custom.py and force align the audio to these hypotheses.
    Run python asr/wav2vec2_forced_alignment_libri.py --help for a description of the usage.
    NOTE: models from torchaudio AND custom wav2vec2 models checkpoints trained/finetuned using the fairseq framework are now supported.

Output Formats

Inference

The format of JSON files containing word-level time-aligned transcripts, outputted by wav2vec2_infer_custom.py and whisper_time_alignment.py, is the following:

{"wav_path": "/path/to/audio1.wav", "id": "unique/audio1/id", "pred_txt": "the predicted transcript sentence for audio one", "timestamps_word": [{"word": "the", "start_time": 0.1, "end_time": 0.2}, {"word": "predicted", "start_time": 0.3, "end_time": 0.4}, {"word": "transcript", "start_time": 0.5, "end_time": 0.6}, {"word": "sentence", "start_time": 0.7, "end_time": 0.8}, {"word": "for", "start_time": 0.9, "end_time": 1.0}, {"word": "audio", "start_time": 1.1, "end_time": 1.2}, {"word": "one", "start_time": 1.3, "end_time": 1.4}]}
{"wav_path": "/path/to/audio2.wav", "id": "unique/audio2/id", "pred_txt": "the predicted transcript sentence for audio two", "timestamps_word": [{"word": "the", "start_time": 0.1, "end_time": 0.2}, {"word": "predicted", "start_time": 0.3, "end_time": 0.4}, {"word": "transcript", "start_time": 0.5, "end_time": 0.6}, {"word": "sentence", "start_time": 0.7, "end_time": 0.8}, {"word": "for", "start_time": 0.9, "end_time": 1.0}, {"word": "audio", "start_time": 1.1, "end_time": 1.2}, {"word": "two", "start_time": 1.3, "end_time": 1.4}]}

etc.

NOTE1: The 'timestamps_word' field is optionally outputted by decoders that support word-level timestamps creation if the --time_aligns flag is set using wav2vec2_infer_custom.py; and if the --time_aligns flag is set using whisper_time_alignment.py.

NOTE2:

  • If --num_hyps is set to a value >1 when using wav2vec2_infer_custom.py; and if --beam_size is set to a value >1 along with --num_hyps<=--beam_size when using whisper_time_alignment.py, then multiple output files will be created, e.g. if --num_hyps=3:
hypotheses1_of_3.json -> contains the best hypothesis per audio file in the input folder.
hypotheses2_of_3.json -> contains the second best hypothesis per audio file in the input folder.
hypotheses3_of_3.json -> contains the third best hypothesis per audio file in the input folder.

Each file contains a JSON row (dict containing 'wav_path', 'id', 'pred_txt', ['timestamps_word'] fields) per audio file in the input folder.

  • If --num_hyps is set to 1 when using wav2vec2_infer_custom.py, or if using a decoder that does not support returning multiple hypotheses; and if --num_hyps is set to 1 or if --beam_size is set to 1 when using whisper_time_alignment.py, then a single best_hypotheses.json file will be created, containing just the best hypothesis per audio file in the input folder.

NOTE3: For the output format produced using NeMo models using the abarcovschi/nemo_asr/transcribe_speech_custom.py script, please see the NeMo ASR Experiments project (output is similar, but has character-level timestamps option also).

Forced Alignment

The format of JSON files containing word-level force-aligned transcripts, outputted by wav2vec2_forced_alignment_libri.py is the following:

{"wav_path": "/path/to/audio1.wav", "id": "unique/audio1/id", "ground_truth_txt": "the ground truth transcript for audio one", "alignments_word": [{"word": "the", "confidence": 0.88, "start_time": 0.1, "end_time": 0.2}, {"word": "ground", "confidence": 0.88, "start_time": 0.3, "end_time": 0.4}, {"word": "truth", "confidence": 0.88, "start_time": 0.5, "end_time": 0.6}, {"word": "transcript", "confidence": 0.88, "start_time": 0.7, "end_time": 0.8}, {"word": "for", "confidence": 0.88, "start_time": 0.9, "end_time": 1.0}, {"word": "audio", "confidence": 0.88, "start_time": 1.1, "end_time": 1.2}, {"word": "one", "confidence": 0.88, "start_time": 1.3, "end_time": 1.4}]}
{"wav_path": "/path/to/audio2.wav", "id": "unique/audio2/id", "ground_truth_txt": "the ground truth transcript for audio two", "alignments_word": [{"word": "the", "confidence": 0.88, "start_time": 0.1, "end_time": 0.2}, {"word": "ground", "confidence": 0.88, "start_time": 0.3, "end_time": 0.4}, {"word": "truth", "confidence": 0.88, "start_time": 0.5, "end_time": 0.6}, {"word": "transcript", "confidence": 0.88, "start_time": 0.7, "end_time": 0.8}, {"word": "for", "confidence": 0.88, "start_time": 0.9, "end_time": 1.0}, {"word": "audio", "confidence": 0.88, "start_time": 1.1, "end_time": 1.2}, {"word": "two", "confidence": 0.88, "start_time": 1.3, "end_time": 1.4}]}

etc.

Speaker Diarization

Current Speaker Diarization models available:

  • Pyannote
  • Resemblyzer
  • NeMo MSDD

    All are accessed using diarization/main.py (run python diarization/main.py --help for a description of the usage).
    The configuration for a diarization run can be modified in the diarization/diarization_config.yaml file.

    Resemblyzer Diairization Notes:
    1. Resemblyzer allows the user to specify an example speech audio wav for each speaker from which a speaker embedding will be created, across all multipeaker audiofiles. To enable this functionality, you must create the following folder structure:

      In the main folder where the multispeaker audiofiles are located, create a speaker-samples subfolder. In it, create a subfolder for each audiofile with the same name as the audiofile. In each of these subfolders, have a <SPEAKER_ID>.WAV file for each speaker you wish to segment in the corresponding multispeaker audiofile from the main folder.

      NOTE: Resemblyzer requires separate audiofiles from which to create speaker embeddings, so this step is mandatory.

WER Calculation using Sclite

The hypotheses transcripts located in the JSON files outputted by ASR inference scripts can be compared against corresponding ground truth transcripts using the sclite tool from usnistgov/SCTK.
Sclite compares a hypothesis.txt file against a reference.txt and calculates the WER along with other statistics detailed here (download the file and open in browser to view).

  • To create a hypothesis.txt file from an ASR inference output JSON file, run Tools/asr/wer_sclite/asr_json_output_to_hypothesis_file.py.
  • To create a reference.txt file by traversing the directory tree of a parallel <audio, ground truth transcript> dataset, run Tools/asr/wer_sclite/create_reference_file.py.

Multi-ASR Parallel Audio-Text Data Creation Funnel Tool

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages