This page contains the base files used throughout the project for training and testing the ASR models.
The project mentioned is the CSE3000 Research Project course (Bachelor Thesis) of the Computer Science and Engineering Bachelor that I graduated from, from the Delft University of Technology.
4files.py
: This is the code used to augment the 4 files needed for training/testing any model.specaug.py
: The SpecAugment script used for generating perturbed audio files.
Both of the scripts above have comments inside of them explaining the steps.
asr0
: The baseline ASR model that was extended from and used throughout the project.data
: Contains files about the data to be used, as well as the data extracted that was used for training/testing.vtlp_aug_scripts
: The VTLP augmentation files (credits to Nikolay Zhlebinkov for creating them).
- Folders
conf
,local
,steps
andutil
are part of the kaldi toolkit and will not be explained here jasmin_run.sh
: Contains the main script used to train the model.- Inside the file, we have stage -1 that deals with training the language model, and everything after stage 0 deals with training of the acoustic model. More will be explained in the "Steps" section
jasmin_run_test.sh
: It was made for testing particular subgroups of the test set (such as children speech or male speech, etc.). It is mainly a modification of thejasmin_run.sh
, which can be modified to contain the same logic.test
will need to be replaced with a variable like$test_folders
(like injasmin_run_test.sh
) that is a string that contains all the names of the individual test folders, separated with spaces.new_test.sh
: A simple script that outputs the WER for each of the test sets. It can be easily modified to integrate more test folders if necessary.run_data_prep.sh
: The first script that should be run in order to generate information about the JASMIN-CGN data available to use. Careful that wav.scp would not be generated and you will have to do it yourself, but an example of how it looks like will be explained later on.srun
scripts are mainly made for running on a HPC cluster.
train
contains the 4 files that were used for training, whereastest
and its other variants contain the testing data. The variants are a subset of the test folder, with a focus on specific groups based on age, gender, or type of speech.local/data
contains information about speakers, as well as some files generated throughout the project only for the Transitional Dutch region.
A more detailed breakdown of these files is provided in data.md
.
aug_helpers.py
: Helper functions used for the main VTLP augmentation script.main.py
: The main script that manages paths and the augmentation. This is the one that needs to be run and where the paths need to be changed.vtlp.py
: The function that deals with the VTLP augmentation.wav2warp
: A description of the warping factor used for each of the files I have augmented and trained on for this project. This file is automatically generated once the files were successfully augmented, after runningmain.py
.
- Open a terminal in
asr0
- Run
./run_data_prep.sh
. It will generate thedata/local/data
folder with information about the speakers. CAREFUL!jasmin_path
needs to point to the folder that contains the JASMIN corpus information - There will be a
data
folder generated now. Go todata/local/data
. - Inside there, there are multiple files. The
spk2
files contain more specific information about each speaker's gender, age, regional accent, proficiency level (if they are non-natives), etc. Using these files, extract thesegments
,utt2spk
andtext
files that you need, from eachcomp_q
andcomp_p
available inside. - Merge the obtained files for each comp_p and comp_q into one, so you will have
segments
,utt2spk
andtext
that each contain the data for all styles of speech (comp_p
andcomp_q
.comp_p
stands for conversational (HMI) speech,comp_q
means speech read from a script). - For the
wav.scp
file, there is the one for the entire cluster included,wav_jas_all.scp
, from which the necessary information can be extracted to generate your ownwav.scp
. - Going back to
asr0/data
(NOTasr0/data/local/data
), make 2 directories:train
andtest
. - (OPTIONAL) Make directories for any specific group of people from your
test
set that you want to evaluate separately. - Now you should have 4 files generated:
segments
,text
,utt2spk
andwav.scp
. These are the files that will contain your extracted speakers and will be used for training/testing. For each of these files, extract the speakers you want to include intrain
and save in a new file insidetrain
and the same for the rest of the speakers, to put insidetest
. - Go back to
asr0
- Run
./utils/fix_data_dir.sh data/train
to make sure the training data is in order for the kaldi script - Do the same for
data/test
- Run
jasmin_run.sh
- Wait until training is complete
- Run
new_test.sh
to see the overall WER performance of your system. Ignore if you have errors about folders missing, as long as the first line displays the WER, then it is fine - (OPTIONAL) Run
jasmin_run_test.sh
to test the subgroups oftest
created insidedata
- (OPTIONAL) Run
new_test.sh
again to see the individual WERs obtained
if you have any suggestions, feel free to create a PR with your changes.
If you have any questions, open up an issue on the page. Disclaimer: the more time it will pass, the less able I will be to respond to questions as I will most likely forget details. I apologize for that :)