Skip to content

C3Imaging/whisper_non_native_child_asr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 

Repository files navigation

Experimental analysis on finetuning large scale supervision to improve the Non-Native English Child speech recognition.

Abstract

Modern End-to-End Automatic Speech Recognition (ASR) systems struggle recognizing children's speech due to limited child training data and acoustic variability of child speech, particularly in low-resource languages or with accented speech. This study focuses on improving the performance of ASR on native and non-native English child speech using publicly available datasets. Specifically, we evaluate the performance of the large-scale Whisper models (trained with large amount of adult speech data) on child speech. In addition, finetuning experiments using different child speech datasets are used to study the performance of ASR for non-native English child speech. We report relative WER improvements between 29% and 89% compared to previously reported results on the same datasets by using only a 10% distribution on unseen non-native datasets in finetuning. These results demonstrate the potential of Whisper for improving ASR on low-resource non-native child speech.

Codebase:

We followed the setup from whisper finetuning event for performing the ASR finetuning experiments with child speech. The training and finetuning details for replicating our experiments can be followed from the following link: https://github.com/huggingface/community-events/tree/main/whisper-fine-tuning-event

We also provide a more comprehensive understanding of the hyperparameter settings and training details for all our experiments on our Hugging Face repository which is accessible at https://huggingface.co/rishabhjain16

Dataset availability and accessibility:

We can only make the basic data cleaning scripts available here as the child audio datasets used in this paper are subject to licensing agreements. For access to respectively cleaner versions of datasets used in this paper, researchers can buy their own license for the original datasets (where required), and on providing proof of that license, can get access to our ‘clean’ versions upon request. All the datasets utilized in this paper are privately accessible through our Hugging Face page. They can be shared if deemed necessary, provided that the terms and conditions governing their use are adhered to.

Table of Results with Checkpoints

Table 3: WER for Whisper Original and Finetuned Models over different child speech datasets used in the paper.

NOTE1: Model IDs are links to the corresponding models on either OpenAI's HuggingFace page (Group A) or our HuggingFace page (Groups B-F). All of the models are openly available.

NOTE2: A Tensorboard page of all the training and evaluation metrics for each model can be found under the "Training metrics" tab after clicking on a model link.

Model ID Whisper Pretraining Model MyST_test PF_br_test CMU_test PF_sw_test PF_ge_test PF_it_test SO_test Dev_clean
Group A: No-Finetuning:
1 Tiny 40.09 159.57 24.62 55.32 103.68 70.57 64.83 10.85
2 Tiny.en 33.02 47.11 16.25 45.23 89.8 47.22 51.28 8.62
3 Base 32.14 100.07 16.65 53.88 126.84 50.29 60.39 8.14
4 Base.en 29.15 45.7 15.01 37.29 93.77 46.84 38.47 7.18
5 Small 26.22 111.75 9.3 60.81 86.72 44.09 36.19 6.43
6 Small.en 26.72 39.0 8.64 32.26 71.04 33.38 30.33 6.06
7 Medium 25.11 80.97 7.48 35.07 105.82 45.65 37.0 5.58
8 Medium.en 28.06 35.25 7.17 27.91 80.4 25.94 25.29 6.20
9 Large 25.24 84.52 7.56 33.09 79.14 51.82 37.25 5.53
10 Large-V2 25.0 73.68 6.86 29.99 77.56 34.97 29.39 5.4
Group B: MyST_train Finetuning:
11 Medium 11.66 19.76 9.43 34.18 62.4 24.53 24.89 5.62
12 Medium.en 11.81 17.83 9.13 23.63 76.84 19.99 25.45 6.48
13 Large-V2 12.28 10.88 9.8 25.56 65.58 23.48 25.05 4.82
Group C: MyST_train + CMU_train Finetuning:
14 Medium 12.14 41.83 4.46 158.75 113.07 125.05 33.24 6.10
15 Medium.en 12.10 31.29 2.27 138.95 125.37 77.38 33.32 6.13
16 Large-V2 12.37 23.62 2.32 184.24 211.01 180.79 48.34 4.81
Group D: MyST_train + PF_br_train Finetuning:
17 Medium 12.22 2.98 16.05 16.52 51.53 14.08 22.80 5.40
18 Medium.en 12.33 3.72 15.08 17.48 59.94 13.95 23.41 4.88
19 Large-V2 13.34 4.17 17.11 26.55 58.37 20.24 24.94 4.97
Group E: MyST_train + CMU_train + PF_br_train Finetuning:
20 Medium 11.72 3.11 2.36 23.94 86.13 16.72 27.88 5.62
21 Medium.en 11.71 3.02 2.23 21.65 68.1 15.87 26.43 5.57
22 Large-V2 12.37 3.1 1.86 43.34 71.18 56.29 32.99 4.75
Group F: MyST_train + PF_br_train + NN_10 Finetuning:
23 Medium 11.73 3.15 9.33 9.12 34.59 5.10 16.02 5.33
24 Medium.en 11.81 3.36 9.58 10.37 35.27 6.22 17.04 4.95
25 Large-V2 12.75 7.05 9.71 8.39 33.48 5.63 16.67 5.09
Group G: MyST_train + PF_br_train + NN_20 Finetuning:
26 Medium 11.96 3.12 8.92 7.74 36.21 4.16 14.40 5.39
27 Medium.en 12.30 3.28 9.53 8.94 34.78 4.42 14.87 5.01
28 Large-V2 11.60 3.09 9.22 7.24 31.46 3.98 13.83 4.47
Group H: MyST_train + CMU_train + PF_br_train + NN_10 Finetuning:
29 Medium 12.75 3.11 1.98 8.99 36.67 5.14 16.09 6.09
30 Medium.en 12.35 3.42 2.06 9.04 35.92 5.84 17.55 5.28
31 Large-V2 11.73 3.13 2.56 9.67 35.05 5.51 15.83 4.69
Group I: MyST_train + CMU_train + PF_br_train + NN_20 Finetuning:
32 Medium 12.55 3.09 1.96 7.66 34.77 4.11 14.31 6.06
33 Medium.en 11.88 3.28 1.98 8.16 34.99 4.65 15.87 5.15
34 Large-V2 11.62 2.84 1.75 8.36 34.26 4.4 14.52 4.53

Training Hyperparameters: (common used for all finetuning experiments)

  • Learning Rate: 1e-05
  • Optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • Learning rate scheduler type: Linear
  • Learning rate scheduler warmup steps: 500
  • Training steps: 4000

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages