The Impact of Dialect Variation on Robust Automatic Speech Recognition for Catalan

CommonVoice 18.0 has the following "macro-dialect" composition (L2 data removed):

	Central	Balear	Nord	Nord-Occidental	Valencià	Unknown
Validated Sample Count	761017	24864	29970	74452	83742	598786
Duration (Hours)	1155.31	38.51	47.04	100.17	120.04	912.81

**The values above are calculated after filtering out repeated recordings of the same sentence from the same macro-dialect. If the same sentence was recorded by speakers from different macro-dialects, all recordings of that sentence from different macro-dialects were left in the data.

Working in a way that tries to use as much data as possible, we aimed to train ASR models that vary in how biased toward the Central dialect they are. Given that the Balear dialect represents the "lower limit" in terms of data, we worked from the assumption that at most, we can have ~30.4 hours of Balear training data (80% of 38 hr), and 3.8 hours each of Balear development and test data (10% each).

We trained four models using the following dialect compositions. Training data ranged from a condition where all the fine-tuning data is in the Central dialect (Model 1) to a condition where the fine-tuning data is perfectly balanced (Model 4).

In terms of hours of data:

Train

	Central	Balear	Nord	Nord-Occidental	Valencià	Total
Model 1 (100% Central)	152	0	0	0	0	152
Model 2 (80% Central)	121.6	7.6	7.6	7.6	7.6	152
Model 3 (50% Central)	76	19	19	19	19	152
Model 4 (20% Central)	30.4	30.4	30.4	30.4	30.4	152

Development

	Central	Balear	Nord	Nord-Occidental	Valencià	Total
Model 1 (100% Central)	19	0	0	0	0	19
Model 2 (80% Central)	15.2	0.95	0.95	0.95	0.95	19
Model 3 (50% Central)	9.5	2.375	2.375	2.375	2.375	19
Model 4 (20% Central)	3.8	3.8	3.8	3.8	3.8	19

Evaluation

	Central	Balear	Nord	Nord-Occidental	Valencià	Total
Model 1 (100% Central)	3.8	3.8	3.8	3.8	3.8	19
Model 2 (80% Central)	3.8	3.8	3.8	3.8	3.8	19
Model 3 (50% Central)	3.8	3.8	3.8	3.8	3.8	19
Model 4 (20% Central)	3.8	3.8	3.8	3.8	3.8	19

To run model training:

After splitting a sample from the common voice corpus, fine-tuning can be initialized with the script ./training/run_training.py

For example:

python3 ./training/run_training.py --experiment_name=central100_53m_v02 --model=XLSR53 --freeze_feature_extractor --prop_central=100

Adversarial Attacks

We train adversarial attacks on the fine-tuned models to study the impact of having multi-dialect fine-tuning data on adversarial robustness. Attacks can be trained using the script ./attack/launch_attack.py:

python3 ./attack/launch_attack.py --experiment_name=central100_53m_v02 --model=XLSR53 --lr=0.01 --regularizing_const=1.0

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
attack		attack
data_prep		data_prep
training		training
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Impact of Dialect Variation on Robust Automatic Speech Recognition for Catalan

To run model training:

Adversarial Attacks

About

Releases

Packages

Languages

zhopto3/DialAttack

Folders and files

Latest commit

History

Repository files navigation

The Impact of Dialect Variation on Robust Automatic Speech Recognition for Catalan

To run model training:

Adversarial Attacks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages