Impact of Zero Width Space (U+200B) on Automatic Speech Recognition (ASR) Training and Performance #2545

cod3r0k · 2025-03-09T11:48:29Z

cod3r0k
Mar 9, 2025

Issue Description:
In some languages, the Zero Width Space (ZWSP) (U+200B) is commonly used for word separation or formatting. However, its presence in training data for Automatic Speech Recognition (ASR) models may introduce challenges in tokenization, data preprocessing, and decoding.

Key Questions & Concerns:

Does the presence of ZWSP affect ASR model training and transcription accuracy?
How should ZWSP be handled in data preprocessing?
Should it be kept, normalized, or removed to ensure consistency?
Does ZWSP impact the alignment of audio-text pairs in training?
How does it affect tokenization and language models (LMs) used in ASR decoding?
Are there best practices for handling ZWSP in post-processing and WER/CER evaluations?

Proposed Solutions & Discussion:

Should tokenizers and LMs be adapted to explicitly recognize ZWSP?
Would it be beneficial to conduct an ablation study comparing ASR performance with and without ZWSP?
Are there language-specific strategies for handling ZWSP in ASR systems?
Would love to hear insights from others working on ASR systems, especially those handling languages where ZWSP is significant! 🚀

ryanheise · 2025-03-10T02:28:05Z

ryanheise
Mar 10, 2025

I don't see any reason why training ASR/tokenizer on ZWSP training data wouldn't perform any differently from how spaces perform, except that dataset curation will be harder since ZWSP isn't always used consistently. But if you train it that way, it would give you an end-to-end model rather than having to add an extra word segmentation model to your transcription pipeline.

Whisper reports WER for languages where words are separated by spaces, and reports CER for languages not separated by spaces (plus Korean as a special case). As far as I can tell, ZWSPs were probably stripped from the training data. If they were present, it would seem to make sense to treat them just like spaces.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Impact of Zero Width Space (U+200B) on Automatic Speech Recognition (ASR) Training and Performance #2545

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Impact of Zero Width Space (U+200B) on Automatic Speech Recognition (ASR) Training and Performance #2545

cod3r0k Mar 9, 2025

Replies: 1 comment

ryanheise Mar 10, 2025

cod3r0k
Mar 9, 2025

ryanheise
Mar 10, 2025