Replies: 1 comment
-
I don't see any reason why training ASR/tokenizer on ZWSP training data wouldn't perform any differently from how spaces perform, except that dataset curation will be harder since ZWSP isn't always used consistently. But if you train it that way, it would give you an end-to-end model rather than having to add an extra word segmentation model to your transcription pipeline. Whisper reports WER for languages where words are separated by spaces, and reports CER for languages not separated by spaces (plus Korean as a special case). As far as I can tell, ZWSPs were probably stripped from the training data. If they were present, it would seem to make sense to treat them just like spaces. |
Beta Was this translation helpful? Give feedback.
-
Issue Description:
In some languages, the Zero Width Space (ZWSP) (U+200B) is commonly used for word separation or formatting. However, its presence in training data for Automatic Speech Recognition (ASR) models may introduce challenges in tokenization, data preprocessing, and decoding.
Key Questions & Concerns:
Proposed Solutions & Discussion:
Beta Was this translation helpful? Give feedback.
All reactions