Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Adding support to Chinese (Traditional) and Yue Chinese #110

Open
slgphantom opened this issue Mar 7, 2025 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@slgphantom
Copy link

First of all, thank you for creating this application as a replacement for Google Translate.

The current language list only supports one type of Chinese, that is the simplified one.

It would be great if more variants of the Chinese language could be added, including Chinese (traditional) and Yue Chinese. By doing so, more Chinese users, especially those from Taiwan and Hong Kong, could benefit from your project.

Both Chinese (Traditional) and Yue Chinese / Chinese (Yue) are supported by NLLB and Whisper.

Thank you once again;)

@slgphantom slgphantom added the enhancement New feature or request label Mar 7, 2025
@niedev
Copy link
Owner

niedev commented Mar 10, 2025

Thank you! From what I've seen whisper doesn't have different codes for traditional and simplified english, in fact it often gets confused between the two, NLLB instead should support both but I don't see yue chinese in the list of language codes, however I'll find out more and see what I can do 👍

@sblair12
Copy link

Think it may be due to the fact that the Traditional/Simplified distinction only applies to the written language and not the spoken one, with Whisper only being able to designate spoken?

@slgphantom
Copy link
Author

Think it may be due to the fact that the Traditional/Simplified distinction only applies to the written language and not the spoken one, with Whisper only being able to designate spoken?

Simply speaking, there are four settings in Chinese:

  1. Mandarin + Simplified Chinese (Mainland China)
  2. Mandarin + Traditional Chinese (Taiwan)
  3. Cantonese (Yue) + Traditional Chinese (Hong Kong)
  4. Cantonese (Yue) + Simplified Chinese (Malaysia)

Meanwhile, although both Taiwan and Hong Kong use traditional Chinese, they are not exactly the same, and many words/grammatical usages (mainly verbal ones) are not the same in both places. You may consider written Chinese a common symbol/ token of Chinese-speaking regions.

Back to our topic, Whisper can detect Cantonese/ Yue; however, the transcribed text would be written in Chinese.
If you check Whisper's Tokenizer, you can see both "zh" (Chinese/ Mandarin) and "yue" (Cantonese).

While I am unsure whether choosing "zh" will have simplified or traditional Chinese as output, choosing "yue" would have traditional Chinese as output.

Thank you both again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants