-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] CJK tokenizer for char-level tokenized BLEU #171
Comments
The papers you cite say "BLEU scores are character-level." or "charBLEU". I would interpret this as tokenizing all characters, including those in latin script names. For this, we already have Of course, there is a question whether character-level BLEU (limited by char 4-grams and the BLEU algorithm focused on precision with brevity penalty instead of recall) is suitable (i.e. correlates with human evaluation) enough and why not use e.g. chrF (where the default |
Hi @martinpopel Thanks for your reply. In terms of "character-level BLEU" for Chinese and Japanese texts, what we mentioned is the way that is exactly the same as Our issue arises from that Plus, it does not seem to make sense to use Thus we would like to ask for extending Many thanks! Zaixiang Reference: |
It sounds good to me to have a separate tokenizer that extends the ranges with the ones you suggested, to not change the current |
Hi,
I have been recently working on WMT'20 EN-JA dataset, and I am wondering if we can add a character-level tokenizer (in stead of
ja-mecab
) to facilitate fair comparison on this task.Existing literature on EN-JA used char-level BLEU on the test set[1, 2, 3], following the rules of the WMT'20 competition.
I have attempted to use the
zh
tokenizer for this purpose. However, the script ignores Katakana and Hiragana characters.Our current solution (suggested by @zhengzx-nlp) is to add
(u'\u3040', u'\u30ff')
to the
_UCODE_RANGES
ofsacrebleu/sacrebleu/tokenizers/tokenizer_zh.py
Line 45 in 2787185
I wonder can we add a similar feature to the existing tokenizers? I am thinking we can either add a
cjk
tokenizer or modify the existingzh
tokenizer. The former could be better for back-compatibility reasons.The text was updated successfully, but these errors were encountered: