Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] CJK tokenizer for char-level tokenized BLEU #171

Open
chenyangh opened this issue Oct 27, 2021 · 3 comments
Open

[Feature request] CJK tokenizer for char-level tokenized BLEU #171

chenyangh opened this issue Oct 27, 2021 · 3 comments

Comments

@chenyangh
Copy link

chenyangh commented Oct 27, 2021

Hi,

I have been recently working on WMT'20 EN-JA dataset, and I am wondering if we can add a character-level tokenizer (in stead of ja-mecab) to facilitate fair comparison on this task.

Existing literature on EN-JA used char-level BLEU on the test set[1, 2, 3], following the rules of the WMT'20 competition.

I have attempted to use the zh tokenizer for this purpose. However, the script ignores Katakana and Hiragana characters.
Our current solution (suggested by @zhengzx-nlp) is to add
(u'\u3040', u'\u30ff')
to the _UCODE_RANGES of

I wonder can we add a similar feature to the existing tokenizers? I am thinking we can either add a cjk tokenizer or modify the existing zh tokenizer. The former could be better for back-compatibility reasons.

@martinpopel
Copy link
Collaborator

tokenizer_zh separates Chinese characters and then tokenizes the non Chinese part using tokenizer_13a.
So if there is e.g. an English name in a Chinese sentence, each word of the name remains a single token.
This was needed for full compatibility with a legacy Chinese-BLEU evaluation (which is also the reason for listing Chinese _UCODE_RANGES instead of using Unicode properties and possibly including all CJK characters).

The papers you cite say "BLEU scores are character­-level." or "charBLEU". I would interpret this as tokenizing all characters, including those in latin script names. For this, we already have sacrebleu --tokenize char (tokenizer_char.py).

Of course, there is a question whether character-level BLEU (limited by char 4-grams and the BLEU algorithm focused on precision with brevity penalty instead of recall) is suitable (i.e. correlates with human evaluation) enough and why not use e.g. chrF (where the default --chrf-char-order is 6).

@zhengzx-nlp
Copy link

zhengzx-nlp commented Oct 27, 2021

Hi @martinpopel

Thanks for your reply.

In terms of "character-level BLEU" for Chinese and Japanese texts, what we mentioned is the way that is exactly the same as tokenizer_zh does: separating CJK characters and remaining non-CJK ones for 13a tokenization. This makes sense as the underlying purpose of this is to avoid ambiguity introduced by different segmentation tools for Chinese and Japanese.

Our issue arises from that _UCODE_RANGES lacks the coverage of the commonly-used full-width Japanese Katakana and Hiragana (e.g., ひらがな and カタカナ), whereas it does include the half-width kana's (e.g., カタカナ, https://github.com/mjpost/sacrebleu/blob/master/sacrebleu/tokenizers/tokenizer_zh.py#L54).

Plus, it does not seem to make sense to use tokenizer_char for Japanese/Chinese texts where Latin scripts also get separated.

Thus we would like to ask for extending _UCODE_RANGES to include an additional range of (u'\u3040', u'\u30ff') [1] to support Katakana and Hiragana.

Many thanks!

Zaixiang


Reference:
[1] https://en.wikipedia.org/wiki/Kana#In_Unicode

@ozancaglayan
Copy link
Collaborator

It sounds good to me to have a separate tokenizer that extends the ranges with the ones you suggested, to not change the current zh tokenizer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants