[Feature request] CJK tokenizer for char-level tokenized BLEU #171

chenyangh · 2021-10-27T08:11:44Z

Hi,

I have been recently working on WMT'20 EN-JA dataset, and I am wondering if we can add a character-level tokenizer (in stead of ja-mecab) to facilitate fair comparison on this task.

Existing literature on EN-JA used char-level BLEU on the test set[1, 2, 3], following the rules of the WMT'20 competition.

I have attempted to use the zh tokenizer for this purpose. However, the script ignores Katakana and Hiragana characters.
Our current solution (suggested by @zhengzx-nlp) is to add
(u'\u3040', u'\u30ff')
to the _UCODE_RANGES of

sacrebleu/sacrebleu/tokenizers/tokenizer_zh.py

Line 45 in 2787185

_UCODE_RANGES = [

I wonder can we add a similar feature to the existing tokenizers? I am thinking we can either add a cjk tokenizer or modify the existing zh tokenizer. The former could be better for back-compatibility reasons.

The text was updated successfully, but these errors were encountered:

martinpopel · 2021-10-27T09:14:27Z

tokenizer_zh separates Chinese characters and then tokenizes the non Chinese part using tokenizer_13a.
So if there is e.g. an English name in a Chinese sentence, each word of the name remains a single token.
This was needed for full compatibility with a legacy Chinese-BLEU evaluation (which is also the reason for listing Chinese _UCODE_RANGES instead of using Unicode properties and possibly including all CJK characters).

The papers you cite say "BLEU scores are character-level." or "charBLEU". I would interpret this as tokenizing all characters, including those in latin script names. For this, we already have sacrebleu --tokenize char (tokenizer_char.py).

Of course, there is a question whether character-level BLEU (limited by char 4-grams and the BLEU algorithm focused on precision with brevity penalty instead of recall) is suitable (i.e. correlates with human evaluation) enough and why not use e.g. chrF (where the default --chrf-char-order is 6).

zhengzx-nlp · 2021-10-27T09:44:19Z

Hi @martinpopel

Thanks for your reply.

In terms of "character-level BLEU" for Chinese and Japanese texts, what we mentioned is the way that is exactly the same as tokenizer_zh does: separating CJK characters and remaining non-CJK ones for 13a tokenization. This makes sense as the underlying purpose of this is to avoid ambiguity introduced by different segmentation tools for Chinese and Japanese.

Our issue arises from that _UCODE_RANGES lacks the coverage of the commonly-used full-width Japanese Katakana and Hiragana (e.g., ひらがな and カタカナ), whereas it does include the half-width kana's (e.g., ｶﾀｶﾅ, https://github.com/mjpost/sacrebleu/blob/master/sacrebleu/tokenizers/tokenizer_zh.py#L54).

Plus, it does not seem to make sense to use tokenizer_char for Japanese/Chinese texts where Latin scripts also get separated.

Thus we would like to ask for extending _UCODE_RANGES to include an additional range of (u'\u3040', u'\u30ff') [1] to support Katakana and Hiragana.

Many thanks!

Zaixiang

Reference:
[1] https://en.wikipedia.org/wiki/Kana#In_Unicode

ozancaglayan · 2021-11-12T09:03:08Z

It sounds good to me to have a separate tokenizer that extends the ranges with the ones you suggested, to not change the current zh tokenizer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] CJK tokenizer for char-level tokenized BLEU #171

[Feature request] CJK tokenizer for char-level tokenized BLEU #171

chenyangh commented Oct 27, 2021 •

edited

Loading

martinpopel commented Oct 27, 2021

zhengzx-nlp commented Oct 27, 2021 •

edited

Loading

ozancaglayan commented Nov 12, 2021

[Feature request] CJK tokenizer for char-level tokenized BLEU #171

[Feature request] CJK tokenizer for char-level tokenized BLEU #171

Comments

chenyangh commented Oct 27, 2021 • edited Loading

martinpopel commented Oct 27, 2021

zhengzx-nlp commented Oct 27, 2021 • edited Loading

ozancaglayan commented Nov 12, 2021

chenyangh commented Oct 27, 2021 •

edited

Loading

zhengzx-nlp commented Oct 27, 2021 •

edited

Loading