How to get the sentence boundary? #7

wasiahmad · 2020-03-05T08:29:02Z

In the paper, you mentioned, "We use whitespace tokenization for all of the MLQA languages other than Chinese". I am wondering is there any suggested way to get the sentence boundaries, so that, we can use additional information of the sentences?

patrick-s-h-lewis · 2020-03-06T12:39:32Z

Hi Wasi,

Multilingual sentence segmentation is challenging. I believe we used moses' sentence splitter during development, https://github.com/moses-smt/mosesdecoder/blob/master/scripts/ems/support/split-sentences.perl. But this may not cover all the languages, The researcher who did it is on leave, we can ask him when he returns how this was done

Patrick

wasiahmad · 2020-03-19T04:19:21Z

Can you tell which whitespace tokenizer was used to perform tokenization? I tried simple whitespace tokenization but in many cases, I was unable to match the token offsets with the ground truth answer span. The main problem is the punctuation symbols.

It would be helpful if you can provide the tokenization script that converts the character offset into word offset for the ground truth answer span.

patrick-s-h-lewis · 2020-03-19T11:50:58Z

Hi wasi,

Answer spans are highlighted by humans, there is no tokenization there. Modelling this correclty is part of the challenge of the dataset.

The evaluation script has the whitespace tokenization that is used to evaluate models.

Patrick

patrick-s-h-lewis closed this as completed Mar 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get the sentence boundary? #7

How to get the sentence boundary? #7

wasiahmad commented Mar 5, 2020

patrick-s-h-lewis commented Mar 6, 2020

wasiahmad commented Mar 19, 2020 •

edited

Loading

patrick-s-h-lewis commented Mar 19, 2020

How to get the sentence boundary? #7

How to get the sentence boundary? #7

Comments

wasiahmad commented Mar 5, 2020

patrick-s-h-lewis commented Mar 6, 2020

wasiahmad commented Mar 19, 2020 • edited Loading

patrick-s-h-lewis commented Mar 19, 2020

wasiahmad commented Mar 19, 2020 •

edited

Loading