Skip to content
This repository was archived by the owner on Oct 31, 2023. It is now read-only.

How to get the sentence boundary? #7

Closed
wasiahmad opened this issue Mar 5, 2020 · 3 comments
Closed

How to get the sentence boundary? #7

wasiahmad opened this issue Mar 5, 2020 · 3 comments

Comments

@wasiahmad
Copy link

In the paper, you mentioned, "We use whitespace tokenization for all of the MLQA languages other than Chinese". I am wondering is there any suggested way to get the sentence boundaries, so that, we can use additional information of the sentences?

@patrick-s-h-lewis
Copy link
Contributor

Hi Wasi,

Multilingual sentence segmentation is challenging. I believe we used moses' sentence splitter during development, https://github.com/moses-smt/mosesdecoder/blob/master/scripts/ems/support/split-sentences.perl. But this may not cover all the languages, The researcher who did it is on leave, we can ask him when he returns how this was done

Patrick

@wasiahmad
Copy link
Author

wasiahmad commented Mar 19, 2020

Can you tell which whitespace tokenizer was used to perform tokenization? I tried simple whitespace tokenization but in many cases, I was unable to match the token offsets with the ground truth answer span. The main problem is the punctuation symbols.

It would be helpful if you can provide the tokenization script that converts the character offset into word offset for the ground truth answer span.

@patrick-s-h-lewis
Copy link
Contributor

Hi wasi,

Answer spans are highlighted by humans, there is no tokenization there. Modelling this correclty is part of the challenge of the dataset.

The evaluation script has the whitespace tokenization that is used to evaluate models.

Patrick

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants