You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Oct 31, 2023. It is now read-only.
In the paper, you mentioned, "We use whitespace tokenization for all of the MLQA languages other than Chinese". I am wondering is there any suggested way to get the sentence boundaries, so that, we can use additional information of the sentences?
The text was updated successfully, but these errors were encountered:
Can you tell which whitespace tokenizer was used to perform tokenization? I tried simple whitespace tokenization but in many cases, I was unable to match the token offsets with the ground truth answer span. The main problem is the punctuation symbols.
It would be helpful if you can provide the tokenization script that converts the character offset into word offset for the ground truth answer span.
In the paper, you mentioned, "We use whitespace tokenization for all of the MLQA languages other than Chinese". I am wondering is there any suggested way to get the sentence boundaries, so that, we can use additional information of the sentences?
The text was updated successfully, but these errors were encountered: