IndexError processing translate-train vs translate-test datasets #16

gowtham1997 · 2020-12-06T11:22:56Z

I downloaded the translate-train and transted-test datasets from the links in the Translate-Train and Translate-Test Data section of the readme page.

I am trying to train QA models with translated squad datasets in the translate-train folder.

import transformers
print(transformers.__version__) # outputs '4.0.0'

from transformers.data.processors.squad import SquadV1Processor
processor = SquadV1Processor()

examples = processor.get_train_examples('mlqa-translate-train/', filename='hi_squad-translate-train-train-v1.1.json')

When I run the above code to get train examples from translate-train jsons, I get IndexError(list index out of range) while the same code works for the files in mlqa-translate-test.

Do you happen to know why this is happening?

patrick-s-h-lewis · 2020-12-07T11:36:12Z

Hi, this looks like a problem with transformers, not with MLQA?

gowtham1997 · 2020-12-07T17:59:03Z

Yes, not an issue with MLQA :). This is a general question on the provided translated datasets.

Since the datasets are translated from squad and maintain the squad dataset format. I tried the standard squadprocessor but this doesn't seem to work on the Translate-Train datasets but works on Translate-test. The above code works for other multilingual datasets like TydiQA, Xquad.

I will check if this is related to the dataset or the library.

patrick-s-h-lewis · 2020-12-07T18:10:10Z

feel free to circle back if there is something up with the data that causes HF to break.
The automatically-translated datasets are a bit noisy, its possible there are some things that are hard for systems to parse and use.

cn-boop · 2021-08-10T13:06:24Z

have you solve this problem? and how? thanks

gowtham1997 closed this as completed Dec 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndexError processing translate-train vs translate-test datasets #16

IndexError processing translate-train vs translate-test datasets #16

gowtham1997 commented Dec 6, 2020

patrick-s-h-lewis commented Dec 7, 2020

gowtham1997 commented Dec 7, 2020

patrick-s-h-lewis commented Dec 7, 2020

cn-boop commented Aug 10, 2021

IndexError processing translate-train vs translate-test datasets #16

IndexError processing translate-train vs translate-test datasets #16

Comments

gowtham1997 commented Dec 6, 2020

patrick-s-h-lewis commented Dec 7, 2020

gowtham1997 commented Dec 7, 2020

patrick-s-h-lewis commented Dec 7, 2020

cn-boop commented Aug 10, 2021