Skip to content
This repository was archived by the owner on Oct 31, 2023. It is now read-only.

IndexError processing translate-train vs translate-test datasets #16

Closed
gowtham1997 opened this issue Dec 6, 2020 · 4 comments
Closed

Comments

@gowtham1997
Copy link

I downloaded the translate-train and transted-test datasets from the links in the Translate-Train and Translate-Test Data section of the readme page.

I am trying to train QA models with translated squad datasets in the translate-train folder.

import transformers
print(transformers.__version__) # outputs '4.0.0'

from transformers.data.processors.squad import SquadV1Processor
processor = SquadV1Processor()

examples = processor.get_train_examples('mlqa-translate-train/', filename='hi_squad-translate-train-train-v1.1.json')

When I run the above code to get train examples from translate-train jsons, I get IndexError(list index out of range) while the same code works for the files in mlqa-translate-test.

Do you happen to know why this is happening?

@patrick-s-h-lewis
Copy link
Contributor

Hi, this looks like a problem with transformers, not with MLQA?

@gowtham1997
Copy link
Author

Yes, not an issue with MLQA :). This is a general question on the provided translated datasets.

Since the datasets are translated from squad and maintain the squad dataset format. I tried the standard squadprocessor but this doesn't seem to work on the Translate-Train datasets but works on Translate-test. The above code works for other multilingual datasets like TydiQA, Xquad.

I will check if this is related to the dataset or the library.

@patrick-s-h-lewis
Copy link
Contributor

feel free to circle back if there is something up with the data that causes HF to break.
The automatically-translated datasets are a bit noisy, its possible there are some things that are hard for systems to parse and use.

@cn-boop
Copy link

cn-boop commented Aug 10, 2021

have you solve this problem? and how? thanks

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants