Code and Data for the paper "PAXQA: Generating Cross-lingual Question Answering Examples at Training Scale" at EMNLP 2023 (Findings).
We release the PAXQA datasets on the HuggingFace Hub. The fields are consistent with the MLQA (and therefore SQuAD) fields.
The PAXQA test and validation sets are available at this link, and consists of 1788 QA examples total.
The PAXQA train sets are available at this link, and consists of 660K QA examples total. PAXQA_HWA are the 2 *gale*
datasets, while PAXQA_AWA are the other 5 datasets.
Table 1 of the paper gives the number of QA examples for each split and each language:
You can verify the numbers with the files you downloaded above (contact the authors if there are inconsistencies).
This section is forthcoming.
@article{li2023paxqa,
title={\textsc{PaxQA}: Generating Cross-lingual Question Answering Examples at Training Scale},
author={Bryan Li and Chris Callison-Burch},
year={2023},
journal={Findings of the Association for Computational Linguistics: EMNLP}
}