XLM Evaluation results #9

nooralahzadeh · 2020-03-07T14:29:34Z

Hi,
I performed some experiments using the XLM implementation of huggingfaces training on sQuAD v1.1 training and test in the MLQA test set. The results are as follows:( f1 / EM)
en 68.51/56.13
es 57.59/41.21
ar 47.88/31.41
de 51.99/38.16
zh 38.34/21.39
hi 46.13/31.72
vi 44.09/27.07
I am wondering what it has a large difference with yours. Did you do special thing except the early stopping on MLQA-en?

patrick-s-h-lewis · 2020-03-09T10:41:03Z

Hi Farad,

Which implementation of XLM are you using?

The HPs for XLM were:
Adam: lr=3e-5,
weight decay 0.005,
clip_norm=5,
epochs=3,
batch size=32,
triangular schedular: warmup_steps=500, total steps=10000

We used the pytext implementation of XLM. The correct tokenization and preprocessing is very important for good performance, I'm not sure whether the HF version has this correct, as a number of people have struggled to get the good results with XLM on HF.
We hope to opensource the code when the colleague who wrote it is back from leave.

RachelKer · 2020-06-29T09:46:16Z

Hello, any updates on the pytext code release ? I know the COVID situation may have changed the plans. I struggle to replicate of your paper on one shot learning (that is, trained on your MLQA-train chinese) with HF XLM-R (inference with a zero-shot model on chinese works fine). Thank you !

patrick-s-h-lewis · 2020-06-29T10:12:37Z

Hi Rachel!

XLM-R wasn't included in our paper, so we can't directly help there.
I'll check internally on the reproducibility code for the MLQA paper

Patrick

nooralahzadeh · 2020-06-30T09:19:44Z

Hi @RachelKer To achieve a similar performance on zh test sets, you just need to add
"final_text = tok_text" after line 497 in squad_metrics.py (only for zh). Because there isn't space and sub-word in Chinese, so we don't need to execute the get_final_test() function.

RachelKer · 2020-06-30T13:07:29Z

@nooralahzadeh Thank you, I saw your issue on HF repo a few days ago and with this change I manage to get the correct results for Bert and XLM trained on chinese, but not for XLM-R. Did you manage to train XLM-Roberta on chinese ?

@patrick-s-h-lewis Oh indeed I confused XLM-R and XLM on your paper, I am sorry. I think the training problem that I have occurs with XLM-R only. Thanks for checking the code release anyway, and your quick answer !

patrick-s-h-lewis · 2020-06-30T13:29:19Z

Hey @RachelKer and @nooralahzadeh,

I asked internally about XLMR (since there is some overlap between the teams), the pytext model is released, but there aren't instructions for how to run it on MLQA, so someone is going to write these instructions up :)

Patrick

patrick-s-h-lewis mentioned this issue Mar 9, 2020

Hyper-parameter in your experiments in BERT and XLM #8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XLM Evaluation results #9

XLM Evaluation results #9

nooralahzadeh commented Mar 7, 2020

patrick-s-h-lewis commented Mar 9, 2020

RachelKer commented Jun 29, 2020

patrick-s-h-lewis commented Jun 29, 2020

nooralahzadeh commented Jun 30, 2020

RachelKer commented Jun 30, 2020

patrick-s-h-lewis commented Jun 30, 2020

XLM Evaluation results #9

XLM Evaluation results #9

Comments

nooralahzadeh commented Mar 7, 2020

patrick-s-h-lewis commented Mar 9, 2020

RachelKer commented Jun 29, 2020

patrick-s-h-lewis commented Jun 29, 2020

nooralahzadeh commented Jun 30, 2020

RachelKer commented Jun 30, 2020

patrick-s-h-lewis commented Jun 30, 2020