MT evaluation as a multi-class classifier: Macro and Micro averaged F-meaures #153

thammegowda · 2021-04-08T00:14:34Z

This code has been used by our work "Macro-Average: Rare Types are Important Too". (To-appear at NAACL2021)
I am still working on a camera-ready version of the paper (haven't uploaded it to arxiv yet); for now, here is a summary: https://isi.edu/~tg/posts/2021/03/macroavg-rare-types-important/ https://aclanthology.org/2021.naacl-main.90/

@mjpost and others managing this repo, Kindly let me know if these changes can be accepted to this repo, so I can point to this repo in my final version instead of our fork. Thanks

How to test:

sacrebleu REF.txt -m macrof microf < HYP.detok.txt

I recommend macrof to be used; microf is for comparison and also a placeholder for more complicated averaging strategies if anyone interested.

One of my favorite features is --report; helps postmortem model performance to the level of each class/type; their precision, recall etc. E.g.

sacrebleu REF.txt -m macrof --report macrof.report.tsv  < HYP.detok.txt

P.S.
In case you are wondering, I had tried several complicated metrics, along the direction of converting BLEU (a micro-avg metric) to have macro-average. In the end, I couldn't justify the complicated metrics to be better than a simple Macro-avg F1 measure (on current WMT metrics datasets; looks like organizers are making some changes, so that's good!). So I cleaned up all the unwanted code and squashed it all into a single commit. For now, these metrics are limiting max_order=1 i.e. just the unigrams.

EDIT:
Here is a preprint from Arxiv https://arxiv.org/abs/2104.05700

gitignore updated sacrebleu CLI epilogue with version and path

…_ to BaseScore

Macro/Micro F and BLEU with new API

* Simplify classifier evaluation metrics remove unwanted approached and their code * Simplify classifier evaluation metrics remove unwanted approached and their code * Add license header; simplify clseval.py * Simplification * large smooth_value, micro -> macro

…are included This code has been used by our work "Macro-Average: Rare Types are Important Too". (To appear at NAACL2021) I had tried several complicated metrics, along the direction of converting BLEU (a micro-avg metric) to have macro-average. In the end, none of the complicated metrics worked better than a simple Macro-avg F1 measure. So I remove all the research code, clean up, and squashing this as single commit.

martinpopel · 2021-05-03T15:20:53Z

This looks very interesting. Thanks.

I did not have time to read the paper and review the code yet.
Nevertheless, we want to merge first #152 and release SacreBLEU 2.0.0. So this PR will have to be rebased afterwards. I am sorry for the complications.

thammegowda · 2021-05-03T21:14:25Z

Thank you.

Drop Python < 3.6 support and migrate to f-strings.

A big +1 for this! 💯
I can wait until you merge #152 . If we get any merge conflicts, please tag me and I will be happy to resolve them.

mjpost · 2021-09-08T02:40:06Z

Hi @thammegowda, somehow I missed this PR until now. Are you still interested in merging, which would require building on the latest master branch? If so I will try to take a look at the paper soon.

thammegowda · 2021-09-08T05:02:30Z

@mjpost Yes, I'd like to merge it!

I tried building off on the master branch (i.e. v2.0) but I couldn't complete it in one pass.
I am still trying to understand the new changes e.g. how signatures are created in v2.0.

I think we should close this PR, and create a new PR by copying the sacrebleu/metrics/clseval.py file.

thammegowda · 2021-09-18T05:08:47Z

Closing this PR in favor of version 2.0 and PR #163;

thammegowda and others added 30 commits March 19, 2020 22:46

reorganize: datasets and tokenizers are moved to separate .py modules

8429e22

invocation: swap ./sacrebleu.py with python -m sacrebleu

8117680

Update README: remove ./sacrebleu.py since it is no longer valid.

a746099

ReBLEU: Revised BLEU with recall and micro/macro avg

ccd0549

remove unused code

0a45115

Change name of ReBLEU

4c2a445

merged with master and resolved conflicts

68d2f39

version 1.4.5

5832e2b

Merge branch 'master' of github.com:mjpost/sacrebleu

d729c54

restructure corpus_rebleu()

6de1633

rewrite rebleu: per class performance includes ngram performance

62f0344

rebleu The one that works: 1gram f1 * precision of 2+ grams

8e93f2e

MacroBLEU format message

9338416

Fix write_report of ReBLEU

a346dcd

Merge branch 'master' of github.com:mjpost/sacrebleu

1336bab

resolve comflicts and merge

71c250b

integrate rebleu: macrobleu microbleu macrof1 microf1

ff4acce

hashbang: env bash instead of hardcoded /bin/bash

83690ea

Merge branch 'master' of github.com:mjpost/sacrebleu

ddef9a3

Merge branch 'master' into rebleu

b767626

ReBLEU mem efficiency using __slots__ API

dbaae8d

gitignore updated sacrebleu CLI epilogue with version and path

add ReCHRF (macrochrf, microchrf), fix rebleu reporting, add __slots_…

a327125

…_ to BaseScore

resolve conflicts and merge

e9098a0

fix log warning issue

30a1e7c

integrate write_report to rechrf

1040f32

rewrite rebleu with proper bucketing

57bc647

fix rebleu2 name in score formatting

90f93f5

Merge pull request #2 from isi-nlp/rebleu

44f17a8

Macro/Micro F and BLEU with new API

Merge branch 'master' of github.com:mjpost/sacrebleu

85c3e71

resolve merge conflicts

81344ff

thammegowda and others added 4 commits April 7, 2021 16:23

Update README: list down additional metrics supported

349e3bb

remove fstrings to support python 3.5

df74a19

thammegowda force-pushed the macrof-squash branch from a1a4ea0 to df74a19 Compare April 8, 2021 02:13

martinpopel mentioned this pull request May 3, 2021

Changes for 2.0.0 #152

Merged

merge and resolve conflicts

dbbd619

thammegowda mentioned this pull request Sep 18, 2021

MacroF and MicroF integration to version 2.0 #163

Open

thammegowda closed this Sep 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MT evaluation as a multi-class classifier: Macro and Micro averaged F-meaures #153

MT evaluation as a multi-class classifier: Macro and Micro averaged F-meaures #153

thammegowda commented Apr 8, 2021 •

edited

Loading

martinpopel commented May 3, 2021

thammegowda commented May 3, 2021 •

edited

Loading

mjpost commented Sep 8, 2021

thammegowda commented Sep 8, 2021

thammegowda commented Sep 18, 2021

MT evaluation as a multi-class classifier: Macro and Micro averaged F-meaures #153

MT evaluation as a multi-class classifier: Macro and Micro averaged F-meaures #153

Conversation

thammegowda commented Apr 8, 2021 • edited Loading

martinpopel commented May 3, 2021

thammegowda commented May 3, 2021 • edited Loading

mjpost commented Sep 8, 2021

thammegowda commented Sep 8, 2021

thammegowda commented Sep 18, 2021

thammegowda commented Apr 8, 2021 •

edited

Loading

thammegowda commented May 3, 2021 •

edited

Loading