Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MT evaluation as a multi-class classifier: Macro and Micro averaged F-meaures #153

Closed
wants to merge 35 commits into from

Conversation

thammegowda
Copy link
Contributor

@thammegowda thammegowda commented Apr 8, 2021

This code has been used by our work "Macro-Average: Rare Types are Important Too". (To-appear at NAACL2021)
I am still working on a camera-ready version of the paper (haven't uploaded it to arxiv yet); for now, here is a summary: https://isi.edu/~tg/posts/2021/03/macroavg-rare-types-important/
https://aclanthology.org/2021.naacl-main.90/

@mjpost and others managing this repo, Kindly let me know if these changes can be accepted to this repo, so I can point to this repo in my final version instead of our fork. Thanks

How to test:

sacrebleu REF.txt -m macrof microf < HYP.detok.txt 

I recommend macrof to be used; microf is for comparison and also a placeholder for more complicated averaging strategies if anyone interested.

One of my favorite features is --report; helps postmortem model performance to the level of each class/type; their precision, recall etc. E.g.

sacrebleu REF.txt -m macrof --report macrof.report.tsv  < HYP.detok.txt 

P.S.
In case you are wondering, I had tried several complicated metrics, along the direction of converting BLEU (a micro-avg metric) to have macro-average. In the end, I couldn't justify the complicated metrics to be better than a simple Macro-avg F1 measure (on current WMT metrics datasets; looks like organizers are making some changes, so that's good!). So I cleaned up all the unwanted code and squashed it all into a single commit. For now, these metrics are limiting max_order=1 i.e. just the unigrams.

EDIT:
Here is a preprint from Arxiv https://arxiv.org/abs/2104.05700

thammegowda and others added 30 commits March 19, 2020 22:46
gitignore updated
sacrebleu CLI epilogue with version and path
Macro/Micro F and BLEU with new API
thammegowda and others added 4 commits April 7, 2021 16:23
* Simplify classifier evaluation metrics

remove unwanted approached and their code

* Simplify classifier evaluation metrics

remove unwanted approached and their code

* Add license header; simplify clseval.py

* Simplification

* large smooth_value, micro -> macro
…are included

This code has been used by our work "Macro-Average: Rare Types are Important Too".  (To appear at NAACL2021)

I had tried several complicated metrics, along the direction of converting BLEU (a micro-avg metric) to have macro-average. 
In the end, none of the complicated metrics worked better than a simple Macro-avg F1 measure. 
So I remove all the research code, clean up, and squashing this as single commit.
@martinpopel
Copy link
Collaborator

This looks very interesting. Thanks.

I did not have time to read the paper and review the code yet.
Nevertheless, we want to merge first #152 and release SacreBLEU 2.0.0. So this PR will have to be rebased afterwards. I am sorry for the complications.

@thammegowda
Copy link
Contributor Author

thammegowda commented May 3, 2021

Thank you.

Drop Python < 3.6 support and migrate to f-strings.

A big +1 for this! 💯
I can wait until you merge #152 . If we get any merge conflicts, please tag me and I will be happy to resolve them.

@mjpost
Copy link
Owner

mjpost commented Sep 8, 2021

Hi @thammegowda, somehow I missed this PR until now. Are you still interested in merging, which would require building on the latest master branch? If so I will try to take a look at the paper soon.

@thammegowda
Copy link
Contributor Author

@mjpost Yes, I'd like to merge it!

I tried building off on the master branch (i.e. v2.0) but I couldn't complete it in one pass.
I am still trying to understand the new changes e.g. how signatures are created in v2.0.

I think we should close this PR, and create a new PR by copying the sacrebleu/metrics/clseval.py file.

@thammegowda
Copy link
Contributor Author

Closing this PR in favor of version 2.0 and PR #163;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants