-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MT evaluation as a multi-class classifier: Macro and Micro averaged F-meaures #153
Conversation
gitignore updated sacrebleu CLI epilogue with version and path
Macro/Micro F and BLEU with new API
* Simplify classifier evaluation metrics remove unwanted approached and their code * Simplify classifier evaluation metrics remove unwanted approached and their code * Add license header; simplify clseval.py * Simplification * large smooth_value, micro -> macro
…are included This code has been used by our work "Macro-Average: Rare Types are Important Too". (To appear at NAACL2021) I had tried several complicated metrics, along the direction of converting BLEU (a micro-avg metric) to have macro-average. In the end, none of the complicated metrics worked better than a simple Macro-avg F1 measure. So I remove all the research code, clean up, and squashing this as single commit.
a1a4ea0
to
df74a19
Compare
This looks very interesting. Thanks. I did not have time to read the paper and review the code yet. |
Thank you.
A big +1 for this! 💯 |
Hi @thammegowda, somehow I missed this PR until now. Are you still interested in merging, which would require building on the latest |
@mjpost Yes, I'd like to merge it! I tried building off on the master branch (i.e. v2.0) but I couldn't complete it in one pass. I think we should close this PR, and create a new PR by copying the |
Closing this PR in favor of version 2.0 and PR #163; |
This code has been used by our work "Macro-Average: Rare Types are Important Too".
(To-appear at NAACL2021)https://aclanthology.org/2021.naacl-main.90/I am still working on a camera-ready version of the paper (haven't uploaded it to arxiv yet); for now, here is a summary: https://isi.edu/~tg/posts/2021/03/macroavg-rare-types-important/
@mjpost and others managing this repo, Kindly let me know if these changes can be accepted to this repo, so I can point to this repo in my final version instead of our fork. Thanks
How to test:
I recommend
macrof
to be used;microf
is for comparison and also a placeholder for more complicated averaging strategies if anyone interested.One of my favorite features is
--report
; helps postmortem model performance to the level of each class/type; their precision, recall etc. E.g.P.S.
In case you are wondering, I had tried several complicated metrics, along the direction of converting BLEU (a micro-avg metric) to have macro-average. In the end, I couldn't justify the complicated metrics to be better than a simple Macro-avg F1 measure (on current WMT metrics datasets; looks like organizers are making some changes, so that's good!). So I cleaned up all the unwanted code and squashed it all into a single commit. For now, these metrics are limiting
max_order=1
i.e. just the unigrams.EDIT:
Here is a preprint from Arxiv https://arxiv.org/abs/2104.05700