You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SacreBLEU includes an implementation of TER, using -m ter. The implementation of HTER is exactly the same, you just need to use "targeted" references for the MT system you plan to evaluate (i.e. human post-edited the MT output, possibly using existing untargeted references).
If you need to strictly follow the original HTER paper, you should also have a set of untargeted references and multiply the final score by the avg length of the targeted reference and divide by the avg length of the untargeted references.
Note that HTER computation is very costly because you need to create a new targeted reference for (each version of) each MT system you plan to evaluate. If you want fairly compare several MT systems, you should create their targeted references at the same time with the same pool of annotators and make sure the assignment of annotators is random.
Note also that HTER was invented before the introduction of modern NMT systems, so we don't know what would be the correlation with human judgements. Also, it is well known that some systems have worse translation quality but need less edits post-editing relative to other systems, so HTER would be biased against these systems (similarly to BLEU).
Is there an implementation of Human-mediated Translation Edit Rate (HTER) algorithm?
Related paper: https://aclanthology.org/2006.amta-papers.25/
The text was updated successfully, but these errors were encountered: