TiGEr toolkit for text generation evaluation.
pip install tiger-eval
pip install .
-
Cross-lingual Consistency
-
Multichoice question evaluation
-
Open generation evaluation (with llama-2-7b-chat only)
-
BLEU score
The toolkit should support various metrics
-
ROUGE, BLEU
-
Model based: BERTScore, BLEURT
-
Open Generation, need to be further improved
-
For multichoice questions. Other matching techniques? E.g. use model to reformat the answer?