TerjamaBench: A Cultural Benchmark for English-Darija Machine Translation
Introduction
We introduce TerjamaBench, an evaluation benchmark for English-Darija machine translation. Darija, the Moroccan Arabic dialect, presents unique challenges for machine translation due to its informal nature, regional variations, and scarcity of digital resources. TerjamaBench features meticulously curated parallel texts in English, Arabic-script Darija, and Latin-script Darija (Arabizi), representing a wide range of cultural contexts and regional differences. We assess multiple state-of-the-art models, including proprietary LLMs and open-source models, utilizing various evaluation methods. We also show the limitations of widely used metrics in Machine Translation (MT) tasks in evaluating Darija translations by analysing their correlation with human judgment. Our results demonstrate significant gaps in current translation capabilities and provide insights for improving Darija-English translation systems.
Topic | Arabizi | English | Darija |
---|---|---|---|
Religion | lahysmehlina men lwalidin | May God forgive us for any wrongs toward our parents | الله يسمحلينا من الوالدين |
Idioms | zreb t3atal | Rush things and you'll get delayed | زرب تعطل |
Named Entities | sir khod chi carte dyal inwi o dirha f tilifonk | Go get an Inwi SIM card and put it in your phone | سير خود شي كارط ديال إنوي وديرها في تليفونك |
Common Phrases | 3tili tisa3 | Leave me alone | عطيلي التساع |
Humor | chb3na tkrkir | We laughed our heads off | شبعنا تكركير |
Numeric and Date | manl9ach 3ndek chi zer9a | Do you have two hundred dirhams | منلقاش عندك شي زرقة |
Mixed Language | Une fois nwessl l dar n3iyt lik | I'll call you as soon as I get home | انفوا نوصل للدار نعيط ليك |
Examples from TerjamaBench dataset. The dataset is available in atlasia/TerjamaBench.
Benchmark Design
TerjamaBench was built through a careful process, addressing the unique challenges of Darija, which exhibits significant regional and linguistic diversity. The benchmark’s development involved curating varied data, extracting valuable insights, and acknowledging the dataset’s inherent limitations.
Data Curation Process
The dataset was curated manually by 16 annotators and 14 reviewers, all native Moroccans. Each annotator brought regional expertise, ensuring a broad representation of Darija’s variations across Morocco. The goal was to capture both formal and informal expressions, with an emphasis on the spoken nature of the language. We followed a structured approach:
- Clear guidelines for annotation.
- Validation steps to ensure linguistic and cultural authenticity.
- Documentation of regional variations and their frequency.
- Multiple review rounds by native speakers to ensure accuracy.
Key Insights and Statistics
The dataset contains 850 entries, structured into six columns:
- Topic: Broad category of the sentence.
- Subtopic: More specific classification within the topic.
- Arabizi: Latin-script written Darija.
- English: English translation of the Darija text.
- Darija (in Arabic letters): Arabic-script written Darija.
- Annotator Dialect (City): The regional variation spoken by the annotator.
The dataset includes both standard phrases and idiomatic expressions, with a focus on minimizing bias, and capturing the frequent code-switching seen in Darija, where speakers blend Arabic, Tamazight, French, and sometimes English.
The topics span a wide range of categories:
Topic | Description | Number of samples |
---|---|---|
Common Phrases | Everyday expressions like greetings and common sayings. | 136 |
Named Entities | Sentences with proper nouns, place names, cities, etc. | 53 |
Numeric and Date Expressions | Sentences containing numbers, dates, or time expressions. | 62 |
Educational | Sentences from domains like medical, legal, or scientific contexts. | 73 |
Mixed Language Content | Sentences combining Darija with MSA, French, or English. | 50 |
Idioms | Proverbs and sayings unique to Moroccan culture. | 51 |
Humor | Jokes, puns, or humorous expressions. | 50 |
Religion | Sentences containing religious terms or expressions. | 66 |
Single Words | Isolated words to test basic translation capabilities. | 163 |
Long Sentences | Sentences designed to test coherence in lengthy translations. | 50 |
Incorrect Spellings | Sentences with slight spelling errors to evaluate model robustness. | 50 |
Dialectal Variations | Sentences from different Moroccan regions (northern, eastern, southern). | 46 |
Limitations
Despite the thorough curation process, the dataset still has some limitations. First, there’s a regional bias; even though we tried to represent a diverse range of dialects, certain regions remain overrepresented. Another challenge is the orthographic variations in written Darija. Since Darija lacks a standardized writing system, inconsistencies in spelling and grammar are common, complicating machine translation models. The use of Arabizi, an informal and phonetically driven script without formal rules, adds further complexity, making normalization difficult for machine translation models.
Experimental Setup
Dataset
The initial dataset contained 850 entries. After deduplication and removing the "dialect_variation" topic due to its complexity, our final experimental subset contained 788 samples. For human evaluation, we selected a stratified random sample of 237 entries (30% of 788), ensuring proportional representation across all topics.
Models
We evaluated a diverse set of both proprietary and open-source models to benchmark translation performance in English-to-Darija:
- gemini-exp-1206, claude-3-5-sonnet-20241022, gpt-4o-2024-08-06: These proprietary models were selected based on top-tier human judgment on English2Darija translation.
- atlasia/Terjman-Large-v1.2, atlasia/Terjman-Nano: AtlasIA’s MT models, fine-tuned specifically for English2Darija translation.
- MBZUAI-Paris/Atlas-Chat-9B: An open-source Darija LLM.
- facebook/nllb-200-3.3B: Used as a baseline.
Model | Parameters | Type | Base Architecture |
---|---|---|---|
gemini-exp-1206 | * | Proprietary | |
claude-3-5-sonnet-20241022 | * | Proprietary | |
gpt-4-2024-08-06 | * | Proprietary | |
atlasia/Terjman-Large-v1.2 | 240M | Open Source | Helsinki-NLP/opus-mt-tc-big-en-ar |
atlasia/Terjman-Nano | 77M | Open Source | Helsinki-NLP/opus-mt-en-ar |
MBZUAI-Paris/Atlas-Chat-9B | 9B | Open Source | gemma-2-9b |
facebook/nllb-200-3.3B | 3.3B | Open Source |
Evaluation Approaches
Metric-Based Evaluation
To evaluate model performance, we employed three standard and widely used metrics in MT: BiLingual Evaluation Understudy (BLEU), CHaRacter-level F-score (chrF), and Translation Error Rate (TER). BLEU: Measures n-gram overlap between the model’s output and reference translations. chrF: Focuses on character-level n-grams, providing finer-grained insights into similarity, particularly for morphologically rich languages like Darija. TER: Computes the number of edits required to transform the model’s output into the reference translation. However, we acknowledge their limitations, particularly for a language with high orthographic and linguistic variability like Darija. The next sections highlight why these metrics may fall short in fully capturing translation quality in the context of Moroccan Darija.
LLM-as-a-Judge Evaluation
To complement traditional metrics and provide a more context-sensitive evaluation, we leveraged Claude 3.5 Sonnet (2024-10-22) as an evaluation judge. Using the prompt in APPENDIX 1, we assessed translations by feeding the model both the reference and the generated output, scoring each sample on a nuanced 4-point scale: -1: Translation contains repetitive tokens or clear bugs. 0: Translation is incorrect, nonsensical, or lacks any Darija words. 1: Translation is mostly correct but includes Modern Standard Arabic elements (at least one Darija word) or has minor typos. 2: Translation is fully correct and entirely in Darija.
Human Evaluation
To validate the reliability of our evaluation approaches, we conducted a human evaluation using a random subsample of 30% from each topic, resulting in 241 samples. We used the same 4-point scale as the LLM-as-a-judge. The primary goal was to assess whether metrics-based approaches (BLEU, chrF, TER) and LLM-as-a-judge evaluations are correlated with human judgments.
Results and Analysis
Model Performance Comparison
Metric-based
Parameters | BLEU↑ | chrF↑ | TER↓ | |
---|---|---|---|---|
Proprietary Models | ||||
gemini-exp-1206 | * | 30.69 | 54.16 | 67.62 |
claude-3-5-sonnet-20241022 | * | 30.51 | 51.8 | 67.42 |
gpt-4o-2024-08-06 | * | 28.3 | 50.13 | 71.77 |
Open Source Models | ||||
atlasia/Terjman-Large-v1.2 | 240M | 16.33 | 37.1 | 89.13 |
MBZUAI-Paris/Atlas-Chat-9B | 9B | 14.8 | 35.26 | 93.95 |
facebook/nllb-200-3.3B | 3.3B | 14.76 | 34.17 | 94.33 |
atlasia/Terjman-Nano | 77M | 9.98 | 26.55 | 106.49 |
Table 1: BLEU, chrF, and TER scores for each model. Higher BLEU and chrF scores indicate better alignment with reference translations, while lower TER scores indicate fewer edits needed.
Proprietary models consistently outperform open-source models, with gemini-exp-1206 and claude-3-5-sonnet-20241022 leading across both metrics-based and topic-specific evaluations (cf Table 2 in Appendix). Among open-source models, atlasia/Terjman-Large performs the best, though it significantly lags behind proprietary counterparts. Topics like "religion" and "single_words" are easier, as evidenced by higher scores across models, while "idioms" and "long_sentences" are notably challenging, particularly for open-source models, highlighting their struggle with context-sensitive and structurally complex translations.
LLM-as-a-Judge
The LLM-as-judge evaluation (Appendix - Table 3) maintains the same hierarchy as metrics-based evaluation for proprietary models, but reveals a more nuanced picture for open-source ones. While proprietary models still lead (63.58% high-quality translations for gemini), Atlas-Chat-9B (39.47%) performs competitively with Terjman-Large (32.87%) despite having lower metric scores. The topic breakdown (Appendix - Table 4) shows interesting patterns - models maintain high performance on "common_phrases" and "named_entities" but struggle more with "idioms" and "long_sentences", particularly open-source ones.
Human evaluation
Human judgments (Appendix - Table 5) corroborate the superiority of proprietary models, with gemini-exp-1206 achieving the highest ratings. However, the topic-level analysis (Appendix - Table 6) shows even proprietary models struggle with culturally-loaded topics like "humor" and "idioms", while excelling at more straightforward topics like "religion" and "common_phrases". Overall, the consistently strong performance of proprietary models across all evaluation approaches, particularly on challenging topics, highlights the current limitations of open-source alternatives in handling Darija's linguistic complexity.
Also, human evaluators rated translations more favorably overall compared to the LLM judge for most models as shown below.
Models | llm-as-a-judge | human-evaluation |
---|---|---|
gemini-exp-1206 | 63.07 | 84.23 |
claude_3_5_sonnet | 65.56 | 79.67 |
gpt-4o-2024-08-06 | 56.43 | 67.22 |
MBZUAI-Paris/Atlas-Chat-9B | 36.51 | 50.21 |
atlasia/Terjman-Large-v1.2 | 29.05 | 48.13 |
facebook/nllb-200-3.3B | 21.58 | 32.78 |
atlasia/Terjman-Nano | 11.62 | 21.99 |
Table 7: % of 2-scored samples using llm-as-a-judge and human evaluation on the evaluation subset.
This suggests current automated evaluation approaches do not align with human assessment. This progressive analysis through different evaluation lenses reveals that while metric-based approaches capture broad performance trends, they may underestimate both the absolute quality of translations and the true difficulty gap between simple and complex topics.
Correlation between human evaluation and other approaches
To validate the reliability of our automated evaluation methods, we conducted a comprehensive correlation analysis between human evaluation scores and other evaluation approaches on the same subset described in section Human evaluation. Table 7 presents the Spearman correlation coefficients.
BLEU | chrF | TER | LLM-as-a-judge | |
---|---|---|---|---|
Spearman Correlation | 0.345 | 0.406 | -0.359 | 0.411 |
Table 8: Correlation between human evaluation and other evaluation approaches. Statistical significance is less than 10−4
Metric Reliability: chrF shows the strongest correlation with human judgment.
Error Metrics: TER shows a moderate negative correlation with human evaluation, indicating that while it captures some aspects of translation quality, it may not fully align with human perception of dialectal translation adequacy.
The LLM-as-a-judge approach shows a comparable and higher correlation to chrF, indicating its potential as a robust evaluation metric.
All correlations are statistically significant (p < 0.001), indicating the reliability of these relationships. However, the moderate strength of these correlations suggests that no single automated metric can fully replace human evaluation for assessing Darija translation quality. This finding underscores the importance of using multiple evaluation approaches, as we have done in this study, to get a comprehensive understanding of translation quality. These results also highlight the need for developing more sophisticated evaluation metrics specifically designed for dialectal Arabic translation, potentially incorporating features that better align with human judgment of translation quality in dialectal contexts.
Qualitative analysis of proprietary LLMs performance
- Gemini-exp-1206 demonstrates a solid capability in handling Darija, although it sometimes produces unnatural or awkward phrases. Its most frequent mistakes are literal translations from English and occasional use of Standard Arabic. Some of the issues include the following:
- Awkward constructions
- Example: "مال هاد الضحك غير على سبة" (What’s with this silly laughter)
- Verb-subject agreement
- Example: "ألف درهم راه" (should be "راها")
- Collocation errors
- Example: "ديال أيام القراية" (“from school days” sounds unnatural)
- Missing articles
- Example: "شكلاط سخون عفاك" (should include an article like "شي" or "واحد ال")
- Standard Arabic vocabulary
- Examples: "مضحك", "ما كياخد حتى حاجة محمل الجد"
- Literal translations from English
- Examples: "كيفاش كتقدر" (how dare you), "السلام الصاحب" (hello friend)
- Awkward constructions
- GPT-4o is also competent in Darija but struggles with consistent word choices and literal translations that reduce the naturalness of the output. Some of these issues include the following:
- Incorrect translation/word usage
- Examples: "شحال فعامك؟" (How old are you), "واش عطيتي لماما العصير" (Have you given your mother juice)
- Literal translations
- Examples: "ضعت لي مفاتيح الدار" (I lost my house keys), "غادي نطيح من الضحك" (I’ll faint from laughter)
- Inconsistent handling of “آ”
- Examples: "سير دي ولادك لغابة المعمورة صاحبي", "شنو قلت ليك يا هاجر؟"
- Missing suffixes
- Example: "شفت الماتش ديال Barça البارح؟" (should be "شفتي")
- Collocation issues
- Example: "شنو النهار اليوم؟" (should be "شمن نهار")
- Occasional Standard Arabic usage
- Examples: "ماخصش الواحد يزعل بسهولة", "مضحك"
- Incorrect translation/word usage
- Claude 3.5 sonnet handles Moroccan Darija with notable difficulties, especially the use of literal translations and vocabulary or expressions more common in Standard Arabic. Below are some of the issues that were observed:
- Verb-object agreement
- Examples: وصلو رسالة, بلي وقع شي مخالفة (should be وصلاتو)
- Collocation issues
- Examples: "تحكم عليه فالمحكمة الفيدرالية ف 2008 على رشوة", "للمعاش ديالو عند التقاعد" (should be بالرشوة)
- Word order issues
- Example: Nothing has changed -> والو ما تبدل (should be ما تبدل والو)
- Excessive definiteness
- Example: “human persons have a right to life” -> “البنادم عندو الحق فالحياة” (should be بنادم)
- Use of Standard Arabic (Fus7a) vocabulary/expressions
- Examples: المقبلة, بسماحهم, القسط, بنايات, على طول, لرفيقتو
- Literal translations
- Example: “cease to be blind” -> “توقفو تكونو عميان”
- Issues with certain question forms
- Example: “Which of the following observations about revolutions and gender is best supported by the first passage?” -> “شنو من هاد الملاحظات على الثورات والجنس هي اللي كتدعمها أحسن الفقرة الأولى؟”
- Verb-object agreement
Summary and Future Research Directions
TerjamaBench makes several significant contributions to the advancement of English-Darija machine translation:
- A rich, diverse, and culturally specific benchmark dataset designed to reflect authentic Moroccan Darija usage, spanning categories like everyday expressions, technical vocabulary, and regional dialects.
- Comparative study of diverse evaluation approaches, highlighting their consistency with human evaluations.
- A detailed quantitative and qualitative evaluation of various models across different linguistic challenges, such as syntax, semantics, and dialectal variations
Our findings reveal that while proprietary models show promising results, significant challenges remain in:
- Handling regional dialectal variations
- Translating idiomatic expressions
- Maintaining consistency in long-form translations
- Processing mixed-language content
Future work should focus on:
- Expanding the benchmark to include more regional variations
- Developing Darija-specific evaluation metrics
- Improving open-source models' performance on cultural expressions
The significant gap between proprietary and open-source models highlights the need for more investment in open-source Darija translation capabilities to improve accessibility of these technologies.
Acknowledgments
Special recognition goes to the contributors: Aissam Outchakoucht, Chaymae Rami, Mahmoud Bidry, Zaid Chiech, Imane Momayiz, Abdelaziz Bounhar, Abir Arsalane, Abdeljalil ElMajjodi, Aymane ElFirdoussi, Nouamane Tazi, Salah-Eddine Iguiliz, Hamza Essamaali, Ihssane Nedjaoui, Anas Amchaar, Yousef Khoubrane, Khaoula Alaoui, Salah-Eddine Alabouch, Adnan Anouzla, Bilal El Hammouchi, Taha Boukhari, Mustapha Ajeghrir, Ikhlas Elhamly, Fouad Aurag, Omar Choukrani, Ali Nirheche, Yanis Bardes, Abdelmonaim Bounite.
Citation
@article{atlasia2024terjamabench,
title={TerjamaBench: A Culturally Specific Dataset for Evaluating Translation Models for Moroccan Darija},
author={Imane Momayiz and Aissam Outchakoucht and Omar Choukrani and Ali Nirheche},
year={2024},
url={https://huggingface.co/datasets/atlasia/TerjamaBench/}
institution={AtlasIA}
}
Appendix
Prompt used to generate translations
You are a Moroccan Arabic (Darija) translator. Your task is to translate text from English to Moroccan Arabic using Arabic script, following these guidelines:
1. Maintain any JSON formatting in the original text
2. For words without common Arabic equivalents, use their French translations as Moroccans would do
3. Preserve all code and technical terms in French/English
4. Adapt any culturally sensitive content to be appropriate for Moroccan audiences
5. For idioms, literature, examples, and questions, provide natural Moroccan Arabic translations
6. Use Moroccan Arabic instead of Modern Standard Arabic whenever possible (VERY IMPORTANT)
Format your response as:
[
{"original": "I love going to the beach", "translation": "كنبغي نمشي للبحر"},
{"original": "The weather is nice today", "translation": "الجو زوين اليوم"}
]
Please translate the following JSON list of texts:
Prompt used for LLM as a judge
You are a native Moroccan Arabic (Darija) speaker and expert linguist. You will evaluate machine translations into Moroccan Arabic.
For each example, you will be given:
1. The original English text
2. The ground truth Moroccan Arabic in Arabic script
3. A machine-generated translation in Arabic script
Please evaluate the machine translation by:
1. Comparing it to the ground truth version (Arabic script). But keep in mind that the ground truth can be in a different dialect.
2. Checking for:
- Accuracy of meaning
- Natural Moroccan dialect usage
- Appropriate colloquial expressions
- Correct grammar and word choice
3. Give a score where:
-1 = Contains repetitive tokens or bugs
0 = Translation is incorrect, makes no sense, or contains no Darija words
1 = Translation is correct but mixed with Modern Standard Arabic (contains at least one Darija word) or has minor typos
2 = Translation is correct and fully in Darija
Format your response in JSON format, with each line containing a JSON object with these fields in this order:
- analysis: Brief explanation of score, highlighting strengths/weaknesses
- score: Integer score (-1, 0, 1, or 2)
# Example evaluation
Input:
{
"English": "the speech was addressed to all the people who were present",
"Darija": "الهدرة توجهات لكاع الناس اللي كانو حاضرين",
"machine": "الخطاب توجه لجميع الناس اللي كانو حاضرين"
}
Output:
{"analysis": "استعملو 'جميع' عوض 'كاع'، وهادي أقرب للعربية الفصحى", "score": 1}
{"analysis": "مسا الخير صحيحة بالدارجة غير هي بلهجة مختلفة", "score": 2}
Here are 15 samples to evaluate:
topic | gemini-exp-1206 | claude_3_5_sonnet | gpt-4o-2024-08-06 | atlasia/Terjman-Large-v1.2 | facebook/nllb-200-3.3B | MBZUAI-Paris/Atlas-Chat-9B | atlasia/Terjman-Nano |
---|---|---|---|---|---|---|---|
common_phrases | 27.27 | 27.99 | 26.86 | 20.75 | 14.87 | 19.26 | 8.71 |
educational | 25.96 | 21.64 | 19.86 | 11.66 | 13.2 | 8.83 | 9.7 |
humor | 21.37 | 17.63 | 15.11 | 15.58 | 7.26 | 10.08 | 8.88 |
idioms | 23.93 | 18.37 | 12.26 | 4.58 | 2.72 | 6.16 | 3.62 |
incorrect_spellings | 18.82 | 17.76 | 15.88 | 13.33 | 9.15 | 11.03 | 7.94 |
long_sentences | 15.57 | 11.23 | 13.5 | 8.28 | 4.86 | 5.82 | 6.89 |
mixed_language | 19.7 | 23.4 | 20.88 | 11.98 | 13.44 | 13.49 | 9.82 |
named_entities | 28.42 | 24.04 | 26.69 | 13.35 | 11.1 | 11.93 | 10.3 |
numeric_and_date | 25.61 | 26.76 | 20.73 | 16.16 | 12.39 | 9.91 | 10.05 |
religion | 53.15 | 51.5 | 48.12 | 25.64 | 26.92 | 21.63 | 18.12 |
single_words | 45.34 | 50.93 | 47.83 | 20.5 | 23.6 | 22.05 | 11.66 |
Table 2: BLEU score per model per topic. (Higher is better)
% High quality translations (2-scored) | |
---|---|
gemini-exp-1206 | 63.58 |
claude-3-5-sonnet-20241022 | 62.69 |
gpt-4o-2024-08-06 | 57.11 |
MBZUAI-Paris/Atlas-Chat-9B | 39.47 |
atlasia/Terjman-Large-v1.2 | 32.87 |
facebook/nllb-200-3.3B | 24.75 |
atlasia/Terjman-Nano | 14.21 |
Table 3: Percentage of translations rated as high quality (score = 2) by LLM-as-a-judge for each model.
topic | gemini-exp-1206 | claude_3_5_sonnet | gpt-4o-2024-08-06 | atlasia/Terjman-Large-v1.2 | MBZUAI-Paris/Atlas-Chat-9B | atlasia/Terjman-Nano | facebook/nllb-200-3.3B |
---|---|---|---|---|---|---|---|
common_phrases | 75.56 | 71.85 | 72.59 | 48.15 | 48.15 | 18.52 | 31.85 |
educational | 47.89 | 53.52 | 43.66 | 30.99 | 40.85 | 11.27 | 25.35 |
humor | 55.1 | 53.06 | 46.94 | 32.65 | 32.65 | 12.24 | 22.45 |
idioms | 68.63 | 62.75 | 56.86 | 13.73 | 23.53 | 11.76 | 13.73 |
incorrect_spellings | 60.87 | 54.35 | 52.17 | 43.48 | 43.48 | 10.87 | 17.39 |
long_sentences | 70 | 84 | 58 | 30 | 36 | 6 | 16 |
mixed_language | 51.06 | 46.81 | 51.06 | 31.91 | 34.04 | 8.51 | 19.15 |
named_entities | 73.58 | 79.25 | 71.7 | 32.08 | 24.53 | 16.98 | 22.64 |
numeric_and_date | 61.02 | 59.32 | 44.07 | 30.51 | 50.85 | 16.95 | 30.51 |
religion | 59.09 | 59.09 | 56.06 | 22.73 | 16.67 | 7.58 | 21.21 |
single_words | 63.35 | 59.63 | 56.52 | 30.43 | 50.31 | 19.25 | 29.19 |
Table 4: Percentage of translations rated as high quality (score = 2) by LLM-as-a-judge for each model per topic.
% High quality translations (2-scored) | |
---|---|
gemini-exp-1206 | 84.23 |
claude-3-5-sonnet-20241022 | 79.67 |
gpt-4o-2024-08-06 | 67.22 |
MBZUAI-Paris/Atlas-Chat-9B | 50.21 |
atlasia/Terjman-Large-v1.2 | 48.13 |
facebook/nllb-200-3.3B | 32.78 |
atlasia/Terjman-Nano | 21.99 |
Table 5: Percentage of high-quality translations (score = 2) rated by human evaluators for each model.
topic | gemini-exp-1206 | claude_3_5_sonnet | gpt-4o-2024-08-06 | atlasia/Terjman-Large-v1.2 | MBZUAI-Paris/Atlas-Chat-9B | atlasia/Terjman-Nano | facebook/nllb-200-3.3B |
---|---|---|---|---|---|---|---|
common_phrases | 90.24 | 85.37 | 78.05 | 51.22 | 73.17 | 31.71 | 39.02 |
educational | 90.91 | 72.73 | 77.27 | 54.55 | 27.27 | 13.64 | 22.73 |
humor | 53.33 | 53.33 | 20 | 46.67 | 33.33 | 6.67 | 26.67 |
idioms | 73.33 | 60 | 66.67 | 6.67 | 6.67 | 6.67 | 0 |
incorrect_spellings | 73.33 | 80 | 66.67 | 66.67 | 60 | 13.33 | 26.67 |
long_sentences | 80 | 53.33 | 33.33 | 13.33 | 33.33 | 0 | 6.67 |
mixed_language | 86.67 | 80 | 73.33 | 53.33 | 60 | 33.33 | 40 |
named_entities | 81.25 | 93.75 | 62.5 | 37.5 | 37.5 | 18.75 | 25 |
numeric_and_date | 84.21 | 84.21 | 36.84 | 63.16 | 47.37 | 10.53 | 31.58 |
religion | 95 | 95 | 85 | 55 | 40 | 25 | 40 |
single_words | 89.58 | 87.5 | 83.33 | 54.17 | 68.75 | 37.5 | 52.08 |
Table 6: Percentage of translations rated as high quality (score = 2) by human evaluators for each model per topic.