TerjamaBench: A Cultural Benchmark for English-Darija Machine Translation

Community Article Published January 10, 2025

Introduction

We introduce TerjamaBench, an evaluation benchmark for English-Darija machine translation. Darija, the Moroccan Arabic dialect, presents unique challenges for machine translation due to its informal nature, regional variations, and scarcity of digital resources. TerjamaBench features meticulously curated parallel texts in English, Arabic-script Darija, and Latin-script Darija (Arabizi), representing a wide range of cultural contexts and regional differences. We assess multiple state-of-the-art models, including proprietary LLMs and open-source models, utilizing various evaluation methods. We also show the limitations of widely used metrics in Machine Translation (MT) tasks in evaluating Darija translations by analysing their correlation with human judgment. Our results demonstrate significant gaps in current translation capabilities and provide insights for improving Darija-English translation systems.

Topic Arabizi English Darija
Religion lahysmehlina men lwalidin May God forgive us for any wrongs toward our parents الله يسمحلينا من الوالدين
Idioms zreb t3atal Rush things and you'll get delayed زرب تعطل
Named Entities sir khod chi carte dyal inwi o dirha f tilifonk Go get an Inwi SIM card and put it in your phone سير خود شي كارط ديال إنوي وديرها في تليفونك
Common Phrases 3tili tisa3 Leave me alone عطيلي التساع
Humor chb3na tkrkir We laughed our heads off شبعنا تكركير
Numeric and Date manl9ach 3ndek chi zer9a Do you have two hundred dirhams منلقاش عندك شي زرقة
Mixed Language Une fois nwessl l dar n3iyt lik I'll call you as soon as I get home انفوا نوصل للدار نعيط ليك

Examples from TerjamaBench dataset. The dataset is available in atlasia/TerjamaBench.

Benchmark Design

TerjamaBench was built through a careful process, addressing the unique challenges of Darija, which exhibits significant regional and linguistic diversity. The benchmark’s development involved curating varied data, extracting valuable insights, and acknowledging the dataset’s inherent limitations.

Data Curation Process

The dataset was curated manually by 16 annotators and 14 reviewers, all native Moroccans. Each annotator brought regional expertise, ensuring a broad representation of Darija’s variations across Morocco. The goal was to capture both formal and informal expressions, with an emphasis on the spoken nature of the language. We followed a structured approach:

  • Clear guidelines for annotation.
  • Validation steps to ensure linguistic and cultural authenticity.
  • Documentation of regional variations and their frequency.
  • Multiple review rounds by native speakers to ensure accuracy.

Key Insights and Statistics

The dataset contains 850 entries, structured into six columns:

  • Topic: Broad category of the sentence.
  • Subtopic: More specific classification within the topic.
  • Arabizi: Latin-script written Darija.
  • English: English translation of the Darija text.
  • Darija (in Arabic letters): Arabic-script written Darija.
  • Annotator Dialect (City): The regional variation spoken by the annotator.

The dataset includes both standard phrases and idiomatic expressions, with a focus on minimizing bias, and capturing the frequent code-switching seen in Darija, where speakers blend Arabic, Tamazight, French, and sometimes English.

The topics span a wide range of categories:

Topic Description Number of samples
Common Phrases Everyday expressions like greetings and common sayings. 136
Named Entities Sentences with proper nouns, place names, cities, etc. 53
Numeric and Date Expressions Sentences containing numbers, dates, or time expressions. 62
Educational Sentences from domains like medical, legal, or scientific contexts. 73
Mixed Language Content Sentences combining Darija with MSA, French, or English. 50
Idioms Proverbs and sayings unique to Moroccan culture. 51
Humor Jokes, puns, or humorous expressions. 50
Religion Sentences containing religious terms or expressions. 66
Single Words Isolated words to test basic translation capabilities. 163
Long Sentences Sentences designed to test coherence in lengthy translations. 50
Incorrect Spellings Sentences with slight spelling errors to evaluate model robustness. 50
Dialectal Variations Sentences from different Moroccan regions (northern, eastern, southern). 46

Limitations

Despite the thorough curation process, the dataset still has some limitations. First, there’s a regional bias; even though we tried to represent a diverse range of dialects, certain regions remain overrepresented. Another challenge is the orthographic variations in written Darija. Since Darija lacks a standardized writing system, inconsistencies in spelling and grammar are common, complicating machine translation models. The use of Arabizi, an informal and phonetically driven script without formal rules, adds further complexity, making normalization difficult for machine translation models.

Experimental Setup

Dataset

The initial dataset contained 850 entries. After deduplication and removing the "dialect_variation" topic due to its complexity, our final experimental subset contained 788 samples. For human evaluation, we selected a stratified random sample of 237 entries (30% of 788), ensuring proportional representation across all topics.

Models

We evaluated a diverse set of both proprietary and open-source models to benchmark translation performance in English-to-Darija:

  • gemini-exp-1206, claude-3-5-sonnet-20241022, gpt-4o-2024-08-06: These proprietary models were selected based on top-tier human judgment on English2Darija translation.
  • atlasia/Terjman-Large-v1.2, atlasia/Terjman-Nano: AtlasIA’s MT models, fine-tuned specifically for English2Darija translation.
  • MBZUAI-Paris/Atlas-Chat-9B: An open-source Darija LLM.
  • facebook/nllb-200-3.3B: Used as a baseline.
Model Parameters Type Base Architecture
gemini-exp-1206 * Proprietary
claude-3-5-sonnet-20241022 * Proprietary
gpt-4-2024-08-06 * Proprietary
atlasia/Terjman-Large-v1.2 240M Open Source Helsinki-NLP/opus-mt-tc-big-en-ar
atlasia/Terjman-Nano 77M Open Source Helsinki-NLP/opus-mt-en-ar
MBZUAI-Paris/Atlas-Chat-9B 9B Open Source gemma-2-9b
facebook/nllb-200-3.3B 3.3B Open Source

Evaluation Approaches

Metric-Based Evaluation

To evaluate model performance, we employed three standard and widely used metrics in MT: BiLingual Evaluation Understudy (BLEU), CHaRacter-level F-score (chrF), and Translation Error Rate (TER). BLEU: Measures n-gram overlap between the model’s output and reference translations. chrF: Focuses on character-level n-grams, providing finer-grained insights into similarity, particularly for morphologically rich languages like Darija. TER: Computes the number of edits required to transform the model’s output into the reference translation. However, we acknowledge their limitations, particularly for a language with high orthographic and linguistic variability like Darija. The next sections highlight why these metrics may fall short in fully capturing translation quality in the context of Moroccan Darija.

LLM-as-a-Judge Evaluation

To complement traditional metrics and provide a more context-sensitive evaluation, we leveraged Claude 3.5 Sonnet (2024-10-22) as an evaluation judge. Using the prompt in APPENDIX 1, we assessed translations by feeding the model both the reference and the generated output, scoring each sample on a nuanced 4-point scale: -1: Translation contains repetitive tokens or clear bugs. 0: Translation is incorrect, nonsensical, or lacks any Darija words. 1: Translation is mostly correct but includes Modern Standard Arabic elements (at least one Darija word) or has minor typos. 2: Translation is fully correct and entirely in Darija.

Human Evaluation

To validate the reliability of our evaluation approaches, we conducted a human evaluation using a random subsample of 30% from each topic, resulting in 241 samples. We used the same 4-point scale as the LLM-as-a-judge. The primary goal was to assess whether metrics-based approaches (BLEU, chrF, TER) and LLM-as-a-judge evaluations are correlated with human judgments.

Results and Analysis

Model Performance Comparison

Metric-based

Parameters BLEU↑ chrF↑ TER↓
Proprietary Models
gemini-exp-1206 * 30.69 54.16 67.62
claude-3-5-sonnet-20241022 * 30.51 51.8 67.42
gpt-4o-2024-08-06 * 28.3 50.13 71.77
Open Source Models
atlasia/Terjman-Large-v1.2 240M 16.33 37.1 89.13
MBZUAI-Paris/Atlas-Chat-9B 9B 14.8 35.26 93.95
facebook/nllb-200-3.3B 3.3B 14.76 34.17 94.33
atlasia/Terjman-Nano 77M 9.98 26.55 106.49

Table 1: BLEU, chrF, and TER scores for each model. Higher BLEU and chrF scores indicate better alignment with reference translations, while lower TER scores indicate fewer edits needed.

Proprietary models consistently outperform open-source models, with gemini-exp-1206 and claude-3-5-sonnet-20241022 leading across both metrics-based and topic-specific evaluations (cf Table 2 in Appendix). Among open-source models, atlasia/Terjman-Large performs the best, though it significantly lags behind proprietary counterparts. Topics like "religion" and "single_words" are easier, as evidenced by higher scores across models, while "idioms" and "long_sentences" are notably challenging, particularly for open-source models, highlighting their struggle with context-sensitive and structurally complex translations.

LLM-as-a-Judge

image/png

The LLM-as-judge evaluation (Appendix - Table 3) maintains the same hierarchy as metrics-based evaluation for proprietary models, but reveals a more nuanced picture for open-source ones. While proprietary models still lead (63.58% high-quality translations for gemini), Atlas-Chat-9B (39.47%) performs competitively with Terjman-Large (32.87%) despite having lower metric scores. The topic breakdown (Appendix - Table 4) shows interesting patterns - models maintain high performance on "common_phrases" and "named_entities" but struggle more with "idioms" and "long_sentences", particularly open-source ones.

Human evaluation

image/png

Human judgments (Appendix - Table 5) corroborate the superiority of proprietary models, with gemini-exp-1206 achieving the highest ratings. However, the topic-level analysis (Appendix - Table 6) shows even proprietary models struggle with culturally-loaded topics like "humor" and "idioms", while excelling at more straightforward topics like "religion" and "common_phrases". Overall, the consistently strong performance of proprietary models across all evaluation approaches, particularly on challenging topics, highlights the current limitations of open-source alternatives in handling Darija's linguistic complexity.

Also, human evaluators rated translations more favorably overall compared to the LLM judge for most models as shown below.

Models llm-as-a-judge human-evaluation
gemini-exp-1206 63.07 84.23
claude_3_5_sonnet 65.56 79.67
gpt-4o-2024-08-06 56.43 67.22
MBZUAI-Paris/Atlas-Chat-9B 36.51 50.21
atlasia/Terjman-Large-v1.2 29.05 48.13
facebook/nllb-200-3.3B 21.58 32.78
atlasia/Terjman-Nano 11.62 21.99

Table 7: % of 2-scored samples using llm-as-a-judge and human evaluation on the evaluation subset.

This suggests current automated evaluation approaches do not align with human assessment. This progressive analysis through different evaluation lenses reveals that while metric-based approaches capture broad performance trends, they may underestimate both the absolute quality of translations and the true difficulty gap between simple and complex topics.

Correlation between human evaluation and other approaches

To validate the reliability of our automated evaluation methods, we conducted a comprehensive correlation analysis between human evaluation scores and other evaluation approaches on the same subset described in section Human evaluation. Table 7 presents the Spearman correlation coefficients.

BLEU chrF TER LLM-as-a-judge
Spearman Correlation 0.345 0.406 -0.359 0.411

Table 8: Correlation between human evaluation and other evaluation approaches. Statistical significance is less than 10−4

  • Metric Reliability: chrF shows the strongest correlation with human judgment.

  • Error Metrics: TER shows a moderate negative correlation with human evaluation, indicating that while it captures some aspects of translation quality, it may not fully align with human perception of dialectal translation adequacy.

The LLM-as-a-judge approach shows a comparable and higher correlation to chrF, indicating its potential as a robust evaluation metric.

All correlations are statistically significant (p < 0.001), indicating the reliability of these relationships. However, the moderate strength of these correlations suggests that no single automated metric can fully replace human evaluation for assessing Darija translation quality. This finding underscores the importance of using multiple evaluation approaches, as we have done in this study, to get a comprehensive understanding of translation quality. These results also highlight the need for developing more sophisticated evaluation metrics specifically designed for dialectal Arabic translation, potentially incorporating features that better align with human judgment of translation quality in dialectal contexts.

Qualitative analysis of proprietary LLMs performance

  • Gemini-exp-1206 demonstrates a solid capability in handling Darija, although it sometimes produces unnatural or awkward phrases. Its most frequent mistakes are literal translations from English and occasional use of Standard Arabic. Some of the issues include the following:
    • Awkward constructions
      • Example: "مال هاد الضحك غير على سبة" (What’s with this silly laughter)
    • Verb-subject agreement
      • Example: "ألف درهم راه" (should be "راها")
    • Collocation errors
      • Example: "ديال أيام القراية" (“from school days” sounds unnatural)
    • Missing articles
      • Example: "شكلاط سخون عفاك" (should include an article like "شي" or "واحد ال")
    • Standard Arabic vocabulary
      • Examples: "مضحك", "ما كياخد حتى حاجة محمل الجد"
    • Literal translations from English
      • Examples: "كيفاش كتقدر" (how dare you), "السلام الصاحب" (hello friend)
  • GPT-4o is also competent in Darija but struggles with consistent word choices and literal translations that reduce the naturalness of the output. Some of these issues include the following:
    • Incorrect translation/word usage
      • Examples: "شحال فعامك؟" (How old are you), "واش عطيتي لماما العصير" (Have you given your mother juice)
    • Literal translations
      • Examples: "ضعت لي مفاتيح الدار" (I lost my house keys), "غادي نطيح من الضحك" (I’ll faint from laughter)
    • Inconsistent handling of “آ”
      • Examples: "سير دي ولادك لغابة المعمورة صاحبي", "شنو قلت ليك يا هاجر؟"
    • Missing suffixes
      • Example: "شفت الماتش ديال Barça البارح؟" (should be "شفتي")
    • Collocation issues
      • Example: "شنو النهار اليوم؟" (should be "شمن نهار")
    • Occasional Standard Arabic usage
      • Examples: "ماخصش الواحد يزعل بسهولة", "مضحك"
  • Claude 3.5 sonnet handles Moroccan Darija with notable difficulties, especially the use of literal translations and vocabulary or expressions more common in Standard Arabic. Below are some of the issues that were observed:
    • Verb-object agreement
      • Examples: وصلو رسالة, بلي وقع شي مخالفة (should be وصلاتو)
    • Collocation issues
      • Examples: "تحكم عليه فالمحكمة الفيدرالية ف 2008 على رشوة", "للمعاش ديالو عند التقاعد" (should be بالرشوة)
    • Word order issues
      • Example: Nothing has changed -> والو ما تبدل (should be ما تبدل والو)
    • Excessive definiteness
      • Example: “human persons have a right to life” -> “البنادم عندو الحق فالحياة” (should be بنادم)
    • Use of Standard Arabic (Fus7a) vocabulary/expressions
      • Examples: المقبلة, بسماحهم, القسط, بنايات, على طول, لرفيقتو
    • Literal translations
      • Example: “cease to be blind” -> “توقفو تكونو عميان”
    • Issues with certain question forms
      • Example: “Which of the following observations about revolutions and gender is best supported by the first passage?” -> “شنو من هاد الملاحظات على الثورات والجنس هي اللي كتدعمها أحسن الفقرة الأولى؟”

Summary and Future Research Directions

TerjamaBench makes several significant contributions to the advancement of English-Darija machine translation:

  1. A rich, diverse, and culturally specific benchmark dataset designed to reflect authentic Moroccan Darija usage, spanning categories like everyday expressions, technical vocabulary, and regional dialects.
  2. Comparative study of diverse evaluation approaches, highlighting their consistency with human evaluations.
  3. A detailed quantitative and qualitative evaluation of various models across different linguistic challenges, such as syntax, semantics, and dialectal variations

Our findings reveal that while proprietary models show promising results, significant challenges remain in:

  • Handling regional dialectal variations
  • Translating idiomatic expressions
  • Maintaining consistency in long-form translations
  • Processing mixed-language content

Future work should focus on:

  • Expanding the benchmark to include more regional variations
  • Developing Darija-specific evaluation metrics
  • Improving open-source models' performance on cultural expressions

The significant gap between proprietary and open-source models highlights the need for more investment in open-source Darija translation capabilities to improve accessibility of these technologies.

Acknowledgments

Special recognition goes to the contributors: Aissam Outchakoucht, Chaymae Rami, Mahmoud Bidry, Zaid Chiech, Imane Momayiz, Abdelaziz Bounhar, Abir Arsalane, Abdeljalil ElMajjodi, Aymane ElFirdoussi, Nouamane Tazi, Salah-Eddine Iguiliz, Hamza Essamaali, Ihssane Nedjaoui, Anas Amchaar, Yousef Khoubrane, Khaoula Alaoui, Salah-Eddine Alabouch, Adnan Anouzla, Bilal El Hammouchi, Taha Boukhari, Mustapha Ajeghrir, Ikhlas Elhamly, Fouad Aurag, Omar Choukrani, Ali Nirheche, Yanis Bardes, Abdelmonaim Bounite.

Citation

@article{atlasia2024terjamabench,
  title={TerjamaBench: A Culturally Specific Dataset for Evaluating Translation Models for Moroccan Darija},
  author={Imane Momayiz and Aissam Outchakoucht and Omar Choukrani and Ali Nirheche},
  year={2024},
  url={https://huggingface.co/datasets/atlasia/TerjamaBench/}
  institution={AtlasIA}
}

Appendix

Prompt used to generate translations

You are a Moroccan Arabic (Darija) translator. Your task is to translate text from English to Moroccan Arabic using Arabic script, following these guidelines:

1. Maintain any JSON formatting in the original text
2. For words without common Arabic equivalents, use their French translations as Moroccans would do
3. Preserve all code and technical terms in French/English
4. Adapt any culturally sensitive content to be appropriate for Moroccan audiences
5. For idioms, literature, examples, and questions, provide natural Moroccan Arabic translations
6. Use Moroccan Arabic instead of Modern Standard Arabic whenever possible (VERY IMPORTANT)

Format your response as:
[
{"original": "I love going to the beach", "translation": "كنبغي نمشي للبحر"},
{"original": "The weather is nice today", "translation": "الجو زوين اليوم"}
]

Please translate the following JSON list of texts:

Prompt used for LLM as a judge

You are a native Moroccan Arabic (Darija) speaker and expert linguist. You will evaluate machine translations into Moroccan Arabic.

For each example, you will be given:
1. The original English text
2. The ground truth Moroccan Arabic in Arabic script
3. A machine-generated translation in Arabic script

Please evaluate the machine translation by:
1. Comparing it to the ground truth version (Arabic script). But keep in mind that the ground truth can be in a different dialect.
2. Checking for:
   - Accuracy of meaning
   - Natural Moroccan dialect usage
   - Appropriate colloquial expressions
   - Correct grammar and word choice
3. Give a score where:
   -1 = Contains repetitive tokens or bugs
   0 = Translation is incorrect, makes no sense, or contains no Darija words
   1 = Translation is correct but mixed with Modern Standard Arabic (contains at least one Darija word) or has minor typos
   2 = Translation is correct and fully in Darija

Format your response in JSON format, with each line containing a JSON object with these fields in this order:
- analysis: Brief explanation of score, highlighting strengths/weaknesses
- score: Integer score (-1, 0, 1, or 2)


# Example evaluation
Input:
{
   "English": "the speech was addressed to all the people who were present",
   "Darija": "الهدرة توجهات لكاع الناس اللي كانو حاضرين",
   "machine": "الخطاب توجه لجميع الناس اللي كانو حاضرين"
}


Output:
{"analysis": "استعملو 'جميع' عوض 'كاع'، وهادي أقرب للعربية الفصحى", "score": 1}
{"analysis": "مسا الخير صحيحة بالدارجة غير هي بلهجة مختلفة", "score": 2}



Here are 15 samples to evaluate:
topic gemini-exp-1206 claude_3_5_sonnet gpt-4o-2024-08-06 atlasia/Terjman-Large-v1.2 facebook/nllb-200-3.3B MBZUAI-Paris/Atlas-Chat-9B atlasia/Terjman-Nano
common_phrases 27.27 27.99 26.86 20.75 14.87 19.26 8.71
educational 25.96 21.64 19.86 11.66 13.2 8.83 9.7
humor 21.37 17.63 15.11 15.58 7.26 10.08 8.88
idioms 23.93 18.37 12.26 4.58 2.72 6.16 3.62
incorrect_spellings 18.82 17.76 15.88 13.33 9.15 11.03 7.94
long_sentences 15.57 11.23 13.5 8.28 4.86 5.82 6.89
mixed_language 19.7 23.4 20.88 11.98 13.44 13.49 9.82
named_entities 28.42 24.04 26.69 13.35 11.1 11.93 10.3
numeric_and_date 25.61 26.76 20.73 16.16 12.39 9.91 10.05
religion 53.15 51.5 48.12 25.64 26.92 21.63 18.12
single_words 45.34 50.93 47.83 20.5 23.6 22.05 11.66

Table 2: BLEU score per model per topic. (Higher is better)

% High quality translations (2-scored)
gemini-exp-1206 63.58
claude-3-5-sonnet-20241022 62.69
gpt-4o-2024-08-06 57.11
MBZUAI-Paris/Atlas-Chat-9B 39.47
atlasia/Terjman-Large-v1.2 32.87
facebook/nllb-200-3.3B 24.75
atlasia/Terjman-Nano 14.21

Table 3: Percentage of translations rated as high quality (score = 2) by LLM-as-a-judge for each model.

topic gemini-exp-1206 claude_3_5_sonnet gpt-4o-2024-08-06 atlasia/Terjman-Large-v1.2 MBZUAI-Paris/Atlas-Chat-9B atlasia/Terjman-Nano facebook/nllb-200-3.3B
common_phrases 75.56 71.85 72.59 48.15 48.15 18.52 31.85
educational 47.89 53.52 43.66 30.99 40.85 11.27 25.35
humor 55.1 53.06 46.94 32.65 32.65 12.24 22.45
idioms 68.63 62.75 56.86 13.73 23.53 11.76 13.73
incorrect_spellings 60.87 54.35 52.17 43.48 43.48 10.87 17.39
long_sentences 70 84 58 30 36 6 16
mixed_language 51.06 46.81 51.06 31.91 34.04 8.51 19.15
named_entities 73.58 79.25 71.7 32.08 24.53 16.98 22.64
numeric_and_date 61.02 59.32 44.07 30.51 50.85 16.95 30.51
religion 59.09 59.09 56.06 22.73 16.67 7.58 21.21
single_words 63.35 59.63 56.52 30.43 50.31 19.25 29.19

Table 4: Percentage of translations rated as high quality (score = 2) by LLM-as-a-judge for each model per topic.

% High quality translations (2-scored)
gemini-exp-1206 84.23
claude-3-5-sonnet-20241022 79.67
gpt-4o-2024-08-06 67.22
MBZUAI-Paris/Atlas-Chat-9B 50.21
atlasia/Terjman-Large-v1.2 48.13
facebook/nllb-200-3.3B 32.78
atlasia/Terjman-Nano 21.99

Table 5: Percentage of high-quality translations (score = 2) rated by human evaluators for each model.

topic gemini-exp-1206 claude_3_5_sonnet gpt-4o-2024-08-06 atlasia/Terjman-Large-v1.2 MBZUAI-Paris/Atlas-Chat-9B atlasia/Terjman-Nano facebook/nllb-200-3.3B
common_phrases 90.24 85.37 78.05 51.22 73.17 31.71 39.02
educational 90.91 72.73 77.27 54.55 27.27 13.64 22.73
humor 53.33 53.33 20 46.67 33.33 6.67 26.67
idioms 73.33 60 66.67 6.67 6.67 6.67 0
incorrect_spellings 73.33 80 66.67 66.67 60 13.33 26.67
long_sentences 80 53.33 33.33 13.33 33.33 0 6.67
mixed_language 86.67 80 73.33 53.33 60 33.33 40
named_entities 81.25 93.75 62.5 37.5 37.5 18.75 25
numeric_and_date 84.21 84.21 36.84 63.16 47.37 10.53 31.58
religion 95 95 85 55 40 25 40
single_words 89.58 87.5 83.33 54.17 68.75 37.5 52.08

Table 6: Percentage of translations rated as high quality (score = 2) by human evaluators for each model per topic.