DarijaBERT is the first Open Source BERT model for the Moroccan Arabic dialect called “Darija”. It is based on the same architecture as BERT-base, but without the Next Sentence Prediction (NSP) objective. This model was trained on a total of ~3 Million sequences of Darija dialect representing 691MB of text or a total of ~100M tokens.
We are releasing the following :
- Pre-processing code
- WordPiece tokenization code
- Pre-trained model in both PyTorch and TensorFlow versions(future plan)
- Example notebook to finetune the model
- MTCD dataset
The model was trained on a dataset issued from three different sources:
- Stories written in Darija scrapped from a dedicated website
- Youtube comments from 40 different Moroccan channels
- Tweets crawled based on a list of Darija keywords.
Concatenating these datasets sums up to 691MB of text.
- Replacing repeated characters with one occurrence of this character
- Replacing hashtags, user mentions and URLs respectively with following words: HASHTAG, USER, URL.
- Keeping sequences with at least two arabic words
- Removing Tatweel character '\u0640'
- Removing diacritics
-
Same architecture as BERT-base was used, but without the Next Sentence Prediction objective.
-
Whole Word Masking (WWM) with a probability of 15% was adopted
-
The sequences were tokenized using the WordPiece Tokenizer from the Huggingface Transformer library. We chose 128 as the maximum length of the input for the model.
-
The vocabulary size is 80.000 wordpiece token
The whole training was done on GCP Compute Engine using free cloud TPU v3.8 offered by Google's TensorFlow Research Cloud (TRC) program. It took 49 hours to run the 40 epochs of pretraining.
Since DarijaBERT was trained using Whole Word Masking, it is capable of predicting missing word in sentence.
from transformers import pipeline
unmasker = pipeline('fill-mask', model='Kamel/DarijaBERT')
unmasker(" اشنو [MASK] ليك ")
{'score': 0.02539043314754963,
'sequence': 'اشنو سيفطو ليك',
'token': 25722,
'token_str': 'سيفطو'},
UPCOMING
********* DarijaBERT models were transfered on the SI2M Lab HuggingFace repo : Juin 20th,2022 ********
The model can be loaded directly using the Huggingface library:
from transformers import AutoTokenizer, AutoModel
DarijaBert_tokenizer = AutoTokenizer.from_pretrained("SI2M-Lab/DarijaBERT")
DarijaBert_model = AutoModel.from_pretrained("SI2M-Lab/DarijaBERT")
Checkpoint for the Pytorch framework is available for downloading in the link below:
https://huggingface.co/SI2M-Lab/DarijaBERT
This checkpoint is destined exclusively for research, any commercial use should be done with author's permission, please contact via email at [email protected]
If you use our models for your scientific publication, or if you find the resources in this repository useful, please cite our paper as follows (to be updated):
@article{gaanoun2023darijabert,
title={Darijabert: a Step Forward in Nlp for the Written Moroccan Dialect},
author={Gaanoun, Kamel and Naira, Abdou Mohamed and Allak, Anass and Benelallam, Imade},
year={2023}
}
We gratefully acknowledge Google’s TensorFlow Research Cloud (TRC) program for providing us with free Cloud TPUs.