Glot500 (base-sized model)

Glot500 model (Glot500-m) pre-trained on 500+ languages using a masked language modeling (MLM) objective. It was introduced in this paper (ACL 2023) and first released in this repository.

Usage

You can use this model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='cis-lmu/glot500-base')
>>> unmasker("Hello I'm a <mask> model.")

Here is how to use this model to get the features of a given text in PyTorch:

>>> from transformers import AutoTokenizer, AutoModelForMaskedLM

>>> tokenizer = AutoTokenizer.from_pretrained('cis-lmu/glot500-base')
>>> model = AutoModelForMaskedLM.from_pretrained("cis-lmu/glot500-base")

>>> # prepare input
>>> text = "Replace me by any text you'd like."
>>> encoded_input = tokenizer(text, return_tensors='pt')

>>> # forward pass
>>> output = model(**encoded_input)

BibTeX entry and citation info

@article{imanigooghari-etal-2023-glot500,
  title={Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages},
  author={ImaniGooghari, Ayyoob  and Lin, Peiqin  and Kargaran, Amir Hossein  and Severini, Silvia  and Jalili Sabet, Masoud  and Kassner, Nora  and Ma, Chunlan  and Schmid, Helmut  and Martins, Andr{\'e}  and Yvon, Fran{\c{c}}ois  and Sch{\"u}tze, Hinrich},
  journal={arXiv preprint arXiv:2305.12182},
  year={2023}
}
Downloads last month
364
Safetensors
Model size
395M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for cis-lmu/glot500-base

Finetunes
6 models

Dataset used to train cis-lmu/glot500-base

Space using cis-lmu/glot500-base 1