xlm-roberta-base-fintuned-panx-ta-hi

This model is a fine-tuned version of xlm-roberta-base on the PAN-X dataset for Tamil (ta) and Hindi (hi). It is fine-tuned for Named Entity Recognition (NER) and achieves the following results on the evaluation set:

  • Loss: 0.2480
  • F1: 0.8347

Model Description

The model is based on XLM-RoBERTa, a multilingual transformer-based architecture, and fine-tuned for NER tasks in Tamil and Hindi. Entity type : LOC (Location), PER (Person), and ORG (Organization)

B- prefix indicates beginning of an entity and I - prefix indicates consecutive entity

Intended Uses & Limitations

Intended Uses:

  • Named Entity Recognition (NER) tasks in Tamil and Hindi.

Limitations:

  • Performance may degrade on languages or domains not included in the training data.
  • Not intended for general text classification or other NLP tasks.

How to Use the Model

You can load and use the model for Named Entity Recognition as follows:

Installation

Ensure you have the transformers and torch libraries installed. Install them via pip if necessary:

pip install transformers torch

Code Example

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# Load the tokenizer and model
model_name = "Lokeshwaran/xlm-roberta-base-fintuned-panx-ta-hi"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Create an NER pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Example text in Tamil and Hindi
example_texts = [
    "அப்துல் கலாம் சென்னை நகரத்தில் ஐஎஸ்ஆர்ஓ நிறுவனத்துக்கு சென்றார்.",  # Abdul Kalam went to the ISRO organization in Chennai city.
    "सचिन तेंदुलकर ने मुंबई में बीसीसीआई के कार्यालय का दौरा किया।",  # Hindi: Sachin Tendulkar visited the BCCI office in Mumbai.
    "മഹാത്മാ ഗാന്ധി തിരുവനന്തപുരം നഗരത്തിലെ ഐഎസ്ആർഒ ഓഫീസ് സന്ദർശിച്ചു." # Malayalam: Mahatma Gandhi visited the ISRO office in Thiruvananthapuram city.
]

# Perform Named Entity Recognition
for text in example_texts:
    results = ner_pipeline(text)
    print(f"Input Text: {text}")
    for entity in results:
        print(f"Entity: {entity['word']}, Label: {entity['entity_group']}, Score: {entity['score']:.2f}")
    print()

Training and Evaluation Data

The model was fine-tuned on the PAN-X dataset, which is part of the XTREME benchmark, specifically for Tamil and Hindi.


Training Procedure

Hyperparameters

  • Learning Rate: 5e-05
  • Batch Size: 24 (both training and evaluation)
  • Epochs: 3
  • Optimizer: AdamW with betas=(0.9, 0.999) and epsilon=1e-08
  • Learning Rate Scheduler: Linear

Results

Epoch Training Loss Validation Loss F1
1.0 0.1886 0.2413 0.8096
2.0 0.1252 0.2415 0.8201
3.0 0.0752 0.2480 0.8347

Framework Versions

  • Transformers: 4.47.1
  • PyTorch: 2.5.1+cu121
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0
Downloads last month
4
Safetensors
Model size
277M params
Tensor type
F32
·
Inference Examples
Unable to determine this model's library. Check the docs .

Dataset used to train Lokeshwaran/xlm-roberta-base-fintuned-panx-ta-hi

Evaluation results