xlm-roberta-base-fintuned-panx-ta-hi

This model is a fine-tuned version of xlm-roberta-base on the PAN-X dataset for Tamil (ta) and Hindi (hi). It is fine-tuned for Named Entity Recognition (NER) and achieves the following results on the evaluation set:

Loss: 0.2480
F1: 0.8347

Model Description

The model is based on XLM-RoBERTa, a multilingual transformer-based architecture, and fine-tuned for NER tasks in Tamil and Hindi. Entity type : LOC (Location), PER (Person), and ORG (Organization)

B- prefix indicates beginning of an entity and I - prefix indicates consecutive entity

Intended Uses & Limitations

Intended Uses:

Named Entity Recognition (NER) tasks in Tamil and Hindi.

Limitations:

Performance may degrade on languages or domains not included in the training data.
Not intended for general text classification or other NLP tasks.

How to Use the Model

You can load and use the model for Named Entity Recognition as follows:

Installation

Ensure you have the transformers and torch libraries installed. Install them via pip if necessary:

pip install transformers torch

Code Example

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# Load the tokenizer and model
model_name = "Lokeshwaran/xlm-roberta-base-fintuned-panx-ta-hi"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Create an NER pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Example text in Tamil and Hindi
example_texts = [
    "அப்துல் கலாம் சென்னை நகரத்தில் ஐஎஸ்ஆர்ஓ நிறுவனத்துக்கு சென்றார்.",  # Abdul Kalam went to the ISRO organization in Chennai city.
    "सचिन तेंदुलकर ने मुंबई में बीसीसीआई के कार्यालय का दौरा किया।",  # Hindi: Sachin Tendulkar visited the BCCI office in Mumbai.
    "മഹാത്മാ ഗാന്ധി തിരുവനന്തപുരം നഗരത്തിലെ ഐഎസ്ആർഒ ഓഫീസ് സന്ദർശിച്ചു." # Malayalam: Mahatma Gandhi visited the ISRO office in Thiruvananthapuram city.
]

# Perform Named Entity Recognition
for text in example_texts:
    results = ner_pipeline(text)
    print(f"Input Text: {text}")
    for entity in results:
        print(f"Entity: {entity['word']}, Label: {entity['entity_group']}, Score: {entity['score']:.2f}")
    print()

Training and Evaluation Data

The model was fine-tuned on the PAN-X dataset, which is part of the XTREME benchmark, specifically for Tamil and Hindi.

Training Procedure

Hyperparameters

Learning Rate: 5e-05
Batch Size: 24 (both training and evaluation)
Epochs: 3
Optimizer: AdamW with betas=(0.9, 0.999) and epsilon=1e-08
Learning Rate Scheduler: Linear

Results

Epoch	Training Loss	Validation Loss	F1
1.0	0.1886	0.2413	0.8096
2.0	0.1252	0.2415	0.8201
3.0	0.0752	0.2480	0.8347

Framework Versions

Transformers: 4.47.1
PyTorch: 2.5.1+cu121
Datasets: 3.2.0
Tokenizers: 0.21.0

Lokeshwaran
/

xlm-roberta-base-fintuned-panx-ta-hi