This repository contains the code and resources for training a GloVe (Global Vectors for Word Representation) model on Igala, a low-resource language. The project includes scripts for preprocessing the text data, training the GloVe model, evaluating the resulting word embeddings, and visualizing the embeddings using techniques such as t-SNE.
- Data Preprocessing: Tools for cleaning and tokenizing Igala text data.
- GloVe Model Training: Implementation of a GloVe model using PyTorch.
- Evaluation: Scripts to evaluate word embeddings using cosine similarity, intrinsic evaluation methods, and extrinsic tasks.
- Visualization: Interactive visualization of word embeddings using t-SNE and Plotly.
Before running the code, ensure you have the following dependencies installed:
- Python 3.x
- PyTorch
- NumPy
- Matplotlib
- Scikit-learn
- Plotly
- Pandas
- GPU
You can install these dependencies using pip:
pip install torch numpy matplotlib scikit-learn plotly pandas