This repository contains a sentiment analysis project trained on the IMDb 50,000-review dataset (25k positive, 25k negative). The model classifies whether a movie review is positive or negative using deep learning (Keras/TensorFlow). A saved tokenizer (tokenizer.pkl) and trained model (IMDB_model.h5) are used for inference.
- Dataset: IMDb Movie Review Dataset (50,000 reviews)
- Public sources:
- Format: CSV containing
reviewandsentimentcolumns
✅ Example row from dataset:
"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked...",positive
project-root/
├── data/ # raw dataset (ignored in Git)
├── models/ # saved model + tokenizer (ignored)
│ ├── IMDB_model.h5
│ └── tokenizer.pkl
├── src/ # preprocessing + training scripts
├── notebooks/ # optional Jupyter notebooks
├── README.md
├── requirements.txt
└── .gitignore
- Load and preprocess text (clean HTML, lowercase, remove punctuation, etc.)
- Tokenize and pad sequences (
tokenizer.pkl) - Train a neural network (LSTM/CNN/etc.) on padded sequences
- Save trained model (
IMDB_model.h5) and tokenizer for later inference - Load model + tokenizer to classify new reviews
Large binary artifacts (datasets, .h5 models, .pkl tokenizers) are not committed to Git because:
- They quickly bloat repo size
- GitHub hard-limits single files >100MB
- They update frequently and do not diff well
Instead, they should be stored using:
- Git LFS
- HuggingFace Hub
- Google Drive / S3 / GitHub Releases
- Or a download script (
download_data.py)
- Python 3.x
- numpy
- pandas
- tensorflow / keras
- scikit-learn
- nltk (or similar for text cleaning)
Install using:
pip install -r requirements.txt
git clone https://github.com/<your-username>/IMDB-Sentiment-Analysis-Model.git
cd IMDB-Sentiment-Analysis-Model
macOS/Linux:
python -m venv .venv
source .venv/bin/activate
Windows:
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
- Place CSV file into
data/OR use a download script if provided.
python src/train.py # train model
python src/predict.py # classify new text
Add accuracy, loss curves, confusion matrix, or sample predictions here.
- Dataset: “Large Movie Review Dataset” — Andrew Maas et al. (Stanford AI Lab, 2011)
- Code: Created by
- Libraries: TensorFlow, Keras, NumPy, Pandas, Scikit-Learn
Pull requests are welcome — feel free to open issues or suggest improvements.
⭐ If you found this useful, consider giving the repo a star!