This project classifies films into four IMDb score categories (Bad, OK, Good, Excellent) using movie metadata, financial data, and social media metrics. We employ Decision Trees, K-Nearest Neighbors (KNN), and Random Forests to predict critical reception and understand key drivers of success.
Source: Kaggle IMDb 5000 Movie Dataset
Location: movie_metadata.csv
Key Features:
- Target:
imdb_score(binned into 4 categories) - Financial:
budget,gross - Metadata:
duration,year,country,content_rating - Social Metrics:
movie_facebook_likes,director_facebook_likes,actor_1_facebook_likes,actor_2_facebook_likes,actor_3_facebook_likes - Derived:
critic_review_ratio,other_actors_facebook_likes
-
Clone the repository
git clone https://github.com/tanayprabhakar/FDSProj.git cd FDSProj -
Install R dependencies
install.packages(c( "dplyr", "ggplot2", "caret", "randomForest", "rpart", "ggrepel", "VIM", "plotly" ))
-
Project structure
FDSProj/ │── movie_metadata.csv │── data_preprocessing.R │── rf_model.rds │── knn_model.rds ├── README.md
- Run the full analysis
# in R console source("data_preprocessing.R")
- Removed duplicates and handled missing values
- Binned
imdb_scoreinto four categories: Bad (<=5), OK (5–6.5), Good (6.5–8), Excellent (>8) - Engineered
critic_review_ratio(num_critic_for_reviews / num_user_for_reviews) - Combined
actor_2_facebook_likesandactor_3_facebook_likesintoother_actors_facebook_likes - Simplified
countrytoUSA/UK/Others - Standardized
content_ratinglevels
- High-budget (> $100M) films seldom achieve "Excellent" ratings
- UK films average 0.3 points higher IMDb scores than US films
- Weak correlation between Facebook likes and IMDb scores (r = 0.18)
| Model | Training Accuracy | Validation Accuracy | Test Accuracy |
|---|---|---|---|
| Decision Tree | 78.0% | 71.3% | 72.4% |
| KNN (k = 9) | — | 71.4% | 74.6% |
| Random Forest | — | 76.4% | 76.6% |
Top Features (Random Forest):
num_voted_users(Importance: 42.6)duration(38.1)critic_review_ratio(35.8)
- Critic Influence: Films with
critic_review_ratio > 0.5are 3× more likely to be Excellent. - Ideal Runtime: 100–120 minutes yields the highest average score (7.2).
- Country Effect: UK productions score on average 0.4 points higher than US films.
Contributions are welcome! Please fork the repository and create a pull request for bug fixes or feature requests.
Maintainer: Tanay & Tarang Last Updated: May 2025