This project implements an XGBoost classifier to predict customer churn for a fictional telecommunications company. The goal is to identify customers who are likely to leave the company based on various customer attributes and service usage patterns.
The Telco customer churn dataset contains information about a fictional telco company that provided home phone and Internet services to 7,043 customers in California in Q3. The objective is to implement an XGBoost classifier to predict whether a customer will leave or stay with the company.
The dataset contains 33 features including:
- CustomerID: Unique identifier for each customer
- Gender: Male/Female
- Senior Citizen: Yes/No (65 or older)
- Partner: Yes/No (has a partner)
- Dependents: Yes/No (lives with dependents)
- Country: Customer's primary residence country
- State: Customer's primary residence state
- City: Customer's primary residence city
- Zip Code: Customer's zip code
- Latitude/Longitude: Geographic coordinates
- Tenure Months: Total months with the company
- Phone Service: Yes/No (home phone service)
- Multiple Lines: Yes/No (multiple telephone lines)
- Internet Service: No/DSL/Fiber Optic/Cable
- Online Security: Yes/No (additional security service)
- Online Backup: Yes/No (additional backup service)
- Device Protection: Yes/No (equipment protection plan)
- Tech Support: Yes/No (technical support plan)
- Streaming TV: Yes/No (streaming television)
- Streaming Movies: Yes/No (streaming movies)
- Contract: Month-to-Month/One Year/Two Year
- Paperless Billing: Yes/No
- Payment Method: Bank Withdrawal/Credit Card/Mailed Check
- Monthly Charges: Current monthly charge for all services
- Total Charges: Total charges to end of quarter
- Churn Value: Binary target (1 = left, 0 = remained)
- Churn Label: Yes/No (customer left/remained)
- Churn Score: Predictive score (0-100) from IBM SPSS Modeler
- CLTV: Customer Lifetime Value
- Churn Reason: Specific reason for leaving
-
Data Cleaning
- Removed columns with exit interview information (Churn Label, Churn Score, CLTV, Churn Reason)
- Eliminated columns with single unique values
- Removed CustomerID and Lat Long (redundant with separate coordinates)
- Replaced whitespaces with underscores in column names and values
-
Missing Value Handling
- Identified 11 missing values in Total_Charges column
- Filled missing values with 0 (customers with 0 tenure months)
- Converted Total_Charges to float64 datatype
-
Feature Engineering
- Applied one-hot encoding to categorical variables
- Expanded feature set from 23 to 1,178 dimensions
- Maintained numerical features unchanged
Algorithm: XGBoost (eXtreme Gradient Boosting)
- Objective Function: binary:logistic
- Evaluation Metric: AUC (Area Under Curve)
- Random State: 42 (for reproducibility)
Key Hyperparameters:
- max_depth: Maximum tree depth
- learning_rate: Step size shrinkage
- gamma: Minimum loss reduction for partition
- reg_lambda: L2 regularization parameter
- scale_pos_weight: Positive class weight scaling
- colsample_bytree: Column subsample ratio
- subsample: Training instance subsample ratio
- Train-Validation Split: 80-20 split with stratification
- Stratification: Maintains class proportions (26.54% churn rate)
- Random State: 42 for reproducible results
- Training AUC: 0.9395
- Validation AUC: 0.8500
- Overall Accuracy: 80%
- Class Distribution: 26.54% churn rate
precision recall f1-score support
0 0.85 0.89 0.87 1035
1 0.64 0.55 0.59 374
accuracy 0.80 1409
macro avg 0.74 0.72 0.73 1409
weighted avg 0.79 0.80 0.79 1409
- Class Imbalance: Significant imbalance between churned (26.54%) and retained customers (73.46%)
- Model Performance: Good performance on majority class (0), moderate performance on minority class (1)
- Overfitting: Training AUC (0.9395) higher than validation AUC (0.8500) indicates some overfitting
The project implements hyperparameter optimization using RandomizedSearchCV:
Search Space:
- max_depth: [3, 4, 5, 6, 7, 8, 9, 10]
- learning_rate: [0.01, 0.1, 0.3]
- n_estimators: [100, 200, 300]
Best Parameters:
- n_estimators: 100
- max_depth: 3
- learning_rate: 0.01
The model provides multiple importance metrics:
- Weight: Number of times a feature is used in branches
- Gain: Average gain across splits
- Cover: Average coverage across splits
- Total Gain: Total gain across all splits
- Total Cover: Total coverage across all splits
- ROC Curves: Training vs. validation performance
- Confusion Matrix: Binary classification results
- AUC-ROC Plots: Model discrimination ability
- Tree Visualization: Optional XGBoost tree structure
pandas
numpy
xgboost
scikit-learn
matplotlibconda install graphviz python-graphviz- Python 3.x
- Jupyter Notebook support
- Random seed set to 42 for reproducibility
- Data Loading: Ensure the dataset file
Telco_customer_churn.xlsxis in the working directory - Notebook Execution: Run all cells sequentially in the Jupyter notebook
- Model Training: The notebook automatically handles data preprocessing and model training
- Results Analysis: View performance metrics, visualizations, and classification reports
project/
├── 22CS30020_P2.ipynb # Main Jupyter notebook
├── README.md # This documentation file
└── Telco_customer_churn.xlsx # Dataset file (not included)
- Dataset overview and feature understanding
- Missing value identification and treatment
- Data type conversion and validation
- Categorical variable encoding
- Whitespace normalization
- Feature selection and dimensionality reduction
- XGBoost classifier implementation
- Hyperparameter optimization
- Cross-validation and early stopping
- Multiple evaluation metrics
- Visualization of results
- Model interpretation and feature importance
- Class Imbalance: 26.54% churn rate requires careful handling
- Feature Engineering: High-dimensional feature space (1,178 features)
- Overfitting: Training performance significantly better than validation
- Data Quality: Missing values and data type inconsistencies
- Dataset Source: Kaggle - Telco Customer Churn
- IBM Documentation: Telco Customer Churn Analysis
- XGBoost Documentation: Official XGBoost Guide
This project demonstrates the implementation of XGBoost for customer churn prediction, showcasing data preprocessing, model development, and performance evaluation in a real-world business context.