๐ Telecom Customer Churn Prediction โ End-to-End ML System
A full-stack machine learning system for predicting customer churn in the telecommunications industry. This project uses an XGBoost classifier, a FastAPI backend, and a Streamlit frontend to deliver real-time churn prediction, customer risk analysis, and batch scoring capabilities.
๐ Project Overview
Customer churn is one of the biggest revenue drains in the telecom industry. Early identification of at-risk customers enables companies to apply targeted retention strategies, significantly increasing customer lifetime value.
This project focuses on maximizing recall (79.36%) to ensure the model captures as many potential churners as possible โ aligning directly with business impact.
๐๏ธ System Architecture โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ Streamlit UI โ โ โข Real-time predictions โ โ โข Batch scoring โ โ โข Customer risk analytics โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโ โ HTTP/REST โผ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ FastAPI Backend โ โ โข Prediction endpoint โ โ โข Preprocessing pipeline โ โ โข Model + threshold loading โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโ โ Load Model โผ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ XGBoost ML Model โ โ โข churn_xgb.pkl โ โ โข Recall = 79.36% (Primary metric) โ โ โข F1 Score = 0.642 โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โจ Key Features โ High-Recall ML Model (79.36%)
Captures maximum churners โ critical for retention strategy.
โ FastAPI-Powered REST API
Production-ready inference endpoint deployed on Render.
โ Interactive Streamlit Frontend
Simple and intuitive interface for non-technical business users.
โ Real-Time + Batch Predictions
Predict churn for individual customers or entire datasets.
โ Fully Deployed
Backend: Render
Frontend: Streamlit Cloud
๐ Dataset
Source: Telco Customer Churn Dataset (Kaggle) File: WA_Fn-UseC_-Telco-Customer-Churn.csv Shape: 7,043 rows ร 21 features
Target Variable
Churn โ Yes/No
Imbalance: 73.5% non-churn / 26.5% churn
Feature Categories Demographics
Gender
SeniorCitizen
Partner
Dependents
Services
PhoneService
MultipleLines
InternetService (DSL / Fiber Optic / No)
OnlineSecurity
OnlineBackup
DeviceProtection
TechSupport
StreamingTV
StreamingMovies
Account Information
Tenure
Contract
PaperlessBilling
PaymentMethod
MonthlyCharges
TotalCharges
๐ง Data Preprocessing โ Cleaning
Dropped customerID
Converted TotalCharges to numeric
Filled missing values using median
Removed OnlineBackup due to quality issues
โ Encoding
Label Encoding for binary fields One-Hot Encoding for multi-class fields
Final Feature Count: 22 engineered features ๐ค Model Development Train/Test Split
Train: 80%
Test: 20%
random_state = 42
๐ฉ Primary Model โ XGBoost Classifier Hyperparameters n_estimators=200 learning_rate=0.05 max_depth=5 subsample=0.8 colsample_bytree=0.8 scale_pos_weight=2.7
Performance Metric Score Accuracy 76.58% F1 Score 0.642 โญ Recall 79.36% Why Selected?
Highest recall โ captures 24% more churners than Logistic Regression
Balanced F1 score
Handles class imbalance effectively
๐จ Secondary Model โ Logistic Regression Metric Score Accuracy 81.83% Recall 55.50% F1 Score 0.618
Good for interpretability, but not suitable for high-recall objectives.
๐ฏ Business Metric Prioritization
Recall โ Most important (donโt miss churners)
F1 Score โ Balanced evaluation
Accuracy โ Least important (misleading on imbalanced datasets)
Example:
A stupid model predicting โNo Churnโ for everyone gets 73.5% accuracy โ completely useless.
โ๏ธ Threshold Optimization
The default 0.5 threshold is not suitable for imbalanced churn data.
After optimization: Optimal threshold = 0.01
This aggressively maximizes recall for business impact.
๐ฆ Saved Artifacts churn_xgb.pkl # Trained XGBoost model threshold.pkl # Optimized threshold = 0.01
๐ FastAPI Backend
Example FastAPI code snippet:
from fastapi import FastAPI import joblib import numpy as np
app = FastAPI()
model = joblib.load("models/churn_xgb.pkl") threshold = joblib.load("models/threshold.pkl")
@app.post("/predict") def predict(data: dict): features = preprocess(data) prob = model.predict_proba([features])[0][1] pred = int(prob >= threshold)
return {
"churn_probability": prob,
"prediction": pred,
"risk_level": "High" if pred == 1 else "Low"
}
Deployed at:
https://churn-prediction-2qrp.onrender.com/
๐ฅ๏ธ Streamlit Frontend
Real-time prediction
Batch CSV upload
Visual analytics
Deployed at:
https://churn-frontend-g3ku8j45b7mfsg6s4ztjfy.streamlit.app/
๐ ๏ธ Installation & Setup Clone repository git clone https://github.com/yourusername/telecom-churn-prediction.git cd telecom-churn-prediction
Backend Setup cd backend pip install -r requirements.txt uvicorn app:app --reload
Frontend Setup cd churn-frontend pip install -r requirements.txt streamlit run app.py
๐งช API Usage Example
curl -X POST "https://churn-prediction-2qrp.onrender.com/predict"
-H "Content-Type: application/json"
-d '{
"gender": "Female",
"SeniorCitizen": 0,
"Partner": "Yes",
"tenure": 12,
"MonthlyCharges": 70
}'
๐ Business Impact
Assumptions:
Avg revenue/user = $64/month
Lifetime value โ $1500
Retention offer cost: $75
Retention success: 40%
With XGBoost (79.36% recall)
Correctly identifies 1,483 churners
Potential revenue saved: $890,000
Campaign cost: $111,225
Net Benefit = $778,775 saved
With Logistic Regression (55.50% recall)
Saves only $623,000
๐ XGBoost prevents $267,000 additional revenue loss.
๐ Repository Structure telecom-churn-prediction/ โ โโโ backend/ โ โโโ app.py โ โโโ preprocessing.py โ โโโ models/ โ โ โโโ churn_xgb.pkl โ โ โโโ threshold.pkl โ โโโ requirements.txt โ โโโ render.yaml โ โโโ churn-frontend/ โ โโโ app.py โ โโโ pages/ โ โ โโโ 01_prediction.py โ โ โโโ 02_batch.py โ โ โโโ 03_analytics.py โ โโโ utils/ โ โ โโโ api_client.py โ โ โโโ visualizations.py โ โโโ requirements.txt โ โโโ .streamlit/ โ โโโ config.toml โ โโโ notebooks/ โ โโโ main_analysis.ipynb โ โโโ README.md โโโ .gitignore
๐ฎ Future Enhancements ML Improvements
SHAP explainability
Feature Engineering Pipeline Automated feature extraction from raw customer data including recency-frequency-monetary (RFM) metrics, engagement velocity tracking, product usage patterns, and customer lifetime value calculations. This creates a rich set of predictive signals that update dynamically as new data arrives. Explainable AI Dashboard
SHAP or LIME-based explanations showing which factors drive each customer's churn risk score. This gives your retention team actionable insights like "customer likely to churn due to: declining login frequency (45% impact), support ticket volume (30%), reduced feature usage (25%)" rather than just a black-box probability. Segmented Retention Strategies
Automatic customer clustering based on churn drivers and behavioral patterns, with recommended retention tactics per segment. For example, price-sensitive churners get discount offers, while feature-confused users get onboarding calls. Each segment gets tracked for intervention effectiveness. API and Real-time Scoring
API endpoint that scores individual customers on-demand for real-time use cases like triggered email campaigns, chat interventions during support calls, or dynamic pricing. Includes batch scoring capabilities for daily/weekly refresh of entire customer base risk scores
๐ฌ Contact
For issues, suggestions, or collaboration: ๐ง workwithanshuman9468@gmail.com
โ Project Status
Production Ready
Model Version: 1.0
Last Updated: Dec 2025