This project focuses on building a Decision Tree Classifier to predict loan approval using a housing finance dataset. It demonstrates a complete end-to-end machine learning workflow including:
- Data preprocessing
- Feature engineering
- Model building
- Model evaluation
- Hyperparameter tuning
The project was implemented using Python in Google Colab and showcases how decision trees can be applied to real-world financial classification problems.
- To predict whether a loan will be approved or not
- To understand Decision Tree concepts like entropy, Gini index, and information gain
- To perform data preprocessing and feature engineering
- To improve model performance using hyperparameter tuning
- Dataset: Housing Finance Dataset
- Total Records: 1570 rows
- Total Features: 22 columns
- https://1drv.ms/f/c/f483042b9735aab9/IgD-H4rA0jvITLwIi8EL5YneARJZcq-SJ7cIEQkmyUe-fFg?e=6TIYTs
- Age
- Income (TotInc)
- Loan Requested (LoanReq)
- Loan-to-Value Ratio (LTV)
- FOIR (Fixed Obligation to Income Ratio)
- Employer Type
- Marital Status
- Accommodation Type
- Decision → (1 = Approved, 0 = Not Approved)
- Data Cleaning & Preprocessing
- Label Encoding & Dummy Variable Creation
- Decision Tree Model Building
- Model Evaluation using Accuracy & Confusion Matrix
- Hyperparameter Tuning using GridSearchCV
- Data Visualization
- Python
- Google Colab
- Pandas
- NumPy
- Matplotlib
- Seaborn
- Scikit-learn
The dataset contains both categorical and numerical variables:
- Employer_Type
- Build_Selfcon
- Tier
- Age
- Income
- LoanReq
- LTV
- etc.
Dummy variables were created for categorical columns, and boolean values were converted into numeric format for model compatibility.
The dataset is stored in Google Drive and accessed within Google Colab. This allows easy file management and integration with the notebook environment.
The dataset is loaded into a Pandas DataFrame using functions like read_csv(). This converts raw data into a structured format for analysis.
Initial exploration is performed to understand the dataset:
head()→ View first few rowsdescribe()→ Get statistical summaryisnull()→ Identify missing values
This step helps in understanding data distribution and detecting issues.
Missing or null values are treated using appropriate methods such as filling with mean/median or removing rows/columns.
Categorical features (like Employer Type, Marital Status) are converted into numerical format using:
- Label Encoding
- One-Hot Encoding
New features may be created or existing ones modified to improve model performance and capture important patterns.
Categorical variables are transformed into dummy/indicator variables so that machine learning models can process them.
The dataset is divided into training and testing sets (e.g., 80% train, 20% test).
- Training data → Used to train the model
- Testing data → Used to evaluate performance
A Decision Tree Classifier is trained using the training dataset.
The model splits data based on feature values to make predictions.
Measures how many predictions are correct out of total predictions.
Provides a detailed breakdown of predictions:
- True Positives
- True Negatives
- False Positives
- False Negatives
This helps in understanding model performance beyond accuracy.
GridSearchCV is used to find the best combination of parameters (like tree depth, split criteria).
This improves model performance and prevents overfitting.
The project follows a complete machine learning workflow:
Data → Preprocessing → Model Building → Evaluation → Optimization
pip install pandas numpy matplotlib seaborn scikit-learn
The complete implementation is available in the notebook:
👉 Decisiontree.ipynb https://colab.research.google.com/drive/1C4JxM-mn9V4HG_EygJRhVR4EuVYfA0Uu?usp=drive_link
- Model Accuracy: ~80.89%
[[ 20 52]
[ 8 234]]
The model performs well in predicting loan approvals with a high number of true positives.
Visualizations were created using Matplotlib and Seaborn for better interpretation.
Open the notebook in Google Colab
Upload given dataset
Run all cells step-by-step
LTV, FOIR, and Income are key factors influencing loan approval Decision Trees provide easy interpretability through rules Limiting tree depth helps avoid overfitting Hyperparameter tuning significantly improves model performance
If you found this project helpful, please give it a ⭐ on GitHub and share it with others! 🚀

