Skip to content

Dhathrinarne/Decision-tree

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

🏠 Loan Approval Prediction using Decision Tree

Google Colab Pandas NumPy Machine Learning Scikit-learn Seaborn Matplotlib


📌 Overview

This project focuses on building a Decision Tree Classifier to predict loan approval using a housing finance dataset. It demonstrates a complete end-to-end machine learning workflow including:

  • Data preprocessing
  • Feature engineering
  • Model building
  • Model evaluation
  • Hyperparameter tuning

The project was implemented using Python in Google Colab and showcases how decision trees can be applied to real-world financial classification problems.


🎯 Objective

  • To predict whether a loan will be approved or not
  • To understand Decision Tree concepts like entropy, Gini index, and information gain
  • To perform data preprocessing and feature engineering
  • To improve model performance using hyperparameter tuning

📊 Dataset

Key Features:

  • Age
  • Income (TotInc)
  • Loan Requested (LoanReq)
  • Loan-to-Value Ratio (LTV)
  • FOIR (Fixed Obligation to Income Ratio)
  • Employer Type
  • Marital Status
  • Accommodation Type

Target Variable:

  • Decision → (1 = Approved, 0 = Not Approved)

⚙️ Features

  • Data Cleaning & Preprocessing
  • Label Encoding & Dummy Variable Creation
  • Decision Tree Model Building
  • Model Evaluation using Accuracy & Confusion Matrix
  • Hyperparameter Tuning using GridSearchCV
  • Data Visualization

🛠️ Tools & Technologies

  • Python
  • Google Colab
  • Pandas
  • NumPy
  • Matplotlib
  • Seaborn
  • Scikit-learn

📁 Dataset Details

The dataset contains both categorical and numerical variables:

Categorical Variables:

  • Employer_Type
  • Build_Selfcon
  • Tier

Numerical Variables:

  • Age
  • Income
  • LoanReq
  • LTV
  • etc.

Dummy variables were created for categorical columns, and boolean values were converted into numeric format for model compatibility.


🔄 Project Flow

1. Data Collection from Google Drive

The dataset is stored in Google Drive and accessed within Google Colab. This allows easy file management and integration with the notebook environment.


2. Data Loading using Pandas

The dataset is loaded into a Pandas DataFrame using functions like read_csv(). This converts raw data into a structured format for analysis.


3. Data Exploration

Initial exploration is performed to understand the dataset:

  • head() → View first few rows
  • describe() → Get statistical summary
  • isnull() → Identify missing values

This step helps in understanding data distribution and detecting issues.


4. Data Preprocessing

a. Handling Missing Values

Missing or null values are treated using appropriate methods such as filling with mean/median or removing rows/columns.

b. Encoding Categorical Variables

Categorical features (like Employer Type, Marital Status) are converted into numerical format using:

  • Label Encoding
  • One-Hot Encoding

c. Feature Engineering

New features may be created or existing ones modified to improve model performance and capture important patterns.

d. Dummy Variable Creation

Categorical variables are transformed into dummy/indicator variables so that machine learning models can process them.


5. Model Pipeline

a. Train-Test Split

The dataset is divided into training and testing sets (e.g., 80% train, 20% test).

  • Training data → Used to train the model
  • Testing data → Used to evaluate performance

b. Decision Tree Model Building

A Decision Tree Classifier is trained using the training dataset.
The model splits data based on feature values to make predictions.


6. Model Evaluation

a. Accuracy Score

Measures how many predictions are correct out of total predictions.

b. Confusion Matrix

Provides a detailed breakdown of predictions:

  • True Positives
  • True Negatives
  • False Positives
  • False Negatives

This helps in understanding model performance beyond accuracy.


7. Hyperparameter Tuning

GridSearchCV for Optimal Parameters

GridSearchCV is used to find the best combination of parameters (like tree depth, split criteria).
This improves model performance and prevents overfitting.


✅ Summary

The project follows a complete machine learning workflow:
Data → Preprocessing → Model Building → Evaluation → Optimization

This ensures a reliable and well-performing prediction model.

▶️ How to Run

Install required libraries

pip install pandas numpy matplotlib seaborn scikit-learn

💻 Code

The complete implementation is available in the notebook:

👉 Decisiontree.ipynb https://colab.research.google.com/drive/1C4JxM-mn9V4HG_EygJRhVR4EuVYfA0Uu?usp=drive_link


📈 Result

  • Model Accuracy: ~80.89%

Confusion Matrix:

[[ 20 52]

[ 8 234]]

The model performs well in predicting loan approvals with a high number of true positives.


📊 Visualizations

  • Decision Tree Diagram

  • image
  • Confusion Matrix Heatmap

  • image

Visualizations were created using Matplotlib and Seaborn for better interpretation.


Steps:

Open the notebook in Google Colab

Upload given dataset

Run all cells step-by-step

🔍 Key Insights

LTV, FOIR, and Income are key factors influencing loan approval Decision Trees provide easy interpretability through rules Limiting tree depth helps avoid overfitting Hyperparameter tuning significantly improves model performance

⭐ If you like this project

If you found this project helpful, please give it a ⭐ on GitHub and share it with others! 🚀

About

This project builds a Decision Tree model to predict loan approval using financial and demographic data. It includes data preprocessing, encoding of categorical variables, model training, and evaluation. The model achieved ~80–85% accuracy and was further optimized using GridSearchCV for hyperparameter tuning.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors