Created a machine learning model for business fraud detection by ANN, Random Forest Classifier and XGBoost Classifier .
This repository contains a comprehensive solution for detecting business fraud using machine learning techniques. The goal is to identify potentially fraudulent activities by analyzing various features of business transactions. This project includes data preprocessing, feature engineering, model training, and evaluation processes.
Business fraud detection is a critical task in ensuring the integrity and trustworthiness of financial transactions. By leveraging machine learning, we can build models that learn from historical data to identify patterns associated with fraudulent activities. This project aims to provide an end-to-end solution for detecting such fraud using various machine learning algorithms.
- Exploratory Data Analysis (EDA)
- Data cleaning including handling missing values, outliers, and multi-collinearity
- Feature engineering and selection
- Implementation of multiple machine learning algorithms
- Model evaluation and comparison
- Handling class imbalance
- Hyperparameter tuning
- Visualization of results
The dataset used in this project contains records of business transactions, including various features that describe the nature of each transaction. For confidentiality reasons, the dataset is not included in this repository. However, you can use any publicly available fraud detection dataset or your proprietary data. The required format and preprocessing steps are documented in the data_preprocessing.ipynb notebook.
To get started, clone this repository and install the required dependencies:
git clone https://github.com/yourusername/business-fraud-detection.git
cd business-fraud-detection
pip install -r requirements.txt-
Data Preprocessing: Prepare your data by running the
data_preprocessing.ipynbnotebook. This will clean the data, handle missing values, outliers, and multi-collinearity, and create necessary features. -
Model Training: Train machine learning models by running the
model_training.ipynbnotebook. This notebook includes implementations of various algorithms such as Artificial Neural Networks, Random Forest, and XGBoost. -
Evaluation: Evaluate the performance of the models using the
model_evaluation.ipynbnotebook. This notebook provides metrics like accuracy, precision, recall, F1-score, and AUC-ROC.
-
Data Cleaning: Handle missing values, outliers, and multi-collinearity.
- Checking for missing data
- Checking for multi-collinearity
- Checking for outliers
- Selecting variables to be included in the model
-
Summary and Explanation:
oldbalanceOrgandnewbalanceOrgare perfectly correlated because these columns represent the original and new balances in the sender's account after the transaction.oldbalanceDestandnewbalanceDestare perfectly correlated because these columns represent the original and new balances in the recipient's account.nameOrigandnameDestare mass categorical variables that can be key factors in predicting fraudulent customers.
-
Key Variables:
steptypeamountoldbalanceOrgoldbalanceDestisFraudisFlaggedFraud
- Normalization: Normalize
amount,oldbalanceOrg, andoldbalanceDestto avoid dominance of significantly larger values. - Encoding: Apply One Hot Encoding on the
typefeature. - Handling Class Imbalance:
- Why does class imbalance affect model performance?
- Bias toward the majority class
- Reduced sensitivity for the minority class
- Low precision due to a high false positive rate
- Difficulty learning patterns due to limited samples of the minority class
- Skewed decision thresholds
- Proposed Solution:
- Combine oversampling and undersampling to balance the dataset
- Use all fraudulent transactions and subsample non-fraudulent transactions to hit the target rate
- Why does class imbalance affect model performance?
-
Artificial Neural Network (ANN):
- Architecture: Sequential model with an input layer, one hidden layer, and an output layer.
- Input Layer: 64 neurons with ReLU activation.
- Regularization: Dropout layers (30% dropout rate) after the input and hidden layers.
- Hidden Layer: 32 neurons with ReLU activation.
- Output Layer: Single neuron with sigmoid activation.
- Compilation: Adam optimizer, binary cross-entropy loss.
- Training: 100 epochs, batch size of 512.
- Evaluation Metrics: True Positives, True Negatives, False Positives, False Negatives, Precision, Recall.
- Architecture: Sequential model with an input layer, one hidden layer, and an output layer.
-
Random Forest Classifier:
- Configuration: 100 decision trees, OOB scoring disabled.
- Training: Trained on the provided training data.
- Predictions: Predictions on both training and test data.
- Evaluation Metrics: True Positives, True Negatives, False Positives, False Negatives, Precision, Recall.
-
XGBoost Classifier:
- Model: XGBoost Classifier.
- Training: Trained using the training data with AUC-PR as the evaluation metric.
- Predictions: Predictions on both training and test data.
- Evaluation Metrics: True Positives, True Negatives, False Positives, False Negatives, Precision, Recall.
Model performance is evaluated using a variety of metrics to ensure robustness. The evaluation process includes:
- Confusion Matrix
- Precision, Recall, and F1-score
- AUC-ROC Curve
- Cross-validation
-
ANN Model:
- F1-score on the training set: 0.9500
- F1-score on the test set: 0.9493
-
Random Forest:
- F1-score on the training set: 1.0 (perfect score)
- F1-score on the test set: 0.9992
-
XGBoost:
- F1-score on the training set: 0.9967
- F1-score on the test set: 0.9963
The Random Forest model works best among the tested models.