Skip to content

jasonfeiwang/Imbalanced-Classification-U.S.-Income

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Binary Classification Problem with highly imbalanced class-distribution

This project aims to classify the income level (under 50K or above 50K) of a U.S. worker based on the 1995 census data. The minority class is less than 6% of the totol datasets and thus challenges the accuracy of the binary classifier.

I took these three steps:

Step 1: identify the best accuracy metrics for modeling performance testing TPR (True Positive Rate) and TNR (True Negative Rate)

Step 2: run a logistic classifier with original dataset of imbalanced class-distribution.

Step 3: repeat Step 2 with the following re-sampling techniques:

  1. Undersampling the majority class
  2. Oversampleing the minority class
  3. Synthetic Minority Oversampleing Technique (SMOTE)
  4. Random Over Sampling Examples (ROSE)

The conclusion is that the ROSE technique leads to the optimal classification result. With ROSE I increased the TPR from 0.31 to 0.89, while maintaining the same TNR, comparing to the original classifier.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published