Binary Classification Problem with highly imbalanced class-distribution

This project aims to classify the income level (under 50K or above 50K) of a U.S. worker based on the 1995 census data. The minority class is less than 6% of the totol datasets and thus challenges the accuracy of the binary classifier.

I took these three steps:

Step 1: identify the best accuracy metrics for modeling performance testing TPR (True Positive Rate) and TNR (True Negative Rate)

Step 2: run a logistic classifier with original dataset of imbalanced class-distribution.

Step 3: repeat Step 2 with the following re-sampling techniques:

Undersampling the majority class
Oversampleing the minority class
Synthetic Minority Oversampleing Technique (SMOTE)
Random Over Sampling Examples (ROSE)

The conclusion is that the ROSE technique leads to the optimal classification result. With ROSE I increased the TPR from 0.31 to 0.89, while maintaining the same TNR, comparing to the original classifier.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
pictures		pictures
READ.md		READ.md
README.md		README.md
census_income_classification.ipynb		census_income_classification.ipynb
census_income_level.R		census_income_level.R
slides.odp		slides.odp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Binary Classification Problem with highly imbalanced class-distribution

About

Uh oh!

Releases

Packages

Languages

jasonfeiwang/Imbalanced-Classification-U.S.-Income

Folders and files

Latest commit

History

Repository files navigation

Binary Classification Problem with highly imbalanced class-distribution

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages