`scutr`: SMOTE and Cluster-Based Undersampling Technique in R

Imbalanced training datasets impede many popular classifiers. To balance training data, a combination of oversampling minority classes and undersampling majority classes is necessary. This package implements the SCUT (SMOTE and Cluster-based Undersampling Technique) algorithm, which uses model-based clustering and synthetic oversampling to balance multiclass training datasets.

This implementation only works on numeric training data and works best when there are more than two classes. For binary classification problems, other packages may be better suited.

The original SCUT paper uses SMOTE (essentially linear interpolation between points) for oversampling and expectation maximization clustering, which fits a mixture of Gaussian distributions to the data. These are the default methods in scutr, but random oversampling as well as some distance-based undersampling techniques are available.

Installation

You can install the released version of scutr from CRAN with:

install.packages("scutr")

And the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("s-kganz/scutr")

Example Usage

We start with an imbalanced dataset that comes with the package.

library(scutr)
data(imbalance)
imbalance <- imbalance[imbalance$class %in% c(2, 3, 19, 20), ]
imbalance$class <- as.numeric(imbalance$class)

plot(imbalance$V1, imbalance$V2, col=imbalance$class)

table(imbalance$class)
#> 
#>   2   3  19  20 
#>  20  30 190 200

Then, we apply SCUT with SMOTE oversampling and k-means clustering with seven clusters.

scutted <- SCUT(imbalance, "class", undersample = undersample_kmeans,
                usamp_opts = list(k=7))
plot(scutted$V1, scutted$V2, col=scutted$class)

table(scutted$class)
#> 
#>   2   3  19  20 
#> 110 110 110 110

The dataset is now balanced and we have retained the distribution of the data.

Name	Name	Last commit message	Last commit date
Latest commit s-kganz Merge pull request #2 from s-kganz/cran-release Nov 18, 2023 624f415 · Nov 18, 2023 History 91 Commits
.github	.github	update github action	Nov 10, 2023
R	R	SMOTE caps neighbors at n-1, not n. Adjust regex in tests	Nov 10, 2023
data	data	add bullseye data to demo tomek undersampling	Jan 22, 2021
inst	inst	Add references to the SCUT and SMOTE papers in documentaiton	May 17, 2021
man	man	SMOTE caps neighbors at n-1, not n. Adjust regex in tests	Nov 10, 2023
tests	tests	SMOTE caps neighbors at n-1, not n. Adjust regex in tests	Nov 10, 2023
.Rbuildignore	.Rbuildignore	submit to cran	Nov 18, 2023
.gitignore	.gitignore	add toydata folder to gitignore	Jan 11, 2021
CRAN-SUBMISSION	CRAN-SUBMISSION	submit to cran	Nov 18, 2023
DESCRIPTION	DESCRIPTION	cleanup for cran	Nov 17, 2023
LICENSE	LICENSE	use_mit_license() so the license file doesn't fire a note.	Jun 13, 2021
LICENSE.md	LICENSE.md	use_mit_license() so the license file doesn't fire a note.	Jun 13, 2021
NAMESPACE	NAMESPACE	change parallel back-end to mclapply, add warning that SCUT_parallel …	Jun 23, 2021
NEWS.md	NEWS.md	cleanup for cran	Nov 17, 2023
README.Rmd	README.Rmd	Create README.rmd	Jun 13, 2021
README.md	README.md	Create README.rmd	Jun 13, 2021
cran-comments.md	cran-comments.md	cleanup for cran	Nov 17, 2023
scutr.Rproj	scutr.Rproj	Initial commit	Jan 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

`scutr`: SMOTE and Cluster-Based Undersampling Technique in R

Installation

Example Usage

About

Licenses found

Releases 2

Packages

Languages

License

s-kganz/scutr

Folders and files

Latest commit

History

Repository files navigation

scutr: SMOTE and Cluster-Based Undersampling Technique in R

Installation

Example Usage

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

`scutr`: SMOTE and Cluster-Based Undersampling Technique in R

Packages