To examine different clustering algorithms a designated framework was created. This framework includes dataset generation used as a benchmark, clustering algorithms implementations of GMM with initial methods, UOFC with variety of criteria, and Heirarchical Clustering with different distance metrics. Overall, the dataset contains 219 examples and consists of Gaussian and non-Gaussian examples.
First, clone the clustering_framework code provided here.
clone https://github.com/NadavSc/Clustering-Framework.git
Then, create an anaconda environment (python 3.8):
conda create -n clustering python=3.8
Then, activate your environment:
conda activate <env_name>
Install the required packages:
- pandas
- scikit-learn
- matplotlib
- seaborn
In the main file:
- Choose algorithms for running:
algorithms = ['gmm', 'uofc', 'hc']
- Choose verbosity level:
- verbosity 0 - no prints at all.
- verbosity 1 - print information about the examples.
- verbosity 2 - plot the results.
- Make sure the saving path is correct:
with open('data/dataset.json', 'r') as openfile:
dataset = json.load(openfile)
There are several scripts:
- main.py - the script for running the different algorithms and saving the results.
- dataset.py - generates the different datasets (Gaussian/non-Gaussian/Unique examples)
- run_unique.py - running the algorithms on the unique dataset.
The data that was generated consists of Gaussian, non-Gaussian, and unique examples. Unique examples are various shapes. All of the above can be generate in the script data.py. We use PCA to visualize examples with more than three dimensions.