CS 5542 BigData Lab Report #04

SPARK PROGRAMMING - Machine Learning Tasks : Image Classification ( Decision Tree Algorithm )

Import needed library & access to Spark.
Create a vector for cluster. Declare a list of categories for images.
Start extracting information from training data sets.
Save features from the images.
Process clusters by using K-Means training.
Load & Parse the feature data.
Cluster the data into 400 classes using K-Means.
Calculate Within Set Sum of Squared Error (WSSSE) to evaluate clustering data.
Create histograms for all images & reduce features.
Build random forest model using the histograms.
Split data into 70% training and 30% testing.
Train the random forest model. ( 4-10 trees )
Use the test data to get the most accurate the model.
Print out best error and parameters.
Train the random forest model once again.
Save & load the mode.
Test image classification by using test images and histogram size to determine the prediction of such image.
Run the above process from the client side.
Generate Confusion Matrix & print out the accuracy of the model.

< DATASET > Training Data: BubbleTea x 40 | Chocolate x 30 | Coffee x 30 | MochaTea x 40 | ShavedIce x 30 || Test Data: 5 for each category.

Determine if the training data is sufficient.
Whether the training error is low. (Take out noise in the data)
Whether the result of the classifier will be too complicated to build a model.
Utilize a relative small set of data for large amount of processing & obtain more training along the way.
For further prediction.