-
Notifications
You must be signed in to change notification settings - Fork 0
CS 5542 BigData Lab Report #04
Amy Lin edited this page Mar 29, 2017
·
2 revisions
Create your own dataset for Image Classification Problem. Use the workflow as discussed in the Tutorial 4 Session using Decision Tree Algorithm. Report the accuracy and confusion matrix obtained. Include a brief description of your dataset and purpose behind image classification problem.
- Import needed library & access to Spark.
- Create a vector for cluster. Declare a list of categories for images.
- Start extracting information from training data sets.
- Save features from the images.
- Process clusters by using K-Means training.
- Load & Parse the feature data.
- Cluster the data into 400 classes using K-Means.
- Calculate Within Set Sum of Squared Error (WSSSE) to evaluate clustering data.
- Create histograms for all images & reduce features.
- Build random forest model using the histograms.
- Split data into 70% training and 30% testing.
- Train the random forest model. ( 4-10 trees )
- Use the test data to get the most accurate the model.
- Print out best error and parameters.
- Train the random forest model once again.
- Save & load the mode.
- Test image classification by using test images and histogram size to determine the prediction of such image.
- Run the above process from the client side.
- Generate Confusion Matrix & print out the accuracy of the model.
< DATASET > Training Data: BubbleTea x 40 | Chocolate x 30 | Coffee x 30 | MochaTea x 40 | ShavedIce x 30 || Test Data: 5 for each category.
- Determine if the training data is sufficient.
- Whether the training error is low. (Take out noise in the data)
- Whether the result of the classifier will be too complicated to build a model.
- Utilize a relative small set of data for large amount of processing & obtain more training along the way.
- For further prediction.