-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Good work Michelle! See comments below
Give an overall summary of your performance from HW2 on the dataset
Clear summary. I was about to ask about bias/variance observations, but you had it in the next section, nice.
Do you see any areas for improvement?
Good points, yes definitely try Gaussian NB.
Run k-Means on the dataset and describe your results
Great work with organizing your three datasets to try.
Given that a high silhouette score indicates a good k for k-means, I would try k=3 for your first dataset, 3,4, or 6 for the second and third datasets.
You did find interesting population clusters with k=8, so don't discount that either. But the k's with high silhouette scores should indicate "tighter" clusters and may have better performance
Run PCA on the dataset and describe your results
Good work, you are using the fit and transform correctly, and successfully figured out how many components you can reduce the dataset to while still capturing over 80% of the data. You can also look into using n_compnents another way - look into what happens when you do n_components = 0.8
Use the cluster outputs you got from running k-means as a new feature. Rerun your "worst" and "best" model from HW2, including this new feature along with your old features. Describe your results.
- Yes! Runtime is slower since you are adding a feature
- NB might have decreased in performance because NB handles correlated features pretty poorly
- Multinomial NB and the whole count vectorizer process is for converting text into numerical features (like the ham and spam lab). You landed in the right place - for numerical features, use Gaussian NB!
- Careful! You ran scale.fit_transform on both your train AND test data. remember to do scale.fit_transform only on training data, and then scale.transform right after on your test data. this ensures that both your training and test data are being transformed the same way
Use the PCA outputs you got from running PCA above as your features. Rerun your "worst" and "best" model from HW2, with the PCA features instead of your old features. Describe your results.
Well done!
- it does seem a bit surprising that logistic regression took longer to run on the PCA features, but remember that you are doing a "penalty=l1" optimization. That means the algorithm keeps going until it finds an "optimal" solution, which might have happened to take longer to find for this dataset.
- your comment about the better NB performance - "that surprised me, but the assumed feature independence might account for that)" is spot-on :) PCA components are guaranteed to be independent by definition, which is a good thing for NB models
Give your conclusions on the use of k-Means and PCA on the Diabetes dataset
Great work!!