HW3 Feedback

Good work Michelle!  See comments below

**Give an overall summary of your performance from HW2 on the dataset**
Clear summary.  I was about to ask about bias/variance observations, but you had it in the next section, nice.

**Do you see any areas for improvement?**
Good points, yes definitely try Gaussian NB.  

**Run k-Means on the dataset and describe your results**
Great work with organizing your three datasets to try. 
Given that a high silhouette score indicates a good k for k-means, I would try k=3 for your first dataset, 3,4, or  6 for the second and third datasets. 
You did find interesting population clusters with k=8, so don't discount that either.  But the k's with high silhouette scores should indicate "tighter" clusters and may have better performance

**Run PCA on the dataset and describe your results**
Good work, you are using the fit and transform correctly, and successfully figured out how many components you can reduce the dataset to while still capturing over 80% of the data.  You can also look into using n_compnents another way - look into what happens when you do n_components = 0.8

**Use the cluster outputs you got from running k-means as a new feature. Rerun your "worst" and "best" model from HW2, including this new feature along with your old features. Describe your results.**
- Yes! Runtime is slower since you are adding a feature
- NB might have decreased in performance because NB handles correlated features pretty poorly
- Multinomial NB and the whole count vectorizer process is for converting text into numerical features (like the ham and spam lab).  You landed in the right place - for numerical features, use Gaussian NB!
- Careful!  You ran scale.fit_transform on both your train AND test data.  remember to do scale.fit_transform only on training data, and then scale.transform right after on your test data.  this ensures that both your training and test data are being transformed the same way

**Use the PCA outputs you got from running PCA above as your features. Rerun your "worst" and "best" model from HW2, with the PCA features instead of your old features. Describe your results.**
Well done!
- it does seem a bit surprising that logistic regression took longer to run on the PCA features, but remember that you are doing a "penalty=l1" optimization. That means the algorithm keeps going until it finds an "optimal" solution, which might have happened to take longer to find for this dataset.  
- your comment about the better NB performance -  "that surprised me, but the assumed feature independence might account for that)" is spot-on :)  PCA components are guaranteed to be independent by definition, which is a good thing for NB models

**Give your conclusions on the use of k-Means and PCA on the Diabetes dataset**
Great work!! 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HW3 Feedback #8

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

HW3 Feedback #8

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions