Skip to content

HW2 Feedback #2

@lemonsoup

Description

@lemonsoup

Status: Pass (Homework is graded on "Pass" or "Needs Improvement")
Comments:
Great work Michelle, your analysis is very articulate and each step is clearly commented to communicate your thought process effectively. Comments for each section below.

Describe the content of the dataset and its goals
Good research and reasoning to give context to the data. Your visuals using both layered histograms and boxplots are very helpful as well.
Describe the features and formulate a hypothesis on which might be relevant in predicting diabetes
Well done, your research helps guide the overall path of analysis, and helps the reader understand what you might be interested in finding. Good point about the much lower average occurrence of diabetes in the US. This also adds perspective to the uniqueness of the current dataset. Great idea for feature engineering!
Describe the missing/NULL values. Decide if you should impute or drop them and justify your choice.
Excellent, your reasoning is clear and well thought out. I'm excited to see you try both dropped and imputed versions of two-hour insulin. It's a judgment call when the loss is near threshold, and often the only way to decide it to try both.
Come up with a benchmark for the minimum performance that an algorithm should have on this dataset
Yes, 65% makes sense. Effective use of parellel_coordinates to visualize your 3 data sets.
What's the best performance you can get with kNN? Is kNN a good choice for this dataset?
Great work, your process of testing with 3 datasets is organized and clearly presented. The trade off of time and performance is considered also. You can also try various measures of performance, such as recall, precision, and f1, you might make different assessments based on those.
What's the best performance you can get with Naive Bayes? Is NB a good choice for this dataset?
Wild! I did not think of converting the numerical data to strings and then tokenizing. As you say the bumpy learning curve suggests this is not a good model for the data, but you still got performance beating the benchmark! I do suggest you try Gaussian NB - as you mentioned, this would be a better NB model for numerical inputs. But cool idea!
What's the best performance you can get with Logistic Regression? Is LR a good choice for this dataset?
Great work, clear analysis of the models, and interesting observation regarding Lasso and the lack of dropped features. Combined with the high bias in the learning curve, this implies that the model might be able to handle more complexity, and you can try adding more custom features in addition to your "ratio".
What's the best performance you can get with Random Forest? Is RF a good choice for this dataset?
Excellent, it is interesting to see the performance of each of your datasets. Great use of RF to view feature importance!
If you could only choose one, which classifier from the above that you already ran is best? How do you define best? (hint: could be prediction accuracy, running time, interpretability, etc)
Well done! I agree that looking at recall for these would be a good next step. Great presentation of well thought out analysis - keep up the good work!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions