Skip to content

Cohort 14 Capstone Project for the Certificate of Data Science at Georgetown University School of Continuing Studies.

License

Notifications You must be signed in to change notification settings

stewartm888/Pet-Adoption-1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 

Repository files navigation

Pet-Adoption

Project Proposal 2/23/19

Domain chosen: Our domain is animal welfare. Millions of abandoned or lost animals end up in shelters. Many shelters are overcrowded, and some euthanize animals that aren’t adopted quickly. Studying adoption data may help shelters and rescue organizations connect with potential adopting homes more effectively, reducing needless animal suffering.

Hypothesis or project topic: An algorithm can be made to predict animal adoption rates from traits in the animal’s online profile. These traits may include breed, color, gender, health, the quality of images, and the quality of descriptive text.

Our main data source is a dataset available at Kaggle.com on animal shelter records. Here are the tables we intend to use: train.csv - Tabular/text data for the training set test.csv - Tabular/text data for the test set breed_labels.csv – Data dictionary on breed color_labels.csv - Data dictionary on color state_labels.csv - Data dictionary on location

Project Description:

  1. First, we will perform a data quality check and EDA on training and testing data, for example: analyzing missing values, summary statistics, data visualizations, scatter plots, and a correlation matrix. We may apply data manipulation depending on the quality of the data, for example: replacing missing data with mean, median, or other imputation techniques.

  2. Afterward, we will use machine learning techniques to narrow down the candidate variable list. Since our target variable is a categorical variable, it is a classification problem, so we will explore using classification approaches. The results of these prior steps will provide insight into the best format for our final model.

  3. Finally, we will build this model and further refine it through testing.

  4. If necessary, we will return to the first step and consider new traits to measure and model then repeat this process.

Questions or avenues of exploration required for the project:

  • What profile traits should we include in our model?
  • Which traits have the most statistically useful records in the Kaggle tables?
  • Considering these traits, which might suggest actions that could be taken to improve how animals are adopted?
  • What may be the best classification model in our case? K - nearest neighbor? Decision trees? Support Vector machines?
  • After our best model have been created, how can we utilize the model or present it in a way that could be easily used by shelters and rescue organizations?

Karen's Initial Commentary on Project Plan - 2/28/19

Thank you for you well-defined problem and topic! I think this dataset can help your team learn all the concepts while you work on something interesting (and fun!). You already have a good plan and you can give me more information in your next deliverable.

Your team name in our records will be: Pet Adoption (This is also the name of your github repository)

GitHub Repository: I have created a repository for your team and given access to your team members to this repository. You will need to accept the membership invitation to the Georgetown-Analytics GitHub to gain access to the repository.

Your team project repository is here: https://github.com/georgetown-analytics/Pet-Adoption

Please use this GitHub repository for your code throughout the cohort. You are required to put your project code for all the steps of your project in this GitHub repository, which will be evaluated at the end of the cohort. I will also be reviewing your code in your team GitHub repository throughout the cohort to check on your progress, so please push your code consistently.


Yuwen "Next Steps" - 3/23/19

In case any of you have time to do a bit further exploration before our meeting at noon, here is a preliminary list of things for the next step exploration. I will continue checking the results and more to come.

  1. all the categorical variables need freq table: type, gender, color, MaturitySize, FurLength,Vaccinated, state, VideoAmt, PhotoAmt, AdoptionSpeed, Quantity
  2. may consider to group breed variable. Do either of you have any good idea on how to group the breed? The problem right now is that breed information is too detailed and became useless to the predictive power when each category of the data is thin. Ideally, we want to create a new variable to group breed by certain characteristics, e.g. how active the breed is, how aggressive the breed is, etc. This could be potentially a burdensome one-time work load.
  3. created new variables: 1) "mixed breed" based on if breed2==0 vs breed2 != 0 2) "no fee flag": if fee == 0 vs fee != 0
  4. need to check number of missing in variables: breed1, age,
  5. need 1% and 99% value for variables: age
  6. data quality check: 1) any observation with color2 == 0 while color3 !=0 2) any observation with breed1 == 0 while breed2 != 0
  7. the correlation matrix is truncated in the jupyter, can you generate a full matrix?

About

Cohort 14 Capstone Project for the Certificate of Data Science at Georgetown University School of Continuing Studies.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •