GitHub - Joblizz123/Data-Wrangling

WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators are almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because they are good dogs! The WeRateDogs dataset that was wrangled is a twitter user/account that primarily contains the ratings of different dogs and rates people’s dogs with humorous comments about the dogs. The wrangling process involves steps which include: i. Gathering the data ii. Assessing the data iii. Cleaning the data iv. Storing the data v. Analyzing the data GATHERING THE DATA: The datasets were gathered from three different sources

Enhanced Twitter Archive: This is an archive of tweets, providing quite a number of information about the dogs like the text, name, dog stages etc. This data was extracted programmatically.
Image Prediction file: This dataset was gotten from the top three results of the image prediction after running the images in the WeRateDogs Twitter archive through the application of Neural Network that classifies breeds of dogs. This filo was loaded programmatically.
Additional Data via Twitter API: This data was gotten by querying Twitter’s API so as to obtain the retweet count, favorite count and the tweet id.
ASSESSING THE DATA After the data from these different sources, the data were then assessed both visually and programmatically. The data were checked for quality and tidiness and these issues were identified: For quality issues:
The rating column numerator should be a float datatype
Timestamp and retweeted_status_timestamp datatypes should be datetime instead of object.
There are about 181 retweets in the retweeted_status_id column
Incorrect ratings of 75/10 and 5/10
The jpg_urls column in image_pred_copy contains duplicates
The name column contains 'None', an indication of incorrect imputations
The datatype of tweet_id for the three datasets is int
Unwanted columns in the df1 dataset For the tidiness issues: i. The dog stages should be on one column ii. The three datasets should be one dataframe
CLEANING THE DATA: The quality and tidiness issues outlined above were addressed as the datasets were cleaned using the define-code-test method
STORING THE CLEANING: The cleaned dataframe was therefore, ready for analysis and vsiualizaton.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
image-predictions.tsv		image-predictions.tsv
tweet_json.txt		tweet_json.txt
twitter-archive.csv		twitter-archive.csv
twitter_archive_master.csv		twitter_archive_master.csv
wrangle_acts.ipynb		wrangle_acts.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages