Contact Information

  • Lindsay Moir
  • 778 679 4406
  • tragoes@gmail.com

Project Description

We are using a twitter_archive from @dog_rates provided as a .csv file. Tweepy is being used to interface to the twitter API and receive JSON data. Finally we have been provided with a file that was produced at Udacity to classify pictures of dogs (.tsv), which is downloaded via a http request. This project includes data gathering, cleaning, storing, and analyzing the results. We are most interested in what this population of users likes in terms of breeds and how their ratings of those breeds have changed over time.

In [636]:
# Import libraries
import collections
import datetime
from ggplot import *
import json
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib_venn import venn2, venn2_circles
from matplotlib_venn import venn3, venn3_circles
import numpy as np
from numpy.polynomial.polynomial import polyfit
import pandas as pd
import requests
import seaborn as sns
sns.set_style('whitegrid')
sns.set_palette("bright", 10)
import sqlite3
import tweepy

Gather

twitter_archive

In [637]:
# Load WeRateDogs enhanced twitter archive from Udacity.
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')
twitter_archive.head(2)
Out[637]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU NaN NaN NaN https://twitter.com/dog_rates/status/892420643555336193/photo/1 13 10 Phineas None None None None
1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV NaN NaN NaN https://twitter.com/dog_rates/status/892177421306343426/photo/1 13 10 Tilly None None None None
In [638]:
twitter_archive.shape
Out[638]:
(2356, 17)

image_predictions

There is also a image file that contains twitter ids and urls for images. For example, this url gives you Stuart's picture (the dog) https://twitter.com/dog_rates/status/889531135344209921. The part after status is the tweet id for Stuart. Udacity wants us to download this programatically via the requests library.

In [639]:
# This loads the contents of this file into response
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
In [640]:
# Save .tsv to file for future use.
with open("image_predictions.tsv", mode = 'wb') as file:
    file.write(response.content)
In [641]:
# Now bring image_predictions.tsv back into memory.
image_predictions = pd.read_csv('image_predictions.tsv', sep = '\t')
image_predictions.head()
Out[641]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True
In [642]:
len(image_predictions)
Out[642]:
2075

twitter_counts

We need to setup access to the twitter api via tweepy.

In [643]:
# This authorizes us to use the twitter api.
consumer_key = ''
consumer_secret = ''
access_token = ''
access_secret = ''

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, parser=tweepy.parsers.JSONParser(), 
                 wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
In [644]:
# The api runs on tweet_id. We need a list of tweets to get the json data. 
# We are using the tweet_id from the archive.
"""
twitter_list = []
errors = []
ta_tweet_id_list = twitter_archive.tweet_id.tolist()
status_list = range(1, len(ta_tweet_id_list), 100)

# For loop to retrieve tweets in json format. We expect errors.
for index, id in enumerate(ta_tweet_id_list):
    try:
        page = api.get_status(id)
        favorites = page['favorite_count']
        retweets = page['retweet_count']
        twitter_list.append({'tweet_id': int(id),
                        'favorites': int(favorites),
                        'retweets': int(retweets)})
        if index in status_list:
            print('Completed ~', index, ' Length of twitter_list is', 
                  len(twitter_list), ' Length of errors list is', 
                  len(errors))
        else:
            pass
    except Exception as e:
        errors.append((id, e))

# Put the list of dictionaries into twitter_counts with tweet_id first.
twitter_counts = pd.DataFrame.from_dict(twitter_list)
# Reorder the columns
twitter_counts = twitter_counts[['tweet_id', 'favorites', 'retweets']]       

# Store twitter_counts on disk now that you have retrieved the data.
# It is time consuming to download this from twitter.
twitter_counts.to_csv('twitter_counts.csv', encoding='utf-8', index=False)
"""
Out[644]:
"\ntwitter_list = []\nerrors = []\nta_tweet_id_list = twitter_archive.tweet_id.tolist()\nstatus_list = range(1, len(ta_tweet_id_list), 100)\n\n# For loop to retrieve tweets in json format. We expect errors.\nfor index, id in enumerate(ta_tweet_id_list):\n    try:\n        page = api.get_status(id)\n        favorites = page['favorite_count']\n        retweets = page['retweet_count']\n        twitter_list.append({'tweet_id': int(id),\n                        'favorites': int(favorites),\n                        'retweets': int(retweets)})\n        if index in status_list:\n            print('Completed ~', index, ' Length of twitter_list is', \n                  len(twitter_list), ' Length of errors list is', \n                  len(errors))\n        else:\n            pass\n    except Exception as e:\n        errors.append((id, e))\n\n# Put the list of dictionaries into twitter_counts with tweet_id first.\ntwitter_counts = pd.DataFrame.from_dict(twitter_list)\n# Reorder the columns\ntwitter_counts = twitter_counts[['tweet_id', 'favorites', 'retweets']]       \n\n# Store twitter_counts on disk now that you have retrieved the data.\n# It is time consuming to download this from twitter.\ntwitter_counts.to_csv('twitter_counts.csv', encoding='utf-8', index=False)\n"
In [645]:
# Check and make sure it reads in OK
twitter_counts = pd.read_csv('twitter_counts.csv', encoding='utf-8')
twitter_counts.head()
Out[645]:
tweet_id favorites retweets
0 892420643555336193 38690 8558
1 892177421306343426 33164 6292
2 891815181378084864 24963 4174
3 891689557279858688 42070 8681
4 891327558926688256 40226 9451

Assess

General

In [646]:
# Just want to make sure that tweet_id is tweet_id.
print('length of twitter_archive.tweet_id in excel spreadsheet is', 
    len('892420643555336000'))
print('length of twitter_archive.tweet_id is', 
    len('892177421306343426'))
print('length of image_predictions.tweet_id is', 
    len('666020888022790149')) 
print('length of twitter_archive tweet_id in expanded_urls is', 
    len('892420643555336193'))
length of twitter_archive.tweet_id in excel spreadsheet is 18
length of twitter_archive.tweet_id is 18
length of image_predictions.tweet_id is 18
length of twitter_archive tweet_id in expanded_urls is 18
In [647]:
# See if there are any duplicate column names amongst the dataframes.
columns = pd.Series(list(twitter_archive) + 
    list(image_predictions) + list(twitter_counts))
columns[columns.duplicated()]
Out[647]:
17    tweet_id
29    tweet_id
dtype: object
In [648]:
# Let's see what the intersection of twitter_archive and image_predictions is.
# Want to see if anything surprising.
taid = set(twitter_archive.tweet_id)
len(taid)
Out[648]:
2356
In [649]:
ipid = set(image_predictions.tweet_id)
len(ipid)
Out[649]:
2075
In [650]:
common = taid.intersection(ipid).intersection(taid)
len(common)
Out[650]:
2075
In [651]:
not_common = list(set(taid) - set(ipid))
len(not_common)
Out[651]:
281
In [652]:
# Math works. No problem.
len(common) + len(not_common)
Out[652]:
2356

twitter_archive

In [653]:
twitter_archive.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), object(10)
memory usage: 313.0+ KB
In [654]:
twitter_archive.sample(5)
Out[654]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
1582 687109925361856513 NaN NaN 2016-01-13 03:12:26 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Baxter. He looks like a fun dog. Prefers action shots. 11/10 the last one is impeccable https://t.co/LHcH1yhhIb NaN NaN NaN https://twitter.com/dog_rates/status/687109925361856513/photo/1,https://twitter.com/dog_rates/status/687109925361856513/photo/1,https://twitter.com/dog_rates/status/687109925361856513/photo/1 11 10 Baxter None None None None
188 855862651834028034 8.558616e+17 194351775.0 2017-04-22 19:15:32 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> @dhmontgomery We also gave snoop dogg a 420/10 but I think that predated your research NaN NaN NaN NaN 420 10 None None None None None
1585 686947101016735744 NaN NaN 2016-01-12 16:25:26 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Jackson. He was specifically told not to sleep in the fridge. Damn it Jackson. 11/10 would squeeze softly https://t.co/lJs10ZJsgj NaN NaN NaN https://twitter.com/dog_rates/status/686947101016735744/photo/1 11 10 Jackson None None None None
137 866094527597207552 NaN NaN 2017-05-21 00:53:21 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> RT @dog_rates: Here's a pupper before and after being asked "who's a good girl?" Unsure as h*ck. 12/10 hint hint it's you https://t.co/ORiKā€¦ 8.378202e+17 4.196984e+09 2017-03-04 00:21:08 +0000 https://twitter.com/dog_rates/status/837820167694528512/photo/1,https://twitter.com/dog_rates/status/837820167694528512/photo/1 12 10 None None None pupper None
1475 693642232151285760 NaN NaN 2016-01-31 03:49:30 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Meet Freddery. He's a Westminster Toblerone. Seems to enjoy car rides. 9/10 would pat on the head approvingly https://t.co/6BS9XEip9a NaN NaN NaN https://twitter.com/dog_rates/status/693642232151285760/photo/1 9 10 Freddery None None None None
In [655]:
# Check for duplicated tweet_id's.
twitter_archive[twitter_archive.tweet_id.duplicated()]
Out[655]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo

After visually inspecting the text field, we may be able to extract gender.

In [656]:
# Most of the ratings use 10 as a denominator. How many do not?
nof_not_10_as_denom = twitter_archive.rating_denominator.value_counts().tolist()
sum(nof_not_10_as_denom) - nof_not_10_as_denom[0]
Out[656]:
23
In [657]:
pd.set_option('display.max_colwidth', 200)
funny_ratings_denom = twitter_archive[
    twitter_archive.rating_denominator != 10]
funny_ratings_denom = funny_ratings_denom[['tweet_id', 'text', 
    'rating_numerator', 'rating_denominator']]
funny_ratings_denom.sample(5)
Out[657]:
tweet_id text rating_numerator rating_denominator
1634 684225744407494656 Two sneaky puppers were not initially seen, moving the rating to 143/130. Please forgive us. Thank you https://t.co/kRK51Y5ac3 143 130
1351 704054845121142784 Here is a whole flock of puppers. 60/50 I'll take the lot https://t.co/9dpcw6MdWa 60 50
1433 697463031882764288 Happy Wednesday here's a bucket of pups. 44/40 would pet all at once https://t.co/HppvrYuamZ 44 40
1598 686035780142297088 Yes I do realize a rating of 4/20 would've been fitting. However, it would be unjust to give these cooperative pups that low of a rating 4 20
2335 666287406224695296 This is an Albanian 3 1/2 legged Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv 1 2
In [658]:
len(funny_ratings_denom)
Out[658]:
23
In [659]:
twitter_archive.rating_numerator.value_counts().head(10)
Out[659]:
12    558
11    464
10    461
13    351
9     158
8     102
7      55
14     54
5      37
6      32
Name: rating_numerator, dtype: int64

Atticus

Time for some fun. Which is the most extreme outlier for numerator?

In [660]:
twitter_archive.rating_numerator.max()
Out[660]:
1776
In [661]:
# Check suspicious numerators
funny_ratings_num = twitter_archive[
    twitter_archive.rating_numerator > 14]
funny_ratings_num = funny_ratings_num[['tweet_id', 'text', 
    'rating_numerator', 'rating_denominator']] 
funny_ratings_num.sort_values('rating_numerator', 
    ascending = False, inplace = True)
funny_ratings_num.head()
Out[661]:
tweet_id text rating_numerator rating_denominator
979 749981277374128128 This is Atticus. He's quite simply America af. 1776/10 https://t.co/GRXwMxLBkh 1776 10
313 835246439529840640 @jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho 960 0
189 855860136149123072 @s8n You tried very hard to portray this good boy as not so good, but you have ultimately failed. His goodness shines through. 666/10 666 10
2074 670842764863651840 After so many requests... here you go.\n\nGood dogg. 420/10 https://t.co/yfAAo1gdeY 420 10
188 855862651834028034 @dhmontgomery We also gave snoop dogg a 420/10 but I think that predated your research 420 10

Lets see who Atticus is. image.png

Now we know why Atticus was the "top dog" at 1776. That is the year that the USA became an independent nation!

In [662]:
len(funny_ratings_num)
Out[662]:
28
In [663]:
twitter_archive.source.value_counts()
Out[663]:
<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2221
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       33
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
Name: source, dtype: int64
In [664]:
twitter_archive.name.value_counts().head()
Out[664]:
None       745
a           55
Charlie     12
Oliver      11
Lucy        11
Name: name, dtype: int64

The above is ugly. Prepositions and other problems are in this name column.

image_predictions

In [665]:
image_predictions.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB
In [666]:
image_predictions.head()
Out[666]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True
In [667]:
image_predictions.sample(5)
Out[667]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
1090 719339463458033665 https://pbs.twimg.com/media/Cfuba6NW4AIeMHk.jpg 1 golden_retriever 0.765778 True borzoi 0.071148 True Leonberg 0.070371 True
560 677716515794329600 https://pbs.twimg.com/media/CWe7kw9W4AE8UJh.jpg 1 teddy 0.662908 False crib 0.031891 False chow 0.025438 True
929 702932127499816960 https://pbs.twimg.com/media/CcFRCfRW4AA5a72.jpg 1 wallaby 0.410710 False wombat 0.239332 False beaver 0.149605 False
1968 868552278524837888 https://pbs.twimg.com/media/DA23sCeVoAE3uF0.jpg 1 whippet 0.378151 True Italian_greyhound 0.275935 True American_Staffordshire_terrier 0.094991 True
290 671166507850801152 https://pbs.twimg.com/media/CVB2TnWUYAA2pAU.jpg 1 refrigerator 0.829772 False toilet_seat 0.030083 False shower_curtain 0.015461 False
In [668]:
# Let's see if there are any odd spellings that cause a problem.
image_predictions.p1.nunique()
Out[668]:
378
In [669]:
image_predictions.p1.str.lower().nunique()
Out[669]:
378
In [670]:
# Let's see if there are any odd spellings that cause a problem.
image_predictions.p2.nunique()
Out[670]:
405
In [671]:
image_predictions.p2.str.lower().nunique()
Out[671]:
404
In [672]:
original = image_predictions.p2.unique().tolist()
lowercase = image_predictions.p2.str.lower().unique().tolist()
original = [x.lower() for x in original]
In [673]:
# Well cardigan is a "problem". I don't think this is serious at all.
print([item for item, count in collections.Counter(original).items() if count > 1])
['cardigan']
In [674]:
# Let's see if there are any odd spellings that cause a problem.
image_predictions.p3.nunique()
Out[674]:
408
In [675]:
image_predictions.p3.str.lower().nunique()
Out[675]:
408
In [676]:
# There are a few jpg urls that are identical. Mmm, not sure what that means. 
image_predictions[image_predictions.jpg_url.duplicated()].count()
Out[676]:
tweet_id    66
jpg_url     66
img_num     66
p1          66
p1_conf     66
p1_dog      66
p2          66
p2_conf     66
p2_dog      66
p3          66
p3_conf     66
p3_dog      66
dtype: int64
In [677]:
image_predictions[image_predictions.jpg_url.duplicated(keep = False)].sort_values(
    by = 'jpg_url').head()
Out[677]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
480 675354435921575936 https://pbs.twimg.com/ext_tw_video_thumb/675354114423808004/pu/img/qL1R_nGLqa6lmkOx.jpg 1 upright 0.303415 False golden_retriever 0.181351 True Brittany_spaniel 0.162084 True
1297 752309394570878976 https://pbs.twimg.com/ext_tw_video_thumb/675354114423808004/pu/img/qL1R_nGLqa6lmkOx.jpg 1 upright 0.303415 False golden_retriever 0.181351 True Brittany_spaniel 0.162084 True
1864 842892208864923648 https://pbs.twimg.com/ext_tw_video_thumb/807106774843039744/pu/img/8XZg1xW35Xp2J6JW.jpg 1 Chihuahua 0.505370 True Pomeranian 0.120358 True toy_terrier 0.077008 True
1641 807106840509214720 https://pbs.twimg.com/ext_tw_video_thumb/807106774843039744/pu/img/8XZg1xW35Xp2J6JW.jpg 1 Chihuahua 0.505370 True Pomeranian 0.120358 True toy_terrier 0.077008 True
1703 817181837579653120 https://pbs.twimg.com/ext_tw_video_thumb/815965888126062592/pu/img/JleSw4wRhgKDWQj5.jpg 1 Tibetan_mastiff 0.506312 True Tibetan_terrier 0.295690 True otterhound 0.036251 True
In [678]:
twitter_archive[twitter_archive.tweet_id == 675354435921575936].head()
Out[678]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
1865 675354435921575936 NaN NaN 2015-12-11 16:40:19 +0000 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a> Everyone needs to watch this. 13/10 https://t.co/Bb3xnpsWBC NaN NaN NaN https://twitter.com/dog_rates/status/675354435921575936/video/1 13 10 None None None None None
In [679]:
twitter_archive[twitter_archive.tweet_id == 752309394570878976].head()
Out[679]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
949 752309394570878976 NaN NaN 2016-07-11 01:11:51 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> RT @dog_rates: Everyone needs to watch this. 13/10 https://t.co/Bb3xnpsWBC 6.753544e+17 4.196984e+09 2015-12-11 16:40:19 +0000 https://twitter.com/dog_rates/status/675354435921575936/video/1,https://twitter.com/dog_rates/status/675354435921575936/video/1 13 10 None None None None None

We need to be careful when we do the data cleaning. These repeated jpgs are retweets. This ends up creating two entries in the image_predictions file. Hopefully when we get rid of the retweets, this will get rid of this. Need to make sure about that.

twitter_counts

In [680]:
twitter_counts.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2345 entries, 0 to 2344
Data columns (total 3 columns):
tweet_id     2345 non-null int64
favorites    2345 non-null int64
retweets     2345 non-null int64
dtypes: int64(3)
memory usage: 55.0 KB
In [681]:
twitter_counts.head()
Out[681]:
tweet_id favorites retweets
0 892420643555336193 38690 8558
1 892177421306343426 33164 6292
2 891815181378084864 24963 4174
3 891689557279858688 42070 8681
4 891327558926688256 40226 9451
In [682]:
twitter_counts[twitter_counts.tweet_id.duplicated()]
Out[682]:
tweet_id favorites retweets

Quality

twitter_archive

  • Retweets should be removed.
  • Retweets associated columns not needed (only need original tweets with images). Those irrelevant columns are 'in_reply_to_user_id', 'in_reply_to_status_id', 'retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp'.
  • Erroneous datatype for timestamp.
  • See if you can extract gender from text factor.
  • Erroneous rating_nominator and rating_denominator values. Several ratings are simply done wrong.
  • Columns doggo, floofer, pupper, and puppo have value None. This should be nulls. We should go on a "None" hunt across the remaining columns.
  • source has only 4 unique entries that are in html format. It is difficult to read this column and get information from it, because of the html. It is confusing.
  • name is often not an actual name. "None" and "a" are the two most common values. There are also several prepositions (the, an, this, etc.). Name is inconsistent. Sometimes lower case sometimes upper case.
  • tweet_id should be string.
  • Column name for each type of dog. Should be 1 column and 4 categorical choices.

image_predictions

  • Column names are not informative.
  • Capitalize breed names.
  • tweet_id is numeric. Should be string.
  • Change breed to categorical.

twitter_counts

  • tweet_id is numeric. Should be string.

flagged but not cleaned.

  • In twitter_archive clean the url from the tweet text.
  • In p2 cardigan is a "problem". I don't think this is serious at all.
  • In image_clean img_num could be converted to integer but ...

Tidiness:

  • We only need one table that is a subset of these 3 tables
  • There are 2075 rows in prediction_images and 2356 rows in twitter_archive.
  • We can merge this table with the twitter_archive table also.
  • The only columns that has useful information in it are the tweet_id, p1 and p1_dog_columns so we can simply merge this subset with the twitter_archive table. Let's put this large table on a diet.

Cleaning

In [683]:
# Make copies of files
archive_clean = twitter_archive.copy()
In [684]:
image_clean = image_predictions.copy()
In [685]:
counts_clean = twitter_counts.copy()

archive_clean

Define

  • If retweeted_status_user_id and/or in_reply_to_user_id isnull that means the tweet was NOT used as a retweet. Identify those tweets and put them in a new version of archive_clean.
  • Retweets associated columns not needed (only need original tweets with images). Those irrelevant columns are retweeted_status_id, retweeted_status_user_id, and retweeted_status_timestamp.

Code

In [686]:
archive_clean = archive_clean[
    archive_clean.retweeted_status_user_id.isnull()]
archive_clean = archive_clean[
    archive_clean.in_reply_to_user_id.isnull()]
archive_clean.drop(['in_reply_to_user_id', 'in_reply_to_status_id', 
    'retweeted_status_id', 'retweeted_status_user_id',
    'retweeted_status_timestamp'], inplace = True, axis = 1) 

Test

In [687]:
archive_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 12 columns):
tweet_id              2097 non-null int64
timestamp             2097 non-null object
source                2097 non-null object
text                  2097 non-null object
expanded_urls         2094 non-null object
rating_numerator      2097 non-null int64
rating_denominator    2097 non-null int64
name                  2097 non-null object
doggo                 2097 non-null object
floofer               2097 non-null object
pupper                2097 non-null object
puppo                 2097 non-null object
dtypes: int64(3), object(9)
memory usage: 213.0+ KB
In [688]:
archive_clean.head(2)
Out[688]:
tweet_id timestamp source text expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
0 892420643555336193 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU https://twitter.com/dog_rates/status/892420643555336193/photo/1 13 10 Phineas None None None None
1 892177421306343426 2017-08-01 00:17:27 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV https://twitter.com/dog_rates/status/892177421306343426/photo/1 13 10 Tilly None None None None

Define

  • Erroneous datatype for timestamp.

Code

In [689]:
archive_clean.timestamp = pd.to_datetime(
    archive_clean.timestamp, infer_datetime_format=True)

Test

In [690]:
archive_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 12 columns):
tweet_id              2097 non-null int64
timestamp             2097 non-null datetime64[ns]
source                2097 non-null object
text                  2097 non-null object
expanded_urls         2094 non-null object
rating_numerator      2097 non-null int64
rating_denominator    2097 non-null int64
name                  2097 non-null object
doggo                 2097 non-null object
floofer               2097 non-null object
pupper                2097 non-null object
puppo                 2097 non-null object
dtypes: datetime64[ns](1), int64(3), object(8)
memory usage: 213.0+ KB

Define

  • Erroneous rating_nominator and rating_denominator values. Several ratings are done wrong. I defined funny_ratings earlier. If numerator is < 14 or denominator is == 10, keep the rows. Otherwise delete them. We are doing programmatic cleaning, not visual (if at all possible).
  • Create a rating which is the correct math of the numerator and denominator. Drop the numerator and denominator columns.

Code

In [691]:
# Subset that has only valid ratings.
archive_clean = archive_clean[archive_clean.rating_numerator < 15]
archive_clean = archive_clean[archive_clean.rating_denominator == 10]
In [692]:
archive_clean['rating'] = archive_clean.rating_numerator / archive_clean.rating_denominator
In [693]:
archive_clean.drop(['rating_numerator', 'rating_denominator'], 
   inplace = True, axis = 1)

Test

In [694]:
len(archive_clean)
Out[694]:
2075
In [695]:
archive_clean.head(0)
Out[695]:
tweet_id timestamp source text expanded_urls name doggo floofer pupper puppo rating

Define

  • Extract gender from text field via defining gender words, matching them in text, and then creating a gender column with 0 for Male, 1 for Female, and NaN for unknown.

Code

In [696]:
# Set up gender lists and container
male = ['boy', 'he', "he's", 'hes', "he's", 'him', 'himself', 'his',  
         'male', 'stud']
female = ['bitch', 'female', 'girl', 'her', 'hers', "her's",
          'herself', 'she', 'shes', "she's"]

# Create calc gender function
def calc_gender(text):
    text = text.replace("'", "") # Encoding problems (sigh)
    text = text.lower().split()
    for word in text:
        if word in male:
            return 0
        elif word in female:
            return 1
        else:
            pass
    else:
        return np.nan
In [697]:
# Apply the function.
archive_clean['gender'] = archive_clean['text'].apply(calc_gender)

Test

In [698]:
archive_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2075 entries, 0 to 2355
Data columns (total 12 columns):
tweet_id         2075 non-null int64
timestamp        2075 non-null datetime64[ns]
source           2075 non-null object
text             2075 non-null object
expanded_urls    2072 non-null object
name             2075 non-null object
doggo            2075 non-null object
floofer          2075 non-null object
pupper           2075 non-null object
puppo            2075 non-null object
rating           2075 non-null float64
gender           1399 non-null float64
dtypes: datetime64[ns](1), float64(2), int64(1), object(8)
memory usage: 210.7+ KB
In [699]:
archive_clean.gender.head(10)
Out[699]:
0    0.0
1    1.0
2    0.0
3    1.0
4    0.0
5    NaN
6    0.0
7    0.0
8    1.0
9    1.0
Name: gender, dtype: float64
In [700]:
archive_clean.text.head(10)
Out[700]:
0                                                             This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU
1        This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV
2                         This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB
3                                                                   This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ
4        This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f
5        Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG: tucker_marlo) #BarkWeek https://t.co/kQ04fDDRmh
6    Meet Jax. He enjoys ice cream so much he gets nervous around it. 13/10 help Jax enjoy more things by clicking below\n\nhttps://t.co/Zr4hWfAs1H https://t.co/tVJBRMnhxl
7                            When you watch your owner call another dog a good boy but then they turn back to you and say you're a great boy. 13/10 https://t.co/v0nONBcwxq
8                        This is Zoey. She doesn't want to be one of the scary sharks. Just wants to be a snuggly pettable boatpet. 13/10 #BarkWeek https://t.co/9TwLuAGH0b
9             This is Cassie. She is a college pup. Studying international doggo communication and stick theory. 14/10 so elegant much sophisticate https://t.co/t1bfwz5S2A
Name: text, dtype: object

These all match up to the True False and NaN on the first 10 records. That is pretty good, 1399 records out of 2075 classified!

Define

  • Columns name, doggo, floofer, pupper, and puppo have value None. This should be nulls. Change all "None" to np.nan.

Code

In [701]:
columns = ['name', 'doggo', 'floofer', 'pupper', 'puppo']
for column in columns:
    archive_clean[column].replace("None", np.nan, inplace = True)  

Test

In [702]:
for column in columns:
    print(archive_clean[archive_clean[column] == "None"])
Empty DataFrame
Columns: [tweet_id, timestamp, source, text, expanded_urls, name, doggo, floofer, pupper, puppo, rating, gender]
Index: []
Empty DataFrame
Columns: [tweet_id, timestamp, source, text, expanded_urls, name, doggo, floofer, pupper, puppo, rating, gender]
Index: []
Empty DataFrame
Columns: [tweet_id, timestamp, source, text, expanded_urls, name, doggo, floofer, pupper, puppo, rating, gender]
Index: []
Empty DataFrame
Columns: [tweet_id, timestamp, source, text, expanded_urls, name, doggo, floofer, pupper, puppo, rating, gender]
Index: []
Empty DataFrame
Columns: [tweet_id, timestamp, source, text, expanded_urls, name, doggo, floofer, pupper, puppo, rating, gender]
Index: []

Define

  • source has only 4 unique entries that are in html format. Change this to a categorical feature with 4 entries.

Code

In [703]:
archive_clean.source.unique().tolist()
Out[703]:
['<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>',
 '<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>',
 '<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>']
In [704]:
source_html = archive_clean.source.unique().tolist()
replace_html = ["iPhone", "Twitter", "Vine", "TweetDeck"]
archive_clean.source.replace(
    source_html, replace_html, inplace = True)
archive_clean.source = archive_clean.source.astype('category')

Test

In [705]:
archive_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2075 entries, 0 to 2355
Data columns (total 12 columns):
tweet_id         2075 non-null int64
timestamp        2075 non-null datetime64[ns]
source           2075 non-null category
text             2075 non-null object
expanded_urls    2072 non-null object
name             1485 non-null object
doggo            83 non-null object
floofer          10 non-null object
pupper           229 non-null object
puppo            24 non-null object
rating           2075 non-null float64
gender           1399 non-null float64
dtypes: category(1), datetime64[ns](1), float64(2), int64(1), object(7)
memory usage: 196.7+ KB

Define

  • name is often not an actual name. "None" and "a" are the two most common values. There are also several prepositions (the, an, this, etc.). Replace all prepositions, None, or words of the same ilk with np.nan. There are also some numbers in here. Convert them to strings first.

Code

In [706]:
# Show some offending values
archive_clean['name'].value_counts().head()
Out[706]:
a          54
Lucy       11
Charlie    11
Oliver     10
Cooper     10
Name: name, dtype: int64
In [707]:
archive_clean['name'].head(10)
Out[707]:
0     Phineas
1       Tilly
2      Archie
3       Darla
4    Franklin
5         NaN
6         Jax
7         NaN
8        Zoey
9      Cassie
Name: name, dtype: object
In [708]:
# There are some numbers in here (sigh)
archive_clean['name'] = archive_clean['name'].astype(str)

# Regex that finds all lowercase and numbers and causes them to return a False value which in turn, returns a NaN
archive_clean['name'] = archive_clean['name'].mask(archive_clean['name'].str.match(r'^[0-9a-z\s]+$'))

Test

In [709]:
archive_clean['name'].value_counts().head()
Out[709]:
Charlie    11
Lucy       11
Cooper     10
Oliver     10
Tucker      9
Name: name, dtype: int64

We can see that 'a' has been removed.

Define

  • tweet_id is numeric. It should be string. Too many to be categorical. Use astype to convert.

Code

In [710]:
archive_clean.tweet_id = archive_clean.tweet_id.astype(str)

Test

In [711]:
archive_clean.tweet_id.sample(5)
Out[711]:
608     797971864723324932
660     791312159183634433
1007    747600769478692864
610     797236660651966464
1957    673583129559498752
Name: tweet_id, dtype: object
In [712]:
# Produce a mask that will only result in tweet_ids != to 18. 
# This should create an empty Series.
mask = (archive_clean.tweet_id.str.len() != 18)
test = archive_clean.tweet_id.loc[mask]
test.head()
Out[712]:
Series([], Name: tweet_id, dtype: object)
In [713]:
archive_clean.head()
Out[713]:
tweet_id timestamp source text expanded_urls name doggo floofer pupper puppo rating gender
0 892420643555336193 2017-08-01 16:23:56 iPhone This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU https://twitter.com/dog_rates/status/892420643555336193/photo/1 Phineas NaN NaN NaN NaN 1.3 0.0
1 892177421306343426 2017-08-01 00:17:27 iPhone This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV https://twitter.com/dog_rates/status/892177421306343426/photo/1 Tilly NaN NaN NaN NaN 1.3 1.0
2 891815181378084864 2017-07-31 00:18:03 iPhone This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB https://twitter.com/dog_rates/status/891815181378084864/photo/1 Archie NaN NaN NaN NaN 1.2 0.0
3 891689557279858688 2017-07-30 15:58:51 iPhone This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ https://twitter.com/dog_rates/status/891689557279858688/photo/1 Darla NaN NaN NaN NaN 1.3 1.0
4 891327558926688256 2017-07-29 16:00:24 iPhone This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1 Franklin NaN NaN NaN NaN 1.2 0.0

Define

  • Column name for each type of dog. Should be 1 column and 4 categorical choices. Use .apply to create a function that checks for the stage in the received list and returns the state or a NaN for a new column called dog_stages. After drop the other columns and make dog_stages categorical.
  • Some dogs have more than one stage. Since this is problmatic, we will concatenate the matches.

Code

In [714]:
# Lets count the numbers in each column first
dog_stages = ['doggo', 'floofer', 'pupper', 'puppo']
for stage in dog_stages:
    display(archive_clean.groupby(stage).count())
tweet_id timestamp source text expanded_urls name floofer pupper puppo rating gender
doggo
doggo 83 83 83 83 83 42 1 9 1 83 49
tweet_id timestamp source text expanded_urls name doggo pupper puppo rating gender
floofer
floofer 10 10 10 10 10 5 1 0 0 10 6
tweet_id timestamp source text expanded_urls name doggo floofer puppo rating gender
pupper
pupper 229 229 229 229 228 124 9 0 0 229 135
tweet_id timestamp source text expanded_urls name doggo floofer pupper rating gender
puppo
puppo 24 24 24 24 24 16 1 0 0 24 18

You can certainly see that there is more than 1 dog stage in some of the rows!

In [715]:
# function that uses sets to create one column in archive_clean using sets

# create a set first
dog_stages = set(dog_stages)

def calc_dog_stage(stages):
    
    # stages is a list. Convert it to a set
    stages = set(stages)
    
    # This gives you all possible combinations
    stages = stages.intersection(dog_stages)
    
    # Check and see if there is anything in stages
    if len(stages) > 0:
    
        # This concatenates the multiple stages (if they exist)
        stages = '_'.join(stages)

        return stages
    
    else:
        # Make sure it returns a NaN if there are no stages
        return np.nan
In [716]:
archive_clean['dog_stages'] = archive_clean[['doggo', 'floofer', 
    'pupper', 'puppo']].apply(calc_dog_stage, axis = 1)
In [717]:
archive_clean.drop(['doggo', 'floofer', 'pupper', 'puppo'], 
    inplace = True, axis = 1)
archive_clean.dog_stages = archive_clean.dog_stages.astype('category')

Test

In [718]:
archive_clean.head(0)
Out[718]:
tweet_id timestamp source text expanded_urls name rating gender dog_stages
In [719]:
archive_clean.dog_stages.value_counts()
Out[719]:
pupper           220
doggo             72
puppo             23
pupper_doggo       9
floofer            9
doggo_puppo        1
doggo_floofer      1
Name: dog_stages, dtype: int64
In [720]:
archive_clean.dog_stages.nunique()
Out[720]:
7
In [721]:
archive_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2075 entries, 0 to 2355
Data columns (total 9 columns):
tweet_id         2075 non-null object
timestamp        2075 non-null datetime64[ns]
source           2075 non-null category
text             2075 non-null object
expanded_urls    2072 non-null object
name             1384 non-null object
rating           2075 non-null float64
gender           1399 non-null float64
dog_stages       335 non-null category
dtypes: category(2), datetime64[ns](1), float64(2), object(4)
memory usage: 134.3+ KB

image_clean

Define

  • Column names are not informative. Change the p1 column names to more informative names.
  • drop the other predictions. They are not useful.
  • Capitalize the breed name in p1 (best prediction).
  • Change breed to categorical by using .astype('category')
  • Change tweet_id to str

Code

In [722]:
image_clean.rename(columns = {'img_num': 'nof_images', 
                               'p1': 'best_prediction', 
                               'p1_conf': 'confidence', 
                               'p1_dog': 'dog_or_not'}, inplace = True)
In [723]:
image_clean.best_prediction = image_clean.best_prediction.str.capitalize()
In [724]:
image_clean.best_prediction = image_clean.best_prediction.astype('category')
In [725]:
image_clean.drop(['p2', 'p2_conf', 'p2_dog', 'p3', 
                  'p3_conf', 'p3_dog'], inplace = True, axis = 1)
In [726]:
# Reorder columns
image_clean = image_clean[['tweet_id', 'best_prediction', 'confidence', 
                           'dog_or_not', 'nof_images', 'jpg_url']]
In [727]:
image_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 6 columns):
tweet_id           2075 non-null int64
best_prediction    2075 non-null category
confidence         2075 non-null float64
dog_or_not         2075 non-null bool
nof_images         2075 non-null int64
jpg_url            2075 non-null object
dtypes: bool(1), category(1), float64(1), int64(2), object(1)
memory usage: 84.0+ KB

Test

In [728]:
image_clean.head(2)
Out[728]:
tweet_id best_prediction confidence dog_or_not nof_images jpg_url
0 666020888022790149 Welsh_springer_spaniel 0.465074 True 1 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg
1 666029285002620928 Redbone 0.506826 True 1 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg
In [729]:
image_clean.best_prediction.value_counts().head(5)
Out[729]:
Golden_retriever      150
Labrador_retriever    100
Pembroke               89
Chihuahua              83
Pug                    57
Name: best_prediction, dtype: int64
In [730]:
image_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 6 columns):
tweet_id           2075 non-null int64
best_prediction    2075 non-null category
confidence         2075 non-null float64
dog_or_not         2075 non-null bool
nof_images         2075 non-null int64
jpg_url            2075 non-null object
dtypes: bool(1), category(1), float64(1), int64(2), object(1)
memory usage: 84.0+ KB

Define

  • tweet_id is numeric. Change it to string using .astype(str).

Code

In [731]:
image_clean.tweet_id = image_clean.tweet_id.astype(str)

Test

In [732]:
image_clean.tweet_id.sample(5)
Out[732]:
1734    821522889702862852
1530    789530877013393408
1944    861769973181624320
2064    890006608113172480
598     679722016581222400
Name: tweet_id, dtype: object
In [733]:
# Produce a mask that will only result in tweet_ids != to 18. 
# This should create an empty Series.
mask = (image_clean.tweet_id.str.len() != 18)
test = image_clean.tweet_id.loc[mask]
test.head()
Out[733]:
Series([], Name: tweet_id, dtype: object)

counts_clean

Define

  • tweet_id is numeric. Change it to string using .astype(str).

Code

In [734]:
counts_clean.tweet_id = counts_clean.tweet_id.astype(str)

Test

In [735]:
counts_clean.tweet_id.sample(5)
Out[735]:
929     753026973505581056
2123    670069087419133954
1902    674372068062928900
1151    723688335806480385
1435    696490539101908992
Name: tweet_id, dtype: object
In [736]:
# Produce a mask that will only result in tweet_ids != to 18. 
# This should create an empty Series.
mask = (counts_clean.tweet_id.str.len() != 18)
test = counts_clean.tweet_id.loc[mask]
test.head()
Out[736]:
Series([], Name: tweet_id, dtype: object)

Tidiness:

Define

  • Combine all 3 tables together and drop the columns that are not useful for analysis. Accomplish the drop by excluding columns. Rename any "funny" column names.
In [737]:
twitter_archive_master = archive_clean.merge(counts_clean, on = 'tweet_id')
twitter_archive_master = twitter_archive_master.merge(image_clean, 
  on = 'tweet_id', how = 'left')
twitter_archive_master.drop(['source', 'text', 'expanded_urls'], 
  inplace = True, axis = 1)

Test

In [738]:
twitter_archive_master.head(0)
Out[738]:
tweet_id timestamp name rating gender dog_stages favorites retweets best_prediction confidence dog_or_not nof_images jpg_url
In [739]:
twitter_archive_master.shape
Out[739]:
(2075, 13)
In [740]:
# Want to find out the overlap between twitter_archive_master and image_clean.
c = twitter_archive_master.tweet_id.tolist()
c = set(c)
ic = image_clean.tweet_id.tolist()
ic = set(ic)
u = set.intersection(c, ic)
len(u)
Out[740]:
1949

Store

  • Store all of the files on disk as .csv and in sqlite.
In [741]:
# Store tables on disk.
twitter_archive_master.to_csv('twitter_archive_master.csv', encoding='utf-8', index=False)
archive_clean.to_csv('archive_clean.csv', encoding='utf-8', index=False)
image_clean.to_csv('image_clean.csv', encoding='utf-8', index=False)
counts_clean.to_csv('counts_clean.csv', encoding='utf-8', index=False)
In [742]:
# For practise, also store in a database.
# Connect to the sqlite database
database = 'weratedogs.db'
conn = sqlite3.connect(database)
In [743]:
twitter_archive_master.to_sql('twitter_archive_master', conn, if_exists = 'replace', index = False)
archive_clean.to_sql('archive_clean', conn, if_exists = 'replace', index = False)
image_clean.to_sql('image_clean', conn, if_exists = 'replace', index = False)
counts_clean.to_sql('counts_clean', conn, if_exists = 'replace', index = False)

Test

In [744]:
# Make sure they all read back in from sqlite db.
twitter_archive_master = pd.read_sql('SELECT * FROM twitter_archive_master', conn) 
archive_clean = pd.read_sql('SELECT * FROM archive_clean', conn) 
image_clean = pd.read_sql('SELECT * FROM image_clean', conn)
counts_clean = pd.read_sql('SELECT * FROM counts_clean', conn)
In [745]:
twitter_archive_master.head(1)
Out[745]:
tweet_id timestamp name rating gender dog_stages favorites retweets best_prediction confidence dog_or_not nof_images jpg_url
0 892420643555336193 2017-08-01 16:23:56 Phineas 1.3 0.0 None 38690 8558 Orange 0.097049 0.0 1.0 https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg
In [746]:
archive_clean.head(1)
Out[746]:
tweet_id timestamp source text expanded_urls name rating gender dog_stages
0 892420643555336193 2017-08-01 16:23:56 iPhone This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU https://twitter.com/dog_rates/status/892420643555336193/photo/1 Phineas 1.3 0.0 None
In [747]:
image_clean.head(1)
Out[747]:
tweet_id best_prediction confidence dog_or_not nof_images jpg_url
0 666020888022790149 Welsh_springer_spaniel 0.465074 1 1 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg
In [748]:
# Check and make sure they read back fine from .csv.
twitter_archive_master = pd.read_csv(
    'twitter_archive_master.csv', encoding='utf-8')
archive_clean = pd.read_csv(
    'archive_clean.csv', encoding='utf-8')
image_clean = pd.read_csv(
    'image_clean.csv', encoding='utf-8')
counts_clean = pd.read_csv(
    'counts_clean.csv', encoding='utf-8')
In [749]:
twitter_archive_master.head(1)
Out[749]:
tweet_id timestamp name rating gender dog_stages favorites retweets best_prediction confidence dog_or_not nof_images jpg_url
0 892420643555336193 2017-08-01 16:23:56 Phineas 1.3 0.0 NaN 38690 8558 Orange 0.097049 False 1.0 https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg
In [750]:
# Is coming back as float. Shoud be string. 
twitter_archive_master.tweet_id = twitter_archive_master.tweet_id.astype(str)
In [751]:
archive_clean.head(1)
Out[751]:
tweet_id timestamp source text expanded_urls name rating gender dog_stages
0 892420643555336193 2017-08-01 16:23:56 iPhone This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU https://twitter.com/dog_rates/status/892420643555336193/photo/1 Phineas 1.3 0.0 NaN
In [752]:
image_clean.head(1)
Out[752]:
tweet_id best_prediction confidence dog_or_not nof_images jpg_url
0 666020888022790149 Welsh_springer_spaniel 0.465074 True 1 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg
In [753]:
counts_clean.head(1)
Out[753]:
tweet_id favorites retweets
0 892420643555336193 38690 8558
In [754]:
counts_clean.head(1)
Out[754]:
tweet_id favorites retweets
0 892420643555336193 38690 8558

Analyzing and Visualizing Data

Data Distribution

Lets take a look at each factor's data distribution and type.

In [755]:
twitter_archive_master.describe()
Out[755]:
rating gender favorites retweets confidence nof_images
count 2075.000000 1399.000000 2075.000000 2075.000000 1949.000000 1949.000000
mean 1.060627 0.247320 8906.798072 2774.800000 0.594315 1.202668
std 0.215384 0.431608 12519.998157 4708.667753 0.272293 0.560024
min 0.000000 0.000000 80.000000 13.000000 0.044333 1.000000
25% 1.000000 0.000000 1987.000000 617.500000 0.362596 1.000000
50% 1.100000 0.000000 4091.000000 1351.000000 0.587764 1.000000
75% 1.200000 0.000000 11203.500000 3185.000000 0.846986 1.000000
max 1.400000 1.000000 142993.000000 77130.000000 1.000000 4.000000

We can see that some dogs get 0s (must be a very bad dog) and 14/10 (1.4) in this column. A very large range for both favorites and retweets. Makes you wonder what those maximum numbers represent! The confidence factor is a little troubling. There is quite a large range in values. nof_images tells us that there is a minimum of 1 picture and a maximum of 4 with only 1949/2075 rows having pictures.

In [756]:
twitter_archive_master.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 13 columns):
tweet_id           2075 non-null object
timestamp          2075 non-null object
name               1384 non-null object
rating             2075 non-null float64
gender             1399 non-null float64
dog_stages         335 non-null object
favorites          2075 non-null int64
retweets           2075 non-null int64
best_prediction    1949 non-null object
confidence         1949 non-null float64
dog_or_not         1949 non-null object
nof_images         1949 non-null float64
jpg_url            1949 non-null object
dtypes: float64(4), int64(2), object(7)
memory usage: 210.8+ KB

Of special note here is the timestamp field which is a proper datetime field. dog_stages only has 335 rows. This will make it problematic to use this factor effectively.

In [757]:
twitter_archive_master.head()
Out[757]:
tweet_id timestamp name rating gender dog_stages favorites retweets best_prediction confidence dog_or_not nof_images jpg_url
0 892420643555336193 2017-08-01 16:23:56 Phineas 1.3 0.0 NaN 38690 8558 Orange 0.097049 False 1.0 https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg
1 892177421306343426 2017-08-01 00:17:27 Tilly 1.3 1.0 NaN 33164 6292 Chihuahua 0.323581 True 1.0 https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg
2 891815181378084864 2017-07-31 00:18:03 Archie 1.2 0.0 NaN 24963 4174 Chihuahua 0.716012 True 1.0 https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg
3 891689557279858688 2017-07-30 15:58:51 Darla 1.3 1.0 NaN 42070 8681 Paper_towel 0.170278 False 1.0 https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg
4 891327558926688256 2017-07-29 16:00:24 Franklin 1.2 0.0 NaN 40226 9451 Basset 0.555712 True 2.0 https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg

These values are expected based on our Cleaning. Of special note though is dog_or_not. When you retrieve this from a .csv file it maintains this True False fill. When it is stored in sqlite and returned it is 1 for True and 0 for False. So, just 1 and 0. This potentially changes the way you do your coding in this project.

In [758]:
# Ratings
plt.figure(figsize = (8, 5))
plt.xlabel('Ratings')
plt.ylabel('Count')
bins = [0, .1, .2, .3, .4, .5, .6, .7, .8, .9, 1, 1.1, 1.2, 1.3, 1.4]
plt.hist(x = twitter_archive_master.rating, 
         edgecolor='white', linewidth=8, bins = bins)
plt.title('Histogram of Dog Ratings');

Heavily skewed right. It seems that the vast majority of dogs are 10s or above:) I wonder who the below .5s (5/10) are:)

In [759]:
twitter_archive_master.gender.value_counts()
Out[759]:
0.0    1053
1.0     346
Name: gender, dtype: int64
In [760]:
mfg = twitter_archive_master.groupby('gender').count()
mfg.tweet_id.head()
Out[760]:
gender
0.0    1053
1.0     346
Name: tweet_id, dtype: int64
In [761]:
twitter_archive_master.gender.mean()
Out[761]:
0.24731951393852752

The males make up 75% of the tweets in this archive. That does not mean that they make up 75% of the "dogs". However, 3/4 of these tweets are from people who have male dogs.

In [762]:
# Histograms of favorites and retweets
plt.figure(figsize=(8, 5))
plt.xlim(0, 60000)
plt.ylabel('Count')
plt.hist(twitter_archive_master.favorites, alpha=.5, label='# Favorites')
plt.hist(twitter_archive_master.retweets, alpha=.5, label='# Retweets')
plt.title('Distribution of Favorites and Retweets Counts')
plt.legend();