Contact Information

  • Lindsay Moir
  • 778 679 4406
  • tragoes@gmail.com

Project Description

We are using a twitter_archive from @dog_rates provided as a .csv file. Tweepy is being used to interface to the twitter API and receive JSON data. Finally we have been provided with a file that was produced at Udacity to classify pictures of dogs (.tsv), which is downloaded via a http request. This project includes data gathering, cleaning, storing, and analyzing the results. We are most interested in what this population of users likes in terms of breeds and how their ratings of those breeds have changed over time.

In [1]:
# Import libraries
import collections
import datetime
from ggplot import *
import json
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib_venn import venn2, venn2_circles
from matplotlib_venn import venn3, venn3_circles
import numpy as np
from numpy.polynomial.polynomial import polyfit
import pandas as pd
import requests
import seaborn as sns
sns.set_style('whitegrid')
sns.set_palette("bright", 10)
import sqlite3
import tweepy
C:\Users\linds\Anaconda2\envs\ipykernel_py3\lib\site-packages\ggplot\utils.py:81: FutureWarning: pandas.tslib is deprecated and will be removed in a future version.
You can access Timestamp as pandas.Timestamp
  pd.tslib.Timestamp,
C:\Users\linds\Anaconda2\envs\ipykernel_py3\lib\site-packages\ggplot\stats\smoothers.py:4: FutureWarning: The pandas.lib module is deprecated and will be removed in a future version. These are private functions and can be accessed from pandas._libs.lib instead
  from pandas.lib import Timestamp
C:\Users\linds\Anaconda2\envs\ipykernel_py3\lib\site-packages\statsmodels\compat\pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

Gather

twitter_archive

In [2]:
# Load WeRateDogs enhanced twitter archive from Udacity.
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')
twitter_archive.head(2)
Out[2]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" r... This is Phineas. He's a mystical boy. Only eve... NaN NaN NaN https://twitter.com/dog_rates/status/892420643... 13 10 Phineas None None None None
1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 <a href="http://twitter.com/download/iphone" r... This is Tilly. She's just checking pup on you.... NaN NaN NaN https://twitter.com/dog_rates/status/892177421... 13 10 Tilly None None None None
In [3]:
twitter_archive.shape
Out[3]:
(2356, 17)

image_predictions

There is also a image file that contains twitter ids and urls for images. For example, this url gives you Stuart's picture (the dog) https://twitter.com/dog_rates/status/889531135344209921. The part after status is the tweet id for Stuart. Udacity wants us to download this programatically via the requests library.

In [4]:
# This loads the contents of this file into response
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
In [5]:
# Save .tsv to file for future use.
with open("image_predictions.tsv", mode = 'wb') as file:
    file.write(response.content)
In [6]:
# Now bring image_predictions.tsv back into memory.
image_predictions = pd.read_csv('image_predictions.tsv', sep = '\t')
image_predictions.head()
Out[6]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True
In [7]:
len(image_predictions)
Out[7]:
2075

twitter_counts

We need to setup access to the twitter api via tweepy.

In [8]:
# This authorizes us to use the twitter api.
consumer_key = '1U1xWVzK9GZaA7nnuz7wGx3fi'
consumer_secret = 'MsLztfRJAxMSu4PHQLBxJ9V5iCpi7QQZ08e15DKgu8bENFzxJR'
access_token = '4623585315-NCuG2MGMsxhvRviAGxJ64g5Koe0yksHTCrPlS6y'
access_secret = 'TUXMhGshFeJZBnNBjQnVElsoulBUPp5NStZrLn2MMntDv'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, parser=tweepy.parsers.JSONParser(), 
                 wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
In [9]:
# The api runs on tweet_id. We need a list of tweets to get the json data. 
# We are using the tweet_id from the archive.
"""
twitter_list = []
errors = []
ta_tweet_id_list = twitter_archive.tweet_id.tolist()
status_list = range(1, len(ta_tweet_id_list), 100)

# For loop to retrieve tweets in json format. We expect errors.
for index, id in enumerate(ta_tweet_id_list):
    try:
        page = api.get_status(id)
        favorites = page['favorite_count']
        retweets = page['retweet_count']
        twitter_list.append({'tweet_id': int(id),
                        'favorites': int(favorites),
                        'retweets': int(retweets)})
        if index in status_list:
            print('Completed ~', index, ' Length of twitter_list is', 
                  len(twitter_list), ' Length of errors list is', 
                  len(errors))
        else:
            pass
    except Exception as e:
        errors.append((id, e))

# Put the list of dictionaries into twitter_counts with tweet_id first.
twitter_counts = pd.DataFrame.from_dict(twitter_list)
# Reorder the columns
twitter_counts = twitter_counts[['tweet_id', 'favorites', 'retweets']]       

# Store twitter_counts on disk now that you have retrieved the data.
# It is time consuming to download this from twitter.
twitter_counts.to_csv('twitter_counts.csv', encoding='utf-8', index=False)
"""
Out[9]:
"\ntwitter_list = []\nerrors = []\nta_tweet_id_list = twitter_archive.tweet_id.tolist()\nstatus_list = range(1, len(ta_tweet_id_list), 100)\n\n# For loop to retrieve tweets in json format. We expect errors.\nfor index, id in enumerate(ta_tweet_id_list):\n    try:\n        page = api.get_status(id)\n        favorites = page['favorite_count']\n        retweets = page['retweet_count']\n        twitter_list.append({'tweet_id': int(id),\n                        'favorites': int(favorites),\n                        'retweets': int(retweets)})\n        if index in status_list:\n            print('Completed ~', index, ' Length of twitter_list is', \n                  len(twitter_list), ' Length of errors list is', \n                  len(errors))\n        else:\n            pass\n    except Exception as e:\n        errors.append((id, e))\n\n# Put the list of dictionaries into twitter_counts with tweet_id first.\ntwitter_counts = pd.DataFrame.from_dict(twitter_list)\n# Reorder the columns\ntwitter_counts = twitter_counts[['tweet_id', 'favorites', 'retweets']]       \n\n# Store twitter_counts on disk now that you have retrieved the data.\n# It is time consuming to download this from twitter.\ntwitter_counts.to_csv('twitter_counts.csv', encoding='utf-8', index=False)\n"
In [10]:
# Check and make sure it reads in OK
twitter_counts = pd.read_csv('twitter_counts.csv', encoding='utf-8')
twitter_counts.head()
Out[10]:
tweet_id favorites retweets
0 892420643555336193 38690 8558
1 892177421306343426 33164 6292
2 891815181378084864 24963 4174
3 891689557279858688 42070 8681
4 891327558926688256 40226 9451

Assess

General

In [11]:
# Just want to make sure that tweet_id is tweet_id.
print('length of twitter_archive.tweet_id in excel spreadsheet is', 
    len('892420643555336000'))
print('length of twitter_archive.tweet_id is', 
    len('892177421306343426'))
print('length of image_predictions.tweet_id is', 
    len('666020888022790149')) 
print('length of twitter_archive tweet_id in expanded_urls is', 
    len('892420643555336193'))
length of twitter_archive.tweet_id in excel spreadsheet is 18
length of twitter_archive.tweet_id is 18
length of image_predictions.tweet_id is 18
length of twitter_archive tweet_id in expanded_urls is 18
In [12]:
# See if there are any duplicate column names amongst the dataframes.
columns = pd.Series(list(twitter_archive) + 
    list(image_predictions) + list(twitter_counts))
columns[columns.duplicated()]
Out[12]:
17    tweet_id
29    tweet_id
dtype: object
In [13]:
# Let's see what the intersection of twitter_archive and image_predictions is.
# Want to see if anything surprising.
taid = set(twitter_archive.tweet_id)
len(taid)
Out[13]:
2356
In [14]:
ipid = set(image_predictions.tweet_id)
len(ipid)
Out[14]:
2075
In [15]:
common = taid.intersection(ipid).intersection(taid)
len(common)
Out[15]:
2075
In [16]:
not_common = list(set(taid) - set(ipid))
len(not_common)
Out[16]:
281
In [17]:
# Math works. No problem.
len(common) + len(not_common)
Out[17]:
2356

twitter_archive

In [18]:
twitter_archive.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), object(10)
memory usage: 313.0+ KB
In [19]:
twitter_archive.sample(5)
Out[19]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
1797 677269281705472000 NaN NaN 2015-12-16 23:29:14 +0000 <a href="http://twitter.com/download/iphone" r... This is the happiest pupper I've ever seen. 10... NaN NaN NaN https://twitter.com/dog_rates/status/677269281... 10 10 the None None pupper None
119 869772420881756160 NaN NaN 2017-05-31 04:27:59 +0000 <a href="http://twitter.com/download/iphone" r... This is Dewey (pronounced "covfefe"). He's hav... NaN NaN NaN https://twitter.com/dog_rates/status/869772420... 13 10 Dewey None None None None
1140 727685679342333952 NaN NaN 2016-05-04 02:26:00 +0000 <a href="http://twitter.com/download/iphone" r... This is Cilantro. She's a Fellation Gadzooks. ... NaN NaN NaN https://twitter.com/dog_rates/status/727685679... 12 10 Cilantro None None None None
1328 705970349788291072 NaN NaN 2016-03-05 04:17:02 +0000 <a href="http://twitter.com/download/iphone" r... This is Lucy. She's a Venetian Kerploof. Suppo... NaN NaN NaN https://twitter.com/dog_rates/status/705970349... 12 10 Lucy None None None None
1032 745314880350101504 NaN NaN 2016-06-21 17:58:09 +0000 <a href="http://twitter.com/download/iphone" r... This is Sugar. She excels underwater. 12/10 ph... NaN NaN NaN https://twitter.com/dog_rates/status/745314880... 12 10 Sugar None None None None
In [20]:
# Check for duplicated tweet_id's.
twitter_archive[twitter_archive.tweet_id.duplicated()]
Out[20]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo

After visually inspecting the text field, we may be able to extract gender.

In [21]:
# Most of the ratings use 10 as a denominator. How many do not?
nof_not_10_as_denom = twitter_archive.rating_denominator.value_counts().tolist()
sum(nof_not_10_as_denom) - nof_not_10_as_denom[0]
Out[21]:
23
In [22]:
pd.set_option('display.max_colwidth', 200)
funny_ratings_denom = twitter_archive[
    twitter_archive.rating_denominator != 10]
funny_ratings_denom = funny_ratings_denom[['tweet_id', 'text', 
    'rating_numerator', 'rating_denominator']]
funny_ratings_denom.sample(5)
Out[22]:
tweet_id text rating_numerator rating_denominator
1202 716439118184652801 This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq 50 50
1351 704054845121142784 Here is a whole flock of puppers. 60/50 I'll take the lot https://t.co/9dpcw6MdWa 60 50
433 820690176645140481 The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd 84 70
1165 722974582966214656 Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a 4 20
1662 682962037429899265 This is Darrel. He just robbed a 7/11 and is in a high speed police chase. Was just spotted by the helicopter 10/10 https://t.co/7EsP8LmSp5 7 11
In [23]:
len(funny_ratings_denom)
Out[23]:
23
In [24]:
twitter_archive.rating_numerator.value_counts().head(10)
Out[24]:
12    558
11    464
10    461
13    351
9     158
8     102
7      55
14     54
5      37
6      32
Name: rating_numerator, dtype: int64

Atticus

Time for some fun. Which is the most extreme outlier for numerator?

In [25]:
twitter_archive.rating_numerator.max()
Out[25]:
1776
In [26]:
# Check suspicious numerators
funny_ratings_num = twitter_archive[
    twitter_archive.rating_numerator > 14]
funny_ratings_num = funny_ratings_num[['tweet_id', 'text', 
    'rating_numerator', 'rating_denominator']] 
funny_ratings_num.sort_values('rating_numerator', 
    ascending = False, inplace = True)
funny_ratings_num.head()
Out[26]:
tweet_id text rating_numerator rating_denominator
979 749981277374128128 This is Atticus. He's quite simply America af. 1776/10 https://t.co/GRXwMxLBkh 1776 10
313 835246439529840640 @jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho 960 0
189 855860136149123072 @s8n You tried very hard to portray this good boy as not so good, but you have ultimately failed. His goodness shines through. 666/10 666 10
2074 670842764863651840 After so many requests... here you go.\n\nGood dogg. 420/10 https://t.co/yfAAo1gdeY 420 10
188 855862651834028034 @dhmontgomery We also gave snoop dogg a 420/10 but I think that predated your research 420 10

Lets see who Atticus is. image.png

Now we know why Atticus was the "top dog" at 1776. That is the year that the USA became an independent nation!

In [27]:
len(funny_ratings_num)
Out[27]:
28
In [28]:
twitter_archive.source.value_counts()
Out[28]:
<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2221
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       33
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
Name: source, dtype: int64
In [29]:
twitter_archive.name.value_counts().head(10)
Out[29]:
None       745
a           55
Charlie     12
Lucy        11
Cooper      11
Oliver      11
Penny       10
Tucker      10
Lola        10
Bo           9
Name: name, dtype: int64

The above is ugly. Prepositions and other problems are in this name column.

image_predictions

In [30]:
image_predictions.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB
In [31]:
image_predictions.head()
Out[31]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True
In [32]:
image_predictions.sample(5)
Out[32]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
1279 750117059602808832 https://pbs.twimg.com/media/Cmjzc-oWEAESFCm.jpg 2 Shih-Tzu 0.814405 True Lhasa 0.175220 True Pekinese 0.008072 True
1238 746818907684614144 https://pbs.twimg.com/media/Cl071YVWEAAlF7N.jpg 1 dingo 0.175518 False timber_wolf 0.133647 False Ibizan_hound 0.101537 True
215 670055038660800512 https://pbs.twimg.com/media/CUyDgChWUAAmNSI.jpg 1 snail 0.563631 False slug 0.296649 False bolete 0.031839 False
67 667177989038297088 https://pbs.twimg.com/media/CUJK18UWEAEg7AR.jpg 1 vizsla 0.259249 True Chesapeake_Bay_retriever 0.176293 True Weimaraner 0.112369 True
2017 880095782870896641 https://pbs.twimg.com/media/DDa6ckbXgAAM1vV.jpg 1 miniature_pinscher 0.120298 True Rhodesian_ridgeback 0.106395 True beagle 0.106073 True
In [33]:
# Let's see if there are any odd spellings that cause a problem.
image_predictions.p1.nunique()
Out[33]:
378
In [34]:
image_predictions.p1.str.lower().nunique()
Out[34]:
378
In [35]:
# Let's see if there are any odd spellings that cause a problem.
image_predictions.p2.nunique()
Out[35]:
405
In [36]:
image_predictions.p2.str.lower().nunique()
Out[36]:
404
In [37]:
original = image_predictions.p2.unique().tolist()
lowercase = image_predictions.p2.str.lower().unique().tolist()
original = [x.lower() for x in original]
In [38]:
# Well cardigan is a "problem". I don't think this is serious at all.
print([item for item, count in collections.Counter(original).items() if count > 1])
['cardigan']
In [39]:
# Let's see if there are any odd spellings that cause a problem.
image_predictions.p3.nunique()
Out[39]:
408
In [40]:
image_predictions.p3.str.lower().nunique()
Out[40]:
408
In [41]:
# There are a few jpg urls that are identical. Mmm, not sure what that means. 
image_predictions[image_predictions.jpg_url.duplicated()].count()
Out[41]:
tweet_id    66
jpg_url     66
img_num     66
p1          66
p1_conf     66
p1_dog      66
p2          66
p2_conf     66
p2_dog      66
p3          66
p3_conf     66
p3_dog      66
dtype: int64
In [42]:
image_predictions[image_predictions.jpg_url.duplicated(keep = False)].sort_values(
    by = 'jpg_url').head()
Out[42]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
480 675354435921575936 https://pbs.twimg.com/ext_tw_video_thumb/675354114423808004/pu/img/qL1R_nGLqa6lmkOx.jpg 1 upright 0.303415 False golden_retriever 0.181351 True Brittany_spaniel 0.162084 True
1297 752309394570878976 https://pbs.twimg.com/ext_tw_video_thumb/675354114423808004/pu/img/qL1R_nGLqa6lmkOx.jpg 1 upright 0.303415 False golden_retriever 0.181351 True Brittany_spaniel 0.162084 True
1864 842892208864923648 https://pbs.twimg.com/ext_tw_video_thumb/807106774843039744/pu/img/8XZg1xW35Xp2J6JW.jpg 1 Chihuahua 0.505370 True Pomeranian 0.120358 True toy_terrier 0.077008 True
1641 807106840509214720 https://pbs.twimg.com/ext_tw_video_thumb/807106774843039744/pu/img/8XZg1xW35Xp2J6JW.jpg 1 Chihuahua 0.505370 True Pomeranian 0.120358 True toy_terrier 0.077008 True
1703 817181837579653120 https://pbs.twimg.com/ext_tw_video_thumb/815965888126062592/pu/img/JleSw4wRhgKDWQj5.jpg 1 Tibetan_mastiff 0.506312 True Tibetan_terrier 0.295690 True otterhound 0.036251 True
In [43]:
twitter_archive[twitter_archive.tweet_id == 675354435921575936].head()
Out[43]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
1865 675354435921575936 NaN NaN 2015-12-11 16:40:19 +0000 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a> Everyone needs to watch this. 13/10 https://t.co/Bb3xnpsWBC NaN NaN NaN https://twitter.com/dog_rates/status/675354435921575936/video/1 13 10 None None None None None
In [44]:
twitter_archive[twitter_archive.tweet_id == 752309394570878976].head()
Out[44]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
949 752309394570878976 NaN NaN 2016-07-11 01:11:51 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> RT @dog_rates: Everyone needs to watch this. 13/10 https://t.co/Bb3xnpsWBC 6.753544e+17 4.196984e+09 2015-12-11 16:40:19 +0000 https://twitter.com/dog_rates/status/675354435921575936/video/1,https://twitter.com/dog_rates/status/675354435921575936/video/1 13 10 None None None None None

We need to be careful when we do the data cleaning. These repeated jpgs are retweets. This ends up creating two entries in the image_predictions file. Hopefully when we get rid of the retweets, this will get rid of this. Need to make sure about that.

twitter_counts

In [45]:
twitter_counts.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2345 entries, 0 to 2344
Data columns (total 3 columns):
tweet_id     2345 non-null int64
favorites    2345 non-null int64
retweets     2345 non-null int64
dtypes: int64(3)
memory usage: 55.0 KB
In [46]:
twitter_counts.head()
Out[46]:
tweet_id favorites retweets
0 892420643555336193 38690 8558
1 892177421306343426 33164 6292
2 891815181378084864 24963 4174
3 891689557279858688 42070 8681
4 891327558926688256 40226 9451
In [47]:
twitter_counts[twitter_counts.tweet_id.duplicated()]
Out[47]:
tweet_id favorites retweets

Quality

twitter_archive

  • Retweets should be removed.
  • Retweets associated columns not needed (only need original tweets with images). Those irrelevant columns are 'in_reply_to_user_id', 'in_reply_to_status_id', 'retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp'.
  • Erroneous datatype for timestamp.
  • See if you can extract gender from text factor.
  • Erroneous rating_nominator and rating_denominator values. Several ratings are simply done wrong.
  • Columns doggo, floofer, pupper, and puppo have value None. This should be nulls. We should go on a "None" hunt across the remaining columns.
  • source has only 4 unique entries that are in html format. It is difficult to read this column and get information from it, because of the html. It is confusing.
  • name is often not an actual name. "None" and "a" are the two most common values. There are also several prepositions (the, an, this, etc.). Name is inconsistent. Sometimes lower case sometimes upper case.
  • tweet_id should be string.
  • Column name for each type of dog. Should be 1 column and 4 categorical choices.

image_predictions

  • Column names are not informative.
  • Capitalize breed names.
  • tweet_id is numeric. Should be string.
  • Change breed to categorical.

twitter_counts

  • tweet_id is numeric. Should be string.

flagged but not cleaned.

  • In twitter_archive clean the url from the tweet text.
  • In p2 cardigan is a "problem". I don't think this is serious at all.
  • In image_clean img_num could be converted to integer but ...

Tidiness:

  • We only need one table that is a subset of these 3 tables
  • There are 2075 rows in prediction_images and 2356 rows in twitter_archive.
  • We can merge this table with the twitter_archive table also.
  • The only columns that has useful information in it are the tweet_id, p1 and p1_dog_columns so we can simply merge this subset with the twitter_archive table. Let's put this large table on a diet.

Cleaning

In [184]:
# Make copies of files
archive_clean = twitter_archive.copy()
In [185]:
image_clean = image_predictions.copy()
In [186]:
counts_clean = twitter_counts.copy()

archive_clean

Define

  • If retweeted_status_user_id and/or in_reply_to_user_id isnull that means the tweet was NOT used as a retweet. Identify those tweets and put them in a new version of archive_clean.
  • Retweets associated columns not needed (only need original tweets with images). Those irrelevant columns are retweeted_status_id, retweeted_status_user_id, and retweeted_status_timestamp.

Code

In [187]:
archive_clean = archive_clean[
    archive_clean.retweeted_status_user_id.isnull()]
archive_clean = archive_clean[
    archive_clean.in_reply_to_user_id.isnull()]
archive_clean.drop(['in_reply_to_user_id', 'in_reply_to_status_id', 
    'retweeted_status_id', 'retweeted_status_user_id',
    'retweeted_status_timestamp'], inplace = True, axis = 1) 

Test

In [188]:
archive_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 12 columns):
tweet_id              2097 non-null int64
timestamp             2097 non-null object
source                2097 non-null object
text                  2097 non-null object
expanded_urls         2094 non-null object
rating_numerator      2097 non-null int64
rating_denominator    2097 non-null int64
name                  2097 non-null object
doggo                 2097 non-null object
floofer               2097 non-null object
pupper                2097 non-null object
puppo                 2097 non-null object
dtypes: int64(3), object(9)
memory usage: 213.0+ KB
In [189]:
archive_clean.head(2)
Out[189]:
tweet_id timestamp source text expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
0 892420643555336193 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU https://twitter.com/dog_rates/status/892420643555336193/photo/1 13 10 Phineas None None None None
1 892177421306343426 2017-08-01 00:17:27 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV https://twitter.com/dog_rates/status/892177421306343426/photo/1 13 10 Tilly None None None None

Define

  • Erroneous datatype for timestamp.

Code

In [190]:
archive_clean.timestamp = pd.to_datetime(
    archive_clean.timestamp, infer_datetime_format=True)

Test

In [191]:
archive_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 12 columns):
tweet_id              2097 non-null int64
timestamp             2097 non-null datetime64[ns]
source                2097 non-null object
text                  2097 non-null object
expanded_urls         2094 non-null object
rating_numerator      2097 non-null int64
rating_denominator    2097 non-null int64
name                  2097 non-null object
doggo                 2097 non-null object
floofer               2097 non-null object
pupper                2097 non-null object
puppo                 2097 non-null object
dtypes: datetime64[ns](1), int64(3), object(8)
memory usage: 213.0+ KB

Define

  • Erroneous rating_nominator and rating_denominator values. Several ratings are done wrong. I defined funny_ratings earlier. If numerator is < 14 or denominator is == 10, keep the rows. Otherwise delete them. We are doing programmatic cleaning, not visual (if at all possible).
  • Create a rating which is the correct math of the numerator and denominator. Drop the numerator and denominator columns.

Code

In [192]:
# Subset that has only valid ratings.
archive_clean = archive_clean[archive_clean.rating_numerator < 15]
archive_clean = archive_clean[archive_clean.rating_denominator == 10]
In [193]:
archive_clean['rating'] = archive_clean.rating_numerator / archive_clean.rating_denominator
In [194]:
archive_clean.drop(['rating_numerator', 'rating_denominator'], 
   inplace = True, axis = 1)

Test

In [195]:
len(archive_clean)
Out[195]:
2075
In [196]:
archive_clean.head(0)
Out[196]:
tweet_id timestamp source text expanded_urls name doggo floofer pupper puppo rating

Define

  • Extract gender from text field via defining gender words, matching them in text, and then creating a gender column with 0 for Male, 1 for Female, and NaN for unknown.

Code

In [197]:
# Set up gender lists and container
male = ['boy', 'he', "he's", 'hes', "he's", 'him', 'himself', 'his',  
         'male', 'stud']
female = ['bitch', 'female', 'girl', 'her', 'hers', "her's",
          'herself', 'she', 'shes', "she's"]

# Create calc gender function
def calc_gender(text):
    text = text.replace("'", "") # Encoding problems (sigh)
    text = text.lower().split()
    for word in text:
        if word in male:
            return 0
        elif word in female:
            return 1
        else:
            pass
    else:
        return np.nan
In [198]:
# Apply the function.
archive_clean['gender'] = archive_clean['text'].apply(calc_gender)

Test

In [199]:
archive_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2075 entries, 0 to 2355
Data columns (total 12 columns):
tweet_id         2075 non-null int64
timestamp        2075 non-null datetime64[ns]
source           2075 non-null object
text             2075 non-null object
expanded_urls    2072 non-null object
name             2075 non-null object
doggo            2075 non-null object
floofer          2075 non-null object
pupper           2075 non-null object
puppo            2075 non-null object
rating           2075 non-null float64
gender           1399 non-null float64
dtypes: datetime64[ns](1), float64(2), int64(1), object(8)
memory usage: 210.7+ KB
In [200]:
archive_clean.gender.head(10)
Out[200]:
0    0.0
1    1.0
2    0.0
3    1.0
4    0.0
5    NaN
6    0.0
7    0.0
8    1.0
9    1.0
Name: gender, dtype: float64
In [201]:
archive_clean.text.head(10)
Out[201]:
0                                                             This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU
1        This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV
2                         This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB
3                                                                   This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ
4        This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f
5        Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG: tucker_marlo) #BarkWeek https://t.co/kQ04fDDRmh
6    Meet Jax. He enjoys ice cream so much he gets nervous around it. 13/10 help Jax enjoy more things by clicking below\n\nhttps://t.co/Zr4hWfAs1H https://t.co/tVJBRMnhxl
7                            When you watch your owner call another dog a good boy but then they turn back to you and say you're a great boy. 13/10 https://t.co/v0nONBcwxq
8                        This is Zoey. She doesn't want to be one of the scary sharks. Just wants to be a snuggly pettable boatpet. 13/10 #BarkWeek https://t.co/9TwLuAGH0b
9             This is Cassie. She is a college pup. Studying international doggo communication and stick theory. 14/10 so elegant much sophisticate https://t.co/t1bfwz5S2A
Name: text, dtype: object

These all match up to the True False and NaN on the first 10 records. That is pretty good, 1399 records out of 2075 classified!

Define

  • Columns name, doggo, floofer, pupper, and puppo have value None. This should be nulls. Change all "None" to np.nan.

Code

In [202]:
columns = ['name', 'doggo', 'floofer', 'pupper', 'puppo']
for column in columns:
    archive_clean[column].replace("None", np.nan, inplace = True)  

Test

In [203]:
for column in columns:
    print(archive_clean[archive_clean[column] == "None"])
Empty DataFrame
Columns: [tweet_id, timestamp, source, text, expanded_urls, name, doggo, floofer, pupper, puppo, rating, gender]
Index: []
Empty DataFrame
Columns: [tweet_id, timestamp, source, text, expanded_urls, name, doggo, floofer, pupper, puppo, rating, gender]
Index: []
Empty DataFrame
Columns: [tweet_id, timestamp, source, text, expanded_urls, name, doggo, floofer, pupper, puppo, rating, gender]
Index: []
Empty DataFrame
Columns: [tweet_id, timestamp, source, text, expanded_urls, name, doggo, floofer, pupper, puppo, rating, gender]
Index: []
Empty DataFrame
Columns: [tweet_id, timestamp, source, text, expanded_urls, name, doggo, floofer, pupper, puppo, rating, gender]
Index: []

Define

  • source has only 4 unique entries that are in html format. Change this to a categorical feature with 4 entries.

Code

In [204]:
archive_clean.source.unique().tolist()
Out[204]:
['<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>',
 '<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>',
 '<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>']
In [205]:
source_html = archive_clean.source.unique().tolist()
replace_html = ["iPhone", "Twitter", "Vine", "TweetDeck"]
archive_clean.source.replace(
    source_html, replace_html, inplace = True)
archive_clean.source = archive_clean.source.astype('category')

Test

In [206]:
archive_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2075 entries, 0 to 2355
Data columns (total 12 columns):
tweet_id         2075 non-null int64
timestamp        2075 non-null datetime64[ns]
source           2075 non-null category
text             2075 non-null object
expanded_urls    2072 non-null object
name             1485 non-null object
doggo            83 non-null object
floofer          10 non-null object
pupper           229 non-null object
puppo            24 non-null object
rating           2075 non-null float64
gender           1399 non-null float64
dtypes: category(1), datetime64[ns](1), float64(2), int64(1), object(7)
memory usage: 196.7+ KB

Define

    • name is often not an actual name. "None" and "a" are the two most common values. There are also several prepositions (the, an, this, etc.). Replace all prepositions, None, or words of the same ilk with np.nan. Make all lowercase. Capitalize the name by using str.upper.

Code

In [207]:
# From nltk plus some of the words that are in name that are not useful
stopwords = ['O', 'a', 'a', 'about', 'above', 'after', 'again', 
             'against', 'all', 'all', 'am', 'an', 'an', 'and', 
             'any', 'are', 'as', 'at', 'at', 'be', 'because', 
             'been', 'before', 'being', 'below', 'between', 
             'both', 'but', 'by', 'by', 'can', 'did', 'do', 
             'does', 'doing', 'don', 'down', 'during', 'each', 
             'few', 'for', 'from', 'further', 'had', 'has', 
             'have', 'having', 'he', 'her', 'here', 'hers', 
             'herself', 'him', 'himself', 'his', 'how', 'i', 
             'if', 'in', 'into', 'is', 'it', 'its', 'itself', 
             'just', 'just', 'life', 'light', 'me', 'more', 
             'most', 'my', 'my', 'myself', 'no', 'nor', 'not', 
             'not', 'now', 'of', 'off', 'old', 'on', 'once', 
             'only', 'or', 'other', 'our', 'ours', 'ourselves', 
             'out', 'over', 'own', 'quite', 's', 'same', 'she', 
             'should', 'so', 'some', 'space', 'such', 'such', 
             't', 'than', 'that', 'the', 'the', 'their', 'theirs', 
             'them', 'themselves', 'then', 'there', 'these', 'they', 
             'this', 'this', 'those', 'through', 'to', 'too', 'under', 
             'until', 'up', 'very', 'very', 'was', 'we', 'were', 
             'what', 'when', 'where', 'which', 'while', 'who', 
             'whom', 'why', 'will', 'with', 'you', 'your', 'yours', 
             'yourself', 'yourselves']
In [208]:
# Creates a list of stopwords WITHOUT duplicates.
stopwords = set(stopwords)

# Write to disk so that we have these words for the next project.
with open('stopwords.txt', 'w') as f:
    for word in stopwords:
        f.write(word + '\n')
In [209]:
# Read back in to make sure it works and then sort it.
with open('stopwords.txt', 'r') as f:
    stopwords = [line.rstrip('\n') for line in f]
stopwords.sort()
In [210]:
# Replace the offending stopwords with NaNs.
archive_clean.name.replace(stopwords, np.nan, inplace = True)
In [211]:
# Capitalize the name.
archive_clean.name = archive_clean.name.str.capitalize()

Test

In [212]:
archive_clean.name.sample(10)
Out[212]:
245          NaN
108     Napolean
1180        Lucy
1902         NaN
86         Goose
2028       Dylan
1727    Penelope
650       Bailey
2317     Jimothy
496        Layla
Name: name, dtype: object

Define

  • tweet_id is numeric. It should be string. Too many to be categorical. Use astype to convert.

Code

In [213]:
archive_clean.tweet_id = archive_clean.tweet_id.astype(str)

Test

In [214]:
archive_clean.tweet_id.sample(5)
Out[214]:
318     834574053763584002
700     786233965241827333
802     772193107915964416
1504    691820333922455552
805     772114945936949249
Name: tweet_id, dtype: object
In [215]:
# Produce a mask that will only result in tweet_ids != to 18. 
# This should create an empty Series.
mask = (archive_clean.tweet_id.str.len() != 18)
test = archive_clean.tweet_id.loc[mask]
test.head()
Out[215]:
Series([], Name: tweet_id, dtype: object)

Define

  • Column name for each type of dog. Should be 1 column and 4 categorical choices. Use .apply to create a function that checks for the stage in the received list and returns the state or a NaN for a new column called dog_stages. After drop the other columns and make dog_stages categorical.
  • Some dogs have more than one stage. Since this is problmatic, we will simply take the first stage listed. Perhaps they are just getting older?

Code

In [216]:
dog_stages = ['doggo', 'floofer', 'pupper', 'puppo']
def calc_dog_stage(stages):
    for stage in stages:
        if stage in dog_stages:
            return stage
        else:
            pass
    return stage
In [217]:
archive_clean['dog_stages'] = archive_clean[['doggo', 'floofer', 
    'pupper', 'puppo']].apply(calc_dog_stage, axis = 1)
archive_clean.drop(['doggo', 'floofer', 'pupper', 'puppo'], 
    inplace = True, axis = 1)
archive_clean.dog_stages = archive_clean.dog_stages.astype('category')

Test

In [218]:
archive_clean.head(0)
Out[218]:
tweet_id timestamp source text expanded_urls name rating gender dog_stages
In [219]:
archive_clean.dog_stages.value_counts()
Out[219]:
pupper     220
doggo       83
puppo       23
floofer      9
Name: dog_stages, dtype: int64
In [220]:
archive_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2075 entries, 0 to 2355
Data columns (total 9 columns):
tweet_id         2075 non-null object
timestamp        2075 non-null datetime64[ns]
source           2075 non-null category
text             2075 non-null object
expanded_urls    2072 non-null object
name             1396 non-null object
rating           2075 non-null float64
gender           1399 non-null float64
dog_stages       335 non-null category
dtypes: category(2), datetime64[ns](1), float64(2), object(4)
memory usage: 134.1+ KB

image_clean

Define

  • Column names are not informative. Change the p1 column names to more informative names.
  • drop the other predictions. They are not useful.
  • Capitalize the breed name in p1 (best prediction).
  • Change breed to categorical by using .astype('category')
  • Change tweet_id to str

Code

In [221]:
image_clean.rename(columns = {'img_num': 'nof_images', 
                               'p1': 'best_prediction', 
                               'p1_conf': 'confidence', 
                               'p1_dog': 'dog_or_not'}, inplace = True)
In [222]:
image_clean.best_prediction = image_clean.best_prediction.str.capitalize()
In [223]:
image_clean.best_prediction = image_clean.best_prediction.astype('category')
In [224]:
image_clean.drop(['p2', 'p2_conf', 'p2_dog', 'p3', 
                  'p3_conf', 'p3_dog'], inplace = True, axis = 1)
In [225]:
# Reorder columns
image_clean = image_clean[['tweet_id', 'best_prediction', 'confidence', 
                           'dog_or_not', 'nof_images', 'jpg_url']]
In [226]:
image_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 6 columns):
tweet_id           2075 non-null int64
best_prediction    2075 non-null category
confidence         2075 non-null float64
dog_or_not         2075 non-null bool
nof_images         2075 non-null int64
jpg_url            2075 non-null object
dtypes: bool(1), category(1), float64(1), int64(2), object(1)
memory usage: 84.0+ KB

Test

In [227]:
image_clean.head(2)
Out[227]:
tweet_id best_prediction confidence dog_or_not nof_images jpg_url
0 666020888022790149 Welsh_springer_spaniel 0.465074 True 1 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg
1 666029285002620928 Redbone 0.506826 True 1 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg
In [228]:
image_clean.best_prediction.value_counts().head(5)
Out[228]:
Golden_retriever      150
Labrador_retriever    100
Pembroke               89
Chihuahua              83
Pug                    57
Name: best_prediction, dtype: int64
In [229]:
image_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 6 columns):
tweet_id           2075 non-null int64
best_prediction    2075 non-null category
confidence         2075 non-null float64
dog_or_not         2075 non-null bool
nof_images         2075 non-null int64
jpg_url            2075 non-null object
dtypes: bool(1), category(1), float64(1), int64(2), object(1)
memory usage: 84.0+ KB

Define

  • tweet_id is numeric. Change it to string using .astype(str).

Code

In [230]:
image_clean.tweet_id = image_clean.tweet_id.astype(str)

Test

In [231]:
image_clean.tweet_id.sample(5)
Out[231]:
1126    727524757080539137
502     675870721063669760
541     677187300187611136
140     668544745690562560
1071    716439118184652801
Name: tweet_id, dtype: object
In [232]:
# Produce a mask that will only result in tweet_ids != to 18. 
# This should create an empty Series.
mask = (image_clean.tweet_id.str.len() != 18)
test = image_clean.tweet_id.loc[mask]
test.head()
Out[232]:
Series([], Name: tweet_id, dtype: object)

counts_clean

Define

  • tweet_id is numeric. Change it to string using .astype(str).

Code

In [233]:
counts_clean.tweet_id = counts_clean.tweet_id.astype(str)

Test

In [234]:
counts_clean.tweet_id.sample(5)
Out[234]:
447     818259473185828864
1283    707995814724026368
1033    743609206067040256
1832    675853064436391936
1777    677557565589463040
Name: tweet_id, dtype: object
In [235]:
# Produce a mask that will only result in tweet_ids != to 18. 
# This should create an empty Series.
mask = (counts_clean.tweet_id.str.len() != 18)
test = counts_clean.tweet_id.loc[mask]
test.head()
Out[235]:
Series([], Name: tweet_id, dtype: object)

Tidiness:

Define

  • Combine all 3 tables together and drop the columns that are not useful for analysis. Accomplish the drop by excluding columns. Rename any "funny" column names.
In [236]:
twitter_archive_master = archive_clean.merge(counts_clean, on = 'tweet_id')
twitter_archive_master = twitter_archive_master.merge(image_clean, 
  on = 'tweet_id', how = 'left')
twitter_archive_master.drop(['source', 'text', 'expanded_urls'], 
  inplace = True, axis = 1)

Test

In [237]:
twitter_archive_master.head(0)
Out[237]:
tweet_id timestamp name rating gender dog_stages favorites retweets best_prediction confidence dog_or_not nof_images jpg_url
In [238]:
twitter_archive_master.shape
Out[238]:
(2075, 13)
In [239]:
# Want to find out the overlap between twitter_archive_master and image_clean.
c = twitter_archive_master.tweet_id.tolist()
c = set(c)
ic = image_clean.tweet_id.tolist()
ic = set(ic)
u = set.intersection(c, ic)
len(u)
Out[239]:
1949

Store

  • Store all of the files on disk as .csv and in sqlite.
In [240]:
# Store tables on disk.
twitter_archive_master.to_csv('twitter_archive_master.csv', encoding='utf-8', index=False)
archive_clean.to_csv('archive_clean.csv', encoding='utf-8', index=False)
image_clean.to_csv('image_clean.csv', encoding='utf-8', index=False)
counts_clean.to_csv('counts_clean.csv', encoding='utf-8', index=False)
In [241]:
# For practise, also store in a database.
# Connect to the sqlite database
database = 'weratedogs.db'
conn = sqlite3.connect(database)
In [242]:
twitter_archive_master.to_sql('twitter_archive_master', conn, if_exists = 'replace', index = False)
archive_clean.to_sql('archive_clean', conn, if_exists = 'replace', index = False)
image_clean.to_sql('image_clean', conn, if_exists = 'replace', index = False)
counts_clean.to_sql('counts_clean', conn, if_exists = 'replace', index = False)

Test

In [243]:
# Make sure they all read back in from sqlite db.
twitter_archive_master = pd.read_sql('SELECT * FROM twitter_archive_master', conn) 
archive_clean = pd.read_sql('SELECT * FROM archive_clean', conn) 
image_clean = pd.read_sql('SELECT * FROM image_clean', conn)
counts_clean = pd.read_sql('SELECT * FROM counts_clean', conn)
In [244]:
twitter_archive_master.head(1)
Out[244]:
tweet_id timestamp name rating gender dog_stages favorites retweets best_prediction confidence dog_or_not nof_images jpg_url
0 892420643555336193 2017-08-01 16:23:56 Phineas 1.3 0.0 None 38690 8558 Orange 0.097049 0.0 1.0 https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg
In [245]:
archive_clean.head(1)
Out[245]:
tweet_id timestamp source text expanded_urls name rating gender dog_stages
0 892420643555336193 2017-08-01 16:23:56 iPhone This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU https://twitter.com/dog_rates/status/892420643555336193/photo/1 Phineas 1.3 0.0 None
In [246]:
image_clean.head(1)
Out[246]:
tweet_id best_prediction confidence dog_or_not nof_images jpg_url
0 666020888022790149 Welsh_springer_spaniel 0.465074 1 1 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg
In [247]:
# Check and make sure they read back fine from .csv.
twitter_archive_master = pd.read_csv(
    'twitter_archive_master.csv', encoding='utf-8')
archive_clean = pd.read_csv(
    'archive_clean.csv', encoding='utf-8')
image_clean = pd.read_csv(
    'image_clean.csv', encoding='utf-8')
counts_clean = pd.read_csv(
    'counts_clean.csv', encoding='utf-8')
In [248]:
twitter_archive_master.head(1)
Out[248]:
tweet_id timestamp name rating gender dog_stages favorites retweets best_prediction confidence dog_or_not nof_images jpg_url
0 892420643555336193 2017-08-01 16:23:56 Phineas 1.3 0.0 NaN 38690 8558 Orange 0.097049 False 1.0 https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg
In [249]:
# Is coming back as float. Shoud be string. 
twitter_archive_master.tweet_id = twitter_archive_master.tweet_id.astype(str)
In [250]:
archive_clean.head(1)
Out[250]:
tweet_id timestamp source text expanded_urls name rating gender dog_stages
0 892420643555336193 2017-08-01 16:23:56 iPhone This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU https://twitter.com/dog_rates/status/892420643555336193/photo/1 Phineas 1.3 0.0 NaN
In [251]:
image_clean.head(1)
Out[251]:
tweet_id best_prediction confidence dog_or_not nof_images jpg_url
0 666020888022790149 Welsh_springer_spaniel 0.465074 True 1 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg
In [252]:
counts_clean.head(1)
Out[252]:
tweet_id favorites retweets
0 892420643555336193 38690 8558
In [253]:
counts_clean.head(1)
Out[253]:
tweet_id favorites retweets
0 892420643555336193 38690 8558

Analyzing and Visualizing Data

Data Distribution

Lets take a look at each factor's data distribution and type.

In [254]:
twitter_archive_master.describe()
Out[254]:
rating gender favorites retweets confidence nof_images
count 2075.000000 1399.000000 2075.000000 2075.000000 1949.000000 1949.000000
mean 1.060627 0.247320 8906.798072 2774.800000 0.594315 1.202668
std 0.215384 0.431608 12519.998157 4708.667753 0.272293 0.560024
min 0.000000 0.000000 80.000000 13.000000 0.044333 1.000000
25% 1.000000 0.000000 1987.000000 617.500000 0.362596 1.000000
50% 1.100000 0.000000 4091.000000 1351.000000 0.587764 1.000000
75% 1.200000 0.000000 11203.500000 3185.000000 0.846986 1.000000
max 1.400000 1.000000 142993.000000 77130.000000 1.000000 4.000000

We can see that some dogs get 0s (must be a very bad dog) and 14/10 (1.4) in this column. A very large range for both favorites and retweets. Makes you wonder what those maximum numbers represent! The confidence factor is a little troubling. There is quite a large range in values. nof_images tells us that there is a minimum of 1 picture and a maximum of 4 with only 1949/2075 rows having pictures.

In [255]:
twitter_archive_master.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 13 columns):
tweet_id           2075 non-null object
timestamp          2075 non-null object
name               1396 non-null object
rating             2075 non-null float64
gender             1399 non-null float64
dog_stages         335 non-null object
favorites          2075 non-null int64
retweets           2075 non-null int64
best_prediction    1949 non-null object
confidence         1949 non-null float64
dog_or_not         1949 non-null object
nof_images         1949 non-null float64
jpg_url            1949 non-null object
dtypes: float64(4), int64(2), object(7)
memory usage: 210.8+ KB

Of special note here is the timestamp field which is a proper datetime field. dog_stages only has 335 rows. This will make it problematic to use this factor effectively.

In [256]:
twitter_archive_master.head()
Out[256]:
tweet_id timestamp name rating gender dog_stages favorites retweets best_prediction confidence dog_or_not nof_images jpg_url
0 892420643555336193 2017-08-01 16:23:56 Phineas 1.3 0.0 NaN 38690 8558 Orange 0.097049 False 1.0 https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg
1 892177421306343426 2017-08-01 00:17:27 Tilly 1.3 1.0 NaN 33164 6292 Chihuahua 0.323581 True 1.0 https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg
2 891815181378084864 2017-07-31 00:18:03 Archie 1.2 0.0 NaN 24963 4174 Chihuahua 0.716012 True 1.0 https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg
3 891689557279858688 2017-07-30 15:58:51 Darla 1.3 1.0 NaN 42070 8681 Paper_towel 0.170278 False 1.0 https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg
4 891327558926688256 2017-07-29 16:00:24 Franklin 1.2 0.0 NaN 40226 9451 Basset 0.555712 True 2.0 https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg

These values are expected based on our Cleaning. Of special note though is dog_or_not. When you retrieve this from a .csv file it maintains this True False fill. When it is stored in sqlite and returned it is 1 for True and 0 for False. So, just 1 and 0. This potentially changes the way you do your coding in this project.

In [257]:
# Ratings
plt.figure(figsize = (8, 5))
plt.xlabel('Ratings')
plt.ylabel('Count')
bins = [0, .1, .2, .3, .4, .5, .6, .7, .8, .9, 1, 1.1, 1.2, 1.3, 1.4]
plt.hist(x = twitter_archive_master.rating, 
         edgecolor='white', linewidth=8, bins = bins)
plt.title('Histogram of Dog Ratings');

Heavily skewed right. It seems that the vast majority of dogs are 10s or above:) I wonder who the below .5s (5/10) are:)

In [258]:
twitter_archive_master.gender.value_counts()
Out[258]:
0.0    1053
1.0     346
Name: gender, dtype: int64
In [259]:
mfg = twitter_archive_master.groupby('gender').count()
mfg.tweet_id.head()
Out[259]:
gender
0.0    1053
1.0     346
Name: tweet_id, dtype: int64
In [260]:
twitter_archive_master.gender.mean()
Out[260]:
0.24731951393852752

The males make up 75% of the tweets in this archive. That does not mean that they make up 75% of the "dogs". However, 3/4 of these tweets are from people who have male dogs.

In [261]:
# Histograms of favorites and retweets
plt.figure(figsize=(8, 5))
plt.xlim(0, 60000)
plt.ylabel('Count')
plt.hist(twitter_archive_master.favorites, alpha=.5, label='# Favorites')
plt.hist(twitter_archive_master.retweets, alpha=.5, label='# Retweets')
plt.title('Distribution of Favorites and Retweets Counts')
plt.legend();

Distribution of favorites and retweet counts follow the same pattern, which is no surprise. Lets see what happens with log10.

In [262]:
log10 = twitter_archive_master.copy()
log10 = log10[['favorites', 'retweets', 'confidence', 'timestamp']]
log10['favorites'] = twitter_archive_master.favorites.apply(lambda x: np.log10(x))
log10['retweets'] = twitter_archive_master.retweets.apply(lambda x: np.log10(x))
log10['confidence'] = twitter_archive_master.confidence.apply(lambda x: np.log10(x))
In [263]:
# Log histogram favorites and retweets
plt.figure(figsize=(12, 8))
plt.ylabel('Count')
plt.hist(log10.favorites, alpha=.5, label='# Favorites')
plt.hist(log10.retweets, alpha=.5, label='# Retweets')
plt.title('Log10 Distribution of Favorites and Retweets Counts')
plt.legend();

The log transformation works its magic and the distribution for both features looks normal.

In [264]:
# Histogram of Confidence Factor for Pictures
# Had to get rid of NaNs. Would not work unless I did.
sns.set_style('whitegrid')
plt.figure(figsize=(8, 5))
plt.xlim(.1, 1)
plt.xlabel('Confidence')
plt.ylabel('Count')
plt.hist(x = twitter_archive_master.confidence[
    ~np.isnan(twitter_archive_master.confidence)], 
    edgecolor='white', linewidth=1.2)
plt.title('Histogram of Confidence Factor For Pictures');
In [265]:
# Lets try a log
# Histogram of Confidence Factor for Pictures
# Had to get rid of NaNs. Would not work unless I did.
sns.set_style('whitegrid')
plt.figure(figsize=(8, 5))
plt.xlabel('Confidence')
plt.ylabel('Count')
plt.hist(x = log10.confidence[~np.isnan(log10.confidence)], 
        edgecolor='white', linewidth=1.2)
plt.title('Log10 Histogram of Confidence Factor For Pictures');

Wow, you can see the mind of a computer in here. This is VERY regular.

In [266]:
twitter_archive_master.boxplot(column='rating', 
   by = 'dog_stages', figsize=(8, 5));

A little bizarre the way the median is right on 1.2 for all of the dog_stages except for pupper. People are seldom rating their dog lower than a 10/10.

In [267]:
twitter_archive_master.groupby('dog_stages')['rating'].describe()
Out[267]:
count mean std min 25% 50% 75% max
dog_stages
doggo 83.0 1.177108 0.157205 0.5 1.10 1.2 1.3 1.4
floofer 9.0 1.188889 0.105409 1.0 1.10 1.2 1.3 1.3
pupper 220.0 1.073182 0.172171 0.3 1.00 1.1 1.2 1.4
puppo 23.0 1.200000 0.127920 0.9 1.15 1.2 1.3 1.4

Bivariate Analysis

See what the correlations are in the dataset. Create a correlation matrix

In [268]:
# Compute the correlation matrix
corr = twitter_archive_master[['rating', 'gender', 
    'favorites', 'retweets', 'confidence', 'nof_images']].corr()
corr
Out[268]:
rating gender favorites retweets confidence nof_images
rating 1.000000 0.131150 0.398324 0.306715 0.100160 0.192380
gender 0.131150 1.000000 0.055134 0.022763 -0.008482 0.006425
favorites 0.398324 0.055134 1.000000 0.914895 0.076910 0.134553
retweets 0.306715 0.022763 0.914895 1.000000 0.054017 0.106179
confidence 0.100160 -0.008482 0.076910 0.054017 1.000000 0.202515
nof_images 0.192380 0.006425 0.134553 0.106179 0.202515 1.000000
In [269]:
# This is largely taken from 
# https://seaborn.pydata.org/examples/many_pairwise_correlations.html
# There are a couple of minor tweaks.

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(12, 8))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.set_style('darkgrid')
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .8});
  • The only correlation of interest is favorites and ratings. It is .40 which would be useful for a linear model.
  • We would not include both retweets and favorites in any model since the correlation coefficient between favorites and retweets is so high.
  • Despite doing cleaning and creating on gender and dog_stages, there is little correlation with any of the other variables in this dataset.
In [270]:
# Scatter plot of favorites and retweets.
plt.figure(figsize=(12, 8))
plt.xlim(0, 40000)
plt.ylim(0, 8000)
plt.xlabel('Favorites Count')
plt.ylabel('Retweets Count')
plt.plot(twitter_archive_master.favorites, 
    twitter_archive_master.retweets, linestyle='', 
     marker='o', markersize=0.7)
plt.title('Scatter Plot of Favorites and Retweets');

Looks like a pretty good correlation (not surprising). See if log10 will be even better.

In [271]:
# Log10 scatter plot of favorites and retweets.
plt.figure(figsize=(12, 8))
plt.xlabel('Favorites Count')
plt.ylabel('Retweets Count')
plt.plot(log10.favorites, log10.retweets, 
           linestyle='', marker='o', markersize=0.7)
plt.title('Scatter Plot of Log10 Favorites and Retweets');

Not surprising! People who favorite seem to almost always retweet. Below is the correlation coefficient which is very high.

In [272]:
log10.favorites.corr(log10.retweets)
Out[272]:
0.9666027994378725
In [273]:
# Favorites and retweets are a proxy for one another, lets look at retweets over time.
time_log10_retweets = log10[['retweets', 'timestamp']]
time_log10_retweets = time_log10_retweets[['retweets', 'timestamp']]
time_log10_retweets.timestamp = pd.to_datetime(time_log10_retweets['timestamp'], 
                                             format='%Y-%m-%d %H:%M:%S.%f')
time_log10_retweets.set_index(['timestamp'], inplace=True)
sns.set_style('whitegrid')
time_log10_retweets.plot(figsize=(12, 8), title = 'Time vs Log10 Retweets', 
           linestyle = '', marker = 'o', markersize = 0.7);

@dog_rates gained notoriety quickly after opening their twitter account, with steady amounts of growth after. Lets see if their tweeting rate has slowed down.

In [274]:
# Make sure timestamp works for the groupby function by date.
twitter_archive_master.timestamp = pd.to_datetime(
    twitter_archive_master['timestamp'], format='%Y-%m-%d %H:%M:%S.%f')
cg_month_tweets = twitter_archive_master.groupby(
    pd.Grouper(key = 'timestamp', freq = "M")).count().reset_index()
cg_month_tweets = cg_month_tweets[['timestamp', 'tweet_id']]
cg_month_tweets.head()
Out[274]:
timestamp tweet_id
0 2015-11-30 294
1 2015-12-31 369
2 2016-01-31 181
3 2016-02-29 119
4 2016-03-31 129
In [275]:
# Check sum on total rows aggregated in groupby.
cg_month_tweets.sum()
Out[275]:
tweet_id    2075
dtype: int64
In [276]:
# Line plot of Time vs Tweets.
sns.set_style('whitegrid')
plt.figure(figsize=(12, 8))
plt.xlim([datetime.date(2015, 11, 30), datetime.date(2017, 7, 30)]) 
plt.xlabel('Year and Month')
plt.ylabel('Tweets Count')
plt.plot(cg_month_tweets.timestamp, cg_month_tweets.tweet_id)
plt.title('Line Plot of Time vs Tweets');

The engine that is driving this account (original tweets) is certainly declining.

In [277]:
cg_day_favorites = twitter_archive_master.groupby(
    pd.Grouper(key = 'timestamp', freq = "D")).sum().reset_index()
cg_day_favorites = cg_day_favorites[['timestamp', 'favorites']]
cg_day_favorites.head()
Out[277]:
timestamp favorites
0 2015-11-15 2817
1 2015-11-16 21064
2 2015-11-17 4410
3 2015-11-18 8577
4 2015-11-19 78811
In [278]:
 # Plot Time vs Favorites Since Inception By Day Count
sns.set_style('darkgrid')
plt.ylim(0, 125000)
plt.xlabel('Days Since First Tweet')
plt.ylabel('Favorites per Day')
plt.title('Time vs Favorites', fontsize = 14)
cg_day_favorites.favorites.plot(figsize=(12, 8), 
           linestyle = '', marker = 'o', markersize = 1)

# Fit a regression line.
intercept, multiplier = polyfit(cg_day_favorites.index.values, 
            cg_day_favorites.favorites, 1)
label = "favorites/day = ", round(intercept, 2), "+ days * ", round(multiplier, 2)
plt.text(440, 45000, label)
plt.plot(cg_day_favorites.index.values, 
         intercept + multiplier * cg_day_favorites.index.values, '-');

Linear regression line has an intercept of 15,777 plus 44.03 favorites/day * # of days since inception. This means that favorites have increased at the rate of roughly 44 per day!

In [279]:
print(intercept, multiplier)
15773.822691349327 43.99843064646805

It would be nice to visualize what the relative share of dog_stages are.

In [280]:
dog_stages_counts = twitter_archive_master.dog_stages.value_counts()
In [281]:
dog_stages_counts
Out[281]:
pupper     220
doggo       83
puppo       23
floofer      9
Name: dog_stages, dtype: int64
In [282]:
labels = []
denominator = dog_stages_counts.sum()

for index, count in enumerate(dog_stages_counts):
    label_first_part = dog_stages_counts.index.values[index]
    label_second_part = (count / denominator) * 100
    label_second_part = round(label_second_part, 2)
    label_second_part = str(label_second_part) + '%'
    label = label_first_part + ' ' + label_second_part
    labels.append(label)
In [283]:
plt.figure(figsize=(12, 8))
plt.pie(dog_stages_counts, labels = labels, explode = (0.1, 0.1, 0.1, 0.1), 
        shadow = True, startangle = 90)
plt.title('Share of Dog Stages');

Not sure what this means other than pupper is the most popular! Only 9 floofer and 23 puppo. Not much data to work with.

Let's see if dog_stages has an effect on rating(s).'

In [284]:
rdsg = twitter_archive_master.groupby('dog_stages').mean().reset_index()
rdsg = rdsg[['dog_stages', 'rating']]
rdsg.head()
Out[284]:
dog_stages rating
0 doggo 1.177108
1 floofer 1.188889
2 pupper 1.073182
3 puppo 1.200000
In [285]:
 # Plot dog_stages vs rating(s)
sns.set_style('darkgrid')
plt.figure(figsize=(8, 5))

ind = np.arange(len(rdsg))    # the locations for the groups on the x axis
width = .5                   # the width of the bars

p1 = plt.bar(ind, rdsg.rating, width)

plt.xlabel('Dog Stages')
plt.ylabel('Average Rating(s)')
plt.title('Effect of Dog Stages on Rating(s)')
plt.xticks(ind, rdsg.dog_stages)
plt.yticks(np.arange(0, 1.4, .2));

There is very little difference. This bar chart shows pupper is less. The rest are virtually identical. This could be as simple as a statistical anomaly. There is very little data for flloer (n of 9). For pupper there are 220. 9 does not come close to the bell curve of 30.

Lets figure out which breeds get favorited the most.

In [286]:
fav_breed = twitter_archive_master[
    twitter_archive_master.dog_or_not == True]
fav_breed = fav_breed.groupby(
    'best_prediction').sum().reset_index()
fav_breed = fav_breed[[
    'best_prediction', 'favorites']]
fav_breed = fav_breed.sort_values(
    'favorites', ascending = False)[0:20]
fav_breed
Out[286]:
best_prediction favorites
43 Golden_retriever 1627628
59 Labrador_retriever 1078064
77 Pembroke 990896
26 Chihuahua 675838
85 Samoyed 530361
39 French_bulldog 475271
27 Chow 405755
79 Pug 312420
28 Cocker_spaniel 308251
78 Pomeranian 282307
24 Cardigan 253999
63 Malamute 249414
100 Toy_poodle 244535
25 Chesapeake_bay_retriever 237621
37 Eskimo_dog 210596
60 Lakeland_terrier 206633
40 German_shepherd 198292
67 Miniature_pinscher 183333
94 Staffordshire_bullterrier 182595
6 Basset 176356
In [287]:
# Bar graph of most 'favorites' breed.
sns.set_style('whitegrid')
plt.figure(figsize=(12, 8))
plt.barh(range(len(fav_breed.best_prediction)), fav_breed.favorites) 
plt.yticks(range(len(fav_breed.best_prediction)),fav_breed.best_prediction)
plt.xlabel('Sum of Favorites')
plt.ylabel('Dog Breed')
plt.title('Bar Chart of The Breeds That Had The Largest Total Number of Favorites');

No surprise here, the Golden_retriever is the expected winner, followed by the Labrador_retriever. The rest of the breeds reflects how many different types of dog lovers and in turn breeds there are.

What is the most common name in the dataset?

In [288]:
twitter_archive_master.name.value_counts().head(10)
Out[288]:
Lucy       11
Charlie    11
Oliver     10
Cooper     10
Tucker      9
Penny       9
Sadie       8
Winston     8
Lola        8
Daisy       7
Name: name, dtype: int64

Not much to see here!

Lets see which are the highest rated breeds. There is really not much difference between the ratings. In order to make that difference apparent, I will add a column that takes the log10 of the rating and then subtracts a constant. This is to make the subsequent bar chart more informative.

In [289]:
top_rated_breeds = twitter_archive_master[
    twitter_archive_master.dog_or_not == True]
top_rated_breeds = top_rated_breeds.groupby(
    'best_prediction').mean().reset_index()
top_rated_breeds = top_rated_breeds[[
    'best_prediction', 'rating']]
top_rated_breeds['log10_rating'] = top_rated_breeds.rating.apply(
    lambda x: np.log10(x) - .05)
top_rated_breeds = top_rated_breeds.sort_values(
    'log10_rating', ascending = False)[0:20]
top_rated_breeds
Out[289]:
best_prediction rating log10_rating
84 Saluki 1.250000 0.046910
20 Briard 1.233333 0.041080
98 Tibetan_mastiff 1.225000 0.038136
15 Border_terrier 1.214286 0.034321
92 Silky_terrier 1.200000 0.029181
96 Standard_schnauzer 1.200000 0.029181
37 Eskimo_dog 1.177778 0.021063
44 Gordon_setter 1.175000 0.020038
50 Irish_setter 1.175000 0.020038
85 Samoyed 1.170000 0.018186
27 Chow 1.160976 0.014823
43 Golden_retriever 1.156716 0.013227
24 Cardigan 1.152941 0.011807
4 Australian_terrier 1.150000 0.010698
108 Wire-haired_fox_terrier 1.150000 0.010698
42 Giant_schnauzer 1.150000 0.010698
46 Great_pyrenees 1.146154 0.009243
56 Kelpie 1.145455 0.008978
77 Pembroke 1.144318 0.008547
61 Leonberg 1.133333 0.004358

As you can see above the differences between the ratings of all of the breeds in vanishingly small. Based on the source of this data, it is somewhat silly to bother seeing if it is significant. People just did what they felt like. If the data was more rigorous it would make sense to apply p values to these ratings.

In [290]:
# Bar graph of top 'favorites' rated breeds.
sns.set_style('whitegrid')
plt.figure(figsize = (12, 8))
xlim(.05, .1)
plt.barh(range(len(top_rated_breeds.best_prediction)), 
         top_rated_breeds.log10_rating) 
plt.yticks(range(len(top_rated_breeds.best_prediction)),
           top_rated_breeds.best_prediction)
plt.xlabel('Log10 * rating - .05')
plt.ylabel('Dog Breed')
plt.title('Relative Breed Rating Bar Chart' );

The above bar graph is easier to read since we have manipulated the scale to make the differences more observable.

Lets see what the intersection of favorite and top rated breeds are. We can plot this as a Venn diagram.

In [291]:
fav_breed_list = fav_breed.best_prediction.tolist()
top_rated_list = top_rated_breeds.best_prediction.tolist()
In [292]:
fav_breed_set = set(fav_breed_list)
top_rated_set = set(top_rated_list)
In [293]:
common = fav_breed_set & top_rated_set
common
Out[293]:
{'Cardigan', 'Chow', 'Eskimo_dog', 'Golden_retriever', 'Pembroke', 'Samoyed'}
In [294]:
plt.figure(figsize = (12, 8))
v = venn2(subsets = {'10': len(fav_breed_list), '01': len(top_rated_list), '11': len(common)}, 
          set_labels = ('Favorite Breeds', 'Top Rated Breeds'))
plt.annotate(common, xy = v.get_label_by_id('11').get_position(),  
    xytext = (0,70), ha = 'center', textcoords = 'offset points', 
    bbox = dict(boxstyle = 'round, pad = 0.5', fc = 'lime', alpha = 0.3),
    arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3, rad = 0', 
    color = 'gray'))
plt.title('Venn Diagram of Intersection Between Favorite and Top Rated Breeds');