Data Wrangling Twitter

Table of Contents

Contact Information

  • Lindsay Moir
  • 778 679 4406
  • tragoes@gmail.com

Project Description

We are using a twitter_archive from @dog_rates provided as a .csv file. Tweepy is being used to interface to the twitter API and receive JSON data. Finally we have been provided with a file that was produced at Udacity to classify pictures of dogs (.tsv), which is downloaded via a http request. This project includes data gathering, cleaning, storing, and analyzing the results. We are most interested in what this population of users likes in terms of breeds and how their ratings of those breeds have changed over time.

Gather

twitter_archive

image_predictions

There is also a image file that contains twitter ids and urls for images. For example, this url gives you Stuart's picture (the dog) https://twitter.com/dog_rates/status/889531135344209921. The part after status is the tweet id for Stuart. Udacity wants us to download this programatically via the requests library.

twitter_counts

We need to setup access to the twitter api via tweepy.

Assess

General

twitter_archive

After visually inspecting the text field, we may be able to extract gender.

Atticus

Time for some fun. Which is the most extreme outlier for numerator?

Lets see who Atticus is. image.png

Now we know why Atticus was the "top dog" at 1776. That is the year that the USA became an independent nation!

The above is ugly. Prepositions and other problems are in this name column.

image_predictions

We need to be careful when we do the data cleaning. These repeated jpgs are retweets. This ends up creating two entries in the image_predictions file. Hopefully when we get rid of the retweets, this will get rid of this. Need to make sure about that.

twitter_counts

Quality

twitter_archive

image_predictions

twitter_counts

flagged but not cleaned.

Tidiness:

Cleaning

archive_clean

Define

Code

Test

Define

Code

Test

Define

Code

Test

Define

Code

Test

These all match up to the True False and NaN on the first 10 records. That is pretty good, 1399 records out of 2075 classified!

Define

Code

Test

Define

Code

Test

Define

Code

Test

We can see that 'a' has been removed.

Define

Code

Test

Define

Code

You can certainly see that there is more than 1 dog stage in some of the rows!

Test

image_clean

Define

Code

Test

Define

Code

Test

counts_clean

Define

Code

Test

Tidiness:

Define

Test

Store

Test

Analyzing and Visualizing Data

Data Distribution

Lets take a look at each factor's data distribution and type.

We can see that some dogs get 0s (must be a very bad dog) and 14/10 (1.4) in this column. A very large range for both favorites and retweets. Makes you wonder what those maximum numbers represent! The confidence factor is a little troubling. There is quite a large range in values. nof_images tells us that there is a minimum of 1 picture and a maximum of 4 with only 1949/2075 rows having pictures.

Of special note here is the timestamp field which is a proper datetime field. dog_stages only has 335 rows. This will make it problematic to use this factor effectively.

These values are expected based on our Cleaning. Of special note though is dog_or_not. When you retrieve this from a .csv file it maintains this True False fill. When it is stored in sqlite and returned it is 1 for True and 0 for False. So, just 1 and 0. This potentially changes the way you do your coding in this project.

Heavily skewed right. It seems that the vast majority of dogs are 10s or above:) I wonder who the below .5s (5/10) are:)

The males make up 75% of the tweets in this archive. That does not mean that they make up 75% of the "dogs". However, 3/4 of these tweets are from people who have male dogs.

Distribution of favorites and retweet counts follow the same pattern, which is no surprise. Lets see what happens with log10.

The log transformation works its magic and the distribution for both features looks normal.

Wow, you can see the mind of a computer in here. This is VERY regular.

A little bizarre the way the median is right on 1.2 for all of the dog_stages except for pupper. People are seldom rating their dog lower than a 10/10.

Bivariate Analysis

See what the correlations are in the dataset. Create a correlation matrix

Looks like a pretty good correlation (not surprising). See if log10 will be even better.

Not surprising! People who favorite seem to almost always retweet. Below is the correlation coefficient which is very high.

@dog_rates gained notoriety quickly after opening their twitter account, with steady amounts of growth after. Lets see if their tweeting rate has slowed down.

The engine that is driving this account (original tweets) is certainly declining.

Linear regression line has an intercept of 15,777 plus 44.03 favorites/day * # of days since inception. This means that favorites have increased at the rate of roughly 44 per day!

It would be nice to visualize what the relative share of dog_stages are.

Not sure what this means other than pupper is the most popular! Only 9 floofer and 23 puppo. Not much data to work with.

Let's see if dog_stages has an effect on rating(s).'

There is very little difference. This bar chart shows pupper is less. The rest are virtually identical. This could be as simple as a statistical anomaly. There is very little data for flloer (n of 9). For pupper there are 220. 9 does not come close to the bell curve of 30.

Lets figure out which breeds get favorited the most.

No surprise here, the Golden_retriever is the expected winner, followed by the Labrador_retriever. The rest of the breeds reflects how many different types of dog lovers and in turn breeds there are.

What is the most common name in the dataset?

Not much to see here!

Lets see which are the highest rated breeds. There is really not much difference between the ratings. In order to make that difference apparent, I will add a column that takes the log10 of the rating and then subtracts a constant. This is to make the subsequent bar chart more informative.

As you can see above the differences between the ratings of all of the breeds in vanishingly small. Based on the source of this data, it is somewhat silly to bother seeing if it is significant. People just did what they felt like. If the data was more rigorous it would make sense to apply p values to these ratings.

The above bar graph is easier to read since we have manipulated the scale to make the differences more observable.

Lets see what the intersection of favorite and top rated breeds are. We can plot this as a Venn diagram.

There are only 6 dogs that are in common in the two sets, which is surprising. One would think that a favorite would also mean a top rating. Obviously not!

Multivariate Analysis

Lets see if tweets vs retweets are changing over time.

The power of twitter is shown here. Despite tweeting A LOT LESS, it is clear that the retweets/interest in @dog_rates is still quite high AND actually growing!

Let's next look at how ratings changed over time, and favorites. Please note that since we are grouping by day to make the data more manageable we are taking a daily average (mean) of the ratings.

Look at the relationship between ratings, retweets, and favorites.

As expected favorites are considerably above retweets. However, the behaviour is marked at the ends of the ratings scale. This makes sense. People will respond to something that is rated high or poor and ignore what is in the middle.

I thought I would look at this as a stacked bar graph to see if there was anything else that would come out of it. Not much except you get to see the total activity of retweets and favorites for each rating which is somewhat helpful.

Fun Analysis

Well lets have some fun. What is the most retweeted tweet?

After appending the tweet_id to the standard url you get the video of a dog figuring out he can stand in the water. He does not have to swim. Very cute! https://twitter.com/dog_rates/status/744234799360020481

Conclusions

Limitations

  • The quality of the data is EXTREMELY suspect. However, as @dog_rates said to Brant, "they're good dogs Brent,". This is supposed to be fun. It is not meant to be rigorous. Twitter data is and can be useful in a social context. However, you would need a LOT of data and an incredible amount of wrangling before it would be useful.

Summary

  • It appears that the @dog_rates community is going to keep this account going for a long time. People are tweeting, retweeting, and favoriting at an increasing rate, despite the decrease in @dog_rates original tweets. This is a good example of how it takes some energy to get these social media destinations going, however, after they gain traction, they can take on a life of their own.
  • Ratings did change over time. Again, most likely because of the "they're good dogs Brent," meme.
  • The retrievers (Golden and Labrador) are as usual very popular dogs. However, based on this data, the interest in a wide variety of breeds is readily apparent. Of the favorites only 6 of the top 20 dogs intersected with the top ratings. They were Cardigan, Chow, Eskimo_dog, Golden_retriever, Pembroke, and Samoyed. Not even Labrador retrievers ended up intersecting!