If for some reason you haven’t seen the WeRateDogs twitter account (accurately described as ‘the best thing ever’ by barkpost.com), do that first then come back to this page.
For this entry I continue with my dog-themed posts, and I build on a few existing blog posts that analyzed tweets by We Rate Dogs. This post by David Montgomery examined how the dog ratings have changed over time, and this post by Bastian Greshake used TensorFlow to try and train an image classifier to rate the dogs. These two posts focus on the ratings, but I simply wanted to work with dog names.
I’ve written some posts on tweet/text analysis in the past, so to document my progress with string processing and text analysis I decided to build a corpus of dog names from those tweeted by We Rate Dogs. Also, some of the names are hilarious and I wanted to see them all together in a single list.
For this post, I did not download the tweets via an API, I worked with the csv file provided in David Montgomery’s post. I also used some of his code to extract the ratings for each dog. All the R code in the code blocks below should be reproducible, although you may need to install a few packages first.
Get the data and extract the ratings
Processing strings to get a corpus of dog names
Once we used the data and code from the Pup Inflation blog post, we can use stringi and stringr and some regex to extract the names. Fortunately, the @dog_rates tweets have a pretty consistent format and use punctuation properly. From my experience following the account and from looking at the data frame, I identified three ways in which the dogs are introduced:
1.) This is X.
This is Aspen. She's never tasted a stick so succulent. On the verge of tears. A face of pure appreciation. 12/10 pic.twitter.com/VlyBzOXHEW
Knowing this, we can use regular expressions to extract the text between each of those phrases and a period.
Next, we remove empty rows and those that didn’t start with a capital letter. For example, the phrase “this is not a dog.” would be picked up by the regex but we don’t want it.
In some cases, the tweets include two dogs, so we use tidyr to unnest rows that contain two names (e.g. Sam and Max, Sam & Max) … I wish I knew about this function when I used to do all my unnesting by hand in Excel a few years ago.
Finally, we can sort out some special characters and clean up some rows manually.
At this point we have a table with 860 names, of which 634 are unique. We also have the ratings in a separate column thanks to David Montgomery’s code.
Here is a random sample of 50 names:
To add a few more variables to play around with, we can use dplyr to summarize the data, getting the mean rating for each name. As a test, I also used the gender package to match the names with historical datasets (for people names) and add a vector of whether the name is for a male or female dog.
For visualization, I used ggplot to graph the 22 most common names in the corpus (ties are sorted in reverse alphabetical order) along with their sex and rating. I chose 22 arbitrarily, and even though the way I assigned male/female is questionable, we see that this list of 22 names is heavily biased towards names for male dogs. Keeping with the dog theme, I fed the ggplot object into the ggup function that I shared in my previous post. It can be sourced directly from a GH gist.
There were 507 names that only appear once, so I chose 22 of these at random to make a similar plot, but in this case the bars represent the rating instead of the frequency.
That’s all. Feel free to contact me if anything isn’t working or for any questions/comments.
Side note: I checked and the name Luna only appears three times in this dataset, with a mean rating of 12.333. This is kind of a low rating, but that’s because I only submitted photos of my own Luna for rating recently. I assume she will probably get at least twice the maximum rating, approximately 30/10. Here she is.