pupdate - Nov 2017. Claus Wilke has added a ‘jittered points’ option to the ridgeline geom that basically does the same as my hacky beeswarm approach but with less code. I added an example of this feature to the post.
pupdate - Oct 2017. I’ve update this post to reflect the project-wide renaming of the ggridges package and added more code annotation.
In recent weeks there has been much interest in making cool-looking plots of overlapping density distributions. Basically: stacking many overlapping polygons/ribbons for a nice visual effect.
I saw this kind of plot a few weeks back in a New York Times infographic, with several more examples appearing in my Twitter feed this month. Most notably, the Free Time survey plot.
The overlapping density plots are very appealing visually, and definitely very challenging to make. Claus Wilke recently stepped up to the challenge and created ggridges, an R package for creating the ridgeline plots.
Kernel densities look good and that they work well for big datasets with clear unimodal or bimodal distributions. However, with smaller datasets I feel that density functions reflect the choice of smoothing parameters more than they reflect the actual distribution of the underlying data. The optimally-smoothed kernels may not be the prettiest, and so it is probably worth trying to show densities as well as the underlying data.
With density plots, it’s difficult to see where the data actually are, and as Andrew Gelman commented:
‘I’d rather just see what’s happening … rather than trying to guess by taking the density estimate and mentally un-convolving the kernel.’
For this post, I go through some code for making density plots that also show the underlying data. As usual, I show this using the best type of data: dog data.
To plot the distribution of variable values for different groups, I used the maximum jump distance for several hundred dogs that participated in the SplashDogs (http://www.splashdogs.com/) ‘Super Air’ dock jumping competition during 2016. Dock jumping is essentially a long jump sport for dogs. Dogs run along a ~12 meter dock and jump into the water, usually chasing a toy. Jumps are measured from the edge of the dock to the point where the base of the dog’s tail first enters the water.
This post has three main steps: scraping the jump distance data, wrangling it, and plotting it. This post in particular could not be possible without all the resources and advice from Bob Rudis that are floating around the web. This includes posts on his blog, answers on random Stack Overflow questions, tweets, and his helpful R packages. I tried to add links to all the hrbrverse resources that helped me along the way at the end of this post. All the code here is fully reproducible, although you may need to install some packages first.
Web scraping
I did not find any Terms of Service prohibiting automated data grabbing or visualization by third parties anywhere on the SplashDogs website or in the site’s robots.txt file. Remember to always check if scraping is allowed and adhere to all Terms and Conditions. Here’s a brief guide on how to crawl the web politely. Take breaks between sequential requests, be kind to web servers when scraping, and just be nice in general. What would the dogs think if you crashed a site!
To scrape the data, I used rvest to interact with the web form on the site, making queries for event results by breed and year. I was only after data for a few breeds, and I managed to abstract the scraping into a function and use purrr (a first for me!) to iterate through a small vector of breeds that I chose following two main criteria: (personal bias, and representation in the competitions). I wanted to compare groups with several hundred entries (Labradors) vs groups with just a few (American Pit Bull Terriers).
Data wrangling
After putting the html tables into data frames, it was a straightforward process to summarize the data. I cleaned up some unnecessary spaces in the handler names, and kept only the maximum jump distance for each dog.
Plotting
My approach was to create a one-sided beeswarm plot object for different groups and plot it over the respective density. I made two versions. One in which the densities and the point swarm are scaled, and one without scaling. I’m using faceting here, and I didn’t try to make the densities overlap.
This code is clunky and it needs different data frames with pre-summarized information, but I’m happy with the results. The forcats package was very useful for reordering the factor levels whenever I had to arrange the groups for plotting.
Here’s the result with scaled densities and point swarms.
Here’s a version with unscaled densities and point swarms.
For comparison, here’s a plot of the same data using geom_density_ridges() and some theming to make the plot look extra cool. It looks really crisp, and the default plots can be built with a single line of code.
I suspect that what I’ve done with the beeswarm points can be made into a geom to accompany geom_density_ridges. If you’re good at ggproto let me know and we can try it out.
Here’s an example with the new jittered_points argument in ggridges:
Update 16/0/2017: After posting this I was very impressed with the jumping prowess of dogs in general so I decided to add a comparison with human jumping skills. I found data for the London 2012 Olympics in this Google Sheets document put together by The Guardian and used the googlesheets package to download the data and repeat the process. The new figure and the code for downloading the data are below:
Get the data by iterating through the sheets in the workbook. See this GitHub issue for a better explanation.