This is an updated version of a post from May 2017. I updated the code to keep up with updates in some packages, replaced all the functions from the apply family with map functions from the purrr package, replaced the figures with high-res versions, and added more detailed code annotations.
The function for adding dog images next to any plot object is in the Gist at the end of this post.
A recent study led by Heidi Parker produced some very interesting results about the origin of different dogs breeds, and how desirable traits from certain breeds have been bred into others. Read more about it here. This dog breed genome paper had a very pretty figure showing the relationship between 161 breeds. Fortunately, the authors made their phylogeny available. Supplementary dataset S2 provides a bootstrapped consensus cladogram built using genomic distances for over 1300 individuals. Having access to these data led to yet another dog-themed and R-themed post.
This post goes through three main steps:
Importing dog breed data.
Reading the cladogram.
Putting the tree and the breed data together for some visualizations.
Importing dog breed data
Knowing that reading and manipulate the dog breed tree shouldn’t be too much of a problem, I searched the web for dog breed data that I could match up with the cladogram. I found several relevant databases (such as this collection of spreadsheets) but I went with one in particular simply because of the format it came in.
After deciding to tackle json files I had to figure out how to work with this format in R, and I followed some of the steps that Jim Vallandingham used to analyse Kung Fu movies. This meant using the tidyjson package to read the json file and wrangle it into a tidy table structure. There are other packages for working with json data in R, but tidyjson has really smooth integration with dplyr and that sealed the deal.
All the R code in the code blocks should be fully reproducible. All the necessary data will be loaded directly from URLS.
Importing the breed attribute data
Match the tips labels with a separate table, even when there are errors or spelling variations
Once we have the data from the json file we can match it up with a table from the Parker et al. paper that explains the breed abbreviations and the clades they belong to. This way we can filter out the rows that aren’t present in the tree. To relate the breeds on the tree with the breed traits previously imported, I used the fuzzyjoin package to merge the table with the tip labels, with the breed names, with the table of breed attributes. The process of maximizing overlap between the taxa present in a tree and the taxa that have trait data available is a big deal in comparative studies.
By using fuzzyjoin, I managed to avoid losing information in the case of typos or minor variations in spelling. For example, “Toy Mnachester Terrier” is misspelled in the genome paper table but I was still able to match it, and the stringdist join also caught the alternative spellings of Xoloizcuintle. In the end I still had to make some matches manually (for example: specify that Foxhound in one table refers to American Foxhound in another).
Reading and pruning the tree
The cladogram provided as supplementary data is in nexus format (that for some reason came as plain text inside a pdf). Reading nexus files is straight-forward using ape, and because we are interested in a tree with only one tip for each breed, we can prune it using this clever set of steps written by Liam Revell (link here).
I rewrote the steps so that we use map functions from the purrr package to apply functions iteratively. The original steps use the apply family of functions, which I find more difficult to read and always end up using blindly hoping that they will work.
If you do any systematics or phylogeography work these steps are probably useful when you have a tree that contains many individuals from the same species and you only need one tip per species, or if you need only one species per genus from a tree.
Importing the tree
Afterwards, the tree can be matched up with the previously wrangled dog breed data using geiger to get both a trimmed tree and a trimmed dataset, sorted and ready for use. After these steps we end up with 136 breeds that are both present in the tree and in the table with the breed traits.
Visualize the tree and associated data
We can plot the tree in any number of ways. Lately I’ve been partial to Guangchuang Yu’sggtree package.
This is the tree in fan layout, and unsurprisingly, with this many tips it gets pretty cluttered. The dog genome paper provides additional information about the clades that different breeds belong to, so for a less cluttered visualization I chose a subset of some clades that I like and then trimmed the tree with another helpful set of steps also provided by Liam Revell. I worked with this subset of breeds for the rest of the post.
Once you get used to it, ggtree can be pretty flexible. Here I took advantage of ggtree to highlight some tips, change the fonts, and show the clades on the tree.
Showing the different clades on the figures already implies combining the tree topology with additional data, and ggtree has a convenient way to attach data to a tree (the %<+% operator). Here I used a nifty ggtree function (groupOTU()) for grouping coloring clades.
After plotting trees, the gheatmap function in comes in handy to show an associated data matrix. After associating the breed attribute data to the phylo object, we can draw a heatmap next to the tree to show some breed properties. Here I decided to plot the values for fur shedding, cold tolerance, and trainability for each breed. I arbitrarily categorized the breed scores into low, medium and high for a simpler three-color scheme.
Because these plots are actually showing data related to dogs, I’m well justified in using my silly ggpup function to add two dog photos next to my plot objects. The original ggpup function scraped two photos at random from a possible set of almost 200 breeds. I modified the function (see the gist at the end of this post) so that it now takes a vector of breeds to choose from, which will be matched against the available photos before sampling two at random. This way, the dog images added to the breed cladogram can actually correspond to breeds that appear in the tree.
The function itself has many dependencies and is a mess in terms of functional programming but it works, and writing it helped me learn about webscraping, working with grid objects, and table joins. I made some updates so that the scraped photos appear with labels.
Thanks for reading. Feel free to contact with my any questions or if the code isn’t working.
This post was written under the supervision of Luna the golden retriever.