Making sense of trees
In the words of Nick Matzke:
R’s tree structure is pretty non-intuitive, compared to the “a tree is a collection of node objects” structure that is taught in most phylogenetics courses and used in e.g. C++ software. It’s done this way because R likes lists, and doesn’t like objects.
The structure of phylogenetic trees in R (namely APE’s phylo objects) can be confusing because much of the information needed to describe how different nodes and branches make up a tree is implicit.
I learned this the hard way during the early days of my PhD research. My thesis work revolved around phylogenetic lineage ages, so one of my very first tasks after finding a suitable tree was to get a table with species and their ‘edge lengths’ as a way to (more or less) represent their evolutionary ages. At this time I was very new to R and I lost months trying to figure this out on my own. After going over the APE book I slowly realized that lineage age information is contained within the phylo objects and that it can be extracted, but only if you ask nicely.
Phylo objects have a vector of edge lengths for every node, but I was unable to figure out how to get a list of tips and internal nodes that would match up with the vector of edge lengths. After much struggle, I figured out that the branching.times() function computes the distance from each node to the tips, producing a named vector that corresponds to the nodes in the tree. I also learned that the mrca() function produces a matrix of node numbers corresponding to the most recent common ancestors between pairs of tips or nodes. The next challenge would be to put the two together.
I’m writing this brief post because I found an R script from early 2012 in a random USB stick, in which coding superstar Jeff Hanson walked me through indexing, building loops, and functions in order to extract the species lineage ages for any phylogeny object.
The function below has a couple of loops that iterate through the mrca matrix and the branching times vector to produce a data frame of edge lengths for all the tree tips. The for loops make for slow processing if we have large trees but I was patient and that exact function created the main dataset that I analysed for a large part of my thesis. If you’re interested have a look at Jeff’s crisp programming to see how clever he was when coming up with the loops.
A few years later I learned that the BioGeoBEARS R package includes the prt() function, which Nick Matzke wrote specifically to print the content of a tree into a tabular format, making all the implicit information explicit. This function is faster and provides much more information, and I was very relieved to see that the edge lengths it provides are the same Jeff and I obtained with the custom function.
Lets have a look. Both functions keep the tip order so comparing the results is easy. This examples uses the built in tree of bird families (Sibley and Ahlquist 1990) from the ape package.
Now we can do side-by-side glimpse of the two data frames, and time how long the two functions take to process the data.
prt is about 4.5x faster than the loops, and this becomes relevant as the number of tips increases or in the case of multiphylo objects with varying edge lengths so I recommend using it for any work on edge lengths.