Update: 02/07/2017
The function described in this post received an awesome makeover from Jenny Bryan. I’m leaving the code in this post as is, so go read her take here. Although my #messyverse approach works, the remixed function is a million times better. This was a huge learning experience for me and I’m grateful to Jenny and the R community in general for not putting me down for being a programming tourist.
Early in 2016 I asked the Twitter R crew for advice on how to untangle interspersed header rows in a table and put them into their own columns.
what's the #dplyr or #tidyr way of tidying these header rows into columns?#rstats pic.twitter.com/XRMiLpn02t
— Luis D. Verde (@LuisDVerde) January 24, 2016
In one of the replies, Jenny Bryan commented that:
this is sadly beyond the reach of tidyr; it is evil and yet totally typical of spreadsheets in the wild
Unofficially, this is Part Three (of many) in series of posts that break down some simple steps to wrangle the data that is often provided as part of comparative or collections-based studies. Part 1 is available here; Part 2 is here.
This post describes one way of using Tidy Evaluation in a function that will pull out any pesky header rows from a column and place them where they belong: in a separate column.
Let’s have a look at two examples showing how header rows usually appear in real datasets.
First up is something that I always see in my field of comparative biology: header rows interspersed within a list of species to show different hierarchical taxonomic levels (families end in dae). This is what motivated me to write the function in the first place. In this example a single variable that includes the taxonomic family to which different species of rodents belong to, along with the diet type for species within the families.
jumbled mess | |
---|---|
1 | Muridae |
2 | diet:seeds |
3 | Sp1 |
4 | Sp2 |
5 | Sp3 |
6 | diet:unknown |
7 | Sp4 |
8 | Sp5 |
9 | Cricetidae |
10 | diet:fruits |
11 | Sp11 |
12 | Sp32 |
13 | Sp113 |
The second example is a silly one, but it also shows the same structure:
jumbled mess |
---|
Monday |
school |
homework |
Friday |
school |
party |
This is a common way to save space and avoid repetition when presenting tables, but data with interspersed header rows is not exactly analysis-ready.
Recently, I had to deal with lots of tables with header rows, so I used dplyr and tidyr to:
- Put the headers in a new variable and tag the rows that would need to be removed afterwards. This involves using case_when() inside a mutate() call to create new variables and populate them conditionally based on string matching.
- Fill the empty rows in the new variable using the previous entry.
- Filter out unwanted rows and drop the temporary variable used to tag the unwanted rows.
This is what the resulting data frame looks like.
Note that the two important inputs here were the name of the new variable (“family”) and the string that was matched to pick out the header rows (“dae”). When header rows correspond to taxonomic ranks, zoological nomenclature makes it easy to match different suffixes. This set of steps could be repeated to pull out the diet categories, matching the string ‘diet’ to end up with a table like this one:
If a variable has several different header rows, copying and pasting code over and over isn’t functional at all and the possibility of making mistakes increases. At the time when I finally decided to try and make a function out of these steps, dplyr 0.7.7 was released, with a major change in the form of a new system for programming (Tidy Evaluation). Tidy eval is so well explained in the Rstudio blog and in the dplyr documentation that even a dummy like me was able to program a function to untangle header rows.
The documentation and examples helped me understand how to pass arguments onto functions and how to convert expressions into strings. Tidy evaluation uses a few new functions and helper operators to make custom functions that work like dplyr functions. The untangle() function can be sourced directly from the gist at the end of this post and used on either tibbles or data frames. Just make sure to have the variable with the messy rows at the beginning of the table (column 1).
Let’s try it out on the same example dataset.
The function also works in pipes, giving us this result:
Now, what if the header rows don’t follow a naming convention such as shared prefixes/suffixes or labels? Take a look at this example with the names of some traditional foods:
foods |
---|
Mexican |
tacos |
chilaquiles |
gorditas |
Swiss |
rösti |
Appenzeller biberli |
There is nothing in common between the header rows that in this case contain the country of origin for the different dishes. However, grepl is pretty flexible in the patterns it can match, so we can use the or operator (|) to look for multiple matches. Let’s see what happens:
Note how I didn’t spell out the entire strings to match, mostly to show that grepl handles partial matches. I’m sure this could be adapted to vectorize the matching over a separate vector.
This amateurish function can probably be improved, but for now I’ve used it effectively to deal with the tables from over ten different sources that all use various forms of header rows. If you have any feedback please let me know.
The untangle function: