Update: 02/07/2017

The function described in this post received an awesome makeover from Jenny Bryan. I’m leaving the code in this post as is, so go read her take here. Although my #messyverse approach works, the remixed function is a million times better. This was a huge learning experience for me and I’m grateful to Jenny and the R community in general for not putting me down for being a programming tourist.

Early in 2016 I asked the Twitter R crew for advice on how to untangle interspersed header rows in a table and put them into their own columns.

In one of the replies, Jenny Bryan commented that:

this is sadly beyond the reach of tidyr; it is evil and yet totally typical of spreadsheets in the wild

Unofficially, this is Part Three (of many) in series of posts that break down some simple steps to wrangle the data that is often provided as part of comparative or collections-based studies. Part 1 is available here; Part 2 is here.

This post describes one way of using Tidy Evaluation in a function that will pull out any pesky header rows from a column and place them where they belong: in a separate column.

Let’s have a look at two examples showing how header rows usually appear in real datasets.

First up is something that I always see in my field of comparative biology: header rows interspersed within a list of species to show different hierarchical taxonomic levels (families end in dae). This is what motivated me to write the function in the first place. In this example a single variable that includes the taxonomic family to which different species of rodents belong to, along with the diet type for species within the families.

  jumbled mess
1 Muridae
2 diet:seeds
3 Sp1
4 Sp2
5 Sp3
6 diet:unknown
7 Sp4
8 Sp5
9 Cricetidae
10 diet:fruits
11 Sp11
12 Sp32
13 Sp113

The second example is a silly one, but it also shows the same structure:

jumbled mess
Monday
school
homework
Friday
school
party

This is a common way to save space and avoid repetition when presenting tables, but data with interspersed header rows is not exactly analysis-ready.

Recently, I had to deal with lots of tables with header rows, so I used dplyr and tidyr to:

  1. Put the headers in a new variable and tag the rows that would need to be removed afterwards. This involves using case_when() inside a mutate() call to create new variables and populate them conditionally based on string matching.
  2. Fill the empty rows in the new variable using the previous entry.
  3. Filter out unwanted rows and drop the temporary variable used to tag the unwanted rows.
# set up data frame
dat <- data.frame(jumble=c("Muridae","diet:seeds","Sp1","Sp2","Sp3","diet:unknown","Sp4","Sp5","Cricetidae","diet:fruits","Sp11","Sp32","Sp113"),stringsAsFactors = FALSE)
# load packages
library(dplyr)
library(tidyr)
# create new variables
dat <- dat %>% mutate(family= case_when(grepl("dae",dat$jumble) ~ dat$jumble),
                      removeLater = case_when(grepl("dae",dat$jumble)~"yes"))
# clean up
dat <- dat %>% fill(family) %>% 
  filter(is.na(removeLater)) %>% select(-removeLater)

This is what the resulting data frame looks like.

> dat
         jumble     family
1    diet:seeds    Muridae
2           Sp1    Muridae
3           Sp2    Muridae
4           Sp3    Muridae
5  diet:unknown    Muridae
6           Sp4    Muridae
7           Sp5    Muridae
8   diet:fruits Cricetidae
9          Sp11 Cricetidae
10         Sp32 Cricetidae
11        Sp113 Cricetidae

Note that the two important inputs here were the name of the new variable (“family”) and the string that was matched to pick out the header rows (“dae”). When header rows correspond to taxonomic ranks, zoological nomenclature makes it easy to match different suffixes. This set of steps could be repeated to pull out the diet categories, matching the string ‘diet’ to end up with a table like this one:

  jumble     family     dietType
1    Sp1    Muridae   diet:seeds
2    Sp2    Muridae   diet:seeds
3    Sp3    Muridae   diet:seeds
4    Sp4    Muridae diet:unknown
5    Sp5    Muridae diet:unknown
6   Sp11 Cricetidae  diet:fruits
7   Sp32 Cricetidae  diet:fruits
8  Sp113 Cricetidae  diet:fruits

If a variable has several different header rows, copying and pasting code over and over isn’t functional at all and the possibility of making mistakes increases. At the time when I finally decided to try and make a function out of these steps, dplyr 0.7.7 was released, with a major change in the form of a new system for programming (Tidy Evaluation). Tidy eval is so well explained in the Rstudio blog and in the dplyr documentation that even a dummy like me was able to program a function to untangle header rows.

The documentation and examples helped me understand how to pass arguments onto functions and how to convert expressions into strings. Tidy evaluation uses a few new functions and helper operators to make custom functions that work like dplyr functions. The untangle() function can be sourced directly from the gist at the end of this post and used on either tibbles or data frames. Just make sure to have the variable with the messy rows at the beginning of the table (column 1).

Let’s try it out on the same example dataset.

# source the function
source("https://gist.githubusercontent.com/luisDVA/0a479881450181f06ac5a117a6aa5d46/raw/f16335e2d9813eb42c98f1c4388bb0b998522aac/untangleFunction.R")

# set up data frame
dat <- data.frame(jumble=c("Muridae","diet:seeds","Sp1","Sp2","Sp3","diet:unknown","Sp4","Sp5","Cricetidae","diet:fruits","Sp11","Sp32","Sp113"),stringsAsFactors = FALSE)
# show the header rows who is boss
dat <- dat %>% untangle("dae","family") %>% 
          untangle("diet","diet_type")
          

The function also works in pipes, giving us this result:

> dat
  jumble     family    diet_type
1    Sp1    Muridae   diet:seeds
2    Sp2    Muridae   diet:seeds
3    Sp3    Muridae   diet:seeds
4    Sp4    Muridae diet:unknown
5    Sp5    Muridae diet:unknown
6   Sp11 Cricetidae  diet:fruits
7   Sp32 Cricetidae  diet:fruits
8  Sp113 Cricetidae  diet:fruits

Now, what if the header rows don’t follow a naming convention such as shared prefixes/suffixes or labels? Take a look at this example with the names of some traditional foods:

foods
Mexican
tacos
chilaquiles
gorditas
Swiss
rösti
Appenzeller biberli

There is nothing in common between the header rows that in this case contain the country of origin for the different dishes. However, grepl is pretty flexible in the patterns it can match, so we can use the or operator (|) to look for multiple matches. Let’s see what happens:

# foods data frame
dat <- data.frame(foods=c("Mexican","tacos","chilaquiles","gorditas","Swiss","rösti","Appenzeler biberli"),stringsAsFactors = FALSE)
# untangle header rows
dat %>% untangle("xican|Swis","nation")

Note how I didn’t spell out the entire strings to match, mostly to show that grepl handles partial matches. I’m sure this could be adapted to vectorize the matching over a separate vector.

> dat
               foods  nation
1              tacos Mexican
2        chilaquiles Mexican
3           gorditas Mexican
4              rösti   Swiss
5 Appenzeler biberli   Swiss

This amateurish function can probably be improved, but for now I’ve used it effectively to deal with the tables from over ten different sources that all use various forms of header rows. If you have any feedback please let me know.

The untangle function: