Fixing broken and irregular column headers
Last week I saw Riva Quiroga masterfully import and clean a scary spreadsheet during an RLadies meetup, and I recognized some common issues with other people’s data for which I have already written relevant functions.
I also realized that I never wrote a dedicated post for the
mash_colnames function in the unheadr package. This functionality has been around for a while now, but is worth describing in detail (plus I get to create some new teaching materials). Credit goes to Jarret Byrnes, who contributed the initial version of the function in this GitHub issue. I just added some tidyeval and a few enhancements for a better fit within the package. Internally,
mash_colnames pivots the column headers and the first n rows of the data and ‘mashes’ whatever needs to be together columnwise.
mash_colnames() has two main uses. We can apply this nifty function to data with the following issues:
Data with the variable names split across > 1 rows (i.e. there are fragments of the headers in the first few data rows)
This little example here with rodent data has different bits of the column headers spread out ‘vertically’ for some reason. This is quite common, as far as I’ve seen. This next image below explains the problem a bit better. Notice that if there were any separators between the pieces of column headers, these are now implicit.
Ultimately, we most likely are interested in something like this:
Names split across >1 rows but with gaps in the headers at the very top
This happens when cells were originally merged in a spreadsheet or formatted table, or maybe the gaps are there intentionally to imply that the values along this row are the same until a new one appears. I’m not sure about the correct terms for this but Charlie Hadley referred to this as “non-regular spanning of column headers”.
With colors to show which columns are meant to share a piece of header, the data look like this:
A more usable version of the data would look like this:
Having shown the two common issues that we can address with
unheadr, let’s work through the same examples using code.
Working with code
Let’s set up the same example data from the images above and fix the issues.
The data in tibble form:
Note the NAs padding the empty spaces.
To fix messy names broken across rows, we tell
mash_colnames() how many data rows have header fragments in them. In this case, it’s two, the names don’t count as data rows. The default separator in the function is the underscore, but we can change it to spaces or dots or whatever.
The data with fixed names:
Pretty nice! No more unnecessary NAs.
In some cases we may recognize messy names or similar issues in a dataset, so we skip the first row. It may look like this, with automated names that do not mean anything.
For cases like these, we can use the
keep_names argument to ignore the names when we’re mashing. In this case we work with three data rows, which hold all the pieces of the names.
The result is the same as before.
Let’s set up the example data with the gaps in the names column. When there are gaps in the column names we tend to skip the names during the import step. The
keep_names argument really comes in handy here.
The approach is similar, and to deal with the gaps that imply a repeated value across columns, we use the
sliding_headers argument. By setting it to
TRUE we fill the gaps from left to right.
The cleaned-up version is ready for further analysis.
That’s it! Feel free to reach out with any questions/feedback or if I was using the wrong data-structuring terms for this.