Unbreaking values with R
The example below shows some data for three players on the 1999 New York Knicks roster. Tables often look like this one in reports, books, or publications.
The table actually looks nice: it has merged cells, custom borders and lines, and some values are wrapped across multiple lines so that everything is easy to read and fits on a sheet of paper or a computer screen. I often come across tables like this in the PDFs of relatively old scientific papers, and the trouble starts when I want to read or import the data from these tables into something more structured and manageable.
Without cell merging, the data tends to look like this:
player | listed_height_m. | teams_chronological | position |
---|---|---|---|
Marcus Camby | 2.11 | Raptors | Power forward |
NA | NA | Knicks | Center |
NA | NA | Nuggets | NA |
NA | NA | Clippers | NA |
NA | NA | Trail Blazers | NA |
NA | NA | Rockets | NA |
NA | NA | Knicks | NA |
Allan Houston | 1.98 | Pistons | Shooting guard |
NA | NA | Knicks | NA |
Latrell Sprewell | 1.96 | Warriors | Small forward |
NA | NA | Knicks | NA |
NA | NA | Timberwolves | NA |
There is an inconsistent number of empty or NA values padding out the vertical space in some of the columns. Lately I’ve had to ‘unbreak’ the values in these types of tables and get rid of all the unnecessary NAs before doing any further wrangling. I’ve written about unbreaking values in the past, but that approach was tailored for a very specific use case and not very flexible. I was getting nowhere until I found this post by Mark Needham about squashing multiple rows per group into one.
Mark’s post took advantage of how dplyr::summarize() reduces multiple values down to a single value, and fed this output into the paste() function. My sneaky upgrade to his post was to first sort out a grouping variable, and then use summarize_all() to summarize multiple columns, using an na.omit() call to get rid of the NA values. Thanks to tidyeval, I was able to write this into a function that has saved me lots of time.
Let’s check it out, using the example from before.
Set up the data
The tibble formatting in RStudio shows the NA mess.
For summarize() to work on grouped data, we use tidyr::fill() to populate missing values in a column with the previous entry until the value changes.
This structure is ready for the summarize_all() approach
It works!
Now let’s write that into a function that takes in the data, the name of the grouping variable, and whatever we want to use to separate the pasted values (i.e. the collapse argument).
Using the function should give us the same output as before.
I’m not referring to this as squashing or squishing because those terms are already used in other packages and they mean different things. I´ll stick with unbreaking. Note that stri_wrap() from the stringi package does more or less the opposite of this. Finally, this also works with blank values, we just need to replace empty with NA.
Let me know if you find this useful, or if anything is working as advertised. Thanks for reading.