Luis Verde Arregoitia bio photo

Luis Verde Arregoitia

Personal research page - ecology . evolution . conservation . biogeography

Twitter Publons ResearchGate Google Scholar

The example below shows some data for three players on the 1999 New York Knicks roster. Tables often look like this one in reports, books, or publications.

The table actually looks nice: it has merged cells, custom borders and lines, and some values are wrapped across multiple lines so that everything is easy to read and fits on a sheet of paper or a computer screen. I often come across tables like this in the PDFs of relatively old scientific papers, and the trouble starts when I want to read or import the data from these tables into something more structured and manageable.

Without cell merging, the data tends to look like this:

player listed_height_m. teams_chronological position
Marcus Camby 2.11 Raptors Power forward
NA NA Knicks Center
NA NA Nuggets NA
NA NA Clippers NA
NA NA Trail Blazers NA
NA NA Rockets NA
NA NA Knicks NA
Allan Houston 1.98 Pistons Shooting guard
NA NA Knicks NA
Latrell Sprewell 1.96 Warriors Small forward
NA NA Knicks NA
NA NA Timberwolves NA

There is an inconsistent number of empty or NA values padding out the vertical space in some of the columns. Lately I’ve had to ‘unbreak’ the values in these types of tables and get rid of all the unnecessary NAs before doing any further wrangling. I’ve written about unbreaking values in the past, but that approach was tailored for a very specific use case and not very flexible. I was getting nowhere until I found this post by Mark Needham about squashing multiple rows per group into one.

Mark’s post took advantage of how dplyr::summarize() reduces multiple values down to a single value, and fed this output into the paste() function. My sneaky upgrade to his post was to first sort out a grouping variable, and then use summarize_all() to summarize multiple columns, using an na.omit() call to get rid of the NA values. Thanks to tidyeval, I was able to write this into a function that has saved me lots of time.

Let’s check it out, using the example from before.

Set up the data

# load all the necessary packages
library(dplyr)
library(tibble)
library(rlang)
library(tidyr)
# set up the data
nyk <- tribble(
                                          ~player, ~listed_height_m., ~teams_chronological,           ~position,
                                   "Marcus Camby",               2.11,              "Raptors",  "Power forward",
                                               NA,                 NA,               "Knicks",         "Center",
                                               NA,                 NA,              "Nuggets",               NA,
                                               NA,                 NA,             "Clippers",               NA,
                                               NA,                 NA,        "Trail Blazers",               NA,
                                               NA,                 NA,              "Rockets",               NA,
                                               NA,                 NA,               "Knicks",               NA,
                                  "Allan Houston",               1.98,              "Pistons", "Shooting guard",
                                               NA,                 NA,               "Knicks",               NA,
                               "Latrell Sprewell",               1.96,             "Warriors",  "Small forward",
                                               NA,                 NA,               "Knicks",               NA,
                                               NA,                 NA,         "Timberwolves",               NA
                               )

The tibble formatting in RStudio shows the NA mess.

For summarize() to work on grouped data, we use tidyr::fill() to populate missing values in a column with the previous entry until the value changes.

nyk %>% fill(player)
# A tibble: 12 x 4
   player           listed_height_m. teams_chronological position      
   <chr>                       <dbl> <chr>               <chr>         
 1 Marcus Camby                 2.11 Raptors             Power forward 
 2 Marcus Camby                NA    Knicks              Center        
 3 Marcus Camby                NA    Nuggets             NA            
 4 Marcus Camby                NA    Clippers            NA            
 5 Marcus Camby                NA    Trail Blazers       NA            
 6 Marcus Camby                NA    Rockets             NA            
 7 Marcus Camby                NA    Knicks              NA            
 8 Allan Houston                1.98 Pistons             Shooting guard
 9 Allan Houston               NA    Knicks              NA            
10 Latrell Sprewell             1.96 Warriors            Small forward 
11 Latrell Sprewell            NA    Knicks              NA            
12 Latrell Sprewell            NA    Timberwolves        NA            

This structure is ready for the summarize_all() approach

nyk %>% fill(player) %>% 
        group_by(player) %>% 
        summarize_all(funs(paste(na.omit(.), collapse=", ")))
# A tibble: 3 x 4
  player           listed_height_m. teams_chronological                    position     
  <chr>            <chr>            <chr>                                  <chr>        
1 Allan Houston    1.98             Pistons, Knicks                        Shooting gua~
2 Latrell Sprewell 1.96             Warriors, Knicks, Timberwolves         Small forward
3 Marcus Camby     2.11             Raptors, Knicks, Nuggets, Clippers, T~ Power forwar~

It works!

Now let’s write that into a function that takes in the data, the name of the grouping variable, and whatever we want to use to separate the pasted values (i.e. the collapse argument).

unwrap_cols <- function(df,groupingVar,separator){
    groupingVar <- enquo(groupingVar)
      
    df %>% 
      fill(!!groupingVar) %>% 
      group_by(!!groupingVar) %>% 
      summarise_all(funs(paste(na.omit(.), collapse=separator)))
  }  

nyk %>% unwrap_cols(groupingVar = player, separator = ", ")

Using the function should give us the same output as before.

nyk %>% unwrap_cols(groupingVar = player, separator = ", ")
# A tibble: 3 x 4
  player           listed_height_m. teams_chronological                    position     
  <chr>            <chr>            <chr>                                  <chr>        
1 Allan Houston    1.98             Pistons, Knicks                        Shooting gua~
2 Latrell Sprewell 1.96             Warriors, Knicks, Timberwolves         Small forward
3 Marcus Camby     2.11             Raptors, Knicks, Nuggets, Clippers, T~ Power forwar~

I’m not referring to this as squashing or squishing because those terms are already used in other packages and they mean different things. I´ll stick with unbreaking. Note that stri_wrap() from the stringi package does more or less the opposite of this. Finally, this also works with blank values, we just need to replace empty with NA.

Let me know if you find this useful, or if anything is working as advertised. Thanks for reading.