Unbreaking values with R

The example below shows some data for three players on the 1999 New York Knicks roster. Tables often look like this one in reports, books, or publications.

The table actually looks nice: it has merged cells, custom borders and lines, and some values are wrapped across multiple lines so that everything is easy to read and fits on a sheet of paper or a computer screen. I often come across tables like this in the PDFs of relatively old scientific papers, and the trouble starts when I want to read or import the data from these tables into something more structured and manageable.

Without cell merging, the data tends to look like this:

player	listed_height_m.	teams_chronological	position
Marcus Camby	2.11	Raptors	Power forward
NA	NA	Knicks	Center
NA	NA	Nuggets	NA
NA	NA	Clippers	NA
NA	NA	Trail Blazers	NA
NA	NA	Rockets	NA
NA	NA	Knicks	NA
Allan Houston	1.98	Pistons	Shooting guard
NA	NA	Knicks	NA
Latrell Sprewell	1.96	Warriors	Small forward
NA	NA	Knicks	NA
NA	NA	Timberwolves	NA

There is an inconsistent number of empty or NA values padding out the vertical space in some of the columns. Lately I’ve had to ‘unbreak’ the values in these types of tables and get rid of all the unnecessary NAs before doing any further wrangling. I’ve written about unbreaking values in the past, but that approach was tailored for a very specific use case and not very flexible. I was getting nowhere until I found this post by Mark Needham about squashing multiple rows per group into one.

Mark’s post took advantage of how dplyr::summarize() reduces multiple values down to a single value, and fed this output into the paste() function. My sneaky upgrade to his post was to first sort out a grouping variable, and then use summarize_all() to summarize multiple columns, using an na.omit() call to get rid of the NA values. Thanks to tidyeval, I was able to write this into a function that has saved me lots of time.

Let’s check it out, using the example from before.

Set up the data

# load all the necessary packages
library(dplyr)
library(tibble)
library(rlang)
library(tidyr)
# set up the data
nyk <- tribble(
                                          ~player, ~listed_height_m., ~teams_chronological,           ~position,
                                   "Marcus Camby",               2.11,              "Raptors",  "Power forward",
                                               NA,                 NA,               "Knicks",         "Center",
                                               NA,                 NA,              "Nuggets",               NA,
                                               NA,                 NA,             "Clippers",               NA,
                                               NA,                 NA,        "Trail Blazers",               NA,
                                               NA,                 NA,              "Rockets",               NA,
                                               NA,                 NA,               "Knicks",               NA,
                                  "Allan Houston",               1.98,              "Pistons", "Shooting guard",
                                               NA,                 NA,               "Knicks",               NA,
                               "Latrell Sprewell",               1.96,             "Warriors",  "Small forward",
                                               NA,                 NA,               "Knicks",               NA,
                                               NA,                 NA,         "Timberwolves",               NA
                               )

The tibble formatting in RStudio shows the NA mess.

For summarize() to work on grouped data, we use tidyr::fill() to populate missing values in a column with the previous entry until the value changes.

nyk %>% fill(player)

# A tibble: 12 x 4
   player           listed_height_m. teams_chronological position      
   <chr>                       <dbl> <chr>               <chr>         
 1 Marcus Camby                 2.11 Raptors             Power forward 
 2 Marcus Camby                NA    Knicks              Center        
 3 Marcus Camby                NA    Nuggets             NA            
 4 Marcus Camby                NA    Clippers            NA            
 5 Marcus Camby                NA    Trail Blazers       NA            
 6 Marcus Camby                NA    Rockets             NA            
 7 Marcus Camby                NA    Knicks              NA            
 8 Allan Houston                1.98 Pistons             Shooting guard
 9 Allan Houston               NA    Knicks              NA            
10 Latrell Sprewell             1.96 Warriors            Small forward 
11 Latrell Sprewell            NA    Knicks              NA            
12 Latrell Sprewell            NA    Timberwolves        NA

This structure is ready for the summarize_all() approach

nyk %>% fill(player) %>% 
        group_by(player) %>% 
        summarize_all(funs(paste(na.omit(.), collapse=", ")))

# A tibble: 3 x 4
  player           listed_height_m. teams_chronological                    position     
  <chr>            <chr>            <chr>                                  <chr>        
1 Allan Houston    1.98             Pistons, Knicks                        Shooting gua~
2 Latrell Sprewell 1.96             Warriors, Knicks, Timberwolves         Small forward
3 Marcus Camby     2.11             Raptors, Knicks, Nuggets, Clippers, T~ Power forwar~

It works!

Now let’s write that into a function that takes in the data, the name of the grouping variable, and whatever we want to use to separate the pasted values (i.e. the collapse argument).

unwrap_cols <- function(df,groupingVar,separator){
    groupingVar <- enquo(groupingVar)
      
    df %>% 
      fill(!!groupingVar) %>% 
      group_by(!!groupingVar) %>% 
      summarise_all(funs(paste(na.omit(.), collapse=separator)))
  }  

nyk %>% unwrap_cols(groupingVar = player, separator = ", ")

Using the function should give us the same output as before.

nyk %>% unwrap_cols(groupingVar = player, separator = ", ")

# A tibble: 3 x 4
  player           listed_height_m. teams_chronological                    position     
  <chr>            <chr>            <chr>                                  <chr>        
1 Allan Houston    1.98             Pistons, Knicks                        Shooting gua~
2 Latrell Sprewell 1.96             Warriors, Knicks, Timberwolves         Small forward
3 Marcus Camby     2.11             Raptors, Knicks, Nuggets, Clippers, T~ Power forwar~

I’m not referring to this as squashing or squishing because those terms are already used in other packages and they mean different things. I´ll stick with unbreaking. Note that stri_wrap() from the stringi package does more or less the opposite of this. Finally, this also works with blank values, we just need to replace empty with NA.

Let me know if you find this useful, or if anything is working as advertised. Thanks for reading.

Share on

Twitter Facebook LinkedIn

Luis D. Verde Arregoitia

Unbreaking values with R

Share on

You may also enjoy

Cell and text formatting is everywhere

2024 LLMs/genAI + R roundup

Descargar, procesar, y acomodar imágenes en un mosaico hexagonal interactivo

Download, batch process, and tile images