Tidyeval meets PDF table hell

Update - April 2018 - In the most recent release of rlang (0.2.0), we can use ensym() as a new variant of enexpr() for cleaner code. I’ve updated the code to reflect this change. Thanks to Hadley Wickham for the heads up.

Although it first became a feature of dplyr in June of 2017, tidy evaluation is once again in the spotlight after the 2018 RStudio conference. This is a good compilation of tidyeval resources, and I suggest watching this five-minute video of Hadley Wickham explaining the big ideas behind tidy evaluation while wearing a stylish sweater.

When tidyeval originally came out, I jumped at the chance to program with dplyr. I blogged about writing a function to deal with non-data rows embedded as hierarchical headers in the data rectangle. Unsurprisingly, I butchered the use of tidyeval and function writing in general, but I was rescued by Jenny Bryan in this post.

As a biologist, the ‘untangle’ function that came out of that exchange has saved me hours upon hours of work, because comparative data always has taxonomic header rows that I usually had to tidy up by hand in a spreadsheet program.

PDF table hell

In my ongoing work with other people’s data, I came across values that are broken up into two lines for whatever reason (often to optimize space on a page in a table in a typeset pdf).

I encounter broken-up values frequently in my biology research, here’s an example that isn’t made up.

Thrichomys cunicularius is broken up into two lines

This is a very common practice, a lot of the pdf tables that I work with (using the awesome tabulizer package) have ‘merged’ cells that end up as broken values.

Here’s a toy example with some data from the summer Olympics.

Games	Country	Soccer_gold_medal
Los Angeles 1984	USA	France
Barcelona	Spain	Spain
1992	NA	NA
Atlanta 1996	USA	Nigeria
Sydney 2000	Australia	Cameroon
London	UK	Mexico
2012	NA	NA

The values for two of the games (Barcelona 1992 & London 2012) are broken up into separate rows, adding a bunch of empty/NA values in the rows that shouldn’t really be there.

This is what the table should look like:

Games_unbroken	Country	Soccer_gold_medal
Los Angeles 1984	USA	France
Barcelona 1992	Spain	Spain
Atlanta 1996	USA	Nigeria
Sydney 2000	Australia	Cameroon
London 2012	UK	Mexico

Using Jenny Bryan’s version of the untangle function as a template, I wrote the ‘unbreak_vals’ function below to unbreak values using tidyeval.

Assuming that:

the NA values in the table only correspond to the rows with broken-up values

the broken-up values can be matched with regex

this function will glue the two value fragments together and get rid of the extra row (via a hacky fill-then-slice operation).

Let’s try it out.

After loading the tidyverse set of packages and rlang, we’ll create the above table, define the “unbreak_vals” function, and use it – matching the rows that start out with numbers with the regex.

library(tidyverse)
library(rlang)

OGames <- tibble(Games = c("Los Angeles 1984","Barcelona","1992","Atlanta 1996","Sydney 2000","London","2012"),
                Country = c("USA","Spain",NA,"USA","Australia","UK",NA),
                Soccer_gold_medal = c("France","Spain",NA,"Nigeria","Cameroon","Mexico",NA))

Let’s check it out

> OGames
# A tibble: 7 x 3
  Games            Country   Soccer_gold_medal
  <chr>            <chr>     <chr>            
1 Los Angeles 1984 USA       France           
2 Barcelona        Spain     Spain            
3 1992             NA        NA               
4 Atlanta 1996     USA       Nigeria          
5 Sydney 2000      Australia Cameroon         
6 London           UK        Mexico           
7 2012             NA        NA

Unbreak the lines, matching strings that start with a number

unbreak_vals <- function(df,regex,ogcol,newcol){
  ogcol <- enquo(ogcol)
  newcol <- ensym(newcol)
  
  df %>% 
    mutate(
      !!newcol := ifelse(grepl(regex,!!ogcol),
                         yes = paste(lag(!!ogcol),!!ogcol),
                         no = !!ogcol)
    ) %>% 
    fill(everything()) %>% 
    slice(-(which(str_detect(!!ogcol,regex))-1)
    ) %>%
    select(-!!ogcol)
}

OGames %>% unbreak_vals("^[0-9]",Games,Games_unbroken) %>% 
  select(Games_unbroken,everything())

It worked!

A tibble: 5 x 3
  Games_unbroken   Country   Soccer_gold_medal
  <chr>            <chr>     <chr>            
1 Los Angeles 1984 USA       France           
2 Barcelona 1992   Spain     Spain            
3 Atlanta 1996     USA       Nigeria          
4 Sydney 2000      Australia Cameroon         
5 London 2012      UK        Mexico

Another case of broken values that I’ve seen is when additional descriptions are interspersed below the original values in separate rows. This is a single-column example from a spreadsheet I had lying around.

dogsDesc <- tibble(dogs=c("Terrier","(Lakeland)","Terrier","(Soft-coated wheaten)","Bulldog","(English)","Bulldog","(French)"))

> dogsDesc
# A tibble: 8 x 1
  dogs                 
  <chr>                
1 Terrier              
2 (Lakeland)           
3 Terrier              
4 (Soft-coated wheaten)
5 Bulldog              
6 (English)            
7 Bulldog              
8 (French)

Matching the opening bracket with the regex:

dogsDesc %>% unbreak_vals("^\\(",dogs,dogs_desc)

# A tibble: 4 x 1
  dogs_desc                    
  <chr>                        
1 Terrier (Lakeland)           
2 Terrier (Soft-coated wheaten)
3 Bulldog (English)            
4 Bulldog (French)

I have lots to learn about writing functions, but so far this ‘unbreak_vals’ function has already saved me lots of time and hassle and painful spreadsheet editing. If you have any questions or if you find this helpful please let me know.

Share on

Twitter Facebook LinkedIn

Luis D. Verde Arregoitia

Tidyeval meets PDF table hell

PDF table hell

Share on

You may also enjoy

Cell and text formatting is everywhere

2024 LLMs/genAI + R roundup

Descargar, procesar, y acomodar imágenes en un mosaico hexagonal interactivo

Download, batch process, and tile images