Reshaping and tidying data from awkward spreadsheet formats

la versión en español de esta publicación está aquí

The table below is a subset of data from a directory of dog rescue resources put together by Speaking of Dogs, a volunteer-based dog rescue organization in Toronto, Canada. The information is real, but for this example I garbled the original data into a particular ‘spreadsheet’ format that I struggled with recently. I chose this source of data in support of the Clear The H*ckin Shelters campaign happening this week (read more here and support dog shelters in general).

Organization Contact name phone website Organization Contact name phone website
“Small Breed”       “Bulldog (English)”      
Happy Tails Rescue Judy 905-357-5096 www.happytailsrescue.ca Homeward Bound Rescue* Kathy 905-987-1104 www.homewardboundrescue.ca
LOYAL Rescue Inc. Anne 888-739-1221 www.loyalrescue.com unknown Joan† 416-738-6059 unknown
Pomeranian and Small Breed Rescue Shelley 416-225-6808 www.psbrescue.com “Labrador Retriever “      
Tiny Paws Rescue Brenda 1-800-774-8315 www.tpdr.ca Labrador Retriever Adoption Service Laura or Karen 289-997-5227 www.lab-rescue.ca
“Senior Dogs”       Dog Rescuers Inc Joan 416-567-6249 ‡ www.thedogrescuersinc.ca
Speaking of Dogs Rescue Lorraine 705-444-7637 www.speakingofdogs.com        

Footnotes for the table:

  • * includes other Flat faced dogs: Bulldogs, Boxers, Bostons, Pugs etc
  • † limited foster care available
  • ‡ phone may not be up to date

This data is not ‘analysis-ready’. Notice the three main issues that need to get sorted:

  • The table has repeated columns. It appears that the table has been split in two (vertically) and the columns are stacked side-by-side in a sort of ‘wide’ format. We don’t really want duplicated variables because having duplicated column names is a very unnatural, complicated, and risky format for keeping data.

  • There are header rows sprinkled throughout the Organization column. These non-data rows are used quite often when we want to save space by having the value in one cell somehow apply to cells below (until we find the next header row used for grouping). These are easy for humans to parse, but not computers. Read more about header rows here.

  • Some ‘cells’ have special characters, these are used to refer to footnotes/information in the table caption, but in this case we would prefer to have this information inside the data rectangle.

rstats time

This post goes through a possible solution to reshape the table and deal with the header rows and footnotes. Make sure you have the necessary R packages installed, and once you do all the code in this block should be fully reproducible.

Start by putting the data into a character vector by simply pasting the table, delimited by tabs and line breaks.

# load packages
library(dplyr)
library(magrittr)
library(tidyr)
library(rlang)
library(purrr)

# chr vector with delimited data
resc <- 
c("Organization	Contact name	phone	website	Organization	Contact name	phone	website
'Small Breed'				'Bulldog (English)'			
Happy Tails Rescue	Judy	905-357-5096	www.happytailsrescue.ca	Homeward Bound Rescue*	Kathy	905-987-1104	www.homewardboundrescue.ca
LOYAL Rescue Inc.	Anne	888-739-1221	www.loyalrescue.com	unknown	Joan†	416-738-6059 	unknown
Pomeranian and Small Breed Rescue	Shelley	416-225-6808	www.psbrescue.com	'Labrador Retriever'			
Tiny Paws Rescue	Brenda	1-800-774-8315	www.tpdr.ca	Labrador Retriever Adoption Service	Laura or Karen 	289-997-5227	www.lab-rescue.ca
'Senior Dogs'				Dog Rescuers Inc	Joan	416-567-6249‡	www.thedogrescuersinc.ca
Speaking of Dogs Rescue	Lorraine	705-444-7637	www.speakingofdogs.com")				

Make each line a row in a tibble, then separate into the corresponding variables (yes, they are still duplicated).

# lines to rows
rescDF <- data_frame(unsep=unlist(strsplit(resc,"\n")))

# separate into variables (tab delimited)
rescDF %<>% separate(unsep,into=unlist(strsplit(rescDF$unsep[1],"\t")),sep ="\t")

Now, to stack the table into a long form. When I asked for advice on Twitter the consensus was to use the gather function in tidyr after sorting out the duplicated variable names (or by referring to columns by number). The sensible answer for this issue is to not have duplicated names in the first place, and there are various tools and functions for avoiding or fixing them.

However, the real world is a harsh place and duplicated variables are pretty common. I found this post on Stack Overflow for this exact problem, and SO user akrun had a pretty clever solution. The suggestion was to:

  • iterate through the unique names to extract all the observations for each variable name
  • unlist them
  • put them into a data frame (with variable names)

All I did was replace lapply with map and used a tibble instead of a data frame for the output because I’m in the process of learning purrr, and because tibbles never convert strings to factors or create row names.

# stack into long form
rescDFstacked <- 
map(unique(names(rescDF)), ~
      unlist(rescDF[names(rescDF)==.x], use.names = FALSE)) %>% 
  as.data.frame(stringsAsFactors = FALSE) %>% 
  set_names(unique(names(rescDF)))

The data is looking better but we still need to sort out the awkward header rows. Fortunately, there’s a function for that. Read about it here. In brief, my bumbling attempt at tidy evaluation received a makeover from Jenny Bryan and now we can define and use the untangle2 function. When that happened, it was like having Xzibit knocking at my door offering to enhance my car. Since then, the untangle2 function has been helping me shred through other people’s data because in my field everything follows a taxonomic hierarchy and everyone likes to use header rows. I feel that untangle belongs in tidyr, and maybe when I’m confident enough I’ll try to contribute to the tidyverse.

In this table, the header rows are quoted, making for smooth untangling.

# define untangle fn
untangle2 <- function(df, regex, orig, new) {
  orig <- enquo(orig)
  new <- sym(quo_name(enquo(new)))
  
  df %>%
    mutate(
      !!new := if_else(grepl(regex, !! orig), !! orig, NA_character_)
    ) %>%
    fill(!! new) %>%
    filter(!grepl(regex, !! orig))
}


# deal with header rows (anything quoted)
rescDFstacked %<>% untangle2("'",Organization,Category)

After that, there are some repeated, empty, and NA rows that need to be filtered out.

# remove repeated, NA, and empty rows
rescDFstacked %<>% filter(Organization != "Organization" & Organization != " ", !is.na(Organization))

The footnotes are the last major issue. To bring them into the data rectangle, I used case_when inside mutate to add the footnote text conditionally, but I’m not very happy with this approach. To figure out the columns to match with the different individual grepl statements I used map to iterate through the columns.

Ideally, I wanted to iterate though the special characters and the columns at the same time, because any given observation could have any combination of footnotes. I couldn’t figure out map2 and list columns :(

# bring footnotes into data rectangle
rescDFstacked %<>% mutate(observation = case_when(
  grepl("\\*",Organization)~"includes other Flat faced dogs: Bulldogs, Boxers, Bostons, Pugs etc",
  grepl("\u0086",`Contact name`)~"limited foster care available",
  grepl("\u0087",phone)~"phone may not be up to date"
  ))


# how I figured out which columns contained which special char
rescDFstacked %>% map(~grepl("\\*",.x)) %>% map(~.x[.x==TRUE]) %>% unlist() %>% names()
rescDFstacked %>% map(~grepl("\u0086",.x)) %>% map(~.x[.x==TRUE]) %>% unlist() %>% names()
rescDFstacked %>% map(~grepl("\u0087",.x)) %>% map(~.x[.x==TRUE]) %>% unlist() %>% names()

# DIDNT WORK
# map2(rescDFstacked,c("\\*","\u0086","\u0087"),~ grepl(.y,.x))

Because the footnotes were informative enough, we can wrap things up by removing all the special characters.

# remove special chars
rescDFstacked %<>% mutate_all(funs(gsub("[†|‡|'|\\*]","",.)))

The final table looks like this:

Organization Contact name phone website Category observation
Happy Tails Rescue Judy 905-357-5096 www.happytailsrescue.ca Small Breed NA
LOYAL Rescue Inc. Anne 888-739-1221 www.loyalrescue.com Small Breed NA
Pomeranian and Small Breed Rescue Shelley 416-225-6808 www.psbrescue.com Small Breed NA
Tiny Paws Rescue Brenda 1-800-774-8315 www.tpdr.ca Small Breed NA
Speaking of Dogs Rescue Lorraine 705-444-7637 www.speakingofdogs.com Senior Dogs NA
Homeward Bound Rescue Kathy 905-987-1104 www.homewardboundrescue.ca Bulldog (English) includes other Flat faced dogs: Bulldogs
unknown Joan 416-738-6059 unknown Bulldog (English) limited foster care available
Labrador Retriever Adoption Service Laura or Karen 289-997-5227 www.lab-rescue.ca Labrador Retriever NA
Dog Rescuers Inc Joan 416-567-6249 www.thedogrescuersinc.ca Labrador Retriever phone may not be up to date

That’s it. Let me know if anything isn’t working. For reference, the table that inspired this post was Table 2 in this 2013 paper by Alvarez et al.