Reshaping and tidying data from awkward spreadsheet formats
la versión en español de esta publicación está aquí
The table below is a subset of data from a directory of dog rescue resources put together by Speaking of Dogs, a volunteer-based dog rescue organization in Toronto, Canada. The information is real, but for this example I garbled the original data into a particular ‘spreadsheet’ format that I struggled with recently. I chose this source of data in support of the Clear The H*ckin Shelters campaign happening this week (read more here and support dog shelters in general).
Organization | Contact name | phone | website | Organization | Contact name | phone | website |
---|---|---|---|---|---|---|---|
“Small Breed” | “Bulldog (English)” | ||||||
Happy Tails Rescue | Judy | 905-357-5096 | www.happytailsrescue.ca | Homeward Bound Rescue* | Kathy | 905-987-1104 | www.homewardboundrescue.ca |
LOYAL Rescue Inc. | Anne | 888-739-1221 | www.loyalrescue.com | unknown | Joan† | 416-738-6059 | unknown |
Pomeranian and Small Breed Rescue | Shelley | 416-225-6808 | www.psbrescue.com | “Labrador Retriever “ | |||
Tiny Paws Rescue | Brenda | 1-800-774-8315 | www.tpdr.ca | Labrador Retriever Adoption Service | Laura or Karen | 289-997-5227 | www.lab-rescue.ca |
“Senior Dogs” | Dog Rescuers Inc | Joan | 416-567-6249 ‡ | www.thedogrescuersinc.ca | |||
Speaking of Dogs Rescue | Lorraine | 705-444-7637 | www.speakingofdogs.com |
Footnotes for the table:
- * includes other Flat faced dogs: Bulldogs, Boxers, Bostons, Pugs etc
- † limited foster care available
- ‡ phone may not be up to date
This data is not ‘analysis-ready’. Notice the three main issues that need to get sorted:
-
The table has repeated columns. It appears that the table has been split in two (vertically) and the columns are stacked side-by-side in a sort of ‘wide’ format. We don’t really want duplicated variables because having duplicated column names is a very unnatural, complicated, and risky format for keeping data.
-
There are header rows sprinkled throughout the Organization column. These non-data rows are used quite often when we want to save space by having the value in one cell somehow apply to cells below (until we find the next header row used for grouping). These are easy for humans to parse, but not computers. Read more about header rows here.
-
Some ‘cells’ have special characters, these are used to refer to footnotes/information in the table caption, but in this case we would prefer to have this information inside the data rectangle.
rstats time
This post goes through a possible solution to reshape the table and deal with the header rows and footnotes. Make sure you have the necessary R packages installed, and once you do all the code in this block should be fully reproducible.
Start by putting the data into a character vector by simply pasting the table, delimited by tabs and line breaks.
Make each line a row in a tibble, then separate into the corresponding variables (yes, they are still duplicated).
Now, to stack the table into a long form. When I asked for advice on Twitter the consensus was to use the gather function in tidyr after sorting out the duplicated variable names (or by referring to columns by number). The sensible answer for this issue is to not have duplicated names in the first place, and there are various tools and functions for avoiding or fixing them.
#rstats crew: how can I use #purrr to stack a 'wide' df with duplicated variable names?
— Luis D. Verde (@LuisDVerde) August 10, 2017
(I know I shouldn't have them in the first place) pic.twitter.com/yxJoHMQ6N3
However, the real world is a harsh place and duplicated variables are pretty common. I found this post on Stack Overflow for this exact problem, and SO user akrun had a pretty clever solution. The suggestion was to:
- iterate through the unique names to extract all the observations for each variable name
- unlist them
- put them into a data frame (with variable names)
All I did was replace lapply with map and used a tibble instead of a data frame for the output because I’m in the process of learning purrr, and because tibbles never convert strings to factors or create row names.
The data is looking better but we still need to sort out the awkward header rows. Fortunately, there’s a function for that. Read about it here. In brief, my bumbling attempt at tidy evaluation received a makeover from Jenny Bryan and now we can define and use the untangle2 function. When that happened, it was like having Xzibit knocking at my door offering to enhance my car. Since then, the untangle2 function has been helping me shred through other people’s data because in my field everything follows a taxonomic hierarchy and everyone likes to use header rows. I feel that untangle belongs in tidyr, and maybe when I’m confident enough I’ll try to contribute to the tidyverse.
In this table, the header rows are quoted, making for smooth untangling.
After that, there are some repeated, empty, and NA rows that need to be filtered out.
The footnotes are the last major issue. To bring them into the data rectangle, I used case_when inside mutate to add the footnote text conditionally, but I’m not very happy with this approach. To figure out the columns to match with the different individual grepl statements I used map to iterate through the columns.
Ideally, I wanted to iterate though the special characters and the columns at the same time, because any given observation could have any combination of footnotes. I couldn’t figure out map2 and list columns :(
Because the footnotes were informative enough, we can wrap things up by removing all the special characters.
The final table looks like this:
Organization | Contact name | phone | website | Category | observation |
---|---|---|---|---|---|
Happy Tails Rescue | Judy | 905-357-5096 | www.happytailsrescue.ca | Small Breed | NA |
LOYAL Rescue Inc. | Anne | 888-739-1221 | www.loyalrescue.com | Small Breed | NA |
Pomeranian and Small Breed Rescue | Shelley | 416-225-6808 | www.psbrescue.com | Small Breed | NA |
Tiny Paws Rescue | Brenda | 1-800-774-8315 | www.tpdr.ca | Small Breed | NA |
Speaking of Dogs Rescue | Lorraine | 705-444-7637 | www.speakingofdogs.com | Senior Dogs | NA |
Homeward Bound Rescue | Kathy | 905-987-1104 | www.homewardboundrescue.ca | Bulldog (English) | includes other Flat faced dogs: Bulldogs |
unknown | Joan | 416-738-6059 | unknown | Bulldog (English) | limited foster care available |
Labrador Retriever Adoption Service | Laura or Karen | 289-997-5227 | www.lab-rescue.ca | Labrador Retriever | NA |
Dog Rescuers Inc | Joan | 416-567-6249 | www.thedogrescuersinc.ca | Labrador Retriever | phone may not be up to date |
That’s it. Let me know if anything isn’t working. For reference, the table that inspired this post was Table 2 in this 2013 paper by Alvarez et al.