This recent post by Jeremy Selva shows a nice workflow for working with problematic formatted spreadsheets in R.
I’ve worked on this topic before, so I when I saw the post I had to meddle. I suggested using functions from the unheadr package to address some of these issues. They didn’t work because of assumptions I hard-coded into the package, but these are fixed now. These fixes are part of unheadr v0.4.0. To demonstrate the functions and also to celebrate 20k downloads (🥳), here is my take (with the benefit of hindsight) on tackling this troublesome spreadsheet. ALL CREDIT TO JEREMY FOR COMING UP WITH THE EXAMPLE AND WORKFLOW.
The spreadsheet/workbook in question can be downloaded from Jeremy’s GitHub here.
For context, I’m reproducing the images from the original post about the issues with the data, which are good examples of:
😢 Things we shouldn’t do but which happen anyway in spreadsheets:
1. A date variable with different cell formats, plus whatever strange thing Excel does to dates.
2. Meaningful formatting
Using text color (and nothing else) to indicate units.
3. Numeric variables with some cells formatted as text.
These issues vary in how hard they are to address in downstream analyses, but as they accumulate we are more tempted to give up and work with the file directly in Excel (or Calc).
Here’s my take, which can be a complement to the stricter approach in Jeremy’s post.
First, read the spreasheet with readxl. Out of caution, we can read everything as text initially.
Next, type_convert() from readr does a good job of parsing variables in a data frame, and has good heuristics to interpret numeric variables even when these had problematic combinations of cell formatting in the spreadsheet.
To fix the awkward date, lubridate can handle the dates that were imported more or less properly, and the excel_numeric_to_date() function from janitor was purpose-built to transform the dates stored as weird numbers. Afterwards we put the two together.
Lastly, there’s the issue of units of measurment for a variable encoded as text formatting. In this case the text color indicates the units. This information is lost when reading only the cell values with readxl, but we can embed the color code from each cell as a text annotation using the annotate_mf functions from unheadr.
If we run annotate_mf_all() with the path to the spreadsheet, we’ll get an error message. Apparently there are formatted cells outside of the data rectangle, which is what readxl focuses on during the data import.
We can use tidyxl to unravel the problem, looking at the tail end of the output, we see some blank-formatted cells in rows 1054 and 1055, even though our samp data frame only has 1053 rows. These formatted cells create “ghost” rows that trip up the functions from unheadr.
After opening the file directly in a spreadsheet program and deleting the problem rows altogether (I saved this as a new file called sample_excel_cln.xlsx), the annotate_mf_all() function is able to translate the colorful text into a text annotation of the hex8 code for each color.
Let’s have a look
If we drop the FF from the color codes (it’s shorthand for 100% opacity), recent versions of R Studio will preview a color right in the editor, so now we know which values are green and which are black.
“FF00B050” is green and “FF000000” is black
Let’s separate color code and weight value into their own columns:
Now we can conditionally convert pounds to kilograms, then remove the color annotation from the id variable, which we’ll use to merge this object with the dataframe we were working on initially.
After joining the two objects, some minor cleanup of the names leaves us with nice and usable data.
If it’s not too late, following good practices for Data organization in spreadsheets will avoid a lot of pain. Otherwise, tools like the ones shown here can be useful.
All feedback welcome and again thanks to Jeremy Selva for the idea and GitHub issue that sparked all of this.