unheadr
0.2.1 is now on CRAN, so here is an updated example for using the two main functions in the package.
The screenshot below comes from a spreadsheet with data from a regional event for the American Kennel Club’s Coursing Ability Test (CAT), which consists of timed 100-yard dashes for dogs. This is a subsample of an original xlsx file that I found online at some point and randomized.
The breeds are embedded in the same column (variable) that holds the names of the individual dogs. Fortunately, the rows with the breeds have cell highlighting. Otherwise there would be no easy way to identify the breed names without prior knowledge (e.g. is Vizsla a name or a breed?). These ‘embedded subheaders’ approach is not quite tidy, and it would not be easy to, for example, calculate the average time by breed.
To work with these data in R, we can use the readxl package to import the xlsx file.
Let’s have a look at the ouput from readxl::read_excel().
With unheadr, we can turn the spreadsheet formatting (meaningful formatting) into character strings and then work with that.
Applying the annotate_mf() function to annotate the Dogs variable, which adds a new variable that tells us about the type of formatting in each cell. The cell highlighting and HTML color code are appended onto the variable of interest.
We can then match the annotation strings with regex, and put the breed names into their own variable, leading to a tidy rectangular table (after some minor cleanup).
Tidiness
On the event website, I later learned that italic text in the Time variable indicates that a dog was disqualified (for whatever reason). We can repeat the previous steps with the times variable, and join the resulting tables.
Pretty cool!
Writing this post made me realize that I should probably rewrite annotate_mf() so that we can specify if we need to skip rows when importing spreadsheets, and also think about a version of the function that can be applied across multiple variables simultaneously. If anyone wants to contribute please reach out.