Bold indicates negative
Late last year I added functionality to unheadr
for importing and working with formatted spreadsheet data (see this post).
Recently, I saw this tweet by Mara Averick mentioning a spreadsheet in which bold text = negative values. The unheadr::annotate_mf
function can translate cell formatting into text annotations within a data frame or tibble, but only for one target variable at a time.
Based on true events of this afternoon...
— Mara Averick (@dataandme) April 29, 2020
I know, I know there's @nacnudus' {unpivotr}, but seriously?! 🤬
I know, @ChelseaParlett, it's not exactly #statsTikTok, but the feels are real! pic.twitter.com/QbngGYLoPd
After also seeing some COVID-related spreadsheets where bold = negative, it seemed like a good idea to generalize the approach and have a single function that annotates meaningful formatting for all the cells in all the columns of a spreadsheet.
Introducing annotate_mf_all()
annotate_mf_all
is now part of unheadr
(dev version). Install from GitHub if you haven’t already.
Let’s try it out with the example data bundled with the package. It is an .xlsx file that looks like this.
boutiques.xlsx is toy dataset with Q1 profits for different store locations. Additional information is encoded as meaningful formatting. Bold indicates losses (negative values), colors indicate continent, and italic indicates a second location in the same city.
If we know what the different formatting options represent, and have them embedded as text within each cell (value) in an R object, we can parse the formatting and do some cleanup following a relatively simple workflow. A walkthrough:
Load the relevant packages and data
The new annotate_mf_all
function has one argument, a path to a single-sheet spreadsheet (for now), and its output is a tibble with the cell contents and their respective annotations.
Let’s see:
All these annotations can be matched with regular expressions to wrangle the data into shape.
Looking better:
Now we can use a conditional mutate
to sort out instances of bold=negative (note the new syntax), and after some minor cleanup we have a usable, rectangular dataset.
Much better, and now we can actually work with these data (e.g. get totals, means, top and bottom values, etc.)
Hopefully I can submit this patch to CRAN soon, but please try this out with your own spreadsheets and feel free to send me any feedback.