Tables created in word processing programs often use merged cells to group values. These cells can cause trouble when we want to work with the data in a programming context, regardless of how the tables were shared (.doc, .docx, PDF).
In R, we can get the data from Word files and PDFs into data frames thanks to packages such as docxtractr
, PDFtools
, tablulizer
, and others (or even AI tools apparently), often preserving the original lines and structure which we can then wrangle into something usable.
This post shows a brief walkthrough for cleaning this type of table, based on my experience with real documents and showcasing a new helper function added to the unheadr
package for version 0.4.0 (now on CRAN).
This example is based on publicly available data with information on income, property and property obligations for government officials in the Chukotka Autonomous Okrug, which is the easternmost federal subject of Russia.
Various examples can be downloaded here:
The files look like this, some are available as Word documents and some as PDFs.
For this example, a translated and simplified version looks like this:
To replicate this example, the docx file is here, and the code below uses docxtractr
to pull the contents of the table into R.
To skip the download we could also recreate the output from docx_extract_tbl()
stage with this code:
The output is ungainly, and everything that appears on separate lines within the borders of the ‘merged’ cells ends up together, separated only by line break sequences (\n).
As a first step we can use the squish_newlines()
function from unheadr
to deduplicate and remove trailing line breaks, and because the rows in the initial table correspond to two different government officials, the operation is grouped using the .by
argument.
Now we can separate the cell values into their own rows using separate_rows()
from tidyr
.
Much nicer and ready for any number of operations. I hope this is helpful and as usual let me know if you run into any issues.