From photo to tibble
When I first wrote about tidy evaluation and the untangle2 function, I used restaurant menus as an example of how embedded subheaders are used to create small multiples of data (by type of menu item).
After a recent update to the tesseract optical character recognition (ocr) engine, I decided to try and digitize and parse a restaruant menu from a photo. I wanted to try with a real photo from a real menu, and I found this photo in my camera roll.

The menu is from a nice taco place in Puerto Vallarta. I cropped the photo to the beer selection. There are craft beers and commercial beers as little subsets of the menu, each with their own heading.
Code-through
Let’s ocr the text and then restructure the data.
To prepare the image for ocr, I followed some existing imagemagick tutorials, and the implementation in magick lets us chain the different operations together, making for very readable code. The code below will read the photo striaght from the web, so anyone can follow along after installing all the required packages.
# load packages
library(magick)
library(dplyr)
library(tesseract)
library(stringr)
library(unheadr)
library(tidyr)
beerpath <- ("https://github.com/luisDVA/luisdva.github.io/raw/master/assets/images/beer.jpg")
# read and process
beer <- magick::image_read(beerpath) %>%
image_resize("25%") %>%
image_convolve("DoG:15,100,0") %>%
image_negate() %>%
image_normalize() %>%
image_blur(0,1) %>%
image_convert(type="grayscale") %>%
image_contrast() %>%
image_enhance()
The resulting image looks like this (even though we don’t need to write it to disk for the ocr process).

The image_ocr function runs the optical character recognition on the image and returns a character vector, which we can structure into a tibble.
# ocr
beer_prices <- image_ocr(beer,language = c("spa","eng"))
beer_prices <- beer_prices %>% strsplit("\n") %>% unlist() %>% data_frame(prices=.)
In tibble form:
# A tibble: 15 x 1
prices
<chr>
1 - CERVEZA ARTESANAL -
2 Colimita Loger $70
3 Piedra Lisa Session IPA $70
4 Los Cuentos Kumúkite Tropical IPA $70
5 Páramo APA $70
6 Los Chiltes Capomo Porter $70
7 Ticus Porter $70
8 * CERVEZA COMERCIAL -+
9 XX Loger especial $30
10 Bohemia Clara Pilsner $35
11 Bohemia Obscura Vienna $35
12 Heineken Lager $40
13 Tecate Pilsner $25
14 Tecate Light $25
15 Indio Lombic $25
There is some minor cleanup to be done. I like this real-world example because the original photo is not the best to begin, and the text itself is a mix of English, Spanish, and made up brand names. Even so, the ocr performed really well. Once the text is clean, we can put the prices into their own variable, and then use unheadr to turn the subheaders into a tidy grouping variable
# cleanup
beer_prices <-
beer_prices %>% mutate(prices=str_remove_all(prices,"[\\+\\-\\*]")) %>%
mutate_all(str_squish) %>% mutate(prices=str_replace(prices,"Loger","Lager")) %>%
mutate(prices=str_replace(prices,"Lombic","Lambic"))
# rectangling
beer_prices <-
beer_prices %>% separate(prices,into = c("beer","price"),sep = " \\$") %>%
mutate(price=readr::parse_number(price)) %>%
untangle2("CERVE",beer,type)
End result:
# A tibble: 13 x 3
beer price type
<chr> <dbl> <chr>
1 Colimita Lager 70 CERVEZA ARTESANAL
2 Piedra Lisa Session IPA 70 CERVEZA ARTESANAL
3 Los Cuentos Kumúkite Tropical IPA 70 CERVEZA ARTESANAL
4 Páramo APA 70 CERVEZA ARTESANAL
5 Los Chiltes Capomo Porter 70 CERVEZA ARTESANAL
6 Ticus Porter 70 CERVEZA ARTESANAL
7 XX Lager especial 30 CERVEZA COMERCIAL
8 Bohemia Clara Pilsner 35 CERVEZA COMERCIAL
9 Bohemia Obscura Vienna 35 CERVEZA COMERCIAL
10 Heineken Lager 40 CERVEZA COMERCIAL
11 Tecate Pilsner 25 CERVEZA COMERCIAL
12 Tecate Light 25 CERVEZA COMERCIAL
13 Indio Lambic 25 CERVEZA COMERCIAL
With the menu items in a tidy structure, we can now group by beer type and get summary data or find the most or least expensive options.
# average price by type
beer_prices %>% group_by(type) %>% summarise(avg_price=mean(price))
# most expensive commercial brew
beer_prices %>% filter(type=="CERVEZA COMERCIAL") %>% top_n(1,price)
# least expensive
beer_prices %>% filter(type=="CERVEZA COMERCIAL") %>% top_n(-1,price)
# prices in US Dollars
beer_prices %>% mutate(prices_usd = price*0.05)
Let’s have a look at the results:
Average prices
# A tibble: 2 x 2
type avg_price
<chr> <dbl>
1 CERVEZA ARTESANAL 70
2 CERVEZA COMERCIAL 30.7
Most expensive commercial beer
# A tibble: 1 x 3
beer price type
<chr> <dbl> <chr>
1 Heineken Lager 40 CERVEZA COMERCIAL
Least expensive (three-way tie)
# A tibble: 3 x 3
beer price type
<chr> <dbl> <chr>
1 Tecate Pilsner 25 CERVEZA COMERCIAL
2 Tecate Light 25 CERVEZA COMERCIAL
3 Indio Lambic 25 CERVEZA COMERCIAL
Convert prices to USD
# A tibble: 13 x 4
beer price type prices_usd
<chr> <dbl> <chr> <dbl>
1 Colimita Lager 70 CERVEZA ARTE… 3.5
2 Piedra Lisa Session IPA 70 CERVEZA ARTE… 3.5
3 Los Cuentos Kumúkite Trop… 70 CERVEZA ARTE… 3.5
4 Páramo APA 70 CERVEZA ARTE… 3.5
5 Los Chiltes Capomo Porter 70 CERVEZA ARTE… 3.5
6 Ticus Porter 70 CERVEZA ARTE… 3.5
7 XX Lager especial 30 CERVEZA COME… 1.5
8 Bohemia Clara Pilsner 35 CERVEZA COME… 1.75
9 Bohemia Obscura Vienna 35 CERVEZA COME… 1.75
10 Heineken Lager 40 CERVEZA COME… 2
11 Tecate Pilsner 25 CERVEZA COME… 1.25
12 Tecate Light 25 CERVEZA COME… 1.25
13 Indio Lambic 25 CERVEZA COME… 1.25
The example is a little silly because all the craft beers have the same price, but I always wanted to apply the untangle2 function to a real restaurant menu, especially a non-digitized one from an actual photo.
Cheers!