While trying to understand why my friends beat me in Age of Empires II so often, I came across this site with stats for each unit. It’s a great reference, but I wanted to explore the data further. This post describes the various steps for reading and cleaning the data, followed by some visualization of the main outputs from a dimension reduction via Principal Component Analysis (PCA).

The code in this post tackles many small issues that come up when getting a dataset ready for analysis and/or visualization. These include:

  • Separating compound values
  • Placing embedded subheaders into their own variables
  • Extracting values from compound variables using regular expressions
  • Finding minimum and maximum values
  • Rearranging factor levels within groups

This brief exploration gave me chance to try out the new syntax in dplyr 1.0.0 and to create some explanatory figures that can help us interpret the typical scatterplots often made for the first two dimensions in a PCA. The plotting code here includes examples of:

  • Leveraging gghighlight to show the relative position of different groups of observations in principal component space
  • A somewhat hacky replacement of bars with crossbars for showing the contribution and correlations of the different variables to the PCA
  • Text labels for points of the scatterplot with top and bottom values
  • Adding text and arrow annotations to the plot to make the interpretation more explicit

The data on the site looks like this:

click to enlarge

It’s not bad, but the different unit types appear as embedded subheaders, and a few variables contain compound values.

Let’s clean the data. First, we need the following packages:

library(tidyr)       # CRAN v1.1.0
library(unheadr)     # CRAN v0.2.1
library(dplyr)       # [github::tidyverse/dplyr] v1.0.0.9000
library(stringr)     # CRAN v1.4.0
library(ggplot2)     # CRAN v3.3.1
library(scico)       # CRAN v1.1.0
library(ggrepel)     # CRAN v0.8.2
library(gghighlight) # CRAN v0.3.0
library(FactoMineR)  # CRAN v2.3
library(tidytext)    # CRAN v0.2.4 

The data came from the aoe2stats website, simply copied and pasted with datapasta

Click here to see code block with raw data
# paste and read data
aocdedat <-
  data.frame(
    stringsAsFactors = FALSE,
    check.names = FALSE,
    `T	V	Name	Age	Cost	BT	RT	AD	MR	LOS	HP	RA	AT	AR	Act` = c(
      "Archery Range",
      "Crossbowman\tCastle\t25W 45G\t0:27\t2\t0.35\t0.96\t7\t35\t5\t5\t0/0",
      "Arbalest\tImperial\t25W 45G\t0:27\t2\t0.35\t0.96\t7\t40\t5\t6\t0/0",
      "Hand Cannoneer\tImperial\t45F 50G\t0:34\t3.45\t0.35\t0.96\t9\t35\t7\t17\t1/0",
      "Genitour\tCastle\t35W 50F\t0:25\t3\t0.5\t1.35\t5\t50\t1-4\t3\t0/3",
      "Elite Genitour\tImperial\t35W 50F\t0:23\t3\t0.5\t1.35\t6\t55\t1-4\t4\t0/4",
      "Skirmisher\tFeudal\t25F 35W\t0:22\t3\t0.5\t0.96\t6\t30\t1-4\t2\t0/3",
      "Elite Skirmisher\tCastle\t25F 35W\t0:22\t3\t0.5\t0.96\t7\t35\t1-5\t3\t0/4",
      "Imperial Skirmisher\tImperial\t35W 25F\t0:22\t3\t0.5\t0.96\t7\t35\t1-5\t4\t0/4",
      "Archer\tFeudal\t25W 45G\t0:35\t2\t0.525\t0.96\t6\t30\t4\t4\t0/0",
      "Slinger\tCastle\t30F 40G\t0:25\t2\t0.775\t0.96\t7\t40\t1-5\t5\t0/0",
      "Cavalry Archer\tCastle\t40W 60G\t0:34\t2\t1\t1.4\t5\t50\t4\t6\t0/0",
      "Heavy Cavalry Archer\tImperial\t40W 60G\t0:27\t2\t1\t1.4\t6\t60\t4\t7\t1/0",
      "Barracks",
      "Condottiero\tImperial\t50F 35G\t0:18\t1.9\t0\t1.2\t6\t80\t-\t9\t1/0",
      "Eagle Scout\tDark\t-\t-\t2\t0\t1.1\t6\t50\t-\t4\t0/2",
      "Eagle Scout\tFeudal\t20F 50G\t1:00\t2\t0\t1.1\t6\t50\t-\t4\t0/2",
      "Eagle Scout\tCastle\t20F 50G\t0:35\t2\t0\t1.1\t8\t50\t-\t7\t0/2",
      "Eagle Warrior\tCastle\t20F 50G\t0:35\t2\t0\t1.15\t6\t55\t-\t7\t0/3",
      "Elite Eagle Warrior\tImperial\t20F 50G\t0:20\t2\t0\t1.3\t6\t60\t-\t9\t0/4",
      "Spearman\tFeudal\t35F 25W\t0:22\t3\t0\t1\t4\t45\t-\t3\t0/0",
      "Pikeman\tCastle\t35F 25W\t0:22\t3\t0\t1\t4\t55\t-\t4\t0/0",
      "Halberdier\tImperial\t35F 25W\t0:22\t3\t0\t1\t4\t60\t-\t6\t0/0",
      "Militia\tDark\t60F 20G\t0:21\t2\t0\t0.9\t4\t40\t-\t4\t0/1",
      "Man-at-arms\tFeudal\t60F 20G\t0:21\t2\t0\t0.9\t4\t45\t-\t6\t0/1",
      "Long Swordsman\tCastle\t60F 20G\t0:21\t2\t0\t0.9\t4\t60\t-\t9\t0/1",
      "Two-handed Swordsman\tImperial\t60F 20G\t0:21\t2\t0\t0.9\t5\t60\t-\t12\t0/1",
      "Champion\tImperial\t60F 20G\t0:21\t2\t0\t0.9\t5\t70\t-\t13\t1/1",
      "Castle",
      "King\tDark\t50F\t0:30\t-\t-\t1.32\t6\t75\t-\t-\t0/0",
      "Petard\tCastle\t65F 20G\t0:25\t-\t0\t0.8\t4\t50\t-\t25\t0/2",
      "Trebuchet (packed)\tImperial\t200W 200G\t0:50\t10 (-)\t0.6 (-)\t- (0.8)\t18\t150\t4-16 (-)\t200 (-)\t1/150 (2/8)",
      "Dock",
      "Cannon Galleon\tImperial\t200W 150G\t0:46\t10\t0\t1.1\t15\t120\t3-13\t35\t0/6",
      "Elite Cannon Galleon\tImperial\t200W 150G\t0:46\t10\t0\t1.1\t17\t150\t3-15\t45\t0/8",
      "Caravel\tCastle\t90W 40G\t0:36\t3\t0\t1.43\t9\t143\t6\t6\t0/8",
      "Elite Caravel\tImperial\t90W 40G\t0:36\t3\t0\t1.43\t9\t165\t7\t8\t0/8",
      "Demolition Raft\tFeudal\t70W 50G\t0:45\t-\t0\t1.5\t6\t45\t-\t90\t0/2",
      "Demolition Ship\tCastle\t70W 50G\t0:31\t-\t0\t1.6\t6\t60\t-\t110\t0/3",
      "Heavy Demolition Ship\tImperial\t70W 50G\t0:31\t-\t0\t1.6\t6\t70\t-\t140\t0/3",
      "Fire Galley\tFeudal\t75W 45G\t1:00\t0.25\t0\t1.3\t5\t100\t2.49\t1\t0/4",
      "Fire Ship\tCastle\t75W 45G\t0:36\t0.25\t0\t1.35\t5\t120\t2.49\t2\t0/6",
      "Fast Fire Ship\tImperial\t75W 45G\t0:36\t0.25\t0\t1.43\t6\t140\t2.49\t3\t0/8",
      "Galley\tFeudal\t90W 30G\t1:00\t3\t0\t1.43\t7\t120\t5\t6\t0/6",
      "War Galley\tCastle\t90W 30G\t0:36\t3\t0\t1.43\t8\t135\t6\t7\t0/6",
      "Galleon\tImperial\t90W 30G\t0:36\t3\t0\t1.43\t9\t165\t7\t8\t0/8",
      "Longboat\tCastle\t85W 43G\t0:25\t3\t0\t1.54\t8\t130\t6\t7\t0/6",
      "Elite Longboat\tImperial\t80W 40G\t0:25\t3\t0\t1.54\t9\t160\t7\t8\t0/8",
      "Fishing Ship\tDark\t75W\t0:40\t-\t0\t1.26\t5\t60\t-\t-\t0/4",
      "Trade Cog\tFeudal\t100W 50G\t0:36\t-\t0\t1.32\t6\t80\t-\t-\t0/6",
      "Transport Ship\tDark\t125W\t0:45\t-\t0\t1.45\t5\t100\t-\t-\t4/8",
      "Turtle Ship\tCastle\t180W 180G\t0:50\t6\t0\t0.9\t8\t200\t6\t50\t6/5",
      "Elite Turtle Ship\tImperial\t180W 180G\t0:50\t6\t0\t0.9\t8\t300\t6\t50\t8/6", "Gaia",
      "Market",
      "Trade Cart\tFeudal\t100W 50G\t0:50\t-\t0\t1\t7\t70\t-\t-\t0/0",
      "Monastery",
      "Missionary\tCastle\t100G\t0:51\t-\t0\t1.1\t9\t30\t7\t0\t0/0",
      "Monk\tCastle\t100G\t0:51\t5.5\t0\t0.7\t11\t30\t9\t0\t0/0",
      "Siege Workshop",
      "Siege Tower\tCastle\t200W 160G\t0:36\t-\t-\t0.8 (+0.05/unit)\t8\t220\t-\t-\t-2/100",
      "Mangonel\tCastle\t160W 135G\t0:46\t6\t0\t0.6\t9\t50\t3-7\t40\t0/6",
      "Onager\tImperial\t160W 135G\t0:46\t6\t0\t0.6\t10\t60\t3-8\t50\t0/7",
      "Siege Onager\tImperial\t160W 135G\t0:46\t6\t0\t0.6\t10\t70\t3-8\t75\t0/8",
      "Battering Ram\tCastle\t160W 75G\t0:36\t5\t0\t0.5 (+0.05/unit)\t3\t175\t0\t2\t0/180",
      "Capped Ram\tImperial\t160W 75G\t0:36\t5\t0\t0.5 (+0.05/unit)\t3\t200\t0\t3\t0/190",
      "Siege Ram\tImperial\t160W 75G\t0:36\t5\t0\t0.6 (+0.05/unit)\t3\t270\t0\t4\t0/195",
      "Bombard Cannon\tImperial\t225W 225G\t0:56\t6.5\t0.21\t0.7\t14\t80\t5-12\t40\t2/5",
      "Scorpion\tCastle\t75W 75G\t0:30\t3.6\t0.21\t0.65\t9\t40\t2-7\t12\t0/7",
      "Heavy Scorpion\tImperial\t75W 75G\t0:30\t3.6\t0.21\t0.65\t9\t50\t2-7\t16\t0/7", "Stable",
      "Battle Elephant\tCastle\t120F 70G\t0:28\t2\t0\t0.85\t4\t250\t-\t12\t1/2",
      "Elite Battle Elephant\tCastle\t120F 70G\t0:28\t2\t0\t0.85\t5\t300\t-\t16\t1/3",
      "Camel\tCastle\t55F 60G\t0:22\t2\t0\t1.45\t4\t100\t-\t6\t0/0",
      "Heavy Camel\tImperial\t55F 60G\t0:22\t2\t0\t1.45\t5\t120\t-\t7\t0/0",
      "Imperial Camel\tImperial\t55F 60G\t0:20\t2\t0\t1.45\t5\t140\t-\t9\t0/0",
      "Knight\tCastle\t60F 75G\t0:30\t1.8\t0\t1.35\t4\t100\t-\t10\t2/2",
      "Cavalier\tImperial\t60F 75G\t0:30\t1.8\t0\t1.35\t4\t120\t-\t12\t2/2",
      "Paladin\tImperial\t60F 75G\t0:30\t1.9\t0\t1.35\t5\t160\t-\t14\t2/3",
      "Scout Cavalry\tDark\t-\t-\t2\t0\t1.2\t4\t45\t-\t3\t0/2",
      "Scout Cavalry\tFeudal\t80F\t0:30\t2.04\t0\t1.55\t6\t45\t-\t5\t0/2",
      "Light Cavalry\tCastle\t80F\t0:30\t2\t0\t1.5\t8\t60\t-\t7\t0/2",
      "Hussar\tImperial\t80F\t0:30\t1.9\t0\t1.5\t10\t75\t-\t7\t0/2",
      "Tarkan\tCastle\t60F 60G\t0:21.7\t2.1\t0\t1.35\t5\t100\t-\t8\t1/3",
      "Elite Tarkan\tCastle\t60F 60G\t0:20\t2.1\t0\t1.35\t7\t150\t-\t11\t1/4",
      "Town Center",
      "Villager\tDark\t50F\t0:25\t2\t0\t0.8\t4\t25\t-\t3\t0/0",
      "Unique",
      "Berserk\tCastle\t65F 25G\t0:14\t2\t0\t1.05\t3\t61\t-\t9\t0/1",
      "Elite Berserk\tImperial\t65F 25G\t0:14\t2\t0\t1.05\t5\t75\t-\t14\t2/1",
      "Boyar\tCastle\t50F 80G\t0:23\t1.9\t0\t1.4\t5\t100\t-\t12\t4/1",
      "Elite Boyar\tImperial\t50F 80G\t0:20\t1.9\t0\t1.4\t5\t130\t-\t14\t6/2",
      "Cataphract\tCastle\t70F 75G\t0:20\t1.8\t0\t1.35\t4\t110\t-\t9\t2/1",
      "Elite Cataphract\tImperial\t70F 75G\t0:20\t1.7\t0\t1.35\t5\t150\t-\t12\t2/1",
      "Huskarl\tCastle\t52F 26G\t0:16\t2\t0\t1.05\t3\t60\t-\t10\t0/6",
      "Elite Huskarl\tImperial\t52F 26G\t0:16\t2\t0\t1.05\t5\t70\t-\t12\t0/8",
      "Jaguar Warrior\tCastle\t60F 30G\t0:17\t2\t0\t1\t3\t50\t-\t10\t1/1",
      "Elite Jaguar Warrior\tImperial\t60F 30G\t0:17\t2\t0\t1\t5\t75\t-\t12\t2/1",
      "Elite Janissary\tImperial\t60F 55G\t0:17\t3.45\t0\t0.96\t10\t50\t8\t22\t2/0",
      "Kamayuk\tCastle\t60F 30G\t0:10\t2\t0\t1.0\t4\t60\t1\t7\t0/0",
      "Elite Kamayuk\tImperial\t60F 30G\t0:10\t2\t0\t1.0\t5\t80\t1\t8\t1/0",
      "Karambit Warrior\tCastle\t30F 15G\t0:06\t2\t0\t1.2\t3\t30\t-\t6\t0/1",
      "Elite Karambit Warrior\tImperial\t30F 15G\t0:06\t2\t0\t1.2\t3\t40\t-\t7\t1/1",
      "Magyar Huszar\tCastle\t80F 10G\t0:16\t1.8\t0\t1.5\t5\t70\t-\t9\t0/2",
      "Elite Magyar Huszar\tImperial\t80F 10G\t0:16\t1.8\t0\t1.5\t6\t85\t-\t10\t0/2",
      "Samurai\tCastle\t60F 30G\t0:09\t1.425\t0\t1\t4\t60\t-\t8\t1/1",
      "Elite Samurai\tImperial\t60F 30G\t0:09\t1.425\t0\t1\t5\t80\t-\t12\t1/1",
      "Shotel Warrior\tCastle\t50F 35G\t0:08\t2\t0\t1.2\t3\t40\t-\t16\t0/0",
      "Elite Shotel Warrior\tImperial\t50F 35G\t0:08\t2\t0\t1.2\t3\t50\t-\t18\t0/1",
      "Tarkan\tCastle\t60F 60G\t0:14\t2.1\t0\t1.35\t5\t100\t-\t8\t1/3",
      "Elite Tarkan\tCastle\t60F 60G\t0:14\t2.1\t0\t1.35\t7\t150\t-\t11\t1/4",
      "Teutonic Knight\tCastle\t85F 40G\t0:12\t2\t0\t0.7\t3\t80\t-\t12\t5/2",
      "Elite Teutonic Knight\tImperial\t85F 40G\t0:12\t2\t0\t0.7\t5\t100\t-\t17\t10/2",
      "War Elephant\tCastle\t200F 75G\t0:31\t2\t0\t0.6\t4\t450\t-\t15\t1/2",
      "Elite War Elephant\tImperial\t200F 75G\t0:31\t2\t0\t0.6\t5\t600\t-\t20\t1/3",
      "Woad Raider\tCastle\t65F 25G\t0:10\t2\t0\t1.38\t3\t65\t-\t8\t0/1",
      "Elite Woad Raider\tImperial\t65F 25G\t0:10\t2\t0\t1.38\t5\t80\t-\t13\t0/1",
      "Chu Ko Nu\tCastle\t40W 35G\t0:16\t3.6\t0.21\t0.96\t6\t45\t4\t8\t0/0",
      "Elite Chu Ko Nu\tImperial\t40W 35G\t0:13\t3.75\t0.21\t0.96\t6\t50\t4\t8\t0/0",
      "Ballista Elephant\tCastle\t100F 80G\t0:25\t2.5\t0.394\t0.8\t7\t250\t5\t8\t0/3",
      "Elite Ballista Elephant\tImperial\t100F 80G\t0:25\t2.5\t0.394\t0.8\t7\t290\t5\t9\t0/3",
      "Elephant Archer\tCastle\t100F 80G\t0:25\t2.5\t0.394\t0.8\t7\t280\t4\t6\t0/3",
      "Elite Elephant Archer\tImperial\t100F 80G\t0:25\t2.5\t0.394\t0.8\t7\t330\t4\t7\t0/3",
      "Conquistador\tCastle\t60F 70G\t0:24\t2.9\t0.4\t1.3\t8\t55\t6\t16\t2/2",
      "Elite Conquistador\tImperial\t60F 70G\t0:24\t2.9\t0.4\t1.3\t8\t70\t6\t18\t2/2",
      "Janissary\tCastle\t60F 55G\t0:17\t3.45\t0.4\t0.96\t10\t44\t8\t17\t1/0",
      "Genoese Crossbowman\tCastle\t45F 45G\t0:22\t3\t0.5\t0.96\t8\t45\t4\t6\t1/0",
      "Elite Genoese Crossbowman\tCastle\t45F 45G\t0:19\t2\t0.5\t0.96\t8\t50\t4\t6\t1/0",
      "Longbowman\tCastle\t35W 40G\t0:18\t2\t0.5\t0.96\t8\t35\t6\t6\t0/0",
      "Elite Longbowman\tImperial\t35W 40G\t0:18\t2\t0.5\t0.96\t10\t40\t8\t7\t0/1",
      "Elite Mameluke\tImperial\t55F 85G\t0:23\t2\t0.5\t1.4\t5\t80\t3\t10\t1/0",
      "Mangudai\tCastle\t55W 65G\t0:26\t1.68\t0.5\t1.45\t6\t60\t4\t6\t0/0",
      "Elite Mangudai\tImperial\t55W 65G\t0:26\t1.68\t0.5\t1.45\t6\t60\t4\t8\t1/0",
      "Plumed Archer\tCastle\t40W 40G\t0:16\t1.9\t0.5\t1.2\t6\t50\t4\t5\t0/1",
      "Elite Plumed Archer\tCastle\t35W 35G\t0:16\t1.9\t0.5\t1.2\t7\t65\t5\t5\t0/2",
      "Mameluke\tCastle\t55F 85G\t0:23\t2\t0.6\t1.4\t5\t65\t3\t8\t0/0",
      "Organ Gun\tCastle\t80W 60G\t0:21\t3.45\t0.6\t0.85\t9\t60\t1-7\t16\t2/4",
      "Elite Organ Gun\tImperial\t80W 60G\t0:21\t3.45\t0.6\t0.85\t9\t70\t1-7\t20\t2/6",
      "Camel Archer\tCastle\t50W 60G\t0:21\t2\t0.625\t1.4\t5\t55\t4\t7\t0/1",
      "Elite Camel Archer\tImperial\t50W 60G\t0:21\t2\t0.625\t1.4\t5\t60\t4\t8\t1/1",
      "Rattan Archer\tCastle\t50W 45G\t0:16\t2\t0.675\t1.1\t6\t40\t4\t6\t0/4",
      "Elite Rattan Archer\tImperial\t50W 45G\t0:16\t2\t0.675\t1.1\t6\t45\t5\t7\t0/6",
      "Arambai\tCastle\t80W 60G\t0:21\t2\t0.7\t1.30\t7\t60\t5\t18\t0/1",
      "Elite Arambai\tImperial\t80W 60G\t0:21\t2\t0.7\t1.30\t7\t65\t5\t20\t0/2",
      "Elite Throwing Axeman\tImperial\t55F 25G\t0:17\t2\t0.8\t1.0\t6\t70\t4\t8\t1/0",
      "War Wagon\tCastle\t110W 60G\t0:21\t2.5\t1\t1.2\t6\t150\t4\t9\t0/3",
      "Elite War Wagon\tImperial\t110W 60G\t0:21\t2.5\t1\t1.2\t8\t200\t5\t9\t0/4",
      "Throwing Axeman\tCastle\t55F 25G\t0:17\t2\t1.0\t1.0\t5\t60\t3\t7\t0/0",
      "Gbeto\tCastle\t50F 40G\t0:17\t2\t1.2\t1.25\t6\t30\t5\t10\t0/0",
      "Elite Gbeto\tImperial\t50F 40G\t0:17\t2\t1.2\t1.25\t6\t45\t6\t13\t0/0"
    )
  )


The web content is now a data frame that looks like this:

> head(aocdedat)
               T\tV\tName\tAge\tCost\tBT\tRT\tAD\tMR\tLOS\tHP\tRA\tAT\tAR\tAct
1                                                                Archery Range
2          Crossbowman\tCastle\t25W 45G\t0:27\t2\t0.35\t0.96\t7\t35\t5\t5\t0/0
3           Arbalest\tImperial\t25W 45G\t0:27\t2\t0.35\t0.96\t7\t40\t5\t6\t0/0
4 Hand Cannoneer\tImperial\t45F 50G\t0:34\t3.45\t0.35\t0.96\t9\t35\t7\t17\t1/0
5            Genitour\tCastle\t35W 50F\t0:25\t3\t0.5\t1.35\t5\t50\t1-4\t3\t0/3
6    Elite Genitour\tImperial\t35W 50F\t0:23\t3\t0.5\t1.35\t6\t55\t1-4\t4\t0/4

Each row of the sole column in this object is tab-delimited, so let’s separate the values:

# separate into columns
aocPI <-
  separate(aocdedat, `T	V	Name	Age	Cost	BT	RT	AD	MR	LOS	HP	RA	AT	AR	Act`,
    into = c("Name", "Age", "Cost", "BT", "RT", "AD", "MR", "LOS", "HP", "RA", "AT", "AR"), sep = "\\t")

…and have a look

> head(aocPI)
            Name      Age    Cost   BT   RT   AD   MR  LOS   HP   RA   AT   AR
1  Archery Range     <NA>    <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2    Crossbowman   Castle 25W 45G 0:27    2 0.35 0.96    7   35    5    5  0/0
3       Arbalest Imperial 25W 45G 0:27    2 0.35 0.96    7   40    5    6  0/0
4 Hand Cannoneer Imperial 45F 50G 0:34 3.45 0.35 0.96    9   35    7   17  1/0
5       Genitour   Castle 35W 50F 0:25    3  0.5 1.35    5   50  1-4    3  0/3
6 Elite Genitour Imperial 35W 50F 0:23    3  0.5 1.35    6   55  1-4    4  0/4         

Now we can start by cleaning up the embedded subheaders with unheadr. These are identifiable as the rows with NAs in all the variables except for the fist one, which I defined using the relatively new across approach.

# tag subheaders
subhtags <- aocPI %>%
  filter(across(Age:AR, is.na)) %>%
  select(Name) %>%
  mutate(nametag = paste0("category", Name))
# untangle embedded subheaders
aocPIut <- left_join(aocPI, subhtags) %>%
  untangle2("^category", nametag, unit_type) %>%
  select(-nametag) %>%
  mutate(unit_type = stringr::str_remove(unit_type, "^category"))

After this step, we no longer have rows a grouping variable somewhere but which are otherwise empty. Look at how Archery range now groups unit types now.

> head(aocPIut)
            Name      Age    Cost   BT   RT   AD   MR LOS HP  RA AT  AR     unit_type
1    Crossbowman   Castle 25W 45G 0:27    2 0.35 0.96   7 35   5  5 0/0 Archery Range
2       Arbalest Imperial 25W 45G 0:27    2 0.35 0.96   7 40   5  6 0/0 Archery Range
3 Hand Cannoneer Imperial 45F 50G 0:34 3.45 0.35 0.96   9 35   7 17 1/0 Archery Range
4       Genitour   Castle 35W 50F 0:25    3  0.5 1.35   5 50 1-4  3 0/3 Archery Range
5 Elite Genitour Imperial 35W 50F 0:23    3  0.5 1.35   6 55 1-4  4 0/4 Archery Range
6     Skirmisher   Feudal 25F 35W 0:22    3  0.5 0.96   6 30 1-4  2 0/3 Archery Range

Next is some minor cleanup. The Cost column has compound values, which need to be extracted with regex that matches numbers followed by units. A similar approach extracts ranges (minimum and maximum) from compound values that show them in brackets.

# Castle and Imperial Age units only
aocImp <- aocPIut %>% filter(Age == "Castle" | Age == "Imperial")
# clean cost columns
aocImp <- aocImp %>%
  mutate(
    cost_wood = stringr::str_extract(Cost, "(\\d+)(?=W)"),
    cost_food = stringr::str_extract(Cost, "(\\d+)(?=F)"),
    cost_gold = stringr::str_extract(Cost, "(\\d+)(?=G)")) %>%
  select(-Age, -Cost)
# better colnames
aocImp <- aocImp %>% rename(
  c(
    "attack_delay" = "AD",
    "build_time" = "BT",
    "reload_time" = "RT",
    "speed" = "MR",
    "line_of_sight" = "LOS",
    "hit_points" = "HP",
    "range" = "RA",
    "damage_attack" = "AT",
    "armor" = "AR"))

# things in brackets (additional info to drop)
aocImp <-
  aocImp %>% mutate(across(everything(), stringr::str_remove, "\\(.*\\)"))
# ranged units (get min and max)
aocImp <-
  aocImp %>%
  mutate(range = str_replace(range, "^-$", "0")) %>%
  mutate(range_max = str_extract(range, "(\\d)(?!-)")) %>%
  mutate(range_min = str_extract(range, "(\\d)(?=-)")) %>%
  select(-range)
# armor
aocImp <- aocImp %>% separate(armor, into = c("melee_armor", "pierce_armor"), sep = "\\/")
# cleanup
aocImp <- aocImp %>%
  select(unit_type, unit = Name, everything()) %>%
  mutate(across(everything(), str_trim, "both"))
# manually assign value for (packed) trebuchet
aocImp$speed[which(aocImp$unit == "Trebuchet")] <- 0.8
# NAs can be recoded as 0
aocImp <- aocImp %>%
  mutate(across(everything(), replace_na, 0)) %>%
  mutate(build_time = str_remove(build_time, "0:")) %>%
  mutate(across(build_time:range_min, as.numeric)) %>%
  mutate(across(everything(), replace_na, 99))
# normalized cost in gold
aocImp <- aocImp %>% mutate(
  costfg = round(cost_food * 1.2),
  costwg = round(cost_wood * 1.2),
  cost_norm = costfg + costwg + cost_gold)
aocImpw <- aocImp %>% select(-cost_food, -cost_wood, -cost_gold, -costfg, -costwg, -range_min)
# terrestrial units only
aocImpw %>% filter(unit_type != "Dock") -> aocImpwterr

I dropped aquatic units and only kept Post-Imperial units because that’s how I usually play online.

A glimpse of the analysis-ready data:

# A tibble: 121 x 13
   unit_type unit  build_time reload_time attack_delay speed line_of_sight hit_points
   <chr>     <chr>      <dbl>       <dbl>        <dbl> <dbl>         <dbl>      <dbl>
 1 Archery … Cros…         27        2           0.35   0.96             7         35
 2 Archery … Arba…         27        2           0.35   0.96             7         40
 3 Archery … Hand…         34        3.45        0.35   0.96             9         35
 4 Archery … Geni…         25        3           0.5    1.35             5         50
 5 Archery … Elit…         23        3           0.5    1.35             6         55
 6 Archery … Elit…         22        3           0.5    0.96             7         35
 7 Archery … Impe…         22        3           0.5    0.96             7         35
 8 Archery … Slin…         25        2           0.775  0.96             7         40
 9 Archery … Cava…         34        2           1      1.4              5         50
10 Archery … Heav…         27        2           1      1.4              6         60
# … with 111 more rows, and 5 more variables: damage_attack <dbl>, melee_armor <dbl>,
#   pierce_armor <dbl>, range_max <dbl>, cost_norm <dbl>

Next, the PCA. I particularly like FactoMineR::PCA, but other implementations can provide the same information. Note the use of tidytext::reorder_within to reorder the variables according to their contribution and correlation values within each component.

# principal component analysis
agepc <- PCA(aocImpwterr[, 3:13])
# get descriptions for each dimension
dimsload <- dimdesc(agepc)
# get correlations
dim1corr <- as_tibble(dimsload$Dim.1[[1]], rownames = "unit_attribute") %>% mutate(component = "Dim.1")
dim2corr <- as_tibble(dimsload$Dim.2[[1]], rownames = "unit_attribute") %>% mutate(component = "Dim.2")
dimscorr <- bind_rows(dim1corr, dim2corr)
# variable contributions
contribs <- agepc$var$contrib
# PC coordinates
pccooords <- agepc$ind$coord
# together
pcssbd <- bind_cols(aocImpwterr, as_tibble(pccooords))
# melt
contribslong <- contribs %>%
  as_tibble(rownames = "unit_property") %>%
  pivot_longer(
    cols = Dim.1:Dim.5, names_to = "component",
    values_to = "contribution")
# arrange by contribution
contrDat <-
  contribslong %>%
  filter(component == "Dim.1" | component == "Dim.2") %>%
  mutate(unit_property = tidytext::reorder_within(unit_property, -contribution, within = component, sep = "."))
# arrange by correlation
corrDat <-
  dimscorr %>%
  mutate(unit_property = tidytext::reorder_within(unit_attribute, -correlation, within = component, sep = "."))

Ready for plotting. Let’s see the variable contributions and correlations. Note the use of stat_summary to replace bars with crossbars.

loadingsPlt <-
  ggplot(contrDat, aes(x = unit_property, y = contribution, color = contribution)) +
  stat_summary(fun = "identity", position = "identity", geom = "crossbar", size = 2, show.legend = FALSE) +
  geom_hline(yintercept = 0, size = 0.5, color = "dark grey", linetype = 2) +
  scale_color_scico(palette = "imola", direction = -1) +
  ggthemes::theme_few(base_size = 16, base_family = "Laksaman") +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
    axis.ticks.x = element_blank()) +
  labs(x = "unit attributes") +
  facet_grid(~component, scales = "free_x")
loadingsPlt

correlatPlt <-
  ggplot(corrDat, aes(x = unit_property, y = correlation, color = correlation)) +
  stat_summary(fun = "identity", position = "identity", geom = "crossbar", size = 2, show.legend = FALSE) +
  scale_color_scico(palette = "tofino", direction = -1) +
  geom_hline(yintercept = 0, size = 0.5, color = "dark grey", linetype = 2) +
  ggthemes::theme_few(base_size = 16, base_family = "Laksaman") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
    axis.ticks.x = element_blank()) +
  labs(x = "unit attributes") +
  facet_grid(~component, scales = "free_x")
correlatPlt

Dimension descriptions

click to enlarge
click to enlarge

PCA scatterplot

Now, the classic PCA scatterplot. gghighlight is quite useful here for facetting and graying out the data relative to a ‘focal’ group.

unitsPCAplt <-
  pcssbd %>% filter(unit_type != "Monastery") %>% # remove facet with just 2 points
  ggplot() +
  geom_point(aes(Dim.1, Dim.2, fill = unit_type),
    color = "black", pch = 21, size = 4) +
  scale_fill_scico_d(palette = "nuuk") +
  geom_hline(yintercept = 0, size = 0.5, color = "dark grey", linetype = 3) +
  geom_vline(xintercept = 0, size = 0.5, color = "dark grey", linetype = 3) +
  coord_equal() +
  ggthemes::theme_few(base_size = 16, base_family = "Laksaman") +
  facet_wrap(~unit_type, nrow = 2) +
  gghighlight()
unitsPCAplt
click to enlarge

To summarize the above figures, we can get some top and bottom cases for the first two dimensions and use these to label those points at the edges of the PCA space. Note how slice_max and slice_min are the new replacements for top_n.

# label top and bottom cases
maxD1 <- pcssbd %>% slice_max(Dim.1, n = 2)
maxD2 <- pcssbd %>% slice_max(Dim.2, n = 2)
minD1 <- pcssbd %>% slice_min(Dim.1, n = 4)
minD2 <- pcssbd %>% slice_min(Dim.2, n = 3)
ext_cases <- bind_rows(maxD1, maxD2, minD1, minD2) %>% distinct()
ext_cases <- ext_cases %>%
  filter(!str_detect(unit, "^K")) %>%
  mutate(unit_labs = unit)

Now, the ‘global’ PCA plot. The ggplot2 code here looks bloated because I added some text and segment annotations to illustrate what we found out when looking at the contributions and correlations. The arrow lengths are arbitrary.

# global PCA plot
glblpca <-
  left_join(pcssbd, ext_cases) %>%
  ggplot() +
  geom_point(aes(Dim.1, Dim.2, fill = unit_type, label = unit_labs),
    color = "black", pch = 21, size = 4) +
  scale_fill_scico_d(palette = "nuuk", guide = FALSE) +
  geom_hline(yintercept = 0, size = 0.5, color = "dark grey", linetype = 3) +
  geom_vline(xintercept = 0, size = 0.5, color = "dark grey", linetype = 3) +
  geom_label_repel(aes(Dim.1, Dim.2, label = unit_labs),
    fill = "#355070",
    label.r = 0.07, color = "#f4e285", family = "bold", segment.color = "black") +
  annotate("text",
    x = -0.7, y = 3.4,
    label = "Cheaper, faster build times",
    size = 3.5, color = "#390099",
    fontface = "italic") +
  annotate(
    geom = "segment", x = -0.5, y = 3.7,
    xend = -2.5, yend = 3.7, color = "#ffcad4", size = 1,
    arrow = arrow(length = unit(2, "mm"), type = "closed")) +
  annotate("text",
    x = 7.2, y = 3.4,
    label = "More attack damage",
    size = 3.5, color = "#390099",
    fontface = "italic") +
  annotate(geom = "segment", x = 7.2, y = 3.7,
    xend = 9.2, yend = 3.7, color = "#ffcad4", size = 1,
    arrow = arrow(length = unit(2, "mm"), type = "closed")) +
  annotate("text",
    x = 4.8, y = -3.2,
    label = "More hit points",
    size = 3.5, color = "#390099",
    fontface = "italic") +
  annotate(
    geom = "segment", x = 4.6, y = -1,
    xend = 4.6, yend = -3, color = "#ffcad4", size = 1, alpha = 0.5,
    arrow = arrow(length = unit(2, "mm"), type = "closed")) +
  annotate("text",
    x = 4.5, y = 2.8,
    label = "More attack range and line of sight",
    size = 3.5, color = "#390099",
    fontface = "italic") +
  annotate(
    geom = "segment", x = 4.6, y = 0.5,
    xend = 4.6, yend = 2.5, color = "#ffcad4", size = 1, alpha = 0.5,
    arrow = arrow(length = unit(2, "mm"), type = "closed")) +
  coord_equal() +
  labs(x = "Dim.1 (32%)", y = "Dim.2 (17%)") +
  ggthemes::theme_few(base_size = 16, base_family = "Laksaman")
glblpca

Pretty cool.

click to enlarge

That’s all. For better explanations of PCA, read these great posts by Julia Silge:

All feedback welcome.
Start the game already!