Matrix Indexing
I recently received a file from a collaborator in which some categorical variables describing various primate species had been recoded into binary columns. I later learned that this is known as a design or model matrix, in which categories (factors) are expanded into a set of dummy variables.
For example, I was looking at something like this:
species | arboreal | terrestrial |
---|---|---|
sp a | 0 | 1 |
sp b | 1 | 0 |
sp c | 1 | 0 |
Instead of something like this:
species | locomotion |
---|---|
sp a | terrestrial |
sp b | arboreal |
sp c | arboreal |
About ten of the variables that I needed were coded as binary columns and I found myself unsure of how I could change them back without too much work. I didn’t know what to call this or what terms to search for, so I took to Twitter and asked:
#rstats people:
— Luis D. Verde (@LuisDVerde) May 25, 2017
what's the dplyr or #tidyr way to do this?
help pls I'm stuck :( pic.twitter.com/OAt5jGed8L
I’m a tidyverse type of person so I specifically asked for a dplyr or tidyr approach. By then I had already written a loop that more or less worked, but I knew I was missing something. Almost immediately the Twitter #rstats community came through and both Naupaka Zimmerman and Giulio Valentino Dalla Riva suggested that I ‘melt’ the data into long format; filter only the rows with value 1, and then select out the column with the values.
Essentialy:
My mistake was not leaving a species/ID column in the rough screenshot that I posted and in the toy dataset that I was using, without which I couldn’t get the above approach to work straight away. After realizing that I needed row IDs I replied in the Twitter thread and T.J. Mahr pointed out that the tibble package has a new function to add row IDs to columns (rowid_to_column()).
If you have a table that already has row IDs, then there’s no need to create them.
That was the last piece missing and I got everything working. Let’s have a look at how to recode dummy binary columns into a single variable (also known as matrix indexing).
First, the tidyverse approach:
With a loop (thanks to Daijiang Li for this suggestion)
baseR approach using the apply family of functions (thanks to Damien R. Farine for this one)
When this indexing has to be done many times for different variables, I came across a nifty way of putting the new tbls together using Reduce() to perform multiple left joins.
Feel free to contact me with any questions or simply to let me know if you found this useful.