Matrix Indexing
I recently received a file from a collaborator in which some categorical variables describing various primate species had been recoded into binary columns. I later learned that this is known as a design or model matrix, in which categories (factors) are expanded into a set of dummy variables.
For example, I was looking at something like this:
species | arboreal | terrestrial |
---|---|---|
sp a | 0 | 1 |
sp b | 1 | 0 |
sp c | 1 | 0 |
Instead of something like this:
species | locomotion |
---|---|
sp a | terrestrial |
sp b | arboreal |
sp c | arboreal |
About ten of the variables that I needed were coded as binary columns and I found myself unsure of how I could change them back without too much work. I didn’t know what to call this or what terms to search for, so I took to Twitter and asked:
#rstats people:
— Luis D. Verde (@LuisDVerde) May 25, 2017
what's the dplyr or #tidyr way to do this?
help pls I'm stuck :( pic.twitter.com/OAt5jGed8L
I’m a tidyverse type of person so I specifically asked for a dplyr or tidyr approach. By then I had already written a loop that more or less worked, but I knew I was missing something. Almost immediately the Twitter #rstats community came through and both Naupaka Zimmerman and Giulio Valentino Dalla Riva suggested that I ‘melt’ the data into long format; filter only the rows with value 1, and then select out the column with the values.
Essentialy:
gather() %>% filter() %>% select()
My mistake was not leaving a species/ID column in the rough screenshot that I posted and in the toy dataset that I was using, without which I couldn’t get the above approach to work straight away. After realizing that I needed row IDs I replied in the Twitter thread and T.J. Mahr pointed out that the tibble package has a new function to add row IDs to columns (rowid_to_column()).
If you have a table that already has row IDs, then there’s no need to create them.
That was the last piece missing and I got everything working. Let’s have a look at how to recode dummy binary columns into a single variable (also known as matrix indexing).
First, the tidyverse approach:
# load packages
library(dplyr)
library(tibble)
library(tidyr)
# create the example dataframe
## Biogeographic regions
regs <- matrix(c(0,0,0,0,0,0,0,
0,1,0,0,0,0,0,
1,0,1,1,1,0,0,
0,0,0,0,0,1,1),ncol = 4, nrow = 7)
colnames(regs) <- c("Asia","Madagascar","Mainland","Neotropics")
regsdf <- data.frame(regs) #coerce to dataframe
# tidyverse approach
regions <- regsdf %>% rowid_to_column() %>% gather(region,present,Asia:Neotropics) %>%
filter(present==1) %>% select(-present) %>% arrange(rowid)
With a loop (thanks to Daijiang Li for this suggestion)
# create an empty vector and populate it with the variable name that isn't cero within each row
regsvec <- c()
for(i in 1:nrow(regsdf)) {
regsvec[i] <- names(regsdf)[which(regsdf[i,]!=0)]
}
baseR approach using the apply family of functions (thanks to Damien R. Farine for this one)
# similar but using the apply family of functions
regionvec <- names(regsdf)[apply(regsdf,1,function(x) {which(x==1)})]
When this indexing has to be done many times for different variables, I came across a nifty way of putting the new tbls together using Reduce() to perform multiple left joins.
# another variable to recode
## locomotion mode
locomotionType <- matrix(c(0,0,1,0,1,0,0,
1,1,0,1,0,1,1),ncol=2, nrow = 7,)
colnames(locomotionType) <- c("loc_arboreal","loc_terrestrial")
locomotionTypedf <- data.frame(locomotionType)
# indexing
locType <- locomotionTypedf %>% rowid_to_column() %>% gather(loctype,present,loc_arboreal:loc_terrestrial) %>%
filter(present==1) %>% select(-present) %>% arrange(rowid)
# one more variable
## habitat type
habt <- matrix(c(1,0,1,0,0,0,0,
0,0,0,0,0,1,1,
0,0,0,1,1,0,0,
0,1,0,0,0,0,0),ncol = 4, nrow = 7)
colnames(habt) <- c("urban","forest","dry","crops")
habtdf <- data.frame(habt)
# indexing
habType <- habtdf %>% rowid_to_column() %>% gather(habitatType,present,urban:crops) %>%
filter(present==1) %>% select(-present) %>% arrange(rowid)
# join the three
sptraits <- Reduce(left_join,list(regions,locType,habType))
Feel free to contact me with any questions or simply to let me know if you found this useful.