Matrix Indexing

I recently received a file from a collaborator in which some categorical variables describing various primate species had been recoded into binary columns. I later learned that this is known as a design or model matrix, in which categories (factors) are expanded into a set of dummy variables.

For example, I was looking at something like this:

species arboreal terrestrial
sp a 0 1
sp b 1 0
sp c 1 0

Instead of something like this:

species locomotion
sp a terrestrial
sp b arboreal
sp c arboreal

About ten of the variables that I needed were coded as binary columns and I found myself unsure of how I could change them back without too much work. I didn’t know what to call this or what terms to search for, so I took to Twitter and asked:

I’m a tidyverse type of person so I specifically asked for a dplyr or tidyr approach. By then I had already written a loop that more or less worked, but I knew I was missing something. Almost immediately the Twitter #rstats community came through and both Naupaka Zimmerman and Giulio Valentino Dalla Riva suggested that I ‘melt’ the data into long format; filter only the rows with value 1, and then select out the column with the values.


gather() %>% filter() %>% select()

My mistake was not leaving a species/ID column in the rough screenshot that I posted and in the toy dataset that I was using, without which I couldn’t get the above approach to work straight away. After realizing that I needed row IDs I replied in the Twitter thread and T.J. Mahr pointed out that the tibble package has a new function to add row IDs to columns (rowid_to_column()).

If you have a table that already has row IDs, then there’s no need to create them.

That was the last piece missing and I got everything working. Let’s have a look at how to recode dummy binary columns into a single variable (also known as matrix indexing).

First, the tidyverse approach:

# load packages

# create the example dataframe
## Biogeographic regions
regs <- matrix(c(0,0,0,0,0,0,0,
                 0,0,0,0,0,1,1),ncol = 4, nrow = 7)
colnames(regs) <- c("Asia","Madagascar","Mainland","Neotropics")
regsdf <- data.frame(regs) #coerce to dataframe

# tidyverse approach
regions <- regsdf %>% rowid_to_column() %>% gather(region,present,Asia:Neotropics) %>% 
              filter(present==1) %>% select(-present) %>% arrange(rowid)

With a loop (thanks to Daijiang Li for this suggestion)

# create an empty vector and populate it with the variable name that isn't cero within each row
regsvec <- c()
for(i in 1:nrow(regsdf)) {
    regsvec[i] <- names(regsdf)[which(regsdf[i,]!=0)]

baseR approach using the apply family of functions (thanks to Damien R. Farine for this one)

# similar but using the apply family of functions
regionvec <- names(regsdf)[apply(regsdf,1,function(x) {which(x==1)})]

When this indexing has to be done many times for different variables, I came across a nifty way of putting the new tbls together using Reduce() to perform multiple left joins.

# another variable to recode
## locomotion mode
locomotionType <- matrix(c(0,0,1,0,1,0,0,
                           1,1,0,1,0,1,1),ncol=2, nrow = 7,)
colnames(locomotionType) <- c("loc_arboreal","loc_terrestrial")
locomotionTypedf <- data.frame(locomotionType)

# indexing
locType <- locomotionTypedf %>% rowid_to_column() %>% gather(loctype,present,loc_arboreal:loc_terrestrial) %>% 
  filter(present==1) %>% select(-present) %>% arrange(rowid)

# one more variable
## habitat type
habt <- matrix(c(1,0,1,0,0,0,0,
                 0,1,0,0,0,0,0),ncol = 4, nrow = 7)
colnames(habt) <- c("urban","forest","dry","crops")
habtdf <- data.frame(habt)

# indexing
habType <- habtdf %>% rowid_to_column() %>% gather(habitatType,present,urban:crops) %>% 
  filter(present==1) %>% select(-present) %>% arrange(rowid)

# join the three
sptraits <- Reduce(left_join,list(regions,locType,habType)) 

Feel free to contact me with any questions or simply to let me know if you found this useful.