Matrix Indexing

I recently received a file from a collaborator in which some categorical variables describing various primate species had been recoded into binary columns. I later learned that this is known as a design or model matrix, in which categories (factors) are expanded into a set of dummy variables.

For example, I was looking at something like this:

species	arboreal	terrestrial
sp a	0	1
sp b	1	0
sp c	1	0

Instead of something like this:

species	locomotion
sp a	terrestrial
sp b	arboreal
sp c	arboreal

About ten of the variables that I needed were coded as binary columns and I found myself unsure of how I could change them back without too much work. I didn’t know what to call this or what terms to search for, so I took to Twitter and asked:

#rstats people:
what's the dplyr or #tidyr way to do this?
help pls I'm stuck :( pic.twitter.com/OAt5jGed8L
— Luis D. Verde (@LuisDVerde) May 25, 2017

I’m a tidyverse type of person so I specifically asked for a dplyr or tidyr approach. By then I had already written a loop that more or less worked, but I knew I was missing something. Almost immediately the Twitter #rstats community came through and both Naupaka Zimmerman and Giulio Valentino Dalla Riva suggested that I ‘melt’ the data into long format; filter only the rows with value 1, and then select out the column with the values.

Essentialy:

gather() %>% filter() %>% select()

My mistake was not leaving a species/ID column in the rough screenshot that I posted and in the toy dataset that I was using, without which I couldn’t get the above approach to work straight away. After realizing that I needed row IDs I replied in the Twitter thread and T.J. Mahr pointed out that the tibble package has a new function to add row IDs to columns (rowid_to_column()).

If you have a table that already has row IDs, then there’s no need to create them.

That was the last piece missing and I got everything working. Let’s have a look at how to recode dummy binary columns into a single variable (also known as matrix indexing).

First, the tidyverse approach:

# load packages
library(dplyr)
library(tibble)
library(tidyr)

# create the example dataframe
## Biogeographic regions
regs <- matrix(c(0,0,0,0,0,0,0,
                 0,1,0,0,0,0,0,
                 1,0,1,1,1,0,0,
                 0,0,0,0,0,1,1),ncol = 4, nrow = 7)
colnames(regs) <- c("Asia","Madagascar","Mainland","Neotropics")
regsdf <- data.frame(regs) #coerce to dataframe

# tidyverse approach
regions <- regsdf %>% rowid_to_column() %>% gather(region,present,Asia:Neotropics) %>% 
              filter(present==1) %>% select(-present) %>% arrange(rowid)

With a loop (thanks to Daijiang Li for this suggestion)

# create an empty vector and populate it with the variable name that isn't cero within each row
regsvec <- c()
for(i in 1:nrow(regsdf)) {
    regsvec[i] <- names(regsdf)[which(regsdf[i,]!=0)]
}

baseR approach using the apply family of functions (thanks to Damien R. Farine for this one)

# similar but using the apply family of functions
regionvec <- names(regsdf)[apply(regsdf,1,function(x) {which(x==1)})]

When this indexing has to be done many times for different variables, I came across a nifty way of putting the new tbls together using Reduce() to perform multiple left joins.

# another variable to recode
## locomotion mode
locomotionType <- matrix(c(0,0,1,0,1,0,0,
                           1,1,0,1,0,1,1),ncol=2, nrow = 7,)
colnames(locomotionType) <- c("loc_arboreal","loc_terrestrial")
locomotionTypedf <- data.frame(locomotionType)

# indexing
locType <- locomotionTypedf %>% rowid_to_column() %>% gather(loctype,present,loc_arboreal:loc_terrestrial) %>% 
  filter(present==1) %>% select(-present) %>% arrange(rowid)

# one more variable
## habitat type
habt <- matrix(c(1,0,1,0,0,0,0,
                 0,0,0,0,0,1,1,
                 0,0,0,1,1,0,0,
                 0,1,0,0,0,0,0),ncol = 4, nrow = 7)
colnames(habt) <- c("urban","forest","dry","crops")
habtdf <- data.frame(habt)

# indexing
habType <- habtdf %>% rowid_to_column() %>% gather(habitatType,present,urban:crops) %>% 
  filter(present==1) %>% select(-present) %>% arrange(rowid)

# join the three
sptraits <- Reduce(left_join,list(regions,locType,habType))

Feel free to contact me with any questions or simply to let me know if you found this useful.

Share on

Twitter Facebook LinkedIn

Luis D. Verde Arregoitia

Matrix Indexing

Share on

You may also enjoy

Cell and text formatting is everywhere

2024 LLMs/genAI + R roundup

Descargar, procesar, y acomodar imágenes en un mosaico hexagonal interactivo

Download, batch process, and tile images