This is Part Two (of many) in series of posts that will go through some simple steps to wrangle the data that is often provided as part of collections-based studies. Data wrangling involves importing, cleaning, and transforming raw data into useful information that is ready for analysis. It is a time-consuming process, so having a reusable and flexible suite of scripts and code can really save us time in the long run.
for Part 1 click here.
This post is on handling multispecies tables of species trait values in which two related values appear together in the same observation or ‘cell’. For example: showing original and transformed values together using some form of punctuation or special character to separate the different values. This saves space when presenting tables in publications and is relatively common in comparative datasets, but if we only want to work with one of the two values that appear together we need to separate them first.
As I work my way through the tables and appendices of various rodent phylogeny/morphology papers, I’ll continue to post more of the R code that has saved me lots of time by not having to edit large datasets by hand. This series of posts are a medium for me to document my code in a way that may help others, and not at all a critique of how different authors present their data, which in many cases is influenced by space restrictions from the journals.
For this post, I wanted to tidy up a simple table in this paper from 2008 by Lessa et al. on the morphological evolution of caviomorph rodents. The table contains discretised values and the original data used to derive them helpfully placed next to each value in parentheses. The table actually looks pretty crisp in the PDF version of the paper, but this format is not the best in case we want to work with the values directly.
A subset of the data showing the issues that need to be wrangled:
|Taxa||1. Burrow structure||2. Deltoid process||3. Epicondylar width|
|Octodontomys gliroides||2||1 (0.33)||1 (0.24)|
|Octomys mimax||2||1 (*)||1 (0.22)|
|Eucelophorus zaratei||?||2 (†)||1 (†)|
|Praectenomys rhombidens||?||2 (‡)||1 (‡)|
Simple enough, but notice that:
- not all the variables have compound values
- there are missing values in some observations
- there are special characters instead of values in some cases (these are used to explain data sources in the table caption)
Let’s load a few packages and set up the example data. This is how the table looked after reading in a csv file created directly from the html version of the paper.
We can start by replacing the question marks for the missing values with NA. Then cleaning up the variable names. The spaces in these names get garbled when importing the data, and a prefix was added automatically because variable names aren’t originally allowed to start with a number. With the janitor package and then some regex we can clean up the spaces and prefixes in the column names.
With clean variable names
To split the compound values easily, we can melt the data into long form using tidyr to gather the columns with the species traits into a key-value pair. The Key will be the variable name and the Value the species trait value.
The first few lines look like this:
We can now split the Value column to get the discrete and raw/original values into separate columns. Here the separator is simply the space between the two. The new variable with the original species trait values has some special characters that we can clean up with regex before converting to numeric.
To change the data back from wide to long form, I figured it was simpler to do one variable type (discrete/original) at a time. This makes it easier to add a suffix that identifies variables as discrete or original before joining everything.
Here’s another way of looking at the new data frames. This is for the discrete values.
After joining the separate data frames for the different variables, the final table is ready for analysis. For example: if we want to subset the table to keep only the original data, we can use the convenience and power of dpylr’s select() and the contains argument to only keep columns that contain a specific string.
Simple enough, but I have the feeling that I’ll be using this code to clean up the tables of several papers in the near future. I think this code should be robust to the new release of dplyr, if it’s not I’ll update the post.