Basic biodiversity data-wrangling operations
At some point we all end up working with datasets that describe biodiversity in some form or another. For example: morphological measurements for museum specimens, occurrence records from bird surveys, ecological traits for many different species, etc. Becuse these datasets share some common features, there are certain tools and operations that we can use to get the rows, columns, and content into something we can use in our analyses.
I’ll admit here that as recently as 2013 - already several years into my PhD program and with a statistics and programming expert as a supervisor, I still wasted a lot of time getting my data ready manually. I took a long time transitioning from xls spreadsheets, doing a lot of copying and pasting, writing things down on paper, and retyping data. These examples deal with very basic operations, but I wish I knew all this when I first got into comparative analyses. The first time I tackled some of these operations for a terrestrial mammal dataset of 3300 species, I wasted months organizing a single flat table, running the risk of introducing errors and with no way of keeping track of what I did.
Hopefully people can find this post with web searches, but this also a reference for myself. I often have to go through my old scripts to remember how to do these basic operations. This post walks through five data-wrangling tips, and I’ll follow it up in the near future.
These little operations are all reproducible on their own, I put everything in code blocks so there is some redundance, just make sure to load dplyr.
Separators
Sometimes the different parts of a scientific name are separated by non-whitespace characters (probably because many phylogenetics programs don’t like spaces). We can replace them all using pattern matching + replacement with base::gsub.
Set up an example table of Mesoamerican rodents
spNames | Genus | spEpithet |
---|---|---|
Peromyscus_melanophrys | Peromyscus | melanophrys |
Peromyscus_nasutus | Peromyscus | nasutus |
Peromyscus_schmidlyi | Peromyscus | schmidlyi |
Peromyscus_melanotis | Peromyscus | melanotis |
Liomys_pictus | Liomys | pictus |
Baiomys_musculus | Baiomys | musculus |
Let’s rewrite the spNames column, with spaces instead of underscores
spNames | Genus | spEpithet |
---|---|---|
Peromyscus melanophrys | Peromyscus | melanophrys |
Peromyscus nasutus | Peromyscus | nasutus |
Peromyscus schmidlyi | Peromyscus | schmidlyi |
Peromyscus melanotis | Peromyscus | melanotis |
Liomys pictus | Liomys | pictus |
Baiomys musculus | Baiomys | musculus |
We can also replace spaces with underscores, or with any other characters.
Combining columns
Sometimes, we find that a scientific name has been split into separate columns when we open a table. This is particularly prevalent in IUCN Red List tables. In Excel, I used to do this by concatenating cells with the “&” operator, then I would copy and paste the values into a new column before removing the one with the formula.
Using the same example table from above, we can use:
spNames | Genus | spEpithet | binomial |
---|---|---|---|
Peromyscus_melanophrys | Peromyscus | melanophrys | Peromyscus melanophrys |
Peromyscus_nasutus | Peromyscus | nasutus | Peromyscus nasutus |
Peromyscus_schmidlyi | Peromyscus | schmidlyi | Peromyscus schmidlyi |
Peromyscus_melanotis | Peromyscus | melanotis | Peromyscus melanotis |
Liomys_pictus | Liomys | pictus | Liomys pictus |
Baiomys_musculus | Baiomys | musculus | Baiomys musculus |
Change the order of variables
With comparative data I’m always more comfortable having the column with taxon names at the very beginning. dplyr comes in handy here.
Let’s make a table in which the column with binomial names we want is at the end and not nice for when we scroll through.
Note how this code includes a line for creating a matrix with random values, this becomes useful when making example datasets or simulations.
spNames | Genus | spEpithet | V1 | V2 | V3 | V4 | V5 | binomial |
---|---|---|---|---|---|---|---|---|
Peromyscus_melanophrys | Peromyscus | melanophrys | -0.265632782924824 | 0.327897273820304 | 0.219972954482649 | -0.64632767838069 | -1.89915607125176 | Peromyscus melanophrys |
Peromyscus_nasutus | Peromyscus | nasutus | -0.973022546256686 | 0.333634724290662 | -0.395166433835357 | 1.91347143971885 | 0.957920666155521 | Peromyscus nasutus |
Peromyscus_schmidlyi | Peromyscus | schmidlyi | 0.799411588833332 | 0.275100569368765 | -0.366978824994484 | 0.824579802851748 | 1.47553578545787 | Peromyscus schmidlyi |
Peromyscus_melanotis | Peromyscus | melanotis | -1.71310752865598 | 0.4877680373641 | -1.30147483454924 | -1.65639349280446 | 0.562997113124786 | Peromyscus melanotis |
Liomys_pictus | Liomys | pictus | -1.14047435456842 | 0.656050064169863 | 0.481688180071972 | 0.652645856842719 | 1.20831417376165 | Liomys pictus |
Baiomys_musculus | Baiomys | musculus | 0.550892132837616 | -0.0246539689028568 | -0.260391572360798 | 0.803264181793322 | -0.583093143639412 | Baiomys musculus |
We can use the everything() argument in dplyr::select to put the species names first and then everything else, no need to type all the column names.
binomial | spNames | Genus | spEpithet | V1 | V2 | V3 | V4 | V5 |
---|---|---|---|---|---|---|---|---|
Peromyscus melanophrys | Peromyscus_melanophrys | Peromyscus | melanophrys | -0.490526050918862 | -2.29797595130779 | 2.11160930871023 | -0.430962443007775 | 0.50344941399482 |
Peromyscus nasutus | Peromyscus_nasutus | Peromyscus | nasutus | -0.781036682214806 | 1.40044764539878 | -0.209455037987789 | -0.935971109894094 | -0.366825311604144 |
Peromyscus schmidlyi | Peromyscus_schmidlyi | Peromyscus | schmidlyi | -2.51591860901864 | -0.23778409071267 | -1.1462302628069 | 0.0739518575091712 | 0.107382290698513 |
Peromyscus melanotis | Peromyscus_melanotis | Peromyscus | melanotis | 1.37170335168867 | 1.44752591932182 | -0.12263680063621 | -0.552935232650411 | 1.27964207296267 |
Liomys pictus | Liomys_pictus | Liomys | pictus | -0.575792000674879 | -1.22862981043979 | 0.400478776675252 | -2.41569084734938 | 0.184892341645976 |
Baiomys musculus | Baiomys_musculus | Baiomys | musculus | 0.876058186669997 | 0.411338959287477 | -1.71683685761396 | -0.333207472623785 | 0.23739134255065 |
Complete NA values in one column with data from another column
If we have a column with gaps, and we want to replace these missing values with values from another columnn in the same table, we can use an ifelse statement to find NA values and replace them with the value on the same row but for a different column.
I often use this when working with mammalian body mass data from various source.
spNames | Genus | spEpithet | tailLength | tailLengthNew |
---|---|---|---|---|
Peromyscus_melanophrys | Peromyscus | melanophrys | 90 | 92 |
Peromyscus_nasutus | Peromyscus | nasutus | NA | 100 |
Peromyscus_schmidlyi | Peromyscus | schmidlyi | NA | 119 |
Peromyscus_melanotis | Peromyscus | melanotis | 123 | 144 |
Liomys_pictus | Liomys | pictus | NA | 89 |
Baiomys_musculus | Baiomys | musculus | 60 | 68 |
spNames | Genus | spEpithet | tailLength | tailLengthNew | tailLengthBoth |
---|---|---|---|---|---|
Peromyscus_melanophrys | Peromyscus | melanophrys | 90 | 92 | 90 |
Peromyscus_nasutus | Peromyscus | nasutus | NA | 100 | 100 |
Peromyscus_schmidlyi | Peromyscus | schmidlyi | NA | 119 | 119 |
Peromyscus_melanotis | Peromyscus | melanotis | 123 | 144 | 123 |
Liomys_pictus | Liomys | pictus | NA | 89 | 89 |
Baiomys_musculus | Baiomys | musculus | 60 | 68 | 60 |
Count how many of each
We often need to group our data and count how many instances we have of each group.
Output from the count function
Genus | n |
---|---|
Baiomys | 1 |
Liomys | 1 |
Peromyscus | 4 |
Using mutate and the n() alternative to count()
spNames | Genus | spEpithet | howMany |
---|---|---|---|
Peromyscus_melanophrys | Peromyscus | melanophrys | 4 |
Peromyscus_nasutus | Peromyscus | nasutus | 4 |
Peromyscus_schmidlyi | Peromyscus | schmidlyi | 4 |
Peromyscus_melanotis | Peromyscus | melanotis | 4 |
Liomys_pictus | Liomys | pictus | 1 |
Baiomys_musculus | Baiomys | musculus | 1 |
That’s it for now, I’ll post five more tips and tricks next month. If there is any mistake in the code please let me know. I hope this helps.