This series of posts will go through some simple steps to wrangle the data that often accompanies collections-based studies. This post is on handling multispecies tables of GenBank accession numbers, a common component of systematics and phylogeography papers.
As I work my way through the tables and appendices of various rodent phylogeny/morphology papers, I’ll continue to post more of the R code that has saved me lots of time by not having to edit large datasets by hand.
In this case, I wanted to tidy up a pretty basic table from this 2013 paper by Morgan and Alvarez. This supplementary table had species with their corresponding GenBank accession numbers for four different gene sequences.
A subset of the data:
|Cuniculus paca||AF433906 (Agouti paca)||AY206573 (Agouti paca)||AF433928||AF433881 (Agouti paca)|
This table in particular had an interesting twist: some of the accession numbers also included a species name for some special cases in which there have been taxonomic changes. This is useful information, and the original way of presenting it as merged cells in a table in a Word document was visually helpful. However, this format is not ideal for further analyses. We often need to download the sequences programmatically, or compute summary statistics about the taxa being studied, and this is all easier with tidier data.
Let’s use some helpful tidyverse functions to change the overall structure of the data and separate the accession numbers from the species synonyms for those species that have any.
Setting up an example data frame from a subset of the actual table:
Looks like the example above.
First we can ‘melt’ the data into long form, gathering the columns with the accession numbers into a key-value pair. The Key will be the gene and Value the accession number. Note that I’m a recent convert to the magrittr %<>% compound assignment operator.
Next we split the accession numbers that also contain a synonym and clean up the synonym column to remove punctuation or extra whitespace. The extra argument for separate() tells the function to only split the values based on the length of the into argument. In this case the function stops separating after the first split.
Pretty simple, but it beats having to manually cut out the synonyms. The original dataset had more genes and many more species but the principle is still the same.