This post from 2019 describes an approach for making Structure-style plots for model-based clusters of population genetic structure using ggplot2
. The code still runs fine, but a) the post was unrealistic and used made-up data that looks odd given the lack of structure and b) we can improve on the plots using new ggplot extensions. (I also wrote the post before learning to use the tidyr::pivot_
functions)
Here I’ll recreate a Discriminant Analysis on Principal Components (DAPC) from this (Open Access) publication by Amélie Desvars-Larrive et al. from 2019. The authors used microsatellites to examine the genetic structure of brown rat populations in Eastern France, and ran DAPC in R using the adegenet
pacakge. This is a really cool paper with a very large sample that also examined resistance to rodenticides.
Figure 1 in the publication shows the study sites, the genetic structure in discriminant space, and the cluster assignment in panel C. We’ll focus on panel C.
Let’s repeat the analysis but then use ggplot
to show the individual membership assignment of the sampled animals to the genetic clusters identified by DAPC. The underlying data was shared in an xlsx file here, which we can work with once we have it in our working directory.
Setup
To get going, we need to load a few package and import the allele data (sources and versions added with annotater
).
Preparing the data
At this point, there’s a few easy steps to prepare the allele data for the df2genind()
function that converts this input to a genind
object. I used sprintf
to pad the repeats so that the ncode
argument works (ncode
is an optional integer giving the number of characters used for coding one genotype at one locus).
The converted object now prints this:
Run DAPC
With the data in genind
format, we can run find.clusters
and then dapc
with the newly identified clusters. For the Structure-style plot, we need the membership probabilities of each individual for each cluster. We pull these from the dapc result, pivot these to long format, and add labels. Then we’ll be ready for plotting.
For clarity, this is what the probabilites look like raw:
Then the long-format probabilities with labels for plotting and faceting look like this:
Plotting
facet_nested
from the ggh4x
package lets us implement nested facets to show sampling sites in their respective municipalities. Before calling ggplot
, we need to set up some customization parameters for the nested facets via strip_nested
. This lets us toggle the font size for each strip layer, and lets us turn off clipping for the strip text for those locations with few samples (and consequently, narrow facets). I was unaware of ggh4x
, but the package comes with lots of cool utility functions for ggplot and has a great hex logo.
Onto the ggplot
call. The suitable geom here is geom_col
because we want the bars to add up to 1. This way we are in control of the spacing of different locations by using facets, the expand
argument for the scales, and the panel.spacing
argument for the overall plot theme. Note how the scales
and space
arguments to facet_nested
help us accommodate the different number of individuals per location. switch
places the facet labels below the plot.
Individuals are represented by vertical bars, colors correspond to different genetic clusters, and each individual’s color proportion indicates its membership to the corresponding cluster. Individuals are faceted by sampling location and a thicker line groups these locations. Compare this plot with the output of dapc::compoplot
:
The final result looks pretty good and those interested in population genetics can now see the structure and migration. This code will work with any number of clusters (K values) as long as the data are in long format. Try this with your own data and let me know if it works. Special thanks to conservation genomics specialist Lilly D. Parker for answering all my microsatellite questions :)