This post from 2019 describes an approach for making Structure-style plots for model-based clusters of population genetic structure using
ggplot2. The code still runs fine, but a) the post was unrealistic and used made-up data that looks odd given the lack of structure and b) we can improve on the plots using new ggplot extensions. (I also wrote the post before learning to use the
Here I’ll recreate a Discriminant Analysis on Principal Components (DAPC) from this (Open Access) publication by Amélie Desvars-Larrive et al. from 2019. The authors used microsatellites to examine the genetic structure of brown rat populations in Eastern France, and ran DAPC in R using the
adegenet pacakge. This is a really cool paper with a very large sample that also examined resistance to rodenticides.
Figure 1 in the publication shows the study sites, the genetic structure in discriminant space, and the cluster assignment in panel C. We’ll focus on panel C.
Let’s repeat the analysis but then use
ggplot to show the individual membership assignment of the sampled animals to the genetic clusters identified by DAPC. The underlying data was shared in an xlsx file here, which we can work with once we have it in our working directory.
To get going, we need to load a few package and import the allele data (sources and versions added with
Preparing the data
At this point, there’s a few easy steps to prepare the allele data for the
df2genind() function that converts this input to a
genind object. I used
sprintf to pad the repeats so that the
ncode argument works (
ncode is an optional integer giving the number of characters used for coding one genotype at one locus).
The converted object now prints this:
With the data in
genind format, we can run
find.clusters and then
dapc with the newly identified clusters. For the Structure-style plot, we need the membership probabilities of each individual for each cluster. We pull these from the dapc result, pivot these to long format, and add labels. Then we’ll be ready for plotting.
For clarity, this is what the probabilites look like raw:
Then the long-format probabilities with labels for plotting and faceting look like this:
facet_nested from the
ggh4x package lets us implement nested facets to show sampling sites in their respective municipalities. Before calling
ggplot, we need to set up some customization parameters for the nested facets via
strip_nested. This lets us toggle the font size for each strip layer, and lets us turn off clipping for the strip text for those locations with few samples (and consequently, narrow facets). I was unaware of
ggh4x, but the package comes with lots of cool utility functions for ggplot and has a great hex logo.
ggplot call. The suitable geom here is
geom_col because we want the bars to add up to 1. This way we are in control of the spacing of different locations by using facets, the
expand argument for the scales, and the
panel.spacing argument for the overall plot theme. Note how the
space arguments to
facet_nested help us accommodate the different number of individuals per location.
switch places the facet labels below the plot.
Individuals are represented by vertical bars, colors correspond to different genetic clusters, and each individual’s color proportion indicates its membership to the corresponding cluster. Individuals are faceted by sampling location and a thicker line groups these locations. Compare this plot with the output of
The final result looks pretty good and those interested in population genetics can now see the structure and migration. This code will work with any number of clusters (K values) as long as the data are in long format. Try this with your own data and let me know if it works. Special thanks to conservation genomics specialist Lilly D. Parker for answering all my microsatellite questions :)