Introduction

Now, we will discuss in more detail how to assess population genetic structure from sequence data. Assuming that you have a priori information about how the individuals are grouped in subpopulations, one can do the following analyses: 1) quantify pairwise subpopulation structure and their significance, 2) test for hierarchical structure among groups of subpopulations, and 3) use population clustering algorithms to corroborate the a priori grouping hypothesis. We will go into each of these analyses in this vignette.

Resources

Data

Data used for the following analyses come from one mitochondrial gene (Cytochrome Oxidase I = ‘Ebom_mt.fas’) and one nuclear gene (Calcium/Calmodulin-Dependent Protein Kinase II = ‘Ebom_CAD.fas’) of the orchid bee Eulaema bombiformis.

Packages

library("apex")
library("adegenet")
library("pegas")
library("mmod")
library("poppr")

Importing data

We will first import FASTA files for all genes at once using the apex function read.multiFASTA(). This will allow us to read in multiple FASTA formatted files at once into a “multiDNA” object from the apex package. We can plot it to see what the two genes look like for all of our samples

# Creating DNAbin objects
beeData <- read.multiFASTA(c("Ebom_mt.fas", "Ebom_CAD.fas"))
plot(beeData, cex = 0.2)

Pro tip: You can list all the files ending in “.fas” in your folder by using dir(pattern = "fas", full.names = TRUE)

Now, this “multidna” object (beeData) will be converted into a “genind” object (beeData.gid), which will be used for downstream analyses. One thing we need to be aware of is the fact that we can’t have any periods in our locus names. By default, the locus names are set to the file names, which both contain “.fas” in the title. We can fix this by using the text-replacing function gsub():

getLocusNames(beeData)
## [1] "Ebom_mt.fas"  "Ebom_CAD.fas"
(setLocusNames(beeData) <- gsub(".fas", "", getLocusNames(beeData)))
## [1] "Ebom_mt"  "Ebom_CAD"

Now we can create our “genind” object.

# Creating genind object by multilocus sequence types
beeData.gid <- multidna2genind(beeData, mlst = TRUE)
beeData.gid
## /// GENIND OBJECT /////////
## 
##  // 40 individuals; 2 loci; 36 alleles; size: 50.6 Kb
## 
##  // Basic content
##    @tab:  40 x 36 matrix of allele counts
##    @loc.n.all: number of alleles per locus (range: 17-19)
##    @loc.fac: locus factor for the 36 columns of @tab
##    @all.names: list of allele names for each locus
##    @ploidy: ploidy of each individual  (range: 1-1)
##    @type:  codom
##    @call: df2genind(X = xdfnum, ind.names = x@labels, ploidy = 1)
## 
##  // Optional content
##    - empty -

We also want to set the population strata.

my_strata <- data.frame(regions = rep(c("West", "East"), each = 20), 
                        populations = rep(c("CA", "Ch", "Am", "AF"), each = 10))
strata(beeData.gid) <- my_strata
setPop(beeData.gid) <- ~populations
beeData.gid
## /// GENIND OBJECT /////////
## 
##  // 40 individuals; 2 loci; 36 alleles; size: 53.4 Kb
## 
##  // Basic content
##    @tab:  40 x 36 matrix of allele counts
##    @loc.n.all: number of alleles per locus (range: 17-19)
##    @loc.fac: locus factor for the 36 columns of @tab
##    @all.names: list of allele names for each locus
##    @ploidy: ploidy of each individual  (range: 1-1)
##    @type:  codom
##    @call: df2genind(X = xdfnum, ind.names = x@labels, ploidy = 1)
## 
##  // Optional content
##    @pop: population of each individual (group size range: 10-10)
##    @strata: a data frame with 2 columns ( regions, populations )

Pairwise population differentiation

Overall F-statistics

diff_stats(beeData.gid) # this function calculates overall Nei's Gst, Hedrick's Gst and  of the dataset
## $per.locus
##                 Hs        Ht       Gst Gprime_st         D
## Ebom_mt  0.7421053 0.9117763 0.1860885 0.9058976 0.8772109
## Ebom_CAD 0.7263158 0.9190789 0.2097351 0.9550196 0.9391026
## 
## $global
##        Hs        Ht   Gst_est Gprime_st     D_het    D_mean 
## 0.7342105 0.9154276 0.1979590 0.9315893 0.9090759 0.9071022
Phi_st_Meirmans(beeData.gid) # this function calculates overall PhiST, the Fst analog for DNA sequence data
## $per.locus
##   Ebom_mt  Ebom_CAD 
## 0.8817688 0.9443099 
## 
## $global
## [1] 0.8938182

Pairwise Fst

pairwise_Gst_Nei(beeData.gid, linearized = FALSE) # Calculates pairwise Gst. If linearized = TRUE, it calculates 1/(1- Gst)  
##            CA         Ch         Am
## Ch 0.08227848                      
## Am 0.12513019 0.15853659           
## AF 0.14803625 0.18012422 0.15297092
pairwise_Gst_Hedrick(beeData.gid, linearized = FALSE)# Calculates pairwise Gst. If linearized = TRUE, it calculates 1/(1- Gst')  
##           CA        Ch        Am
## Ch 0.6419753                    
## Am 0.9828211 1.0000000          
## AF 1.0000000 1.0000000 0.9002976
pairwise_D(beeData.gid, linearized = FALSE, hsht_mean = "harmonic") # Calculates pairwise Gst. If linearized = TRUE, it calculates 1/(1- D)  
##           CA        Ch        Am
## Ch 0.5636175                    
## Am 0.9780685 1.0000000          
## AF 1.0007468 1.0012306 0.8632757

Testing for significance

To estimate if populations are significantly different, we will generate 100 replicates of the dataset using the function chao_bootstrap(). Then, summary statistics (mean and 95% CI) will be calculated for each of the different parameters of population differentiation.

bs <- chao_bootstrap(beeData.gid, nreps = 100)
summarise_bootstrap(bs, Gst_Nei)     # for Nei's Gst
## 
## Estimates for each locus
## Locus    Mean     95% CI
## Ebom_mt  0.1861  (0.095-0.277)
## Ebom_CAD 0.2097  (0.078-0.341)
## 
## Global Estimate based on average heterozygosity
## 0.1980   (0.114-0.282)
summarise_bootstrap(bs, Gst_Hedrick) # for Hedrick's Gst
## 
## Estimates for each locus
## Locus    Mean     95% CI
## Ebom_mt  0.9059  (0.844-0.968)
## Ebom_CAD 0.9550  (0.900-1.010)
## 
## Global Estimate based on average heterozygosity
## 0.9316   (0.888-0.975)
summarise_bootstrap(bs, D_Jost)      # for Jost's D
## 
## Estimates for each locus
## Locus    Mean     95% CI
## Ebom_mt  0.8772  (0.794-0.961)
## Ebom_CAD 0.9391  (0.864-1.014)
## 
## Global Estimate based on average heterozygosity
## 0.9091   (0.850-0.968)
## 
## Global Estimate based on harmonic mean of statistic
## 0.9071   (0.849-0.965)

AMOVA (Analysis of Molecular Variance)

Analysis of Molecular Variance (AMOVA) is a method for estimating population differentiation from molecular data taking into account the mutational distance between alleles. Unlike \(F_{st}\), which quantifies genetic differentiation based on allele frequencies, AMOVA treats molecular data as vectors and estimates Euclidean distances between alleles. Furthermore, it is possible to test hypotheses about differentiation by grouping subpopulations in a hierarchical structure (Excoffier et al., 1992).

beeData_dist <- dist.multidna(beeData, pool = TRUE)
amova(beeData_dist ~ populations, data = strata(beeData.gid), nperm = 100)
## 
##  Analysis of Molecular Variance
## 
## Call: amova(formula = beeData_dist ~ populations, data = strata(beeData.gid), 
##     nperm = 100)
## 
##                      SSD          MSD df
## populations 0.0009468508 3.156169e-04  3
## Error       0.0005930812 1.647448e-05 36
## Total       0.0015399320 3.948544e-05 39
## 
## Variance components:
##                 sigma2 P.value
## populations 2.9914e-05       0
## Error       1.6474e-05        
## 
## Variance coefficients:
##  a 
## 10

Conclusions

In this vignette, we learned how to estimate different parameters of overall and pairwise population differentiation (Nei’s Gst, Hedrick’s Gst, Jost D) from DNA sequence data. We also learned how to do a hierarchical analysis of population structure using AMOVA.

What’s next

With these exploratory analyses, we understand how genetic diversity is distributed among populations. One could now move to testing specific hypotheses than could explain the patterns of population structure found with the F-statistics.

Contributors

References

Excoffier L., Smouse PE., Quattro JM. 1992. Analysis of molecular variance inferred from metric distances among dNA haplotypes: Application to human mitochondrial dNA restriction data. Genetics 131:479–491. Available at: http://www.genetics.org/content/131/2/479.abstract

Session Information

This shows us useful information for reproducibility. Of particular importance are the versions of R and the packages used to create this workflow. It is considered good practice to record this information with every analysis.

options(width = 100)
devtools::session_info()
## Session info ---------------------------------------------------------------------------------------
##  setting  value                       
##  version  R version 3.3.2 (2016-10-31)
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       <NA>                        
##  date     2017-01-04
## Packages -------------------------------------------------------------------------------------------
##  package    * version date       source        
##  ade4       * 1.7-5   2016-12-13 CRAN (R 3.3.2)
##  adegenet   * 2.0.1   2016-02-15 CRAN (R 3.3.2)
##  ape        * 4.0     2016-12-01 CRAN (R 3.3.2)
##  apex       * 1.0.2   2016-12-05 CRAN (R 3.3.2)
##  assertthat   0.1     2013-12-06 CRAN (R 3.3.2)
##  backports    1.0.4   2016-10-24 CRAN (R 3.3.2)
##  boot         1.3-18  2016-02-23 CRAN (R 3.3.2)
##  cluster      2.0.5   2016-10-08 CRAN (R 3.3.2)
##  coda         0.19-1  2016-12-08 CRAN (R 3.3.2)
##  colorspace   1.3-2   2016-12-14 CRAN (R 3.3.2)
##  DBI          0.5-1   2016-09-10 CRAN (R 3.3.2)
##  deldir       0.1-12  2016-03-06 CRAN (R 3.3.2)
##  devtools     1.12.0  2016-12-05 CRAN (R 3.3.2)
##  digest       0.6.10  2016-08-02 CRAN (R 3.3.2)
##  dplyr        0.5.0   2016-06-24 CRAN (R 3.3.2)
##  evaluate     0.10    2016-10-11 CRAN (R 3.3.2)
##  fastmatch    1.0-4   2012-01-21 CRAN (R 3.3.2)
##  gdata        2.17.0  2015-07-04 CRAN (R 3.3.2)
##  ggplot2      2.2.0   2016-11-11 CRAN (R 3.3.2)
##  gmodels      2.16.2  2015-07-22 CRAN (R 3.3.2)
##  gtable       0.2.0   2016-02-26 CRAN (R 3.3.2)
##  gtools       3.5.0   2015-05-29 CRAN (R 3.3.2)
##  htmltools    0.3.5   2016-03-21 CRAN (R 3.3.2)
##  httpuv       1.3.3   2015-08-04 CRAN (R 3.3.2)
##  igraph       1.0.1   2015-06-26 CRAN (R 3.3.2)
##  knitr        1.15.1  2016-11-22 CRAN (R 3.3.2)
##  lattice      0.20-34 2016-09-06 CRAN (R 3.3.2)
##  lazyeval     0.2.0   2016-06-12 CRAN (R 3.3.2)
##  LearnBayes   2.15    2014-05-29 CRAN (R 3.3.2)
##  magrittr     1.5     2014-11-22 CRAN (R 3.3.2)
##  MASS         7.3-45  2016-04-21 CRAN (R 3.3.2)
##  Matrix       1.2-7.1 2016-09-01 CRAN (R 3.3.2)
##  memoise      1.0.0   2016-01-29 CRAN (R 3.3.2)
##  mgcv         1.8-16  2016-11-07 CRAN (R 3.3.2)
##  mime         0.5     2016-07-07 CRAN (R 3.3.2)
##  mmod       * 1.3.2   2016-09-16 CRAN (R 3.3.2)
##  munsell      0.4.3   2016-02-13 CRAN (R 3.3.2)
##  nlme         3.1-128 2016-05-10 CRAN (R 3.3.2)
##  pegas      * 0.9     2016-04-16 CRAN (R 3.3.2)
##  permute      0.9-4   2016-09-09 CRAN (R 3.3.2)
##  phangorn   * 2.1.1   2016-12-04 CRAN (R 3.3.2)
##  plyr         1.8.4   2016-06-08 CRAN (R 3.3.2)
##  poppr      * 2.3.0   2016-11-23 CRAN (R 3.3.2)
##  quadprog     1.5-5   2013-04-17 CRAN (R 3.3.2)
##  R6           2.2.0   2016-10-05 CRAN (R 3.3.2)
##  Rcpp         0.12.8  2016-11-17 CRAN (R 3.3.2)
##  reshape2     1.4.2   2016-10-22 CRAN (R 3.3.2)
##  rmarkdown    1.3     2016-12-21 CRAN (R 3.3.2)
##  rprojroot    1.1     2016-10-29 CRAN (R 3.3.2)
##  scales       0.4.1   2016-11-09 CRAN (R 3.3.2)
##  seqinr       3.3-3   2016-10-13 CRAN (R 3.3.2)
##  shiny        0.14.2  2016-11-01 CRAN (R 3.3.2)
##  sp           1.2-4   2016-12-22 CRAN (R 3.3.2)
##  spdep        0.6-8   2016-09-21 CRAN (R 3.3.2)
##  stringi      1.1.2   2016-10-01 CRAN (R 3.3.2)
##  stringr      1.1.0   2016-08-19 CRAN (R 3.3.2)
##  tibble       1.2     2016-08-26 CRAN (R 3.3.2)
##  vegan        2.4-1   2016-09-07 CRAN (R 3.3.2)
##  withr        1.0.2   2016-06-20 CRAN (R 3.3.2)
##  xtable       1.8-2   2016-02-05 CRAN (R 3.3.2)
##  yaml         2.1.14  2016-11-12 CRAN (R 3.3.2)