Introduction

In this vignette, you will calculate basic population genetic statistics from SNP data using R packages. These statistics serve as exploratory analysis and require to work at the population level. We will calculate:

  1. Genetic diversity,
  2. Test Hardy Weinberg
  3. \(F_{is}\) and global \(F_{st}\).

The dataset used for these analyses are for lodgepole pine (Pinus contorta, Pinaceae), a plant species. More information on the dataset and the species is available from A. Eckert’s website. Here we use the dataset with no presupposed idea of interpreting the results in a biological way. To make the calculations faster, we work only with a subset of the full dataset.

Load necessary resources and packages

library("adegenet")
library("hierfstat")
library("pegas")

Workflow for SNP data

Import data

The data are stored in a text file (genotype=AA..). We will import the dataset into R as a data frame, and then convert the SNP data file into a “genind” object.

The dataset “Master_Pinus_data_genotype.txt” can be downloaded here.

The text file is a matrix of (550 rows x 3086 columns). It contains 4 extra columns: first column is the label of the individuals, the three other are description of the region, all the other columns are for the genotypes as (AA or AT…).

When you import the data into R, the data file needs to be in your working directory, or adjust the path in the read.table() invocation below accordingly.

Mydata <- read.table("Master_Pinus_data_genotype.txt", header = TRUE)
dim(Mydata) # Matrix of dimension 550x3086
## [1]  550 3086

To work with the data, we need to convert the R object returned by read.table() to a “genind” object. To achieve this, we create a matrix with only genotypes, and keep only a subset of the first 11 SNP loci (to make calculations faster). The result can then be converted to a “genind” object (for package adegent). The “genind” object can then easily be converted into a “hierfstat” (package hierfstat) object.

locus <- Mydata[, -c(1, 2, 3, 4, 17:3086)]    
colnames(locus) <- gsub("\\.", "_", colnames(locus)) # locus names can't have "."
ind <- as.character(Mydata$tree_id) # labels of the individuals
population <- as.character(Mydata$state) # labels of the populations
Mydata1 <- df2genind(locus, ploidy = 2, ind.names = ind, pop = population, sep = "")
Mydata1
## /// GENIND OBJECT /////////
## 
##  // 550 individuals; 12 loci; 24 alleles; size: 97.5 Kb
## 
##  // Basic content
##    @tab:  550 x 24 matrix of allele counts
##    @loc.n.all: number of alleles per locus (range: 2-2)
##    @loc.fac: locus factor for the 24 columns of @tab
##    @all.names: list of allele names for each locus
##    @ploidy: ploidy of each individual  (range: 2-2)
##    @type:  codom
##    @call: df2genind(X = locus, sep = "", ind.names = ind, pop = population, 
##     ploidy = 2)
## 
##  // Optional content
##    @pop: population of each individual (group size range: 4-177)
nAll(Mydata1) # Number of alleles per locus
## X0_10037_01_257 X0_10040_02_394 X0_10044_01_392  X0_10048_01_60 
##               2               2               2               2 
## X0_10051_02_166 X0_10054_01_402 X0_10067_03_111 X0_10079_02_168 
##               2               2               2               2 
## X0_10112_01_169 X0_10113_01_119 X0_10116_01_165  X0_10151_01_86 
##               2               2               2               2
Mydata2 <- genind2hierfstat(Mydata1) # Create hierfstat object

Genetic diversity (observed and expected heterozygosity)

With adegenet:

div <- summary(Mydata1)
div
## 
## // Number of individuals: 550
## // Group sizes: 169 10 24 10 4 9 177 7 58 9 73
## // Number of alleles per locus: 2 2 2 2 2 2 2 2 2 2 2 2
## // Number of alleles per group: 24 19 23 22 18 21 24 20 23 19 24
## // Percentage of missing data: 1.62 %
## // Observed heterozygosity: 0.42 0.49 0.46 0.4 0.12 0.47 0.07 0.08 0.11 0.46 0.04 0.14
## // Expected heterozygosity: 0.41 0.5 0.5 0.45 0.13 0.5 0.07 0.08 0.11 0.44 0.04 0.16
names(div)
## [1] "n"         "n.by.pop"  "loc.n.all" "pop.n.all" "NA.perc"   "Hobs"     
## [7] "Hexp"
plot(div$Hobs, xlab="Loci number", ylab="Observed Heterozygosity", 
     main="Observed heterozygosity per locus")

plot(div$Hobs,div$Hexp, xlab="Hobs", ylab="Hexp", 
     main="Expected heterozygosity as a function of observed heterozygosity per locus")

bartlett.test(list(div$Hexp, div$Hobs)) # a test : H0: Hexp = Hobs
## 
##  Bartlett test of homogeneity of variances
## 
## data:  list(div$Hexp, div$Hobs)
## Bartlett's K-squared = 0.011457, df = 1, p-value = 0.9148

We observed that heterozygosity varies among loci. We observed no difference between expected and observed heterozygosity.

Basic statistics with hierfstat. Populations are states. The function basic.stats() provides the observed heterozygosity (\(H_o\)), mean gene diversities within population (\(H_s\)), \(F_{is}\), and \(F_{st}\). The function boot.ppfis() provides confidence interval for \(F_{is}\). The function indpca() does an PCA on the centered matrix of individuals’ allele frequencies.

basicstat <- basic.stats(Mydata2, diploid = TRUE, digits = 2) 
names(basicstat)   
## [1] "n.ind.samp" "pop.freq"   "Ho"         "Hs"         "Fis"       
## [6] "perloc"     "overall"
boot.ppfis(Mydata2) 
## $call
## boot.ppfis(dat = Mydata2)
## 
## $fis.ci
##         ll      hl
## 1  -0.0434  0.0898
## 2  -0.2655  0.1689
## 3  -0.2090 -0.0765
## 4  -0.0362  0.2955
## 5  -0.4398 -0.1249
## 6  -0.1759  0.2867
## 7   0.0152  0.1082
## 8  -0.4712  0.1966
## 9  -0.0521  0.0736
## 10  0.0228  0.3708
## 11 -0.0589  0.0409
x <- indpca(Mydata2) 
plot(x, cex = 0.7)

Testing for Hardy-Weinberg Equilibrium

We used the function hw.test() from the pegas package.

hw.test(Mydata1, B = 1000)
##                      chi^2 df Pr(chi^2 >) Pr.exact
## X0_10037_01_257 0.75465490  1 0.385006451    0.448
## X0_10040_02_394 0.16628227  1 0.683437231    0.738
## X0_10044_01_392 2.96053195  1 0.085319867    0.096
## X0_10048_01_60  6.39076800  1 0.011471539    0.013
## X0_10051_02_166 1.12035389  1 0.289842265    0.293
## X0_10054_01_402 1.54582494  1 0.213752853    0.235
## X0_10067_03_111 0.04868128  1 0.825374035    0.550
## X0_10079_02_168 0.01579313  1 0.899992594    0.612
## X0_10112_01_169 0.41126025  1 0.521330544    1.000
## X0_10113_01_119 0.74248768  1 0.388865151    0.449
## X0_10116_01_165 1.87098306  1 0.171362531    0.254
## X0_10151_01_86  8.94232481  1 0.002786379    0.007

We get for each locus a test of significance of the null hypothesis: \(H_0\) the locus is in HW equilibrium in the population. All but one locus are in HW equilibrium.

Conclusion

What did we learn today?

In this vignette, we learned how to explore the patterns of genetic diversity and how to estimate the level of genetic differentiation in one population. Also, you have an idea of potential violations of the dataset to the null Wright-Fischer model.

What is next?

You may now want to move into looking into population differentiation in more detail (See Calculating genetic differentiation and clustering methods from SNP data)

References

Eckert, A. J., A. D. Bower, S. C. González-Martínez, J. L. Wegrzyn, G. Coop and D. B. Neale. 2010. Back to nature: Ecological genomics of loblolly pine (Pinus taeda, Pinaceae). Molecular Ecology 19: 3789-3805. doi:[10.1111/j.1365-294X.2010.04698.x](http://dx.doi.org/10.1111/j.1365-294X.2010.04698.x)

Contributors

Session Information

This shows us useful information for reproducibility. Of particular importance are the versions of R and the packages used to create this workflow. It is considered good practice to record this information with every analysis.

options(width = 100)
devtools::session_info()
## Session info ---------------------------------------------------------------------------------------
##  setting  value                       
##  version  R version 3.3.2 (2016-10-31)
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       <NA>                        
##  date     2017-01-04
## Packages -------------------------------------------------------------------------------------------
##  package    * version date       source        
##  ade4       * 1.7-5   2016-12-13 CRAN (R 3.3.2)
##  adegenet   * 2.0.1   2016-02-15 CRAN (R 3.3.2)
##  ape        * 4.0     2016-12-01 CRAN (R 3.3.2)
##  assertthat   0.1     2013-12-06 CRAN (R 3.3.2)
##  backports    1.0.4   2016-10-24 CRAN (R 3.3.2)
##  boot         1.3-18  2016-02-23 CRAN (R 3.3.2)
##  cluster      2.0.5   2016-10-08 CRAN (R 3.3.2)
##  coda         0.19-1  2016-12-08 CRAN (R 3.3.2)
##  colorspace   1.3-2   2016-12-14 CRAN (R 3.3.2)
##  DBI          0.5-1   2016-09-10 CRAN (R 3.3.2)
##  deldir       0.1-12  2016-03-06 CRAN (R 3.3.2)
##  devtools     1.12.0  2016-12-05 CRAN (R 3.3.2)
##  digest       0.6.10  2016-08-02 CRAN (R 3.3.2)
##  dplyr        0.5.0   2016-06-24 CRAN (R 3.3.2)
##  evaluate     0.10    2016-10-11 CRAN (R 3.3.2)
##  gdata        2.17.0  2015-07-04 CRAN (R 3.3.2)
##  ggplot2      2.2.0   2016-11-11 CRAN (R 3.3.2)
##  gmodels      2.16.2  2015-07-22 CRAN (R 3.3.2)
##  gtable       0.2.0   2016-02-26 CRAN (R 3.3.2)
##  gtools       3.5.0   2015-05-29 CRAN (R 3.3.2)
##  hierfstat  * 0.04-22 2015-12-04 CRAN (R 3.3.2)
##  htmltools    0.3.5   2016-03-21 CRAN (R 3.3.2)
##  httpuv       1.3.3   2015-08-04 CRAN (R 3.3.2)
##  igraph       1.0.1   2015-06-26 CRAN (R 3.3.2)
##  knitr        1.15.1  2016-11-22 CRAN (R 3.3.2)
##  lattice      0.20-34 2016-09-06 CRAN (R 3.3.2)
##  lazyeval     0.2.0   2016-06-12 CRAN (R 3.3.2)
##  LearnBayes   2.15    2014-05-29 CRAN (R 3.3.2)
##  magrittr     1.5     2014-11-22 CRAN (R 3.3.2)
##  MASS         7.3-45  2016-04-21 CRAN (R 3.3.2)
##  Matrix       1.2-7.1 2016-09-01 CRAN (R 3.3.2)
##  memoise      1.0.0   2016-01-29 CRAN (R 3.3.2)
##  mgcv         1.8-16  2016-11-07 CRAN (R 3.3.2)
##  mime         0.5     2016-07-07 CRAN (R 3.3.2)
##  munsell      0.4.3   2016-02-13 CRAN (R 3.3.2)
##  nlme         3.1-128 2016-05-10 CRAN (R 3.3.2)
##  pegas      * 0.9     2016-04-16 CRAN (R 3.3.2)
##  permute      0.9-4   2016-09-09 CRAN (R 3.3.2)
##  plyr         1.8.4   2016-06-08 CRAN (R 3.3.2)
##  R6           2.2.0   2016-10-05 CRAN (R 3.3.2)
##  Rcpp         0.12.8  2016-11-17 CRAN (R 3.3.2)
##  reshape2     1.4.2   2016-10-22 CRAN (R 3.3.2)
##  rmarkdown    1.3     2016-12-21 CRAN (R 3.3.2)
##  rprojroot    1.1     2016-10-29 CRAN (R 3.3.2)
##  scales       0.4.1   2016-11-09 CRAN (R 3.3.2)
##  seqinr       3.3-3   2016-10-13 CRAN (R 3.3.2)
##  shiny        0.14.2  2016-11-01 CRAN (R 3.3.2)
##  sp           1.2-4   2016-12-22 CRAN (R 3.3.2)
##  spdep        0.6-8   2016-09-21 CRAN (R 3.3.2)
##  stringi      1.1.2   2016-10-01 CRAN (R 3.3.2)
##  stringr      1.1.0   2016-08-19 CRAN (R 3.3.2)
##  tibble       1.2     2016-08-26 CRAN (R 3.3.2)
##  vegan        2.4-1   2016-09-07 CRAN (R 3.3.2)
##  withr        1.0.2   2016-06-20 CRAN (R 3.3.2)
##  xtable       1.8-2   2016-02-05 CRAN (R 3.3.2)
##  yaml         2.1.14  2016-11-12 CRAN (R 3.3.2)