Introduction

The purpose of the vignette is to detect outlier SNP loci from a large number of SNPs genotyped for a large number of individuals. Individuals are organised in well defined populations. Then, we tested if allele counts in outlier loci are correlated to the temperature. We will first use a method based on principal component analysis (PCAdapt) (section 2) and a second method based on atypical values of Fst (OutFLANK; Whitlok and Lotterhos (2015); https://github.com/whitlock/OutFLANK) (section 3) to detect outliers loci. Then, we will test if the allele counts in outliers are correlated to the temperature using a Poisson regression (section 4).

Assumptions

Data

We will use simulated data provided by Katie Lotterhos (Lotterhos et al. in prep). The initial data set is constituted of 509 individuals, 19 populations and 3084 SNP. It has 2 extra-columns: Population and Temperature measures. SNP are in columns: each row contains one value at each SNP locus (separated by spaces or tabulations) corresponding to the number of reference alleles: 0 means zero copies of the reference allele, 1 means one copy of the reference allele (i.e. heterozygote), and 2 means two copies of the reference allele.

Resources/Packages required

Since OutFLANK is not yet on CRAN, you will need to install it from GitHub using the devtools R package. Additionally, OutFLANK depends on qvalue from Bioconductor, which must be installed beforehand.

install.packages("devtools")
library("devtools")
source("https://bioconductor.org/biocLite.R")
biocLite("qvalue")
install_github("whitlock/OutFLANK")

Loading the required packages:

library("pcadapt")
library("qvalue")
library("OutFLANK")
library("ggplot2")

Section 1: Load the data

sel <- read.table("data/SNPselection1.txt", head = TRUE)
dim(sel)
## [1]  509 3086

Section 2: PCAdapt

To run the function pcadapt(), the user should specify the number K of principal components (PC) to work with: first perform with a large number of principal components (e.g. higher than the number of populations), then use the ‘scree plot’ to chose the value of K. It displays the percentage of variance that is explained by each PC. The recommended value of K corresponds to the largest value of K before the plateau of ‘scree plot’ is reached. Then for a given SNP, a statistical test to define the SNP as outlier or not is based on the “loadings” that are defined as the correlation between the SNP and the PCs. The statistic test for detecting outlier SNPs is the Mahalanobis distance (between the K correlations of the SNP and each axis and mean correlations) and, which scaled by a constant, should have a chi-square distribution with K degrees of freedom under the assumption that there are no outlier. By default P-values of SNPs with a minor allele frequency smaller than 0.05 are not computed. A Manhattan plot displays log10 of the p-values. It is also possible to check the distribution of the p-values using a Q-Q plot. The authors suggest to use false discovery rate (q-value) to provide a list of outliers.

Note: PCAdapt expects the incoming matrix to have samples in columns and loci in rows. Because our data is the opposite, we need to transpose our data matrix with the t() function.

genotype <- sel[, 3:ncol(sel)]
dim(genotype)
## [1]  509 3084
# As of version 3.0.3, PCAdapt does not allow matrix input, so the input must be
# stored in a separate file. We will store this in a temporary file that will be
# automatically cleaned up after use.
genotype_file <- tempfile()
write.table(x = t(genotype), # transposing the genotype matrix with t()
            file = genotype_file, 
            sep = " ", 
            col.names = FALSE, 
            row.names = FALSE)
K <- 25
x <- pcadapt(genotype_file, K = K)
## Reading file /tmp/Rtmpr5E150/file149287ba7...
## Number of SNPs: 3084
## Number of individuals: 509
## Number of SNPs with minor allele frequency lower than 0.05 ignored: 0
plot(x, option = "screeplot") # 19 groups seems to be the correct value

plot(x, option = "scores", pop = sel[, 1]) # how populations are shared among the 19 groups

K <- 19
x <- pcadapt(genotype_file, K = K, min.maf = 0)
## Reading file /tmp/Rtmpr5E150/file149287ba7...
## Number of SNPs: 3084
## Number of individuals: 509
## Number of SNPs with minor allele frequency lower than 0 ignored: 0
summary(x) # numerical quantities obtained after performing a PCA
##                 Length Class  Mode   
## maf              3084  -none- numeric
## loadings        58596  -none- numeric
## singular.values    19  -none- numeric
## scores           9671  -none- numeric
## zscores         58596  -none- numeric
## stat             3084  -none- numeric
## gif                 1  -none- numeric
## chi2.stat        3084  -none- numeric
## pvalues          3084  -none- numeric
plot(x, option = "manhattan")

plot(x, option = "qqplot", threshold = 0.1)

plot(x, option = "stat.distribution") # Distribution of Mahalanobis distances.

qval <- qvalue(x$pvalues)$qvalues
alpha <- 0.1
outliers_pcadapt <- which(qval < alpha)
print(outliers_pcadapt)
##  [1]  663  706  796  923 1155 1163 1347 1378 1639 1666 1825 2133 2556 2871
length(outliers_pcadapt) # 14 outliers
## [1] 14
alpha <- 0.05 # use of a more stringent threshold to detect outliers
outliers <- which(qval < alpha)
print(outliers)
##  [1]  663  706  796  923 1155 1163 1347 1378 1639 1666 1825 2133 2556 2871
length(outliers) # 14 outliers
## [1] 14

section 3: OutFLANK (https://github.com/whitlock/OutFLANK)

A procedure to find Fst outliers based on an inferred distribution of neutral Fst: it uses likelihood on a trimmed distribution of Fst values to infer the distribution of Fst for neutral markers. This distribution is used to assign q-values to each locus to detect outliers loci potentially due to spatially heterogeneous selection. In practice, first the function MakeDiploidFSTMat() allows to calculate the appropriate input data frame among other parameters, Fst for each locus, and Fst without sampling size correction (FSTNoCorr). Then the function OutFLANK() estimate q-values and provides a list of outliers.

ind <- paste("pop", sel[, 1]) # vector with the name of population

locinames <- as.character(seq(ncol(genotype))) # vector with the name of loci

FstDataFrame <- MakeDiploidFSTMat(genotype, locinames, ind)
## Calculating FSTs, may take a few minutes...
plot(FstDataFrame$FST, FstDataFrame$FSTNoCorr, xlim = c(-0.01,0.3), 
     ylim = c(-0.01, 0.3), pch = 20)
abline(0, 1) # Checking the effect of sample size on Fst since FSTCoCorr will be used in the follow

hist(FstDataFrame$FSTNoCorr) 

OF <- OutFLANK(FstDataFrame, NumberOfSamples=19, qthreshold = 0.05, 
               RightTrimFraction = 0.05)

# Plot the ditribution of Fst with the chi squared distribution
OutFLANKResultsPlotter(OF, withOutliers = TRUE, NoCorr = TRUE, Hmin = 0.1, 
                       binwidth = 0.005, Zoom = FALSE, RightZoomFraction = 0.05, 
                       titletext = NULL)

outliers_OF <- OF$results$LocusName[OF$results$OutlierFlag == TRUE]
print(outliers_OF)
##  [1] 663  706  923  1163 1378 1639 1666 1825 2133 2556 2871
## 3084 Levels: 1 10 100 1000 1001 1002 1003 1004 1005 1006 1007 1008 ... 999
length(outliers_OF) # 11 outliers
## [1] 11

Section 4: Logistic regression : linking outliers and temperature

We aim to test if the outliers detected with PCAdapt and OutFLANK (663 706 923 1163 1378 1639 1666 1825 2133 2556 2871) are correlated to the temperature. We use a GLM with a binomial error to model the probability to have the allele “A” out of the two alleles. We derive the relation for some of the shared outliers. You can do the test for all.

# We keep the outlier loci selected by both pcadapt and outflank
outliers <- outliers_pcadapt[outliers_pcadapt %in% outliers_OF]
length(outliers) # 11 outliers
## [1] 11
loc1 <- genotype[, outliers[1]]
temp <- sel$Temperature
loc1temp <- data.frame(loc1, temp)
 
mod <- glm(cbind(loc1, 2 - loc1) ~ temp, family = binomial) 
summary(mod) # This locus is significantly correlated to temperature
## 
## Call:
## glm(formula = cbind(loc1, 2 - loc1) ~ temp, family = binomial)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.43957  -1.10354   0.09418   1.03271   2.62515  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.02085    0.06770  -0.308    0.758    
## temp         0.27603    0.02603  10.606   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 869.59  on 508  degrees of freedom
## Residual deviance: 738.61  on 507  degrees of freedom
## AIC: 1011.6
## 
## Number of Fisher Scoring iterations: 3
loc2 <- genotype[, outliers[2]]
loc2temp <- data.frame(loc2, temp)
ggplot(loc2temp, aes(x = factor(loc2), y = temp)) + 
 geom_boxplot() + 
 xlab("Major allele count") +
 ylab("Temperature (Centigrade)")

mod <- glm(cbind(loc2, 2 - loc2) ~ temp, family = binomial) 
summary(mod) # This locus is significantly correlated to temperature
## 
## Call:
## glm(formula = cbind(loc2, 2 - loc2) ~ temp, family = binomial)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.19071  -1.25625   0.06401   1.21800   2.20092  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.02568    0.06552  -0.392    0.695    
## temp         0.19104    0.02411   7.923 2.32e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 837.64  on 508  degrees of freedom
## Residual deviance: 769.88  on 507  degrees of freedom
## AIC: 1059.5
## 
## Number of Fisher Scoring iterations: 3
# Test with a locus that is only selected by one method
loc7 <- genotype[, outliers_pcadapt[!outliers_pcadapt %in% outliers_OF][1]]
loc7temp <- data.frame(loc7, temp)
  
ggplot(loc2temp, aes(x = factor(loc7), y = temp)) + 
 geom_boxplot() + 
 xlab("Major allele count") +
 ylab("Temperature (Centigrade)")

mod <- glm(cbind(loc7, 2 - loc7) ~ temp, family = binomial)
summary(mod) ## This locus is not significantly correlated to temperature
## 
## Call:
## glm(formula = cbind(loc7, 2 - loc7) ~ temp, family = binomial)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.84798  -1.59313  -0.06144   1.53467   1.73831  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.06980    0.06346   1.100    0.271  
## temp        -0.04202    0.02243  -1.874    0.061 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 802.15  on 508  degrees of freedom
## Residual deviance: 798.62  on 507  degrees of freedom
## AIC: 1106.2
## 
## Number of Fisher Scoring iterations: 3

Conclusions

What’s next

Information on further analysis that could be done as LFFM (package LEA..).

Contributors

References

Duforet-Frebourg N, K Luu, G Laval, E Bazin, MGB Blum. (2016) Detecting genomic signatures of natural selection with principal component analysis: application to the 1000 Genomes data. Molecular Biology and Evolution (http://arxiv.org/abs/1504.04543)

Whitlock MC, Lotterhos KE (2015) Reliable Detection of Loci Responsible for Local Adaptation: Inference of a Null Model through Trimming the Distribution of Fst. The American Naturalist 186, S24-S36. http://www.jstor.org/stable/10.1086/682949

Session Information

This shows us useful information for reproducibility. Of particular importance are the versions of R and the packages used to create this workflow. It is considered good practice to record this information with every analysis.

options(width = 100)
devtools::session_info()
## Session info ---------------------------------------------------------------------------------------
##  setting  value                       
##  version  R version 3.3.2 (2016-10-31)
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       <NA>                        
##  date     2017-01-04
## Packages -------------------------------------------------------------------------------------------
##  package     * version date       source                            
##  ape           4.0     2016-12-01 CRAN (R 3.3.2)                    
##  assertthat    0.1     2013-12-06 CRAN (R 3.3.2)                    
##  backports     1.0.4   2016-10-24 CRAN (R 3.3.2)                    
##  cluster       2.0.5   2016-10-08 CRAN (R 3.3.2)                    
##  colorspace    1.3-2   2016-12-14 CRAN (R 3.3.2)                    
##  DEoptimR      1.0-8   2016-11-19 CRAN (R 3.3.2)                    
##  devtools      1.12.0  2016-12-05 CRAN (R 3.3.2)                    
##  digest        0.6.10  2016-08-02 CRAN (R 3.3.2)                    
##  evaluate      0.10    2016-10-11 CRAN (R 3.3.2)                    
##  fit.models  * 0.5-10  2013-02-23 CRAN (R 3.3.2)                    
##  ggplot2     * 2.2.0   2016-11-11 CRAN (R 3.3.2)                    
##  gtable        0.2.0   2016-02-26 CRAN (R 3.3.2)                    
##  htmltools     0.3.5   2016-03-21 CRAN (R 3.3.2)                    
##  knitr         1.15.1  2016-11-22 CRAN (R 3.3.2)                    
##  labeling      0.3     2014-08-23 CRAN (R 3.3.2)                    
##  lattice     * 0.20-34 2016-09-06 CRAN (R 3.3.2)                    
##  lazyeval      0.2.0   2016-06-12 CRAN (R 3.3.2)                    
##  magrittr      1.5     2014-11-22 CRAN (R 3.3.2)                    
##  MASS        * 7.3-45  2016-04-21 CRAN (R 3.3.2)                    
##  Matrix        1.2-7.1 2016-09-01 CRAN (R 3.3.2)                    
##  memoise       1.0.0   2016-01-29 CRAN (R 3.3.2)                    
##  mgcv          1.8-16  2016-11-07 CRAN (R 3.3.2)                    
##  munsell       0.4.3   2016-02-13 CRAN (R 3.3.2)                    
##  mvtnorm       1.0-5   2016-02-02 CRAN (R 3.3.2)                    
##  nlme          3.1-128 2016-05-10 CRAN (R 3.3.2)                    
##  OutFLANK    * 0.1     2017-01-04 Github (whitlock/OutFLANK@1379c26)
##  pcadapt     * 3.0.4   2017-01-02 CRAN (R 3.3.2)                    
##  pcaPP         1.9-61  2016-10-11 CRAN (R 3.3.2)                    
##  permute       0.9-4   2016-09-09 CRAN (R 3.3.2)                    
##  pinfsc50      1.1.0   2016-12-02 CRAN (R 3.3.2)                    
##  plyr          1.8.4   2016-06-08 CRAN (R 3.3.2)                    
##  qvalue      * 2.6.0   2017-01-04 Bioconductor                      
##  Rcpp          0.12.8  2016-11-17 CRAN (R 3.3.2)                    
##  reshape2      1.4.2   2016-10-22 CRAN (R 3.3.2)                    
##  rmarkdown     1.3     2016-12-21 CRAN (R 3.3.2)                    
##  robust      * 0.4-16  2014-05-18 CRAN (R 3.3.2)                    
##  robustbase  * 0.92-7  2016-12-09 CRAN (R 3.3.2)                    
##  rprojroot     1.1     2016-10-29 CRAN (R 3.3.2)                    
##  rrcov       * 1.4-3   2016-09-06 CRAN (R 3.3.2)                    
##  scales        0.4.1   2016-11-09 CRAN (R 3.3.2)                    
##  stringi       1.1.2   2016-10-01 CRAN (R 3.3.2)                    
##  stringr       1.1.0   2016-08-19 CRAN (R 3.3.2)                    
##  tibble        1.2     2016-08-26 CRAN (R 3.3.2)                    
##  vcfR        * 1.3.0   2016-12-08 CRAN (R 3.3.2)                    
##  vegan         2.4-1   2016-09-07 CRAN (R 3.3.2)                    
##  viridisLite   0.1.3   2016-03-12 CRAN (R 3.3.2)                    
##  withr         1.0.2   2016-06-20 CRAN (R 3.3.2)                    
##  yaml          2.1.14  2016-11-12 CRAN (R 3.3.2)