Introduction

In this vignette, we will estimate individual genetic distances from SNP data. It is useful when you have individual genotype data and you don’t know the populations.

Assumptions

We will use non-evolutionary genetic distances, i.e. not based on Hardy-Weinberg assumptions: free from assumptions.

Data

The dataset used for those analysis concerns the plant: lodgepole pine (Pinus contorta, Pinaceae). You can have more information on this data set and the species on the web site of A. Eckert: (http://eckertdata.blogspot.fr/). But here the dataset is used as a test dataset with no idea of interpreting the results in a biological way. We will work on a subset of the dataset to make the calculations faster.

The data are stored in a text file (genotype = AA). We will import the dataset in R as a data frame, and then convert the SNP data file into genind objects.

The dataset “Master_Pinus_data_genotype.txt” can be downloaded here.

The text file is a matrix of (550 rows x 3086 columns). It contains 4 extra columns: first column is the label of the individuals, the three other are description of the region, all the other columns are for the genotypes as (AA or AT…).

When you import the data into R, the data file needs to be in your working directory, or adjust the path in the read.table() invocation below accordingly.

Mydata <- read.table("Master_Pinus_data_genotype.txt", header = TRUE, check.names = FALSE)   
dim(Mydata) 
## [1]  550 3086
ind <- as.character(Mydata$tree_id) # individual labels 
population <- as.character(Mydata$state) # population labels
county <- Mydata$county 
dim(Mydata) # 550 individuals x 3082 SNPs
## [1]  550 3086

Resources/Packages required

Loading the required packages:

library("poppr")
library("pegas")
library("ape")
library("adegenet")
library("ade4")

Analysis

Section 1: Convert the data

To work with the data, we need to convert the R object returned by read.table() to a genind object. To achieve this, we create a matrix with only genotypes, and keep only a subset of the first 100 SNP loci (to make calculations faster). The result can then be converted to a genind object (for the package adegenet). The genind object can then easily be converted into loci objects (package pegas) (i.e. Mydata2)

locus <- Mydata[, -c(1, 2, 3, 4, 105:ncol(Mydata))] 
Mydata1 <- df2genind(locus, ploidy = 2, ind.names = ind, pop = population, sep="")
Mydata1
## /// GENIND OBJECT /////////
## 
##  // 550 individuals; 100 loci; 200 alleles; size: 519.8 Kb
## 
##  // Basic content
##    @tab:  550 x 200 matrix of allele counts
##    @loc.n.all: number of alleles per locus (range: 2-2)
##    @loc.fac: locus factor for the 200 columns of @tab
##    @all.names: list of allele names for each locus
##    @ploidy: ploidy of each individual  (range: 2-2)
##    @type:  codom
##    @call: df2genind(X = locus, sep = "", ind.names = ind, pop = population, 
##     ploidy = 2)
## 
##  // Optional content
##    @pop: population of each individual (group size range: 4-177)
Mydata2 <-genind2loci(Mydata1)

Section 2. Individual genetic distance: euclidean distance (dist {adegenet})

The unit of the observation is the individuals.

  • 550 genotypes
  • 100 binary SNPs
  • Ploidy : 2

The analysis is applied on allele frequency within individuals as represented in the genind object. We can use the function dist() from adegenet which provides different options. We will use the euclidean distance among vector of allele frequencies.

distgenEUCL <- dist(Mydata1, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)
hist(distgenEUCL)

Section 3. Individual genetic distance: number of loci for which individuals differ (dist.gene {ape})

The option pairwise.deletion = FALSE in the command dist.gene() removes all loci with one missing values : you an see on the histogram that we get a maximum distance of 3 loci out of 100.

We can see that we get 98 loci with at least one sample missing. Then using the option pairwise.deletion = TRUE in the command dist.gene() allows you to keep loci with one missing value.

distgenDIFF <- dist.gene(Mydata2, method="pairwise", pairwise.deletion = FALSE, variance = FALSE)
hist(distgenDIFF)

# Get percent missing data per population
missing_data <- info_table(Mydata1, type = "missing")
sum(missing_data["Total", 1:100] > 0)
## [1] 98
barplot(missing_data["Total", 1:100], xlab = "Locus", ylab = "Percent Missing")

distgenDIFF <- dist.gene(Mydata2, method="pairwise", pairwise.deletion = TRUE, variance = FALSE)
hist(distgenDIFF)

Section 4: number of allelic differences between two individuals (diss.dist {poppr})

distgenDISS <- diss.dist(Mydata1, percent = FALSE, mat = FALSE)
hist(distgenDISS)

Section 5: Conclusions drawn from the analysis

boxplot(distgenEUCL, distgenDIFF, distgenDISS)

The number of allelic differences between two individuals is a different measure from euclidean distance or number of locus differences between two individuals.

Conclusion

What did we learn today?

In this vignette, we explore different measures of individual genetic distances. It is important to investigate the different options of each command. Missing data can be handled in different ways.

What is next?

These individuals genetic distances can then be used in analyses such as Mantel tests to test for isolation by distance, more complex analyses in landscape genetics to test for resistance by distance, cluster analysis, or spatial networks.

Contributors

Session Information

This shows us useful information for reproducibility. Of particular importance are the versions of R and the packages used to create this workflow. It is considered good practice to record this information with every analysis.

options(width = 100)
devtools::session_info()
## Session info ---------------------------------------------------------------------------------------
##  setting  value                       
##  version  R version 3.3.2 (2016-10-31)
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       <NA>                        
##  date     2017-01-04
## Packages -------------------------------------------------------------------------------------------
##  package    * version date       source        
##  ade4       * 1.7-5   2016-12-13 CRAN (R 3.3.2)
##  adegenet   * 2.0.1   2016-02-15 CRAN (R 3.3.2)
##  ape        * 4.0     2016-12-01 CRAN (R 3.3.2)
##  assertthat   0.1     2013-12-06 CRAN (R 3.3.2)
##  backports    1.0.4   2016-10-24 CRAN (R 3.3.2)
##  boot         1.3-18  2016-02-23 CRAN (R 3.3.2)
##  cluster      2.0.5   2016-10-08 CRAN (R 3.3.2)
##  coda         0.19-1  2016-12-08 CRAN (R 3.3.2)
##  colorspace   1.3-2   2016-12-14 CRAN (R 3.3.2)
##  DBI          0.5-1   2016-09-10 CRAN (R 3.3.2)
##  deldir       0.1-12  2016-03-06 CRAN (R 3.3.2)
##  devtools     1.12.0  2016-12-05 CRAN (R 3.3.2)
##  digest       0.6.10  2016-08-02 CRAN (R 3.3.2)
##  dplyr        0.5.0   2016-06-24 CRAN (R 3.3.2)
##  evaluate     0.10    2016-10-11 CRAN (R 3.3.2)
##  fastmatch    1.0-4   2012-01-21 CRAN (R 3.3.2)
##  gdata        2.17.0  2015-07-04 CRAN (R 3.3.2)
##  ggplot2      2.2.0   2016-11-11 CRAN (R 3.3.2)
##  gmodels      2.16.2  2015-07-22 CRAN (R 3.3.2)
##  gtable       0.2.0   2016-02-26 CRAN (R 3.3.2)
##  gtools       3.5.0   2015-05-29 CRAN (R 3.3.2)
##  htmltools    0.3.5   2016-03-21 CRAN (R 3.3.2)
##  httpuv       1.3.3   2015-08-04 CRAN (R 3.3.2)
##  igraph       1.0.1   2015-06-26 CRAN (R 3.3.2)
##  knitr        1.15.1  2016-11-22 CRAN (R 3.3.2)
##  lattice      0.20-34 2016-09-06 CRAN (R 3.3.2)
##  lazyeval     0.2.0   2016-06-12 CRAN (R 3.3.2)
##  LearnBayes   2.15    2014-05-29 CRAN (R 3.3.2)
##  magrittr     1.5     2014-11-22 CRAN (R 3.3.2)
##  MASS         7.3-45  2016-04-21 CRAN (R 3.3.2)
##  Matrix       1.2-7.1 2016-09-01 CRAN (R 3.3.2)
##  memoise      1.0.0   2016-01-29 CRAN (R 3.3.2)
##  mgcv         1.8-16  2016-11-07 CRAN (R 3.3.2)
##  mime         0.5     2016-07-07 CRAN (R 3.3.2)
##  munsell      0.4.3   2016-02-13 CRAN (R 3.3.2)
##  nlme         3.1-128 2016-05-10 CRAN (R 3.3.2)
##  pegas      * 0.9     2016-04-16 CRAN (R 3.3.2)
##  permute      0.9-4   2016-09-09 CRAN (R 3.3.2)
##  phangorn     2.1.1   2016-12-04 CRAN (R 3.3.2)
##  plyr         1.8.4   2016-06-08 CRAN (R 3.3.2)
##  poppr      * 2.3.0   2016-11-23 CRAN (R 3.3.2)
##  quadprog     1.5-5   2013-04-17 CRAN (R 3.3.2)
##  R6           2.2.0   2016-10-05 CRAN (R 3.3.2)
##  Rcpp         0.12.8  2016-11-17 CRAN (R 3.3.2)
##  reshape2     1.4.2   2016-10-22 CRAN (R 3.3.2)
##  rmarkdown    1.3     2016-12-21 CRAN (R 3.3.2)
##  rprojroot    1.1     2016-10-29 CRAN (R 3.3.2)
##  scales       0.4.1   2016-11-09 CRAN (R 3.3.2)
##  seqinr       3.3-3   2016-10-13 CRAN (R 3.3.2)
##  shiny        0.14.2  2016-11-01 CRAN (R 3.3.2)
##  sp           1.2-4   2016-12-22 CRAN (R 3.3.2)
##  spdep        0.6-8   2016-09-21 CRAN (R 3.3.2)
##  stringi      1.1.2   2016-10-01 CRAN (R 3.3.2)
##  stringr      1.1.0   2016-08-19 CRAN (R 3.3.2)
##  tibble       1.2     2016-08-26 CRAN (R 3.3.2)
##  vegan        2.4-1   2016-09-07 CRAN (R 3.3.2)
##  withr        1.0.2   2016-06-20 CRAN (R 3.3.2)
##  xtable       1.8-2   2016-02-05 CRAN (R 3.3.2)
##  yaml         2.1.14  2016-11-12 CRAN (R 3.3.2)