The following table describes specialized objects to store data represented in population genetics packages. Conversion between all types is possible.
Anyone developing a package for population genetic analysis is encouraged to use or build upon these data structures. If a new data structure is needed, please provide a conversion method to one or more of the classes listed below.
Class {type} (package) | Strengths | Weaknesses |
---|---|---|
DNAbin {S3} (ape) | stores all sets of sequences (aligned or not) | less compact than 2-bit coding (but by a factor 4 at most) |
uses matrices (aligned) or lists so usual R’s commands (names , rownames , [ , [[ , $ ) can be used |
||
many as.DNAbin methods in ape (inc. from BioConductor) |
||
efficient functions in ape (dist.dna , seg.sites , base.freq , read.FASTA ) and in pegas (haplotype ) |
||
loci {S3} (pegas) | low memory usage | not really appropriate for some analyses (e.g., multivariate analyses) |
all levels of ploidy and any number of alleles | needs to improve the treatment of NA’s (especially when data are read with read.vcf() | |
genotypes can be phased | ||
any kind of individual data can be associated in the data frame | ||
efficient to compute genotype and allele frequencies | ||
genind {S4} (adegenet) | stores allelic counts; ideal for multivariate analyses | requires more memory |
additional slots for individual data | less efficient to compute frequencies | |
additional slot for population strata | ||
all levels of ploidy | ||
genpop {S4} (adegenet) | equivalent to genind at group level; ideal for multivariate analysis | requires more memory |
genlight {S4} (adegenet) | stores binary SNPs using bit-level coding; very memory efficient | more computationally intensive to handle; less functionalities |
additional slots for individual data and population strata | ||
all levels of ploidy | assumes bi-allelic loci | |
genclone {S4} (poppr) | inherits genind object; gains all advantages | all the same weaknesses plus slightly more memory |
stores multilocus genotype/lineage definitions (@mlg slot) for clonal populations |
||
snpclone {S4} (poppr) | inherits genlight object; gains all advantages | all the same weaknesses plus slightly more memory |
stores multilocus genotype/lineage definitions (@mlg slot) for clonal populations |
||
genambig {S4} (polysat) | stores microsatellite data with ambiguous ploidy | does not handle any other data type |
exports to genpop objects | cannot easily be transferred to any other object | |
phyDat {S3} (phangorn) | very general inspired by R data.frame , factor and contrasts , can contain any discrete data type; nucleotides, amino acids and codons have some more support |
designed having phylogenetic analysis in mind; requires alignments, where all sequences have same length |
can be converted to and from DNAbin objects (as.DNAbin / as.phyDat ) |
||
a few generic functions work on it: c , unique , subset and utility functions baseFreq , allSitePattern , etc. |
data are not necessarily very memory efficient (as integer + contrast matrix), but stores only unique site patterns and their weights (as double) | |
“efficient” maximum likelihood, maximum parsimony and distance functions in phangorn | ||
gtype {S3} (strataG) | a simple R list containing a matrix where the first column is a stratification scheme and columns afterward are either haplotypes or diploid loci. If haploid data, the gtype object can also contain a list of DNA sequences. |
Can likely be made more efficient in terms of storage and preprocessing for other analytical routines in package |
can be converted to data.frame or matrix with appropriate as. functions. |
||
has manipulation functions like subset which will select certain strata and/or loci, merge to combine mulitple gtypes , and summary . |
||
can create input files for Genepop, STRUCTURE, fastsimcoal, Arlequin, MEGA, and PHASE | ||
multiDNA {S4} (apex) | stores multiple DNAbin objects from ape |
|
multiPhyDat {S4} (apex) | stores multiple phyDat objects from phangorn |