Data classes

The following table describes specialized objects to store data represented in population genetics packages. Conversion between all types is possible.

Anyone developing a package for population genetic analysis is encouraged to use or build upon these data structures. If a new data structure is needed, please provide a conversion method to one or more of the classes listed below.

Class {type} (package)	Strengths	Weaknesses
DNAbin {S3} (ape)	stores all sets of sequences (aligned or not)	less compact than 2-bit coding (but by a factor 4 at most)
	uses matrices (aligned) or lists so usual R’s commands (`names`, `rownames`, `[`, `[[`, `$`) can be used
	many `as.DNAbin` methods in ape (inc. from BioConductor)
	efficient functions in ape (`dist.dna`, `seg.sites`, `base.freq`, `read.FASTA`) and in pegas (`haplotype`)
loci {S3} (pegas)	low memory usage	not really appropriate for some analyses (e.g., multivariate analyses)
	all levels of ploidy and any number of alleles	needs to improve the treatment of NA’s (especially when data are read with read.vcf()
	genotypes can be phased
	any kind of individual data can be associated in the data frame
	efficient to compute genotype and allele frequencies
genind {S4} (adegenet)	stores allelic counts; ideal for multivariate analyses	requires more memory
	additional slots for individual data	less efficient to compute frequencies
	additional slot for population strata
	all levels of ploidy
genpop {S4} (adegenet)	equivalent to genind at group level; ideal for multivariate analysis	requires more memory
genlight {S4} (adegenet)	stores binary SNPs using bit-level coding; very memory efficient	more computationally intensive to handle; less functionalities
	additional slots for individual data and population strata
	all levels of ploidy	assumes bi-allelic loci
genclone {S4} (poppr)	inherits genind object; gains all advantages	all the same weaknesses plus slightly more memory
	stores multilocus genotype/lineage definitions (`@mlg` slot) for clonal populations
snpclone {S4} (poppr)	inherits genlight object; gains all advantages	all the same weaknesses plus slightly more memory
	stores multilocus genotype/lineage definitions (`@mlg` slot) for clonal populations
genambig {S4} (polysat)	stores microsatellite data with ambiguous ploidy	does not handle any other data type
	exports to genpop objects	cannot easily be transferred to any other object
phyDat {S3} (phangorn)	very general inspired by R `data.frame`, `factor` and `contrasts`, can contain any discrete data type; nucleotides, amino acids and codons have some more support	designed having phylogenetic analysis in mind; requires alignments, where all sequences have same length
	can be converted to and from `DNAbin` objects (`as.DNAbin` / `as.phyDat`)
	a few generic functions work on it: `c`, `unique`, `subset` and utility functions `baseFreq`, `allSitePattern`, etc.	data are not necessarily very memory efficient (as integer + contrast matrix), but stores only unique site patterns and their weights (as double)
	“efficient” maximum likelihood, maximum parsimony and distance functions in phangorn
gtype {S3} (strataG)	a simple R `list` containing a `matrix` where the first column is a stratification scheme and columns afterward are either haplotypes or diploid loci. If haploid data, the `gtype` object can also contain a list of DNA sequences.	Can likely be made more efficient in terms of storage and preprocessing for other analytical routines in package
	can be converted to `data.frame` or `matrix` with appropriate `as.` functions.
	has manipulation functions like `subset` which will select certain strata and/or loci, `merge` to combine mulitple `gtypes`, and `summary`.
	can create input files for Genepop, STRUCTURE, fastsimcoal, Arlequin, MEGA, and PHASE
multiDNA {S4} (apex)	stores multiple `DNAbin` objects from ape
multiPhyDat {S4} (apex)	stores multiple `phyDat` objects from phangorn