Great advances have been made in the field of genetic analysis over the last years. The availability of millions
of single nucleotide polymorphisms (SNPs) in widely available databases, coupled with major advances in SNP genotyping
technology that reduce costs and increase throughput, are enabling a host of studies aimed at elucidating the genetic basis
of complex disease. The focus in this task view is on R packages implementing statistical methods and algorithms for the
analysis of genetic data and for related population genetics studies.
A number of R packages are already available and many more are most likely to be developed in the near future.
Please send your comments and suggestions to the task view maintainer.
-
Population Genetics
:
genetics
implements classes and methods for representing genotype and haplotype data, and has several
functions for population genetic analysis (e.g. functions for estimation and testing of
Hardy-Weinberg and linkage disequilibria, etc.).
rmetasim
provides an interface to the metasim engine
for population genetics simulations.
A few population genetics functions are also implemented in
gap.
hwde
fits models for genotypic disequilibria. Whilst
HardyWeinberg
provides graphical representation of disequilibria via ternary plots (also known as de Finetti diagrams).
Biodem
package provides functions for Biodemographical analysis, e.g.
Fst()
calculates the Fst from the conditional kinship matrix. The
adegenet
package implements a number of different methods for analysing population structure using multivariate
statistics, graphics and spatial statistics.
The
hierfstat
package allows the estimation of hierarchical F-statistics from haploid or diploid genetic data with any numbers of levels in the hierarchy.
-
Phylogenetics
:
The
Phylogenetics
view has more detailed information,
the most important packages are also mentioned here.
Phylogenetic and evolution analyses can be performed via
ape. Package
ouch
provides Ornstein-Uhlenbeck models for phylogenetic comparative hypotheses.
phangorn
estimates phylogenetic trees and networks using maximum likelihood, maximum parsimony, distance
methods and Hadamard conjugation.
-
Linkage
:
There are few native packages for performing parametric or non-parametric linkage analysis
from within R itself, the calculations must be performed using external packages. However,
there are a number of ancillary R packages that facilitate interface with these stand-alone
programs and using the results for further analysis and presentation.
ibdreg
uses Identity By Descent (IBD) Non-Parametric Linkage (NPL) statistics for related pairs calculated
externally to test for genetic linkage with covariates by regression modelling.
Whilst not official R packages one software suite in particular is worthy of mention.
PLINK
is a C++ program for genome wide linkage analysis that supports R-based plug-ins via Rserve allowing
users to utilise the rich suite of statistical functions in R for analysis.
-
QTL mapping
:
Packages in this category develop methods for the analysis of experimental crosses
to identify markers contributing to variation in quantitative traits.
bqtl
implement both likelihood-based and Bayesian methods for inbred crosses and recombinant inbred
lines.
qtl
provides several functions and a data structure for QTL mapping, including a function
scanone()
for genome-wide scans.
wgaim
builds on the
qtl
by including functions for the modelling and summary of QTL intervals from the
full linkage map whilst
dlmap
can be used to perform QTL mapping in a mixed model framework with separate detection and localization stages.
-
Association
:
Packages in this category provide statistical methods to test associations between individual genetic markers
and a phenotype.
gap
is a package for genetic data analysis of both population and family data; it contains functions for sample
size calculations, probability of familial disease aggregation, kinship calculation, and some tests for linkage
and association analyses. Among the other functions,
genecounting()
estimates haplotype frequencies from genotype data, and
gcontrol()
implements a Bayesian genomic control statistics for association studies. For family data,
tdthap
offers an implementation of the Transmission/Disequilibrium Test (TDT) for extended marker haplotypes.
-
Linkage Disequilibrium and haplotype mapping
:
A number of packages provide haplotype estimation for unrelated individuals with ambiguous haplotypes
(due to unknown linkage phase) and allow testing for associations between the estimated haplotypes and
phenotypes (including co-variates) under a GLM framework.
hapassoc
performs likelihood inference of trait associations with haplotypes in GLMs.
haplo.stats
also contains tests for haplotype associations under a GLM framework, but also provides score tests of
association as well as providing novel functionality for building haplotypes in a sequential manner,
power and sample-size calculations and the preparation of data matrices for use in other methods.
tdthap
implements transmission/disequilibrium tests for extended marker haplotypes.
-
Genome-Wide Association Studies (GWAS)
:
With recent technical advances in high-throughput genotyping technologies the possibility of performing
Genome-Wide Association Studies is now a feasible strategy. A number of packages are available to facilitate
the analysis of these large data sets.
pbatR
provides a GUI to the powerful PBAT software which performs family and population based family and
population based studies. The software has been implemented to take advantage of parallel processing, which
vastly reduces the computational time required for GWAS.
snpMatrix
Implements classes and methods for large-scale SNP association studies.
-
Multiple testing
:
The package
qvalue
on Bioconductor
implements False Discovery Rate; the main function
qvalue()
estimates the q-values from a list of p-values.
Package
multtest
on Bioconductor
also offers several non-parametric bootstrap and permutation resampling-based multiple testing procedures.
-
Importing Sequence Data
:
There are utilities in the
seqinr
package to import sequence data from various sources, including files of aligned sequences in mase, clustal,
phylip, fasta and msf format which will be of utility to some population genetic analysis. Users interested in
using R for sequence data and bioinformatics are also referred to the
BioConductor
project.