Social scientists use a wide range of statistical methods, most of which are not unique to the social sciences. Indeed, most statistical data analysis in the social sciences is covered by the facilities in the base and recommended packages, which are part of the standard R distribution. In the package descriptions below, I identify base and recommended packages on first mention; packages that are not specifically identified as "R-base" or "recommended" are contributed packages.
Other Relevant Task Views:
Beyond the base and contributed packages, many of the methods commonly employed in the social sciences are covered extensively in other CRAN task views, including the following. I will try to minimize duplicating information present in these other task views, given here in alphabetical order.
-
Bayesian: Methods of Bayesian inference in a variety of settings of interest to social scientists, including mixed-effects models.
-
Econometrics
and
Finance: In addition to methods of specific interest to economists and financial analysts, these task views covers a variety of commonly used regression models and methods, instrumental-variables estimation, models for panel data, and some time-series models.
-
MetaAnalysis: Methods of meta analysis for combining results from primary studies. If data on individuals in each study are available, meta analysis can be performed using
mixed-effects models
.
-
Multivariate: A broad, if far from exhaustive, catalog of methods implemented in R for analyzing multivariate data, from data visualization to statistical modeling, and including correspondence analysis for multivariate categorical data.
-
OfficialStatistics: Covers not only official statistics but also methods for collecting and analyzing data from complex sample surveys, such as the
survey
package.
-
Psychometrics: Extensively covers methods of scale construction, including item-response theory, multidimensional scaling, and classical test theory, along with other topics of interest in the social sciences, such as structural-equation modeling.
-
Spatial: Methods for managing, visualizing, and modeling spatial data, including spatial regression analysis.
-
SpatioTemporal: Methods for representing, visualizing, and analyzing data with information both on time and location.
-
Survival: Methods for survival analysis (often termed "event-history analysis" in the social sciences), beyond the basic and standard methods, such as for Cox regression, included in the recommended
survival
package.
-
TimeSeries: Methods for representing, manipulating, visualizing, and modeling time-series data, including time-series regression methods.
It is noteworthy that this enumeration includes about a third of the CRAN task views. Moreover, there are other task views of potential interest to social scientists (such as the
Graphics
task view on statistical graphics); I suggest that you look at the
list of all task views on CRAN
.
Linear and Generalized Linear Models:
Univariate and multivariate linear models are fit by the
lm
function, generalized linear models by the
glm
function, both in the R-base
stats
package. Beyond
summary
and
plot
methods for
lm
and
glm
objects, there is a wide array of functions that support these objects.
-
The generic
anova
function in the
stats
package constructs sequential ("Type-I") analysis of variance and analysis of deviance tables, and can also compute
F
and chisquare likelihood-ratio tests for nested models. (It is typical for other classes of statistical models in R to have
anova
methods as well, along with methods for other standard generics, such as
coef, for returning regression coefficients;
vcov
for the coefficient covariance matrix;
residuals; and
fitted
for fitted values of the response.) The generic
Anova
function in the
car
package (associated with Fox and Weisberg,
An R Companion to Applied Regression, Second Edition
, Sage, 2011) constructs so-called "Type-II" and "Type-III" partial tests for linear, generalized linear, and many other classes of regression models.
-
F
and chisquare Wald tests for a variety of hypotheses are available from the
coeftest
and
waldtest
functions in the
lmtest
package, and the
linearHypothesis
function in the
car
package. All of these functions permit the use of heteroscedasticity and heteroscedasticity/autocorrelation-consistent covariance matrices, as computed, e.g., by functions in the
sandwich
and
car
packages. Also see the
glh.test
function in the
gmodels
package. Nonlinear functions of parameters can be tested via the
deltaMethod
function in the
car
package. The
multcomp
package includes functions for multiple comparisons. The
vuong
function in the
pscl
package tests non-nested hypotheses for generalized linear and some other models. Also see the
rms
package for tests on linear and generalized linear models.
-
The standard R distribution has excellent basic facilities for linear and generalized linear model "diagnostics," including, for example, hat-values and deletion statistics such as studentized residuals and Cook's distances (
hatvalues,
rstudent, and
cooks.distance, all in the
stats
package). These are augmented by other packages: several functions in the
car
package, which emphasizes graphical methods, e.g.,
crPlots
for component-plus-residual plots and
avPlots
for added-variable plots (among others), in addition to numerical diagnostics, such
vif
for (generalized) variance-inflation factors; the
dr
package for dimension reduction in regression, including SIR, SAVE, and pHd; and the
lmtest
package, which implements a variety of diagnostic tests (e.g., for heteroscedasticity, nonlinearity, and autocorrelation). Other collinearity diagnostics are in the
perturb
package. Diagnostics may also be found in the
rms
package. See the
influence.ME
package for influential-data diagnostics for mixed-effects models.
-
Several packages contain functions that are useful for interpreting linear and generalized linear models that have been fit to data: The
qvcalc
packages computes "quasi variances" for factors in linear and generalized linear models (and more generally). The
effects
package constructs effect displays, including, e.g., "adjusted means," for linear, generalized linear, and many other regression models; diagnostic partial-residual plots are available for linear and generalized linear models. Similar, if somewhat less general, plots are available in the
visreg
package. The
lsmeans
implements so-called "least-squares means" for linear, generalized linear, and mixed models, and includes provisions for hypothesis tests.
Analysis of Categorical and Count Data:
Binomial logit and probit models, as well as Poisson-regression and loglinear models for contingency tables (including models for "over-dispersed" binomial and Poisson data), can be fit with the
glm
function in the
stats
package. For over-dispersed data, see also the
aod
package, the
dispmod
package, and the
glm.nb
function in the recommended
MASS
package (associated with Venables and Ripley,
Modern Applied Statistics in S, Fourth Ed.
, Springer, 2002), which fits negative-binomial GLMs. The
pscl
package includes functions for fitting zero-inflated and hurdle regression models to count data.
The multinomial logit model is fit by the
multinom
function in the recommended
nnet
package, and ordered logit and probit models by the
polr
function in the
MASS
package. Also see the
mlogit
for the multinomial logit model, the
MNP
package for the multinomial probit model, and the
multinomRob
package for the analysis of overdispersed multinomial data. The
VGAM
package is capable of fitting a very wide variety of fixed-effect regression models within a unified framework, including models for ordered and unordered categorical responses and for count data.
There are other noteworthy facilities for analyzing categorical and count data.
-
The
table
function in the R-base
base
package and the
xtabs
and
ftable
functions in the
stats
package construct contingency tables.
-
The
chisq.test
and
fisher.test
functions in the
stats
package may be used to test for independence in two-way contingency tables.
-
The
loglm
and
loglin
functions in the
MASS
package fit hierarchical loglinear models to contingency tables, the former as a front end to
glm, the latter by iterative proportional fitting.
-
See the
brglm
and
logistf
packages for bias-reduction in binomial-response GLMs (useful, e.g., in cases of complete separation); the
exactLoglinTest
package for exact tests of loglinear models; the
clogit
function in the
survival
package for conditional logistic regression; and the
vcd
package for graphical displays of categorical data, including mosaic plots.
-
The
gnm
package estimates generalized
nonlinear
models, and can be used, e.g., to fit certain specialized models to mobility tables. The
logmult
package provides convenience functions based on
gnm
to fit log-multiplicative (e.g., UNIDIFF) and association (e.g., Goodman's RC) models. Also see the
catspec
package for estimating various special models for square tables.
-
As previously mentioned, the
Multivariate
task view covers correspondence analysis of multivariate categorical data.
-
See the
betareg
package for beta regression of data on rates and proportions, a topic closely associated with categorical data.
Other Regression Models:
It is possible to fit a very wide variety of regression models with the facilities provided by the base and recommended packages, and an even wider variety of models with contributed packages, in addition to those covered extensively in
other task views
.
-
Nonlinear regression:
The
nls
function in the
stats
package fits nonlinear models by least-squares. The
nlstools
includes several functions for assessing models fit by
nls.
-
Mixed-effects models:
The recommended
nlme
package, associated with Pinheiro and Bates,
Mixed-Effects Models in S and S-PLUS
(Springer, 2000), fits linear (
lme) and nonlinear (
nlme) mixed-effects models, commonly used in the social sciences for hierarchical and longitudinal data. Generalized linear mixed-effects models may be fit by the
glmmPQL
function in the
MASS
package, or (preferably) by the
glmer
function in the
lme4
package. The
lme4
package also largely supersedes
nlme
for
linear
mixed models, via its
lmer
function. Unlike
lme,
lmer
supports crossed random effects, but does not support autocorrelated or heteroscedastic individual-level errors. Also see the
lmeSplines,
lmm, and
MCMCglmm
packages.
-
Generalized estimating equations:
The
gee
and
geepack
packages fit marginal models by generalized estimating equations; see the
multgee
package for GEE estimation of models for correlated nominal or ordinal multinomial responses.
-
Nonparametric regression analysis:
This is one of the conspicuous strengths of R. The standard R distribution includes several functions for smoothing scatterplots, including
loess.smooth
and
smooth.spline, both in the
stats
package. The
loess
function, also in the
stats
package, fits simple and multiple nonparametric-regression models by local polynomial regression. Generalized additive models are covered by several packages, including the recommended
mgcv
package and the
gam
package, the latter associated with Hastie and Tibshirani,
Generalized Additive Models
(Chapman and Hall, 1990); also see the
VGAM
package. Some other noteworthy contributed packages in this area are
gss, which fits spline regressions;
locfit, for local-polynomial regression (and also density estimation) (Loader,
Local Regression and Likelihood,
Springer, 1999);
sm, for a variety of smoothing techniques, including for regression (Bowman and Azzalini,
Applied Smoothing Techniques for Data Analysis,
Oxford, 1997);
np, which implements kernel smoothing methods for mixed data types; and
acepack
for ACE (alternating conditional expectations) and AVAS (additivity and variance stabilization) nonparametric transformation of the response and explanatory variables in regression.
-
Quantile regression:
Methods for linear, nonlinear, and nonparametric quantile regression are extensively provided by the
quantreg
package.
-
Regression splines:
Parametric regression splines (as opposed to nonparametric smoothing splines), supported by the base-R
splines
package, can be used by
lm,
glm, and other statistical modeling functions that employ model formulas. See the
bs
(B-spline) and
ns
(natural spline) functions.
-
Very large data sets:
The
biglm
package can fit linear and generalized linear models to data sets too large to fit in memory.
Other Statistical Methods:
Here is a brief survey of implementations in R of other statistical methods commonly used by social scientists.
-
Missing Data:
Several packages implement methods for handling missing data by multiple imputation, including the (conspicuously aging)
mix,
norm, and
pan
packages associated with Shafer,
Analysis of Incomplete Multivariate Data
(Chapman and Hall, 1997), and the newer and more actively maintained
Amelia,
mi,
mice, and
mitools
packages (the latter for drawing inferences from multiply imputed data sets). There are also some facilities for missing-data imputation in the general
Hmisc
package, which is described below, under
"Collections"
. Some of the structural-equation modeling software discussed in the
Psychometrics
taskview is capable of maximum-likelihood estimation of regression models with missing data. The
VIM
package has functions for visualizing missing and imputed values.
-
Bootstrapping and Other Resampling Methods:
The recommended package
boot, associated with Davison and Hinkley,
Bootstrap Methods and Their Application
(Cambridge, 1997), has excellent facilities for bootstrapping and some related methods. Also notable is the
bootstrap
package, associated with Efron and Tibshirani,
An Introduction to the Bootstrap
(Chapman and Hall, 1993), which has functions for bootstrapping and jackknifing. In addition, see the functions
Boot
and
bootCase
in the
car
package, and
nlsBoot
in the
nlstools
package, along with the
simpleboot
package.
-
Model Selection:
The
step
function in the
stats
package and the more broadly applicable
stepAIC
function in the
MASS
package perform forward, backward, and forward-backward stepwise selection for a variety of statistical models. The
regsubsets
function in the
leaps
package performs all-subsets regression. The
BMA
package performs Bayesian model averaging. The standard
AIC
and
BIC
functions are also relevant to model selection. Beyond these packages and functions, see the
MachineLearning
task view.
-
Social Network Analysis:
There are several packages useful for social network analysis, including
sna
for sociometric analysis of networks (e.g., blockmodeling),
network
for manipulating and displaying network objects,
latentnet
for latent position and cluster models for networks,
ergm
for exponential random graph models of networks, and the "metapackage"
statnet, all associated with the
statnet project
. Also see the
RSiena
and
PAFit
packages for longitudinal social network analysis; and the
multiplex
package, which implements algebraic procedures for the analysis of multiple social networks.
-
Propensity Scores and Matching:
See the
Matching,
MatchIt,
optmatch, and
PSAgraphics
packages, and the
matching
function in the
arm
package (associated with Gelman and Hill,
Data Analysis Using Regression and Multilevel/Hierarchical Models,
Cambridge, 2007).
-
Demographic methods
: The
demography
package includes functions for constructing life tables, for analyzing mortality, fertility, and immigration, and for forecasting population.
Collections of Functions:
There are some packages that are so heterogeneous that they are difficult to classify, yet contain functions (typically in multiple domains) that are of interest to social scientists:
-
I have already made several references to the recommended
MASS
package, which is associated with Venables and Ripley's
Modern Applied Statistics With S.
Other recommended packages associated with this book are
nnet, for fitting neural networks (but also, as mentioned, multinomial logistic-regression models);
spatial
for spatial statistics; and
class, which contains functions for classification.
-
I've also mentioned the
car
package, associated with Fox and Weisberg,
An R Companion to Applied Regression, Second Edition
, which has a variety of functions supporting regression analysis, data exploration, and data transformation.
-
The
Hmisc
and
rms
packages (both mentioned above), associated with Harrell,
Regression Modeling Strategies, Second Edition
(Springer, 2015), provide functions for data manipulation, linear models, logistic-regression models, and survival analysis, many of them "front ends" to or modifications of other facilities in R.
Acknowledgments:
Jangman Hong contributed to the general revision of this task view, as did other individuals who made a variety of specific suggestions.
If I have omitted something of importance not covered in one of the other task views cited, or if a new package or function should be mentioned here,
please let me know.
Compilation of this task view was partly supported by grants from the Social Sciences and Humanities Research Council of Canada.