As a first example of the functions in this package we will fit a model to the pied flycatcher data. We will start by fitting a model with fixed effects on the mean and variance components. Specifically, we will model the load returned on each trip by the adult flycatchers a independent normal random variables with mean given by a linear combination of the logarithm of the inter-visit interval (IVI), the broodsize manipulation (broodsize), and the adult’s sex, and standard deviation given by a linear combination of broodsize and sex, on the log scale. We will then incorporate individual random effects into this model.
First you have to load the package:
## Load package
library(dalmatian)
The raw data for this example is provided in the package and can be accessed with:
## Load pied flycatcher data
data(pied_flycatchers_1)
This data contains records on 5795 different food deliveries recorded during the pied flycatcher experiment. The response variable, load
, records the load rounded to the nearest .1 gram. We want to model the logarithm of the load, and we also want to account for the rounding. To do this we will create two new variables in the data frame, lower
and upper
, which bound the logarithm of the true load. Not that we add .001 when the observed value is zero to avoid problems with the logarithm.
## Create variables bounding the true load
pfdata$lower=ifelse(pfdata$load==0,log(.001),log(pfdata$load-.049))
## Warning in log(pfdata$load - 0.049): NaNs produced
pfdata$upper=log(pfdata$load+.05)
Our initial model for the logarithm of the load returned on the \(j\)th trip by adult \(i\) will be \(N(\mu_{ij},\sigma_{ij}^2)\) where: \[\mu=\beta_0 + \beta_1 \mathrm{log(IVI)_{ij} + \beta_2 broodsize_i + \beta_3 sex_i}\] and \[\log(\sigma)=\psi_0 + \psi_1 \mathrm{broodsize}_i + \psi_2\mathrm{sex}_i\].
The first step is to define the models for the mean and variance components. These models are simply lists with (up to) two named components, fixed
and random
, which provide the details on the fixed and random effects. Both lists must contain a variable name
specifying the basename for the coefficients and a variable formula
specifying the effects. The fixed
list should also contain an object called priors
which specifies the priors for coefficients in the fixed effects portion of the model. Random effects are currently assumed to be normal with mean zero and unknown variances which are assigned the half \(t\)-distribution with ? degrees of freedom and scale ?. The optional parameter link
can also be used to specify a link function for either component.
The components of the model for the pied flycatcher data specified above would be generated with:
## Mean model
mymean=list(fixed=list(name="alpha",
formula=~ log(IVI) + broodsize + sex,
priors=list(c("dnorm",0,.001))))
## Variance model
myvar=list(fixed=list(name="psi",
link="log",
formula=~broodsize + sex,
priors=list(c("dnorm",0,.001))))
These two objects will now be used to generate the JAGS code, data, and initial values for running the model.
The primary function in the package dalmatian
automates the creation of the JAGS
code, data, and initial values and then passes these to JAGS
via the functions in the rjags
package. The dalmatian
function takes requires several arguments including the data frame for the analysis, the model objects created above, and the name of the JAGS
model file. It also requires two lists containing named arguments for the functions jags.model
and coda.samples
from the rjags
package. Descriptions of the arguments for these two functions are available in their own help pages. Any arguments that are not specified in these lists will take the default values given by rjags
. The two exceptions are the file
argument of jags.model
and the n.iter
argument of coda.samples
which do not have default values and must be specified. The number of parallel chains will be taken from the jags.model
list. If this value is not specified then three chains will be run in order that convergence diagnostics can be computed. Note that the chains have been run for a relatively small number of iterations and heavily subsampled to satisfy the size requirements for packages on . Neither of these is recommended in practice.
## Set working directory
## By default uses a system temp directory. You probably want to change this.
workingDir <- tempdir()
## Define list of arguments for jags.model()
jm.args <- list(file=file.path(workingDir,"pied_flycatcher_1_jags.R"),n.adapt=1000)
## Define list of arguments for coda.samples()
cs.args <- list(n.iter=5000,thin=20)
## Run the model using dalmatian
## This is how the model is run. However, to save you time we will load output from a previous run instead.
if(runModels){
pfresults <- dalmatian(df=pfdata,
mean.model=mymean,
variance.model=myvar,
jags.model.args=jm.args,
coda.samples.args=cs.args,
rounding=TRUE,
lower="lower",
upper="upper",
debug=FALSE)
save(pfresults,"pfresults.RData")
}
if(!runModels){
load(system.file("Pied_Flycatchers_1","pfresults.RData",package="dalmatian"))
}
The function dalmatian
returns list containing three objects: 1. jags.model.args
The full list of arguments passed to jags.model
. This includes the variables in the input list along with the name of the model file, the formatted data, and the initial values. 2. coda.samples.args
The full list of arguments passed to coda.samples
. This includes the variables in the input list along with the complete object returned by jags.model.args
and the character vector including the names of the monitored parameters. 3. coda
An object of class mcmc.list
and length n.chains
containing the samples generated by JAGS
.
Inference is based on the output from the MCMC chain which is stored in the coda
object. Summaries of the MCMC sampler and the posterior distribution can be constructed from coda
with a variety of tools designed to work with objects of the mcmc.list
class. For example, we can use the tools in the package coda
to generate traceplots, density plots and histograms, numerical summaries of the posterior, and convergence diagnostics. The ggmcmc
package can also be used to create fancier looking traceplots, density plots, and plots summarizing the posterior distribition in terms of posterior means and quantiles (credible intervals). However, several wrapper functions exist within the package to automate some of these tasks.
The first thing we should do is to check the convergence of the chains. Summary statistics computed from the output in coda
are only valid if the chain has converged and is sampling from the correct distribution. The wrapper function convergence()
applies three of the convergence diagnostics from the coda
package.
## Compute convergence diagnostics
pfconvergence <- convergence(pfresults)
## Gelman-Rubin diagnostics
pfconvergence$gelman
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## alpha.(Intercept) 1 1.00
## alpha.log(IVI) 1 1.00
## alpha.broodsize 1 1.01
## alpha.sex 1 1.01
## psi.(Intercept) 1 1.02
## psi.broodsize 1 1.01
## psi.sex 1 1.01
## Raftery diagnostics
pfconvergence$raftery
##
## Quantile (q) = 0.025
## Accuracy (r) = +/- 0.005
## Probability (s) = 0.95
##
## You need a sample size of at least 3746 with these values of q, r and s
## Effective sample size
pfconvergence$effectiveSize
## alpha.(Intercept) alpha.log(IVI) alpha.broodsize alpha.sex
## 750.000 680.420 750.000 750.000
## psi.(Intercept) psi.broodsize psi.sex
## 1242.813 750.000 1095.642
Details of these measures can be found in the help files of the coda
package. Not surprisingly, these all suggest that the chains need to be run for longer.
Convergence and mixing of the chains can also be examined visually through traceplots. The wrapper function traceplots
generates traces using the functions from the ggmcmc
package and separates the variables by the model components. One of the advantages of using ggmcmc
which is based on ggplot2
is that the plots can be saved and easily reproduced. Here I have thinned the chains further prior to plotting to reduce the size of the vignette. Note that thinning is conducted relative to the original chain, not the already thinned chain. I.e., thinning by 100 selects the output from every 100th iteration of the original chain.
## Generate traceplots
pftraceplots <- traceplots(pfresults,plot=FALSE,nthin=100)
## Fixed effects for mean
pftraceplots$meanFixed
## Fixed effects for variance
pftraceplots$varianceFixed
Numerical summaries of the posterior distribution computed from the output in coda
can be obtained with the summary function.
## Compute numerical summaries
summary(pfresults)
##
## Iterations = 1001:5981
## Thinning interval = 20
## Number of chains = 3
## Sample size per chain = 250
##
## Posterior Summary Statistics for Each Model Component
##
## Mean Model: Fixed Effects
## Mean Median SD Lower 95% Lower 50% Upper 50% Upper 95%
## (Intercept) -3.34 -3.34 0.08 -3.50 -3.39 -3.28 -3.18
## log(IVI) 0.15 0.15 0.01 0.13 0.14 0.16 0.17
## broodsize 0.11 0.11 0.03 0.06 0.10 0.13 0.16
## sex -0.07 -0.07 0.02 -0.12 -0.10 -0.06 -0.03
##
## Variance Model: Fixed Effects
## Mean Median SD Lower 95% Lower 50% Upper 50% Upper 95%
## (Intercept) -0.39 -0.39 0.06 -0.50 -0.44 -0.35 -0.25
## broodsize 0.02 0.02 0.03 -0.04 0.00 0.04 0.07
## sex 0.01 0.01 0.03 -0.04 0.00 0.04 0.07
For each component of the model the tables provide statistics including the estimated posterior mean, median, and standard deviation and the end points of the 50% and 95% highest posterior density credible intervals. Note that these summaries are only valid if the previous diagnostics have suggested that the chain has converged and the effective sample size is large enough. Graphical summaries can also be computed with the caterpillar
which, like traceplots
, is a simple wrapper for the functions in the ggmcmc
package.
## Generate caterpillar
pfcaterpillar <- caterpillar(pfresults,plot = FALSE)
## Fixed effects for mean
pfcaterpillar$meanFixed
## Fixed effects for variance
pfcaterpillar$varianceFixed
We will now consider adding individual (bird) random effects to both the fixed and random components of our model. This can be done by simply adding to the model components created above and then rerunnig the model. The new model will have mean and variance \[\mu=\beta_0 + \beta_1 \mathrm{log(IVI)_{ij} + \beta_2 broodsize_i + \beta_3 sex_i + \epsilon_{i}}\] and \[\log(\sigma)=\psi_0 + \psi_1 \mathrm{broodsize}_i + \psi_2\mathrm{sex}_i + \xi_j\] where \(\epsilon_i \sim N(0,\tau^2_\mu)\) and \(\xi_i \sim N(0,\tau^2_\sigma)\). By default the variances \(\tau^2_\epsilon\) and \(\tau^2_\xi\) are assigned half-\(t\) prior distributions.
Instead of creating new model objects we can simply add to the old objects by creating the random
fields. As with the fixed effects, these fields are lists which must include a basename for the random effects and formula specifying the random effects model.
# Random component of mean
mymean$random=list(name="epsilon",formula=~-1 + indidx)
# Random component of variance
myvar$random=list(name="xi",formula=~-1 + indidx)
The new model can now be run using dalmatian
in exactly the same way as above. However, we will make one change. In order to shorten the convergence time, we will base initial values on the results of the fixed effects model and then provide these as input. In particular, we will define initial values for the fixed effects of both the mean and variance components by taking the values from the final iterations of each of the chains run for the fixed effects model. I have also found that it is best to set initial values for the response variable, y
, if the response is rounded. Otherwise, JAGS often fails to generate initial values that fall between the rounding limits. For convenience, we set the random effects variances equal to 1 in all three chains. Note that the chains have been run for a relatively small number of iterations and heavily subsampled to satisfy the size requirements for packages on . Neither of these is recommended in practice.
## Define initial values
inits <- lapply(1:3,function(i){
setJAGSInits(mymean,
myvar,
y = runif(nrow(pfdata),pfdata$lower,pfdata$upper),
fixed.mean = tail(pfresults$coda[[i]],1)[1:4],
fixed.variance = tail(pfresults$coda[[i]],1)[5:7],
sd.mean = 1,
sd.variance=1)
})
## Define list of arguments for jags.model()
jm.args <- list(file=file.path(workingDir,"pied_flycatcher_2_jags.R"),inits=inits,n.adapt=1000)
## Define list of arguments for coda.samples()
cs.args <- list(n.iter=5000,thin=10)
## Run the model using dalmatian
## This is how the model is run. However, to save you time we will load output from a previous run instead.
if(runModels){
pfresults2 <- dalmatian(df=pfdata,
mean.model=mymean,
variance.model=myvar,
jags.model.args=jm.args,
coda.samples.args=cs.args,
rounding=TRUE,
lower="lower",
upper="upper",
debug=FALSE)
save(pfresults2,"pfresults2.RData")
}
if(!runModels){
load(system.file("Pied_Flycatchers_1","pfresults2.RData",package="dalmatian"))
}
As before, we can use the wrapper functions within the package to examine the convergence and compute summaries of the posterior distribution.
## Compute convergence diagnostics
pfconvergence2 <- convergence(pfresults2)
## Gelman-Rubin diagnostics
pfconvergence2$gelman
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## alpha.(Intercept) 1.01 1.04
## alpha.log(IVI) 1.00 1.00
## alpha.broodsize 1.01 1.03
## alpha.sex 1.01 1.03
## psi.(Intercept) 1.00 1.00
## psi.broodsize 1.01 1.03
## psi.sex 1.00 1.02
## sd.epsilon.indidx 1.08 1.11
## sd.epsilon.indidx 1.08 1.11
## sd.xi.indidx 1.01 1.02
## Raftery diagnostics
pfconvergence2$raftery
##
## Quantile (q) = 0.025
## Accuracy (r) = +/- 0.005
## Probability (s) = 0.95
##
## You need a sample size of at least 3746 with these values of q, r and s
## Effective sample size
pfconvergence2$effectiveSize
## alpha.(Intercept) alpha.log(IVI) alpha.broodsize alpha.sex
## 654.0340 829.4722 511.8398 594.6217
## psi.(Intercept) psi.broodsize psi.sex sd.epsilon.indidx
## 790.6854 688.6121 704.4285 558.6042
## sd.epsilon.indidx sd.xi.indidx
## 558.6042 565.8051
## Generate traceplots
pftraceplots2 <- traceplots(pfresults2,plot = FALSE,nthin=100)
## Fixed effects for mean
pftraceplots2$meanFixed
## Fixed effects for variance
pftraceplots2$varianceFixed
## Random effects variances for mean
pftraceplots2$meanRandom
## Random effects variances for variances
pftraceplots2$varianceRandom
## Compute numerical summaries
summary(pfresults2)
##
## Iterations = 1001:5981
## Thinning interval = 20
## Number of chains = 3
## Sample size per chain = 250
##
## Posterior Summary Statistics for Each Model Component
##
## Mean Model: Fixed Effects
## Mean Median SD Lower 95% Lower 50% Upper 50% Upper 95%
## (Intercept) -3.08 -3.09 0.17 -3.37 -3.16 -2.94 -2.70
## log(IVI) 0.11 0.11 0.01 0.09 0.10 0.12 0.14
## broodsize 0.11 0.12 0.07 -0.03 0.06 0.15 0.24
## sex -0.10 -0.10 0.07 -0.25 -0.14 -0.04 0.03
##
## Mean Model: Random Effects
## Mean Median SD Lower 95% Lower 50% Upper 50% Upper 95%
## indidx 0.23 0.23 0.04 0.18 0.21 0.24 0.29
##
## Variance Model: Fixed Effects
## Mean Median SD Lower 95% Lower 50% Upper 50% Upper 95%
## (Intercept) -0.45 -0.45 0.10 -0.63 -0.53 -0.41 -0.26
## broodsize 0.01 0.00 0.04 -0.08 -0.02 0.03 0.09
## sex 0.05 0.05 0.04 -0.03 0.03 0.09 0.13
##
## Variance Model: Random Effects
## Mean Median SD Lower 95% Lower 50% Upper 50% Upper 95%
## indidx 0.11 0.11 0.02 0.07 0.09 0.12 0.16
## Generate caterpillar
pfcaterpillar2 <- caterpillar(pfresults2,plot = FALSE)
## Fixed effects for mean
pfcaterpillar2$meanFixed
## Fixed effects for variance
pfcaterpillar2$varianceFixed
By using ‘prediction’ function, predictions for the mean and variance models can be calculated. The ‘prediction’ function should take a data frame including all variables required and their names defined for the usage of ‘dalmatian’ function. For the Pied Flycatcher data, previously ‘log(IVI)’ was used to define the mean and variance models. Hence, it should be included in the data frame as following:
## Add 'log(IVI)' variable in pfdata
pfdata$'log(IVI)' <- log(pfdata$IVI)
We can now get predictions.
## Plot predictions for the Model 1
pred.pfresults <- fitted(object = pfresults,
df = pfdata,
method = "mean",
ci = TRUE,
level = 0.95)
# predictions for the Model 2
pred.pfresults2 <- fitted(object = pfresults2,
df = pfdata,
method = "mean",
ci = TRUE,
level = 0.95)