For even quicker usage instructions, see the README.
The general idea of this package is that you can throw nearly any model built in the typical R fashion. For now, we just need a somewhat standard way of extracting fixed effects, and that update
works, which is nearly always the case.
Assuming your model is called base_model
, if coef(summary(base_model))
gives the estimates and standard errors, the method will work. The result of coef
is often a list; this package will then produce estimates for each element of the list. Common examples where there are multiple sets of coefficients are zero-inflated models.
The data should be supplied manually, although it’s currently the case that if the dataframe that creates the model is accessible from base_model$model
, base_model$frame
or base_model@frame
, you won’t have to send the data in to the function separately (but you will get a warning). Be careful here - some models supply this but the method introduces bugs. Much safer to supply the data explicitly.
Let’s work through an example with glmmTMB
. We’ll use the data that comes with glmmboot
.
library(glmmboot)
data(test_data)
head(test_data)
# x_var1 x_var2 x_var3 subj y
# 1 -0.3603511 2.7427146 0.2636070 3 1.00000000
# 2 2.6866123 1.5713332 0.1509874 20 0.08064748
# 3 0.6342672 0.9936967 0.1261363 19 0.43073744
# 4 1.1648519 1.6567148 0.6139765 33 0.98353502
# 5 0.7842167 1.3459927 -0.7063382 1 1.00000000
# 6 2.8261973 1.9342870 1.2697793 3 1.00000000
We’re assuming the x_var
variables are fixed, and subj
is to be treated as a random effect.
Thus our base analysis is:
library(glmmTMB)
model_formula <- as.formula(y ~ x_var1 + x_var2 + x_var2 + (1 | subj))
base_run <- glmmTMB(formula = model_formula,
data = test_data,
family = binomial)
# Warning in eval(family$initialize): non-integer #successes in a binomial glm!
We get a warning because the outcome data is proportional. Not to worry.
Now we’ll use the bootstrap. By default it’ll perform block bootstrapping over the highest entropy random effect - but there’s only one, so of course the winner is subj
.
bootstrap_over_subj <- bootstrap_model(base_model = base_run,
base_data = test_data,
resamples = 99)
For publications etc, better to run about ten thousand resamples to avoid noise having much of an effect. Of course 99 is far too small, only for an example.
And comparing results:
print(bootstrap_over_subj)
# estimate boot 2.5% boot 97.5% boot p_value base p_value base 2.5%
# (Intercept) 0.6018968 -0.1050 1.2018 0.14 0.1636 -0.2449
# x_var1 0.1909343 -0.1192 0.5888 0.24 0.2338 -0.1234
# x_var2 0.3140507 0.0535 0.5376 0.04 0.0396 0.0150
# base 97.5% boot/base width
# (Intercept) 1.4487 0.7715774
# x_var1 0.5052 1.1263951
# x_var2 0.6131 0.8093863
The above might take a long time in a real setting. If it takes far too long on your machine, you can ideally run it on a bunch of computers. We don’t want each computer to output the fully processed output, only the intermediate outcome. To do this, we set return_coefs_instead = TRUE
for each run:
b_list1 <- bootstrap_model(base_model = base_run,
base_data = test_data,
resamples = 29,
return_coefs_instead = TRUE)
b_list2 <- bootstrap_model(base_model = base_run,
base_data = test_data,
resamples = 30,
return_coefs_instead = TRUE)
b_list3 <- bootstrap_model(base_model = base_run,
base_data = test_data,
resamples = 30,
return_coefs_instead = TRUE)
Combining this is simple enough. If we’ve used a few, we don’t want to mess around with even more lists, so we can enter them into the relevant function:
print(combine_resampled_lists(b_list1, b_list2, b_list3))
# estimate boot 2.5% boot 97.5% boot p_value base p_value base 2.5%
# (Intercept) 0.6018968 0.0168 1.0446 0.0444 0.1636 -0.2449
# x_var1 0.1909343 -0.1379 0.4916 0.4222 0.2338 -0.1234
# x_var2 0.3140507 0.0992 0.5177 0.0222 0.0396 0.0150
# base 97.5% boot/base width
# (Intercept) 1.4487 0.6068176
# x_var1 0.5052 1.0014936
# x_var2 0.6131 0.6997574
If we’ve run a huge number of such runs, ideally we’ll combine all output to a list of lists, like so:
And we’ll get the same result:
print(combine_resampled_lists(list_of_lists_output))
# estimate boot 2.5% boot 97.5% boot p_value base p_value base 2.5%
# (Intercept) 0.6018968 0.0168 1.0446 0.0444 0.1636 -0.2449
# x_var1 0.1909343 -0.1379 0.4916 0.4222 0.2338 -0.1234
# x_var2 0.3140507 0.0992 0.5177 0.0222 0.0396 0.0150
# base 97.5% boot/base width
# (Intercept) 1.4487 0.6068176
# x_var1 0.5052 1.0014936
# x_var2 0.6131 0.6997574
You MUST set return_coefs_instead = TRUE
for methods like these that combine output.
There are two primary ways to run models in parallel:
parallel
: this will use parallel::mclapply
to run the models.future
: this will use future.apply::future_lapply
to run the modelsUse he parameter parallelism
in the function bootstrap_model()
to choose how to parallelise.
parallelism = "none"
This is the default, and will run each model sequentially, using lapply()
. If you set num_cores
to a number greater than 1, you’ll get an error with this option.
This isn’t the default if you set num_cores
to a value greater than 1.
parallelism = "parallel"
This will use parallel::mclapply
to run the models. If you set num_cores
, then that many cores will be used. If you don’t set num_cores
, the function will use num_cores = parallel::detectCores() - 1
.
Example:
## will use `parallel::detectCores() - 1` cores
model_results <- bootstrap_model(base_model = some_base_run,
base_data = some_interesting_data,
resamples = 9999,
parallelism = "parallel")
## will use 4 cores:
model_results <- bootstrap_model(base_model = some_base_run,
base_data = some_interesting_data,
resamples = 9999,
parallelism = "parallel",
num_cores = 4)
I’ve heard that parallel::mclapply
doesn’t play well on Windows. Instead of implementing parallel::parLapply
(which I will do if there’s interest), I would recommend using parallelism = "future"
, see below.
This becomes the default if you don’t set parallelism
but just num_cores
.
parallelism = "future"
This uses the very nice future.apply::future_lapply
function to run the models in parallel. Note that you MUST set the future::plan
(see the docs for the future
and future.apply
packages) to actually make use of multiple cores etc, or else you’ll just get a sequential run. Using num_cores
is NOT the right way to set the backend, and will cause an error.
Further, in many cases you should supply any packages required for the model to run with the argument future_packages
, because the S3 generic update()
is used to update the model, which won’t ship the required globals. You don’t need this if using plan(multicore)
, since all futures will have access to the same shared memory (and you also don’t need it if your model is a base model, e.g. glm
).
This should work well with Windows.
Example:
num_cores
If you don’t touch the parallelism
argument, but just set num_cores
, e.g.:
model_results <- bootstrap_model(base_model = some_base_run,
base_data = some_interesting_data,
resamples = 9999,
num_cores = 8)
It’ll be as if you set parallelism = "parallel"
, and will use the parallel::mclapply
backend (unless you set num_cores = 1
!).
Unfortunately currently none of these methods have any progress bars setup. There used to be a call to a parallel package that did this, but it seemed to have some bugs. Hopefully this will be possible in the future.
The (not great) options for now: * run a small set to estimate timing before running a large set of models * using the split approach (see the Combining Runs approach above), print out some intermediate output between runs.
Let’s look at an example of the second approach, using the future
argument. First, we’ll write a somewhat rough logging function:
log_remaining <- function(start_time,
cur_time,
j,
total_iters,
time_units = "hours"){
cur_time <- Sys.time()
total_time <- difftime(cur_time, start_time, units = time_units)
est_remaining <- (total_iters - j) * (total_time / j)
paste0("[", cur_time, "] [iteration ", j, "] ",
"total ", time_units, ": ", round(total_time, 3), " // ",
"est remaining ", time_units, ": ", round(est_remaining, 3))
}
This looks like:
total_iterations <- 5
start_time <- Sys.time()
for (j in 1:total_iterations) {
## simulate expensive operation...
Sys.sleep(2)
print(log_remaining(start_time, Sys.time(), j, total_iterations, "secs"))
}
# [1] "[2020-03-29 15:42:09] [iteration 1] total secs: 2.004 // est remaining secs: 8.017"
# [1] "[2020-03-29 15:42:11] [iteration 2] total secs: 4.015 // est remaining secs: 6.023"
# [1] "[2020-03-29 15:42:13] [iteration 3] total secs: 6.02 // est remaining secs: 4.014"
# [1] "[2020-03-29 15:42:15] [iteration 4] total secs: 8.025 // est remaining secs: 2.006"
# [1] "[2020-03-29 15:42:17] [iteration 5] total secs: 10.028 // est remaining secs: 0"
As before, don’t forget return_coefs_instead = TRUE
. We’d run:
library(future)
plan("multiprocess")
results_list <- list()
num_blocks <- 50
runs_per_block <- 100
start_time <- Sys.time()
for (j in 1:num_blocks) {
results_list[[j]] <- bootstrap_model(base_model = base_run,
base_data = test_data,
resamples = runs_per_block,
parallelism = "future",
return_coefs_instead = TRUE,
suppress_sampling_message = TRUE)
print(log_remaining(start_time, Sys.time(), j, num_blocks, "secs"))
}
combined_results <- combine_resampled_lists(results_list)
Of course with 50 blocks of size 100, we’d get the equivalent of 5000 samples.
Let’s first reuse the model from the glmmTMB vignettes:
owls <- transform(Owls,
nest = reorder(Nest, NegPerChick),
ncalls = SiblingNegotiation,
ft = FoodTreatment)
fit_zipoisson <- glmmTMB(
ncalls ~ (ft + ArrivalTime) * SexParent +
offset(log(BroodSize)) + (1 | nest),
data = owls,
ziformula = ~1,
family = poisson)
We run this more or less the same way as the other models:
In this case the output would be a list with two entries, one matrix in zero_boot$cond
, the typical conditional
model, and another matrix (in this case a single row, because ziformula = ~1
) in zero_cond$zi
, each giving the same shape output as we’ve seen above.
In this particular model, the bootstrap model is quite conservative. Assumptions are often useful!