Full infer Pipeline Examples

Introduction

This vignette is intended to provide a set of examples that nearly exhaustively demonstrate the functionalities provided by infer. Commentary on these examples is limited—for more discussion of the intuition behind the package, see the “Getting to Know infer” vignette, accessible by calling vignette("infer").

Throughout this vignette, we’ll make use of the gss dataset supplied by infer, which contains a sample of data from the General Social Survey. See ?gss for more information on the variables included and their source. Note that this data (and our examples on it) are for demonstration purposes only, and will not necessarily provide accurate estimates unless weighted properly. For these examples, let’s suppose that this dataset is a representative sample of a population we want to learn about: American adults. The data looks like this:

# load in the dataset
data(gss)

# take a look at its structure
dplyr::glimpse(gss)
## Rows: 500
## Columns: 11
## $ year    <dbl> 2014, 1994, 1998, 1996, 1994, 1996, 1990, 2016, 2000, 1998, 2…
## $ age     <dbl> 36, 34, 24, 42, 31, 32, 48, 36, 30, 33, 21, 30, 38, 49, 25, 5…
## $ sex     <fct> male, female, male, male, male, female, female, female, femal…
## $ college <fct> degree, no degree, degree, no degree, degree, no degree, no d…
## $ partyid <fct> ind, rep, ind, ind, rep, rep, dem, ind, rep, dem, dem, ind, d…
## $ hompop  <dbl> 3, 4, 1, 4, 2, 4, 2, 1, 5, 2, 4, 3, 4, 4, 2, 2, 3, 2, 1, 2, 5…
## $ hours   <dbl> 50, 31, 40, 40, 40, 53, 32, 20, 40, 40, 23, 52, 38, 72, 48, 4…
## $ income  <ord> $25000 or more, $20000 - 24999, $25000 or more, $25000 or mor…
## $ class   <fct> middle class, working class, working class, working class, mi…
## $ finrela <fct> below average, below average, below average, above average, a…
## $ weight  <dbl> 0.8960, 1.0825, 0.5501, 1.0864, 1.0825, 1.0864, 1.0627, 0.478…

Hypothesis tests

One numerical variable (mean)

Calculating the observed statistic,

x_bar <- gss %>%
  specify(response = hours) %>%
  calculate(stat = "mean")

Then, generating the null distribution,

null_distn <- gss %>%
  specify(response = hours) %>%
  hypothesize(null = "point", mu = 40) %>%
  generate(reps = 1000) %>%
  calculate(stat = "mean")

Visualizing the observed statistic alongside the null distribution,

visualize(null_distn) +
  shade_p_value(obs_stat = x_bar, direction = "two-sided")

Calculating the p-value from the null distribution and observed statistic,

null_distn %>%
  get_p_value(obs_stat = x_bar, direction = "two-sided")
p_value
0.026

One numerical variable (standardized mean \(t\))

Calculating the observed statistic,

t_bar <- gss %>%
  specify(response = hours) %>%
  hypothesize(null = "point", mu = 40) %>%
  calculate(stat = "t")

Alternatively, using the wrapper to calculate the test statistic,

t_bar <- gss %>%
  t_stat(response = hours, mu = 40)

Then, generating the null distribution,

null_distn <- gss %>%
  specify(response = hours) %>%
  hypothesize(null = "point", mu = 40) %>%
  generate(reps = 1000) %>%
  calculate(stat = "t")

Alternatively, finding the null distribution using theoretical methods by skipping the generate() step,

null_distn_theoretical <- gss %>%
  specify(response = hours) %>%
  hypothesize(null = "point", mu = 40) %>%
  calculate(stat = "t")

Visualizing the observed statistic alongside the null distribution,

visualize(null_distn) +
  shade_p_value(obs_stat = t_bar, direction = "two-sided")

Alternatively, visualizing the observed statistic using the theory-based null distribution,

visualize(null_distn_theoretical, method = "theoretical") +
  shade_p_value(obs_stat = t_bar, direction = "two-sided")

Alternatively, visualizing the observed statistic using both of the null distributions,

visualize(null_distn, method = "both") +
  shade_p_value(obs_stat = t_bar, direction = "two-sided")

Note that the above code makes use of the randomization-based null distribution.

Calculating the p-value from the null distribution and observed statistic,

null_distn %>%
  get_p_value(obs_stat = t_bar, direction = "two-sided")
p_value
0.044

Alternatively, using the t_test wrapper:

gss %>%
  t_test(response = hours, mu = 40)
statistic t_df p_value alternative lower_ci upper_ci
2.085 499 0.0376 two.sided 40.08 42.68

One numerical variable (median)

Calculating the observed statistic,

x_tilde <- gss %>%
  specify(response = age) %>%
  calculate(stat = "median")

Then, generating the null distribution,

null_distn <- gss %>%
  specify(response = age) %>%
  hypothesize(null = "point", med = 40) %>% 
  generate(reps = 1000) %>% 
  calculate(stat = "median")

Visualizing the observed statistic alongside the null distribution,

visualize(null_distn) +
  shade_p_value(obs_stat = x_tilde, direction = "two-sided")

Calculating the p-value from the null distribution and observed statistic,

null_distn %>%
  get_p_value(obs_stat = x_tilde, direction = "two-sided")
p_value
0.008

One categorical (one proportion)

Calculating the observed statistic,

p_hat <- gss %>%
  specify(response = sex, success = "female") %>%
  calculate(stat = "prop")

Then, generating the null distribution,

null_distn <- gss %>%
  specify(response = sex, success = "female") %>%
  hypothesize(null = "point", p = .5) %>%
  generate(reps = 1000) %>%
  calculate(stat = "prop")

Visualizing the observed statistic alongside the null distribution,

visualize(null_distn) +
  shade_p_value(obs_stat = p_hat, direction = "two-sided")

Calculating the p-value from the null distribution and observed statistic,

null_distn %>%
  get_p_value(obs_stat = p_hat, direction = "two-sided")
p_value
0.248

Note that logical variables will be coerced to factors:

null_distn <- gss %>%
  dplyr::mutate(is_female = (sex == "female")) %>%
  specify(response = is_female, success = "TRUE") %>%
  hypothesize(null = "point", p = .5) %>%
  generate(reps = 1000) %>%
  calculate(stat = "prop")

One categorical variable (standardized proportion \(z\))

While the standardized proportion \(z\) statistic has not yet been implemented in the randomization-based framework, the package supplies a wrapper around prop.test to allow for tests of a single proportion on tidy data.

prop_test(gss,
          college ~ NULL,
          p = .2)
statistic chisq_df p_value alternative
635.6 1 0 two.sided

Two categorical (2 level) variables

The infer package provides several statistics to work with data of this type. One of them is the statistic for difference in proportions.

Calculating the observed statistic,

d_hat <- gss %>% 
  specify(college ~ sex, success = "no degree") %>%
  calculate(stat = "diff in props", order = c("female", "male"))

Then, generating the null distribution,

null_distn <- gss %>%
  specify(college ~ sex, success = "no degree") %>%
  hypothesize(null = "independence") %>% 
  generate(reps = 1000) %>% 
  calculate(stat = "diff in props", order = c("female", "male"))

Visualizing the observed statistic alongside the null distribution,

visualize(null_distn) +
  shade_p_value(obs_stat = d_hat, direction = "two-sided")

Calculating the p-value from the null distribution and observed statistic,

null_distn %>%
  get_p_value(obs_stat = d_hat, direction = "two-sided")
p_value
1

infer also provides functionality to calculate ratios of proportions. The workflow looks similar to that for diff in props.

Calculating the observed statistic,

r_hat <- gss %>% 
  specify(college ~ sex, success = "no degree") %>%
  calculate(stat = "ratio of props", order = c("female", "male"))

Then, generating the null distribution,

null_distn <- gss %>%
  specify(college ~ sex, success = "no degree") %>%
  hypothesize(null = "independence") %>% 
  generate(reps = 1000) %>% 
  calculate(stat = "ratio of props", order = c("female", "male"))

Visualizing the observed statistic alongside the null distribution,

visualize(null_distn) +
  shade_p_value(obs_stat = r_hat, direction = "two-sided")

Calculating the p-value from the null distribution and observed statistic,

null_distn %>%
  get_p_value(obs_stat = r_hat, direction = "two-sided")
p_value
0.964

In addition, the package provides functionality to calculate odds ratios. The workflow also looks similar to that for diff in props.

Calculating the observed statistic,

or_hat <- gss %>% 
  specify(college ~ sex, success = "no degree") %>%
  calculate(stat = "odds ratio", order = c("female", "male"))

Then, generating the null distribution,

null_distn <- gss %>%
  specify(college ~ sex, success = "no degree") %>%
  hypothesize(null = "independence") %>% 
  generate(reps = 1000) %>% 
  calculate(stat = "odds ratio", order = c("female", "male"))

Visualizing the observed statistic alongside the null distribution,

visualize(null_distn) +
  shade_p_value(obs_stat = or_hat, direction = "two-sided")

Calculating the p-value from the null distribution and observed statistic,

null_distn %>%
  get_p_value(obs_stat = or_hat, direction = "two-sided")
p_value
1

Two categorical (2 level) variables (z)

Finding the standardized observed statistic,

z_hat <- gss %>% 
  specify(college ~ sex, success = "no degree") %>%
  calculate(stat = "z", order = c("female", "male"))

Then, generating the null distribution,

null_distn <- gss %>%
  specify(college ~ sex, success = "no degree") %>%
  hypothesize(null = "independence") %>% 
  generate(reps = 1000) %>% 
  calculate(stat = "z", order = c("female", "male"))

Alternatively, finding the null distribution using theoretical methods by skipping the generate() step,

null_distn_theoretical <- gss %>%
  specify(college ~ sex, success = "no degree") %>%
  hypothesize(null = "independence") %>%  
  calculate(stat = "z", order = c("female", "male"))

Visualizing the observed statistic alongside the null distribution,

visualize(null_distn) +
  shade_p_value(obs_stat = z_hat, direction = "two-sided")

Alternatively, visualizing the observed statistic using the theory-based null distribution,

visualize(null_distn_theoretical, method = "theoretical") +
  shade_p_value(obs_stat = z_hat, direction = "two-sided")

Alternatively, visualizing the observed statistic using both of the null distributions,

visualize(null_distn, method = "both") +
  shade_p_value(obs_stat = z_hat, direction = "two-sided")

Note that the above code makes use of the randomization-based null distribution.

Calculating the p-value from the null distribution and observed statistic,

null_distn %>%
  get_p_value(obs_stat = z_hat, direction = "two-sided")
p_value
1

Note the similarities in this plot and the previous one.

The package also supplies a wrapper around prop.test to allow for tests of equality of proportions on tidy data.

prop_test(gss, 
          college ~ sex,  
          order = c("female", "male"))
statistic chisq_df p_value alternative lower_ci upper_ci
0 1 0.9964 two.sided -0.1009 0.0917

One categorical (>2 level) - GoF

Calculating the observed statistic,

Note the need to add in the hypothesized values here to compute the observed statistic.

Chisq_hat <- gss %>%
  specify(response = finrela) %>%
  hypothesize(null = "point",
              p = c("far below average" = 1/6,
                    "below average" = 1/6,
                    "average" = 1/6,
                    "above average" = 1/6,
                    "far above average" = 1/6,
                    "DK" = 1/6)) %>%
  calculate(stat = "Chisq")

Alternatively, using the chisq_stat wrapper to calculate the test statistic,

Chisq_hat <- gss %>%
  chisq_stat(response = finrela,
             p = c("far below average" = 1/6,
                   "below average" = 1/6,
                   "average" = 1/6,
                   "above average" = 1/6,
                   "far above average" = 1/6,
                   "DK" = 1/6))

Then, generating the null distribution,

null_distn <- gss %>%
  specify(response = finrela) %>%
  hypothesize(null = "point",
              p = c("far below average" = 1/6,
                    "below average" = 1/6,
                    "average" = 1/6,
                    "above average" = 1/6,
                    "far above average" = 1/6,
                    "DK" = 1/6)) %>%
  generate(reps = 1000, type = "simulate") %>%
  calculate(stat = "Chisq")

Alternatively, finding the null distribution using theoretical methods by skipping the generate() step,

null_distn_theoretical <- gss %>%
  specify(response = finrela) %>%
  hypothesize(null = "point",
              p = c("far below average" = 1/6,
                    "below average" = 1/6,
                    "average" = 1/6,
                    "above average" = 1/6,
                    "far above average" = 1/6,
                    "DK" = 1/6)) %>%
  calculate(stat = "Chisq")

Visualizing the observed statistic alongside the null distribution,

visualize(null_distn) +
  shade_p_value(obs_stat = Chisq_hat, direction = "greater")

Alternatively, visualizing the observed statistic using the theory-based null distribution,

visualize(null_distn_theoretical, method = "theoretical") +
  shade_p_value(obs_stat = Chisq_hat, direction = "greater")

Alternatively, visualizing the observed statistic using both of the null distributions,

visualize(null_distn_theoretical, method = "both") +
  shade_p_value(obs_stat = Chisq_hat, direction = "greater")

Note that the above code makes use of the randomization-based null distribution.

Calculating the p-value from the null distribution and observed statistic,

null_distn %>%
  get_p_value(obs_stat = Chisq_hat, direction = "greater")
p_value
0

Alternatively, using the chisq_test wrapper:

chisq_test(gss, 
           response = finrela,
           p = c("far below average" = 1/6,
                 "below average" = 1/6,
                 "average" = 1/6,
                 "above average" = 1/6,
                 "far above average" = 1/6,
                 "DK" = 1/6))
statistic chisq_df p_value
488 5 0

Two categorical (>2 level): Chi-squared test of independence

Calculating the observed statistic,

Chisq_hat <- gss %>%
  specify(formula = finrela ~ sex) %>% 
  calculate(stat = "Chisq")

Alternatively, using the wrapper to calculate the test statistic,

Chisq_hat <- gss %>%
  chisq_stat(formula = finrela ~ sex)

Then, generating the null distribution,

null_distn <- gss %>%
  specify(finrela ~ sex) %>%
  hypothesize(null = "independence") %>% 
  generate(reps = 1000, type = "permute") %>% 
  calculate(stat = "Chisq")

Alternatively, finding the null distribution using theoretical methods by skipping the generate() step,

null_distn_theoretical <- gss %>%
  specify(finrela ~ sex) %>%
  hypothesize(null = "independence") %>% 
  calculate(stat = "Chisq")

Visualizing the observed statistic alongside the null distribution,

visualize(null_distn) +
  shade_p_value(obs_stat = Chisq_hat, direction = "greater")

Alternatively, visualizing the observed statistic using the theory-based null distribution,

visualize(null_distn_theoretical, method = "theoretical") +
  shade_p_value(obs_stat = Chisq_hat, direction = "greater")

Alternatively, visualizing the observed statistic using both of the null distributions,

visualize(null_distn, method = "both") +
  shade_p_value(obs_stat = Chisq_hat, direction = "greater")

Note that the above code makes use of the randomization-based null distribution.

Calculating the p-value from the null distribution and observed statistic,

null_distn %>%
  get_p_value(obs_stat = Chisq_hat, direction = "greater")
p_value
0.092

Alternatively, using the wrapper to carry out the test,

gss %>%
  chisq_test(formula = finrela ~ sex)
statistic chisq_df p_value
9.105 5 0.1049

One numerical variable, one categorical (2 levels) (diff in means)

Calculating the observed statistic,

d_hat <- gss %>% 
  specify(age ~ college) %>% 
  calculate(stat = "diff in means", order = c("degree", "no degree"))

Then, generating the null distribution,

null_distn <- gss %>%
  specify(age ~ college) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("degree", "no degree"))

Visualizing the observed statistic alongside the null distribution,

visualize(null_distn) +
  shade_p_value(obs_stat = d_hat, direction = "two-sided")

Calculating the p-value from the null distribution and observed statistic,

null_distn %>%
  get_p_value(obs_stat = d_hat, direction = "two-sided")
p_value
0.464

One numerical variable, one categorical (2 levels) (t)

Finding the standardized observed statistic,

t_hat <- gss %>% 
  specify(age ~ college) %>% 
  calculate(stat = "t", order = c("degree", "no degree"))

Then, generating the null distribution,

null_distn <- gss %>%
  specify(age ~ college) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "t", order = c("degree", "no degree"))

Alternatively, finding the null distribution using theoretical methods by skipping the generate() step,

null_distn_theoretical <- gss %>%
  specify(age ~ college) %>%
  hypothesize(null = "independence") %>%
  calculate(stat = "t", order = c("degree", "no degree"))

Visualizing the observed statistic alongside the null distribution,

visualize(null_distn) +
  shade_p_value(obs_stat = t_hat, direction = "two-sided")

Alternatively, visualizing the observed statistic using the theory-based null distribution,

visualize(null_distn_theoretical, method = "theoretical") +
  shade_p_value(obs_stat = t_hat, direction = "two-sided")

Alternatively, visualizing the observed statistic using both of the null distributions,

visualize(null_distn, method = "both") +
  shade_p_value(obs_stat = t_hat, direction = "two-sided")

Note that the above code makes use of the randomization-based null distribution.

Calculating the p-value from the null distribution and observed statistic,

null_distn %>%
  get_p_value(obs_stat = t_hat, direction = "two-sided")
p_value
0.412

Note the similarities in this plot and the previous one.

One numerical variable, one categorical (2 levels) (diff in medians)

Calculating the observed statistic,

d_hat <- gss %>% 
  specify(age ~ college) %>% 
  calculate(stat = "diff in medians", order = c("degree", "no degree"))

Then, generating the null distribution,

null_distn <- gss %>%
  specify(age ~ college) %>% # alt: response = age, explanatory = season
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in medians", order = c("degree", "no degree"))

Visualizing the observed statistic alongside the null distribution,

visualize(null_distn) +
  shade_p_value(obs_stat = d_hat, direction = "two-sided")

Calculating the p-value from the null distribution and observed statistic,

null_distn %>%
  get_p_value(obs_stat = d_hat, direction = "two-sided")
p_value
0.172

One numerical, one categorical (>2 levels) - ANOVA

Calculating the observed statistic,

F_hat <- gss %>% 
  specify(age ~ partyid) %>%
  calculate(stat = "F")

Then, generating the null distribution,

null_distn <- gss %>%
   specify(age ~ partyid) %>%
   hypothesize(null = "independence") %>%
   generate(reps = 1000, type = "permute") %>%
   calculate(stat = "F")

Alternatively, finding the null distribution using theoretical methods by skipping the generate() step,

null_distn_theoretical <- gss %>%
   specify(age ~ partyid) %>%
   hypothesize(null = "independence") %>%
   calculate(stat = "F")

Visualizing the observed statistic alongside the null distribution,

visualize(null_distn) +
  shade_p_value(obs_stat = F_hat, direction = "greater")

Alternatively, visualizing the observed statistic using the theory-based null distribution,

visualize(null_distn_theoretical, method = "theoretical") +
  shade_p_value(obs_stat = F_hat, direction = "greater")

Alternatively, visualizing the observed statistic using both of the null distributions,

visualize(null_distn, mdthod = "both") +
  shade_p_value(obs_stat = F_hat, direction = "greater")

Note that the above code makes use of the randomization-based null distribution.

Calculating the p-value from the null distribution and observed statistic,

null_distn %>%
  get_p_value(obs_stat = F_hat, direction = "greater")
p_value
0.047

Two numerical vars - SLR

Calculating the observed statistic,

slope_hat <- gss %>% 
  specify(hours ~ age) %>% 
  calculate(stat = "slope")

Then, generating the null distribution,

null_distn <- gss %>%
   specify(hours ~ age) %>% 
   hypothesize(null = "independence") %>%
   generate(reps = 1000, type = "permute") %>%
   calculate(stat = "slope")

Visualizing the observed statistic alongside the null distribution,

visualize(null_distn) +
  shade_p_value(obs_stat = slope_hat, direction = "two-sided")

Calculating the p-value from the null distribution and observed statistic,

null_distn %>%
  get_p_value(obs_stat = slope_hat, direction = "two-sided")
p_value
0.852

Two numerical vars - correlation

Calculating the observed statistic,

correlation_hat <- gss %>% 
  specify(hours ~ age) %>% 
  calculate(stat = "correlation")

Then, generating the null distribution,

null_distn <- gss %>%
   specify(hours ~ age) %>% 
   hypothesize(null = "independence") %>%
   generate(reps = 1000, type = "permute") %>%
   calculate(stat = "correlation")

Visualizing the observed statistic alongside the null distribution,

visualize(null_distn) +
  shade_p_value(obs_stat = correlation_hat, direction = "two-sided")

Calculating the p-value from the null distribution and observed statistic,

null_distn %>%
  get_p_value(obs_stat = correlation_hat, direction = "two-sided")
p_value
0.9

Two numerical vars - SLR (t)

Not currently implemented since \(t\) could refer to standardized slope or standardized correlation.

Confidence intervals

One numerical (one mean)

Finding the observed statistic,

x_bar <- gss %>% 
  specify(response = hours) %>%
  calculate(stat = "mean")

Then, generating the null distribution,

boot <- gss %>%
   specify(response = hours) %>%
   generate(reps = 1000, type = "bootstrap") %>%
   calculate(stat = "mean")

Use the null distribution to find a confidence interval,

percentile_ci <- get_ci(boot)

Visualizing the observed statistic alongside the null distribution,

visualize(boot) +
  shade_confidence_interval(endpoints = percentile_ci)

Alternatively, use the null distribution to find a confidence interval using the standard error,

standard_error_ci <- get_ci(boot, type = "se", point_estimate = x_bar)

visualize(boot) +
  shade_confidence_interval(endpoints = standard_error_ci)

One numerical (one mean - standardized)

Finding the observed statistic,

t_hat <- gss %>% 
  specify(response = hours) %>%
  hypothesize(null = "point", mu = 40) %>%
  calculate(stat = "t")

Then, generating the null distribution,

boot <- gss %>%
   specify(response = hours) %>%
   hypothesize(null = "point", mu = 40) %>%
   generate(reps = 1000, type = "bootstrap") %>%
   calculate(stat = "t")

Use the null distribution to find a confidence interval,

percentile_ci <- get_ci(boot)

Visualizing the observed statistic alongside the null distribution,

visualize(boot) +
  shade_confidence_interval(endpoints = percentile_ci)

Alternatively, use the null distribution to find a confidence interval using the standard error,

standard_error_ci <- boot %>%
  get_ci(type = "se", point_estimate = t_hat)

visualize(boot) +
  shade_confidence_interval(endpoints = standard_error_ci)

One categorical (one proportion)

Finding the observed statistic,

p_hat <- gss %>% 
   specify(response = sex, success = "female") %>%
   calculate(stat = "prop")

Then, generating the null distribution,

boot <- gss %>%
 specify(response = sex, success = "female") %>%
 generate(reps = 1000, type = "bootstrap") %>%
 calculate(stat = "prop")

Use the null distribution to find a confidence interval,

percentile_ci <- get_ci(boot)

Visualizing the observed statistic alongside the null distribution,

visualize(boot) +
  shade_confidence_interval(endpoints = percentile_ci)

Alternatively, use the null distribution to find a confidence interval using the standard error,

standard_error_ci <- boot %>%
  get_ci(type = "se", point_estimate = p_hat)

visualize(boot) +
  shade_confidence_interval(endpoints = standard_error_ci)

One categorical variable (standardized proportion \(z\))

Not yet implemented.

One numerical variable, one categorical (2 levels) (diff in means)

Finding the observed statistic,

d_hat <- gss %>%
  specify(hours ~ college) %>%
  calculate(stat = "diff in means", order = c("degree", "no degree"))

Then, generating the null distribution,

boot <- gss %>%
   specify(hours ~ college) %>%
   generate(reps = 1000, type = "bootstrap") %>%
   calculate(stat = "diff in means", order = c("degree", "no degree"))

Use the null distribution to find a confidence interval,

percentile_ci <- get_ci(boot)

Visualizing the observed statistic alongside the null distribution,

visualize(boot) +
  shade_confidence_interval(endpoints = percentile_ci)

Alternatively, use the null distribution to find a confidence interval using the standard error,

standard_error_ci <- boot %>%
  get_ci(type = "se", point_estimate = d_hat)

visualize(boot) +
  shade_confidence_interval(endpoints = standard_error_ci)

One numerical variable, one categorical (2 levels) (t)

Finding the standardized point estimate,

t_hat <- gss %>%
  specify(hours ~ college) %>%
  calculate(stat = "t", order = c("degree", "no degree"))

Then, generating the null distribution,

boot <- gss %>%
   specify(hours ~ college) %>%
   generate(reps = 1000, type = "bootstrap") %>%
   calculate(stat = "t", order = c("degree", "no degree"))

Use the null distribution to find a confidence interval,

percentile_ci <- get_ci(boot)

Visualizing the observed statistic alongside the null distribution,

visualize(boot) +
  shade_confidence_interval(endpoints = percentile_ci)

Alternatively, use the null distribution to find a confidence interval using the standard error,

standard_error_ci <- boot %>%
  get_ci(type = "se", point_estimate = t_hat)

visualize(boot) +
  shade_confidence_interval(endpoints = standard_error_ci)

Two categorical variables (diff in proportions)

Finding the observed statistic,

d_hat <- gss %>% 
  specify(college ~ sex, success = "degree") %>%
  calculate(stat = "diff in props", order = c("female", "male"))

Then, generating the null distribution,

boot <- gss %>%
  specify(college ~ sex, success = "degree") %>%
  generate(reps = 1000, type = "bootstrap") %>% 
  calculate(stat = "diff in props", order = c("female", "male"))

Use the null distribution to find a confidence interval,

percentile_ci <- get_ci(boot)

Visualizing the observed statistic alongside the null distribution,

visualize(boot) +
  shade_confidence_interval(endpoints = percentile_ci)

Alternatively, use the null distribution to find a confidence interval using the standard error,

standard_error_ci <- boot %>%
  get_ci(type = "se", point_estimate = d_hat)

visualize(boot) +
  shade_confidence_interval(endpoints = standard_error_ci)

Two categorical variables (z)

Finding the standardized point estimate,

z_hat <- gss %>% 
  specify(college ~ sex, success = "degree") %>%
  calculate(stat = "z", order = c("female", "male"))

Then, generating the null distribution,

boot <- gss %>%
  specify(college ~ sex, success = "degree") %>%
  generate(reps = 1000, type = "bootstrap") %>% 
  calculate(stat = "z", order = c("female", "male"))

Use the null distribution to find a confidence interval,

percentile_ci <- get_ci(boot)

Visualizing the observed statistic alongside the null distribution,

visualize(boot) +
  shade_confidence_interval(endpoints = percentile_ci)

Alternatively, use the null distribution to find a confidence interval using the standard error,

standard_error_ci <- boot %>%
  get_ci(type = "se", point_estimate = z_hat)

visualize(boot) +
  shade_confidence_interval(endpoints = standard_error_ci)

Two numerical vars - SLR

Finding the observed statistic,

slope_hat <- gss %>% 
  specify(hours ~ age) %>%
  calculate(stat = "slope")

Then, generating the null distribution,

boot <- gss %>%
   specify(hours ~ age) %>% 
   generate(reps = 1000, type = "bootstrap") %>%
   calculate(stat = "slope")

Use the null distribution to find a confidence interval,

percentile_ci <- get_ci(boot)

Visualizing the observed statistic alongside the null distribution,

visualize(boot) +
  shade_confidence_interval(endpoints = percentile_ci)

Alternatively, use the null distribution to find a confidence interval using the standard error,

standard_error_ci <- boot %>%
  get_ci(type = "se", point_estimate = slope_hat)

visualize(boot) +
  shade_confidence_interval(endpoints = standard_error_ci)

Two numerical vars - correlation

Finding the observed statistic,

correlation_hat <- gss %>% 
  specify(hours ~ age) %>%
  calculate(stat = "correlation")

Then, generating the null distribution,

boot <- gss %>%
   specify(hours ~ age) %>% 
   generate(reps = 1000, type = "bootstrap") %>%
   calculate(stat = "correlation")

Use the null distribution to find a confidence interval,

percentile_ci <- get_ci(boot)

Visualizing the observed statistic alongside the null distribution,

visualize(boot) +
  shade_confidence_interval(endpoints = percentile_ci)

Alternatively, use the null distribution to find a confidence interval using the standard error,

standard_error_ci <- boot %>%
  get_ci(type = "se", point_estimate = correlation_hat)

visualize(boot) +
  shade_confidence_interval(endpoints = standard_error_ci)

Two numerical vars - t

Not currently implemented since \(t\) could refer to standardized slope or standardized correlation.