Tidy t-Tests with infer

Introduction

In this vignette, we’ll walk through conducting t-tests using infer. We’ll start out with a 1-sample t-test, which compares a sample mean to a hypothesized true mean value. Then, we’ll discuss paired t-tests, which are a special use case of 1-sample t-tests, and evaluate whether differences in paired values (e.g. some measure taken of a person before and after an experiment) differ from 0. Finally, we’ll wrap up with 2-sample t-tests, testing the difference in means of two populations using a sample of data drawn from them.

Throughout this vignette, we’ll make use of the gss dataset supplied by infer, which contains a sample of data from the General Social Survey. See ?gss for more information on the variables included and their source. Note that this data (and our examples on it) are for demonstration purposes only, and will not necessarily provide accurate estimates unless weighted properly. For these examples, let’s suppose that this dataset is a representative sample of a population we want to learn about: American adults. The data looks like this:

dplyr::glimpse(gss)

## Rows: 500
## Columns: 11
## $ year    <dbl> 2014, 1994, 1998, 1996, 1994, 1996, 1990, 2016, 2000, 1998, 2…
## $ age     <dbl> 36, 34, 24, 42, 31, 32, 48, 36, 30, 33, 21, 30, 38, 49, 25, 5…
## $ sex     <fct> male, female, male, male, male, female, female, female, femal…
## $ college <fct> degree, no degree, degree, no degree, degree, no degree, no d…
## $ partyid <fct> ind, rep, ind, ind, rep, rep, dem, ind, rep, dem, dem, ind, d…
## $ hompop  <dbl> 3, 4, 1, 4, 2, 4, 2, 1, 5, 2, 4, 3, 4, 4, 2, 2, 3, 2, 1, 2, 5…
## $ hours   <dbl> 50, 31, 40, 40, 40, 53, 32, 20, 40, 40, 23, 52, 38, 72, 48, 4…
## $ income  <ord> $25000 or more, $20000 - 24999, $25000 or more, $25000 or mor…
## $ class   <fct> middle class, working class, working class, working class, mi…
## $ finrela <fct> below average, below average, below average, above average, a…
## $ weight  <dbl> 0.8960, 1.0825, 0.5501, 1.0864, 1.0825, 1.0864, 1.0627, 0.478…

1-Sample t-Test

The 1-sample t-test can be used to test whether a sample of continuous data could have plausibly come from a population with a specified mean. As an example, we’ll test whether the average American adult works 40 hours a week using data from the gss. To do so, we make use of the hours variable, giving the number of hours that respondents reported having worked in the previous week. The distribution of hours in the observed data looks like this:

Note the warning about missing values—many respondents’ entries are missing. If we were actually carrying out this hypothesis test, we might look further into how this data was collected; it’s possible that some of the missing values should actually be 0 hours.

In general, though, it looks like most respondents reported having worked 40 hours, but there’s quite a bit of variability. Let’s test whether we have evidence that the true mean number of hours that Americans work per week is 40.

First, to calculate the observed statistic, we can use specify() and calculate().

# calculate the observed statistic
observed_statistic <- gss %>%
  specify(response = hours) %>%
  hypothesize(null = "point", mu = 40) %>%
  calculate(stat = "t")

The observed statistic is 2.0852. Now, we want to compare this statistic to a null distribution, generated under the assumption that the mean was actually 40, to get a sense of how likely it would be for us to see this observed statistic if the true number of hours worked per week in the population was really 40.

We can generate the null distribution using the bootstrap. In the bootstrap, for each replicate, a sample of size equal to the input sample size is drawn (with replacement) from the input sample data. This allows us to get a sense of how much variability we’d expect to see in the entire population so that we can then understand how unlikely our sample mean would be.

# generate the null distribution
null_distribution_1_sample <- gss %>%
  specify(response = hours) %>%
  hypothesize(null = "point", mu = 40) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "t")

To get a sense for what these distributions look like, and where our observed statistic falls, we can use visualize():

# visualize the null distribution and test statistic!
null_distribution_1_sample %>%
  visualize() + 
  shade_p_value(observed_statistic,
                direction = "two-sided")

It looks like our observed statistic of 2.0852 would be relatively unlikely if the true mean was actually 40 hours a week. More exactly, we can calculate the p-value:

# calculate the p value from the test statistic and null distribution
p_value_1_sample <- null_distribution_1_sample %>%
  get_p_value(obs_stat = observed_statistic,
              direction = "two-sided")

p_value_1_sample

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.026

Thus, if the true mean number of hours worked per week was really 40, the probability that we would see a test statistic as or more extreme than 2.0852 is approximately 0.026.

Note that, equivalently to the steps shown above, the package supplies a wrapper function, t_test, to carry out 1-sample t-Tests on tidy data. The syntax looks like this:

t_test(gss, response = hours, mu = 40)

## # A tibble: 1 x 6
##   statistic  t_df p_value alternative lower_ci upper_ci
##       <dbl> <dbl>   <dbl> <chr>          <dbl>    <dbl>
## 1      2.09   499  0.0376 two.sided       40.1     42.7

Also, to just calculate the observed statistic, the package supplies the wrapper t_stat().

t_stat(gss, response = hours, mu = 40)

##     t 
## 2.085

Paired t-Test

You might be interested in running a paired t-Test. Paired t-tests can be used in situations when there is a natural pairing between values in distributions—a common example would be two columns, before and after, say, that contain measurements from a patient before and after some treatment. To compare these two distributions, then, we’re not necessarily interested in how the two distributions look different altogether, but how these two measurements from each individual change across time. (Pairings don’t necessarily have to be over time; another common usage is measurements from two married people, for example.) Thus, we can create a new column (see mutate() from the dplyr package if you’re not sure how to do this) that is the difference between the two: difference = after - before, and then examine this distribution to see how each individuals’ measurements changed over time.

Once we’ve mutate()d that new difference column, we can run a 1-sample t-Test on it, where our null hypothesis is that mu = 0 (i.e. the difference between these measurements before and after treatment is, on average, 0). To do so, we’d use the procedure outlined in the above section.

2-Sample t-Test

2-Sample t-Tests evaluate the difference in mean values of two populations using data randomly-sampled from the population that approximately follows a normal distribution. As an example, we’ll test if Americans work the same number of hours a week regardless of whether they have a college degree or not using data from the gss. The college and hours variables allow us to do so:

It looks like both of these distributions are centered near 40 hours a week, but the distribution for those with a degree is slightly right skewed.

Again, note the warning about missing values—many respondents’ values are missing. If we were actually carrying out this hypothesis test, we might look further into how this data was collected; it’s possible that whether or not a value in either of these columns is missing is related to what that value would be.

Now, to calculate the observed \(t\) statistic, we can use specify() and calculate().

# calculate the observed statistic
observed_statistic <- gss %>%
  specify(hours ~ college) %>%
  calculate(stat = "t", order = c("degree", "no degree"))

observed_statistic

## # A tibble: 1 x 1
##    stat
##   <dbl>
## 1  1.12

Note that, in the line specify(hours ~ college), we could have swapped this out with the syntax specify(response = hours, explanatory = college)!

The order argument in that calculate line gives the order to subtract the mean values in: in our case, we’re taking the mean number of hours worked by those with a degree minus the mean number of hours worked by those without a degree; a positive difference, then, would mean that people with degrees worked more than those without a degree.

Now, we want to compare this \(t\)-statistic to a null distribution, generated under the assumption that the number of hours worked a week has no relationship with whether or not one has a college degree, to get a sense of how likely it would be for us to see this \(t\)-statistic if there were really no relationship between these two variables.

We can generate the null distribution using permutation, where, for each replicate, each value of degree status will be randomly reassigned (without replacement) to a new number of hours worked per week in the sample in order to break any association between the two.

# generate the null distribution with randomization
null_distribution_2_sample_permute <- gss %>%
  specify(hours ~ college) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "t", order = c("degree", "no degree"))

We can also generate the null distribution using the theoretical \(t\) distribution. The code to do that looks like this:

# generate the null distribution with the theoretical t
null_distribution_2_sample_theoretical <- gss %>%
  specify(hours ~ college) %>%
  hypothesize(null = "independence") %>%
  # generate() isn't used for the theoretical version!
  calculate(stat = "t", order = c("degree", "no degree"))

Again, note that, in the lines specify(hours ~ college) in the above two chunks, we could have used the syntax specify(response = hours, explanatory = college) instead!

To get a sense for what these distributions look like, and where our observed statistic falls, we can use visualize(). First, to do this with the randomization-based distribution:

# visualize the randomization-based null distribution and test statistic!
null_distribution_2_sample_permute %>%
  visualize() + 
  shade_p_value(observed_statistic,
                direction = "two-sided")

The syntax to do the same thing with the theory-based null distribution looks similar—we just pipe in the theory-based null distribution instead, and drop in method = "theoretical" to visualize().

# visualize the theoretical null distribution and test statistic!
null_distribution_2_sample_theoretical %>%
  visualize(method = "theoretical") + 
  shade_p_value(observed_statistic,
                direction = "two-sided")

We could also visualize both null distributions to get a sense of how the randomization-based and theoretical versions relate. To do so, start off with the randomization-based null distribution, and then drop in method = "both" to visualize().

# visualize both null distributions and test statistic!
null_distribution_2_sample_permute %>%
  visualize(method = "both") + 
  shade_p_value(observed_statistic,
                direction = "two-sided")

Regardless, it looks like our observed statistic of 1.1193 would be really unlikely if there was truly no relationship between degree status and number of hours worked. More exactly, we can calculate the p-value; theoretical p-values are not yet supported, so we’ll use the randomization-based null distribution to do calculate the p-value.

# calculate the p value from the randomization-based null 
# distribution and the observed statistic
p_value_2_sample <- null_distribution_2_sample_permute %>%
  get_p_value(obs_stat = observed_statistic,
              direction = "two-sided")

p_value_2_sample

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.290

Thus, if there were really no relationship between the number of hours worked a week and whether one has a college degree, the probability that we would see a statistic as or more extreme than 1.1193 is approximately 0.29.

Note that, equivalently to the steps shown above, the package supplies a wrapper function, t_test, to carry out 2-sample t-Tests on tidy data. The syntax looks like this:

t_test(x = gss, 
       formula = hours ~ college, 
       order = c("degree", "no degree"),
       alternative = "two-sided")

## # A tibble: 1 x 6
##   statistic  t_df p_value alternative lower_ci upper_ci
##       <dbl> <dbl>   <dbl> <chr>          <dbl>    <dbl>
## 1      1.12  366.   0.264 two.sided      -1.16     4.24

In the above example, we specified the relationship with the syntax formula = hours ~ college; we could have also written response = hours, explanatory = college.