Note: This is a work in progress…

This vignette discusses the basics of using Difference-in-Differences (DID) designs to identify and estimate the average effect of participating in a treatment with a particular focus on tools from the did package.

A Running Example

Throughout the vignette, we use a subset of data that comes from Callaway and Sant'Anna (2019). This is a dataset that contains county-level teen employment rates from 2003-2007. The data can be loaded by

library(did)
data(mpdta)

mpdta is a balanced panel with 2500 observations. And the dataset looks like

head(mpdta)
#>     year countyreal     lpop     lemp first.treat treat
#> 866 2003       8001 5.896761 8.461469        2007     1
#> 841 2004       8001 5.896761 8.336870        2007     1
#> 842 2005       8001 5.896761 8.340217        2007     1
#> 819 2006       8001 5.896761 8.378161        2007     1
#> 827 2007       8001 5.896761 8.487352        2007     1
#> 937 2003       8019 2.232377 4.997212        2007     1

In particular applications, the dataset should look like this with the key parts being:

There needs to be an id variable. In mpdta, it is the variable countyreal. This should not vary over time for particular units.
There needs to be a time variable. In mpdta, it is the variable year
In this application, the outcome is lemp
There needs to be a treatment variable. In mpdta, it is the variable treat. This variable indicates whether or not an individual is ever treated. Thus, it should be set equal to 1 in all periods for individuals that are treated at any point. It should be set equal to 0 for individuals that are never treated.
There needs to be a first.treated variable. In mpdta, it is the variable first.treat. This is the time period when an individual first becomes treated. For individuals that are never treated, this variable should be set equal to 0.
The did package allows for incorporating covariates (see below). In mpdta, lpop is the log of county population. The did package requires that covariates be time-invariant. For time varying covariates like county population, we recommend using the value of the covariate in the first time period.

Here are some additional comments about the data structure:

In the case with panel data, the did package “balances” the panel by dropping individual units which do not have observations in each time period.
The did package requires that there are some individuals that are never treated. In some applications, eventually all individuals become treated and identifying treatment effects is based on exploiting differences in the timing when individuals become treated. One way around this is to drop the last time period and call the group of individuals that are first treated in the last period as the control group (i.e., set treat=0 for that group)
The did package is only built to handle staggered treatment adoption designs. This means that once an individual becomes treated, they remain treated in all subsequent periods.

Identification

First, we provide a brief overview of how identification works as well as parameters of interest in DID designs.

The main identifying assumption in DID designs is called a parallel trends assumption. Let \(Y_{it}(0)\) denote an individual's untreated “potential” outcome in time period \(t\) and \(Y_{it}(1)\) denote an individual's treated “potential” outcome in time period \(t\). The observed outcome for an individual is \(Y_{it} = D_i Y_{it}(1) - (1-D_i)Y_{it}(0)\). \begin{align} E[\Delta Y_t(0) | X, D=1] = E[\Delta Y_t(0)|X,D=1] \end{align}

Estimation

Two-Groups / Two Periods

Multiple Groups and Periods

Common Issues using the `did` package