Note: This is a work in progress…
This vignette discusses the basics of using Difference-in-Differences (DID) designs to identify and estimate the average effect of participating in a treatment with a particular focus on tools from the did
package.
Throughout the vignette, we use a subset of data that comes from Callaway and Sant'Anna (2019). This is a dataset that contains county-level teen employment rates from 2003-2007. The data can be loaded by
library(did)
data(mpdta)
mpdta
is a balanced panel with 2500 observations. And the dataset looks like
head(mpdta)
#> year countyreal lpop lemp first.treat treat
#> 866 2003 8001 5.896761 8.461469 2007 1
#> 841 2004 8001 5.896761 8.336870 2007 1
#> 842 2005 8001 5.896761 8.340217 2007 1
#> 819 2006 8001 5.896761 8.378161 2007 1
#> 827 2007 8001 5.896761 8.487352 2007 1
#> 937 2003 8019 2.232377 4.997212 2007 1
In particular applications, the dataset should look like this with the key parts being:
There needs to be an id
variable. In mpdta
, it is the variable countyreal
. This should not vary over time for particular units.
There needs to be a time
variable. In mpdta
, it is the variable year
In this application, the outcome is lemp
There needs to be a treatment
variable. In mpdta
, it is the variable treat
. This variable indicates whether or not an individual is ever treated. Thus, it should be set equal to 1 in all periods for individuals that are treated at any point. It should be set equal to 0 for individuals that are never treated.
There needs to be a first.treated
variable. In mpdta
, it is the variable first.treat
. This is the time period when an individual first becomes treated. For individuals that are never treated, this variable should be set equal to 0.
The did
package allows for incorporating covariates (see below). In mpdta
, lpop
is the log of county population. The did
package requires that covariates be time-invariant. For time varying covariates like county population, we recommend using the value of the covariate in the first time period.
Here are some additional comments about the data structure:
In the case with panel data, the did
package “balances” the panel by dropping individual units which do not have observations in each time period.
The did
package requires that there are some individuals that are never treated. In some applications, eventually all individuals become treated and identifying treatment effects is based on exploiting differences in the timing when individuals become treated. One way around this is to drop the last time period and call the group of individuals that are first treated in the last period as the control group (i.e., set treat=0
for that group)
The did
package is only built to handle staggered treatment adoption designs. This means that once an individual becomes treated, they remain treated in all subsequent periods.
First, we provide a brief overview of how identification works as well as parameters of interest in DID designs.
The main identifying assumption in DID designs is called a parallel trends assumption. Let \(Y_{it}(0)\) denote an individual's untreated “potential” outcome in time period \(t\) and \(Y_{it}(1)\) denote an individual's treated “potential” outcome in time period \(t\). The observed outcome for an individual is \(Y_{it} = D_i Y_{it}(1) - (1-D_i)Y_{it}(0)\). \begin{align} E[\Delta Y_t(0) | X, D=1] = E[\Delta Y_t(0)|X,D=1] \end{align}
did
package