Angrist & Pischke, Mostly Harmless Econometrics — Chapter 5
"The first thing to realize about parallel universes... is that they are not parallel." — Douglas Adams
Suhyeon Lee
Angrist & Pischke, Mostly Harmless Econometrics — Chapter 5
"The first thing to realize about parallel universes... is that they are not parallel." — Douglas Adams
When important confounders are unobserved but fixed over time, we can eliminate them using panel data strategies: fixed effects (within-person variation) or differences-in-differences (parallel trends assumption). These methods "punt on comparisons in levels" while requiring counterfactual trends to be the same.
The identification toolkit so far:
Classic question in Labor Economics: Do workers whose wages are set by collective bargaining earn more because of this, or would they earn more anyway (perhaps because they are more experienced or skilled)?
Let yit = log earnings of worker i at time t, and dit = union status. Assume:
Conditional independence:
E(y0it | Ai, Xit, t, dit) = E(y0it | Ai, Xit, t)
Union status is as good as randomly assigned conditional on unobserved ability Ai, observed covariates Xit, and time.
Key assumption: The unobserved Ai appears without a time subscript in a linear model:
With constant, additive treatment effect ρ:
This implies the fixed effects model:
where αi ≡ α + A'iγ is the individual fixed effect (treated as a parameter to be estimated), and λt is a year effect (coefficients on time dummies).
With panel data (repeated observations on individuals), we can eliminate αi. First, calculate individual averages:
Subtracting from the original equation:
The fixed effect αi is eliminated! This is called the "within estimator" or "analysis of covariance".
Why does this work algebraically?
By the regression anatomy formula (3.1.3), estimating with a full set of person dummies is the same as regressing on the residuals from a regression on those dummies. The residuals from regressing on person dummies are exactly deviations from person means.
An alternative to deviations from means:
where Δyit = yit − yit−1.
| Method | Deviations from Means | First Differencing |
|---|---|---|
| With T = 2 | Algebraically identical | |
| With T > 2 | More efficient if εit is homoskedastic & serially uncorrelated | May be more convenient; note Δεit is serially correlated |
Random effects assumes αi is uncorrelated with the regressors. Then αi becomes part of the residual (no OVB from ignoring it), but residuals for a given person are correlated across periods.
The authors prefer: OLS with fixed effects + robust standard errors, rather than GLS under random effects. GLS requires stronger assumptions (linear CEF, homoskedasticity) and efficiency gains are typically modest.
Freeman uses four panel data sets to estimate union wage effects:
| Survey | Cross-section | Fixed Effects |
|---|---|---|
| May CPS, 1974-75 | 0.19 | 0.09 |
| NLS Young Men, 1970-78 | 0.28 | 0.19 |
| Michigan PSID, 1970-79 | 0.23 | 0.14 |
| QES, 1973-77 | 0.14 | 0.16 |
Pattern: FE estimates (0.09–0.19) are generally smaller than cross-section estimates (0.14–0.28). This suggests positive selection bias in cross-section — more able workers join unions and earn more.
FE estimates are notoriously susceptible to attenuation bias:
Possible fixes:
Differencing/demeaning removes both good and bad variation. The transformation may kill OVB bathwater but also remove useful information.
Twins and Returns to Schooling:
Ashenfelter & Krueger (1994) and Ashenfelter & Rouse (1998) estimate returns to schooling using twins, controlling for family fixed effects (common family/genetic background).
Surprising result: Within-family estimates are larger than OLS!
Bound & Solon (1999) critique:
FE requires panel data with repeated observations on the same individuals. Often, however, treatment varies only at a more aggregate level (state, cohort). Examples:
The source of OVB must therefore be unobserved variables at the state and year level.
Classic question: In a competitive labor market, higher minimum wages should reduce employment (moving up a downward-sloping demand curve). Does this actually happen?
Natural experiment:
Define potential outcomes:
Key assumption — parallel trends in absence of treatment:
This says: in the absence of a minimum wage change, employment is determined by the sum of:
With constant treatment effect δ:
where dst is a dummy for high-minimum-wage state-periods and E(εist | s, t) = 0.
Control state (PA):
E[y|PA, Nov] − E[y|PA, Feb] = λNov − λFeb
Treatment state (NJ):
E[y|NJ, Nov] − E[y|NJ, Feb] = λNov − λFeb + δ
Difference-in-differences:
[E[y|NJ, Nov] − E[y|NJ, Feb]] − [E[y|PA, Nov] − E[y|PA, Feb]] = δ
| FTE Employment | PA (Control) | NJ (Treatment) | NJ − PA |
|---|---|---|---|
| Before (Feb) | 23.33 (1.35) | 20.44 (0.51) | −2.89 (1.44) |
| After (Nov) | 21.17 (0.94) | 21.03 (0.52) | −0.14 (1.07) |
| Change | −2.16 (1.25) | +0.59 (0.54) | +2.76 (1.36) |
Interpretation:
Employment
│
│ ●───────● Treatment (observed)
│ ╱
│ ╱ ← Treatment effect (δ)
│ ╱
│ ●─ ─ ─ ─ ─ ● Counterfactual
│ ╱ (parallel to control)
│ ╱
│ ●─────────● Control (observed)
│
└────────────────────────────── Time
Before After
Key insight: We never observe the counterfactual.
The parallel trends assumption lets us use
the control group's change as a proxy.
The identifying assumption can be investigated with multiple pre-treatment periods. Do treatment and control follow similar trends before treatment?
Card & Krueger (2000) Follow-up:
Administrative payroll data for restaurants in NJ and PA for multiple years reveal:
Concern: PA may not provide a good measure of counterfactual NJ employment.
Better Example: Pischke (2007) — German School Term Length
Results:
DD can be estimated via regression. Let NJs = dummy for NJ, dt = dummy for November:
Parameter interpretation:
| Parameter | Meaning |
|---|---|
| α | E[y | PA, Feb] = γPA + λFeb |
| γ | E[y | NJ, Feb] − E[y | PA, Feb] = γNJ − γPA |
| λ | E[y | PA, Nov] − E[y | PA, Feb] = λNov − λFeb |
| δ | DD estimate = {E[y|NJ,Nov] − E[y|NJ,Feb]} − {E[y|PA,Nov] − E[y|PA,Feb]} |
This is a saturated model: 4 possible values of E(y|s,t), 4 parameters.
1. Easy to add states/periods: Just include more dummies. The generalization includes a dummy for each state and period.
2. Variable treatment intensity: Instead of switched-on/off treatment, use continuous measures.
Example: Card (1992) — Federal Minimum Wage
In 1990, federal minimum increased from $3.35 to $3.80. Impact varies by state (irrelevant in high-wage Connecticut, big deal in low-wage Mississippi).
where fas = baseline fraction of teens earning below $3.80 in state s (treatment intensity).
| Outcome | Δ Mean Log Wage | Δ Emp/Pop Ratio |
|---|---|---|
| Fraction affected (fas) | 0.15 (0.03) | 0.02 (0.03) |
Wages rose more in states where minimum wage had more bite (0.15), but employment was largely unrelated to fraction affected (0.02 ≈ 0).
3. Easy to add covariates: Control for time-varying state characteristics Xst (e.g., adult employment as proxy for state economic conditions).
When the sample includes many years and treatment timing varies across states, we can test whether "causes happen before consequences":
Example: Autor (2003) — Employment Protection & Temp Workers
State court rulings allowing "unjust dismissal" lawsuits → Do firms use more temp workers?
Estimated leads and lags pattern:
This pattern is consistent with a causal interpretation: no anticipation, gradual adjustment.
Alternative robustness check: allow treatment and control to follow different linear trends:
This allows limited heterogeneity in trends. It's heartening if results survive, discouraging otherwise.
Example: Besley & Burgess (2004) — Labor Regulation in India
| Specification | Labor Regulation Effect |
|---|---|
| DD only | −0.186 (0.064) |
| DD + state-level controls | −0.104 (0.039) |
| DD + state-specific trends | 0.0002 (0.02) |
Interpretation: Without trends, labor regulation appears to reduce output. With state trends, the effect disappears → regulation increased in states where output was already declining.
DD sets up an implicit treatment-control comparison. A potential pitfall: composition changes as a result of treatment.
Example: Welfare benefits and labor supply
If generous welfare states attract poor people with weak labor force attachment (program-induced migration), DD makes generous welfare look worse for labor supply than it really is.
Fix: Use state of birth or previous residence (unchanged by treatment but correlated with current location). This can be implemented as an IV strategy.
When treatment varies along three dimensions (state × time × age), use higher-order contrasts:
This controls for:
Example: Yelowitz (1995) — Medicaid Expansion
Medicaid eligibility was once tied to AFDC (cash welfare). In the 1980s, some states extended coverage to children in families ineligible for AFDC.
Treatment varies by state, time, and child's age. DDD compares across all three dimensions, providing more convincing control than standard DD.
FE and DD are based on time-invariant omitted variables. But for many questions, this assumption doesn't seem plausible.
Example: Training Program Evaluation
People in government training programs have often suffered a recent setback (job loss). Many programs explicitly target such people.
Ashenfelter (1978), Ashenfelter & Card (1985): Training participants exhibit a pre-program earnings dip.
Past earnings is a time-varying confounder that cannot be subsumed in a time-invariant αi.
| Fixed Effects | Lagged Dependent Variable | |
|---|---|---|
| Selection based on | Time-invariant unobservables (αi) | Past outcomes (yit−h) |
| CIA | E(y0it|αi, Xit, dit) = E(y0it|αi, Xit) | E(y0it|yit−h, Xit, dit) = E(y0it|yit−h, Xit) |
| Model | yit = αi + λt + ρdit + Xitβ + εit | yit = θ + γyit−h + λt + ρdit + Xitβ + εit |
| Appropriate when | Permanent unobserved ability/preferences drive selection | Recent setback/change drives selection (training programs) |
Tempting to estimate a model with both αi and yit−1:
To remove αi, we difference:
Nickell (1981) Problem:
Δyit−1 = yit−1 − yit−2 contains εit−1
Δεit = εit − εit−1 also contains εit−1
→ Regressor correlated with error! OLS is inconsistent.
Possible fix: Use yit−2 as an instrument for Δyit−1. But this requires:
The FE and LDV models are not nested. Only the combined model (which is hard to estimate) nests both. However, they have a useful bracketing property:
| If True Model Is... | But You Estimate... | Bias Direction |
|---|---|---|
| LDV (selection on yit−1) | FE (differencing) | Upward — estimate too big |
| FE (selection on αi) | LDV (control for yit−1) | Downward — estimate too small |
Implication: FE and LDV estimates bracket the true causal effect. You can think of them as providing bounds.
Case 1: LDV is correct, but you use FE
True model (simplified, no covariates/time effects, dit−1 = 0):
where εit is serially uncorrelated and uncorrelated with αi, dit.
You mistakenly control for yit−1 = αi + εit−1. The LDV estimator has probability limit:
where d̃it = dit − [regression of dit on yit−1].
Substituting αi = yit−1 − εit−1:
The LDV estimator picks up:
Since trainees have low yit−1, the correlation between dit and yit−1 is negative (π < 0). The bias term is positive → LDV estimate is too small.
Case 2: FE is correct, but you use LDV
True model:
where εit is serially uncorrelated and 0 < γ < 1 (stationarity).
You mistakenly difference (FE). Subtracting yit−1:
The differenced estimator picks up:
Since γ < 1 (so γ−1 < 0) and trainees have low yit−1 (negative correlation), the bias term is positive → FE estimate is too big.
Example: Guryan (2004) uses this bracketing reasoning in studying the effects of court-ordered busing on Black high school graduation rates.
| Concept | Key Point |
|---|---|
| Fixed Effects | Eliminates time-invariant unobserved confounders using within-unit variation |
| FE Estimation | Deviations from means or first-differencing (equivalent with T=2) |
| FE Limitations | Measurement error amplified; removes both good and bad variation |
| DD | FE for aggregate data: (ΔTreatment) − (ΔControl) |
| Parallel Trends | Key DD assumption — treatment & control would follow same trend absent treatment |
| Regression DD | State + time dummies + interaction; allows variable treatment intensity, covariates |
| Testing DD | Pre-trends, leads/lags (Granger), state-specific trends, triple differences |
| FE vs. LDV | Different assumptions; not nested; estimates bracket true effect |
| Bracketing | FE too big if LDV true; LDV too small if FE true → bounds on causal effect |
Practical Checklist: