Angrist Ch.5 - Fixed Effects, DD, and Panel Data

Chapter 5: Fixed Effects, Differences-in-Differences, and Panel Data

한국어

Angrist & Pischke, Mostly Harmless Econometrics — Chapter 5

"The first thing to realize about parallel universes... is that they are not parallel." — Douglas Adams

Core Message

When important confounders are unobserved but fixed over time, we can eliminate them using panel data strategies: fixed effects (within-person variation) or differences-in-differences (parallel trends assumption). These methods "punt on comparisons in levels" while requiring counterfactual trends to be the same.

The identification toolkit so far:

Chapter 3: Control for observed confounders (regression, matching)
Chapter 4: Use instruments when confounders are unobserved
Chapter 5: Exploit time/cohort dimension when confounders are unobserved but fixed

5.1 Individual Fixed Effects

Motivation: Union Wage Premium

Classic question in Labor Economics: Do workers whose wages are set by collective bargaining earn more because of this, or would they earn more anyway (perhaps because they are more experienced or skilled)?

The Problem: Unobserved worker ability A_i affects both union status and wages. If more able workers are more likely to join unions, OLS overstates the union effect.

The Fixed Effects Setup

Let y_it = log earnings of worker i at time t, and d_it = union status. Assume:

Conditional independence:

E(y_0it | A_i, X_it, t, d_it) = E(y_0it | A_i, X_it, t)

Union status is as good as randomly assigned conditional on unobserved ability A_i, observed covariates X_it, and time.

Key assumption: The unobserved A_i appears without a time subscript in a linear model:

E(y_0it | A_i, X_it, t) = α + λ_t + A'_iγ + X_itβ

With constant, additive treatment effect ρ:

E(y_1it | A_i, X_it, t) = E(y_0it | A_i, X_it, t) + ρ

This implies the fixed effects model:

y_it = α_i + λ_t + ρd_it + X_itβ + ε_it

where α_i ≡ α + A'_iγ is the individual fixed effect (treated as a parameter to be estimated), and λ_t is a year effect (coefficients on time dummies).

Note: These assumptions are more restrictive than those in Chapter 3. We need the linear, additive functional form to make progress on unobserved confounders using panel data without instruments.

Estimation Strategy 1: Deviations from Means

With panel data (repeated observations on individuals), we can eliminate α_i. First, calculate individual averages:

ȳ_i = α_i + λ̄ + ρd̄_i + X̄_iβ + ε̄_i

Subtracting from the original equation:

(y_it − ȳ_i) = (λ_t − λ̄) + ρ(d_it − d̄_i) + (X_it − X̄_i)β + (ε_it − ε̄_i)

The fixed effect α_i is eliminated! This is called the "within estimator" or "analysis of covariance".

Why does this work algebraically?

By the regression anatomy formula (3.1.3), estimating with a full set of person dummies is the same as regressing on the residuals from a regression on those dummies. The residuals from regressing on person dummies are exactly deviations from person means.

Estimation Strategy 2: First Differencing

An alternative to deviations from means:

Δy_it = Δλ_t + ρΔd_it + ΔX_itβ + Δε_it

where Δy_it = y_it − y_it−1.

Method	Deviations from Means	First Differencing
With T = 2	Algebraically identical
With T > 2	More efficient if ε_it is homoskedastic & serially uncorrelated	May be more convenient; note Δε_it is serially correlated

Fixed Effects vs. Random Effects

Random effects assumes α_i is uncorrelated with the regressors. Then α_i becomes part of the residual (no OVB from ignoring it), but residuals for a given person are correlated across periods.

The authors prefer: OLS with fixed effects + robust standard errors, rather than GLS under random effects. GLS requires stronger assumptions (linear CEF, homoskedasticity) and efficiency gains are typically modest.

Example: Union Wage Effects (Freeman 1984)

Freeman uses four panel data sets to estimate union wage effects:

Survey	Cross-section	Fixed Effects
May CPS, 1974-75	0.19	0.09
NLS Young Men, 1970-78	0.28	0.19
Michigan PSID, 1970-79	0.23	0.14
QES, 1973-77	0.14	0.16

Pattern: FE estimates (0.09–0.19) are generally smaller than cross-section estimates (0.14–0.28). This suggests positive selection bias in cross-section — more able workers join unions and earn more.

Caution 1: Measurement Error

FE estimates are notoriously susceptible to attenuation bias:

Economic variables like union status tend to be persistent (a union member this year is likely a union member next year)
Measurement error often changes year-to-year (union status may be misreported this year but not next)
→ While few workers are misclassified in any single year, observed year-to-year changes in union status may be mostly noise
→ More measurement error in Δd_it than in d_it → FE estimates biased toward zero

Possible fixes:

IV: Use cross-sibling reports as instruments (Ashenfelter & Krueger 1994)
External validation: Adjust estimates using measurement error rates from validation surveys (Card 1996)

Caution 2: Removing Good Variation (Twins Example)

Differencing/demeaning removes both good and bad variation. The transformation may kill OVB bathwater but also remove useful information.

Twins and Returns to Schooling:

Ashenfelter & Krueger (1994) and Ashenfelter & Rouse (1998) estimate returns to schooling using twins, controlling for family fixed effects (common family/genetic background).

Surprising result: Within-family estimates are larger than OLS!

Bound & Solon (1999) critique:

Even twins have small differences: first-borns typically have higher birth weight and higher IQ
While within-twin differences are small, so is the difference in their schooling
→ A small amount of unobserved ability differences could cause substantial bias

Bottom line: Avoid overly strong claims when interpreting fixed-effects estimates. The exact nature of unobserved variables typically remains somewhat mysterious.

5.2 Differences-in-Differences (DD)

When Treatment Varies at Group Level

FE requires panel data with repeated observations on the same individuals. Often, however, treatment varies only at a more aggregate level (state, cohort). Examples:

State policies on health care benefits for pregnant workers
State minimum wages
Court rulings on employment law

The source of OVB must therefore be unobserved variables at the state and year level.

Classic Example: Card & Krueger (1994) — Minimum Wage

Classic question: In a competitive labor market, higher minimum wages should reduce employment (moving up a downward-sloping demand curve). Does this actually happen?

Natural experiment:

April 1, 1992: New Jersey raised state minimum from $4.25 to $5.05
Pennsylvania: Stayed at $4.25 (federal minimum)
Data: Employment at fast food restaurants (Burger King, Wendy's, etc.) in NJ and eastern PA
Timing: February 1992 (before) and November 1992 (after)

The DD Model

Define potential outcomes:

y_1ist = employment if high minimum wage
y_0ist = employment if low minimum wage

Key assumption — parallel trends in absence of treatment:

E(y_0ist | s, t) = γ_s + λ_t

This says: in the absence of a minimum wage change, employment is determined by the sum of:

γ_s: Time-invariant state effect (plays the role of α_i in individual FE)
λ_t: Year effect common across states

With constant treatment effect δ:

y_ist = γ_s + λ_t + δd_st + ε_ist

where d_st is a dummy for high-minimum-wage state-periods and E(ε_ist | s, t) = 0.

Deriving the DD Estimator

Control state (PA):

E[y|PA, Nov] − E[y|PA, Feb] = λ_Nov − λ_Feb

Treatment state (NJ):

E[y|NJ, Nov] − E[y|NJ, Feb] = λ_Nov − λ_Feb + δ

Difference-in-differences:

[E[y|NJ, Nov] − E[y|NJ, Feb]] − [E[y|PA, Nov] − E[y|PA, Feb]] = δ

Card & Krueger Results

FTE Employment	PA (Control)	NJ (Treatment)	NJ − PA
Before (Feb)	23.33 (1.35)	20.44 (0.51)	−2.89 (1.44)
After (Nov)	21.17 (0.94)	21.03 (0.52)	−0.14 (1.07)
Change	−2.16 (1.25)	+0.59 (0.54)	+2.76 (1.36)

Interpretation:

PA employment fell by 2.16 workers per store
NJ employment rose by 0.59 workers per store
DD = +2.76 — opposite of standard prediction!
Higher minimum wage did not reduce employment; if anything, it slightly increased it

Visual Representation

Employment
    │
    │                    ●───────● Treatment (observed)
    │                   ╱         
    │                  ╱  ← Treatment effect (δ)
    │                 ╱           
    │                ●─ ─ ─ ─ ─ ●  Counterfactual
    │               ╱               (parallel to control)
    │              ╱
    │  ●─────────●  Control (observed)
    │
    └────────────────────────────── Time
              Before      After

Key insight: We never observe the counterfactual.
The parallel trends assumption lets us use
the control group's change as a proxy.

Testing Parallel Trends

The identifying assumption can be investigated with multiple pre-treatment periods. Do treatment and control follow similar trends before treatment?

Card & Krueger (2000) Follow-up:

Administrative payroll data for restaurants in NJ and PA for multiple years reveal:

Feb-Nov 1992: Slight PA decline, little NJ change (consistent with original survey)
But: Substantial year-to-year variation in other periods
Employment swings often differ substantially between states
PA employment fell relative to NJ over 1992-1995, mostly before the 1996 federal minimum increase

Concern: PA may not provide a good measure of counterfactual NJ employment.

Better Example: Pischke (2007) — German School Term Length

Until 1960s: German states (except Bavaria) started school in Spring
1966-67: Non-Bavarian states switched to Fall start
Transition required two short school years (24 weeks instead of 37)
Outcome: Grade repetition rates for 2nd graders

Results:

Bavaria (control): Flat repetition rates ~2.5% from 1966 onwards
Treatment states: Higher baseline (~4-4.5%), jump by ~1 percentage point for affected cohorts, then return to baseline
→ Strong visual evidence of parallel trends + transitory treatment effect

5.2.1 Regression DD

DD can be estimated via regression. Let NJ_s = dummy for NJ, d_t = dummy for November:

y_ist = α + γ·NJ_s + λ·d_t + δ·(NJ_s × d_t) + ε_ist

Parameter interpretation:

Parameter	Meaning
α	E[y \| PA, Feb] = γ_PA + λ_Feb
γ	E[y \| NJ, Feb] − E[y \| PA, Feb] = γ_NJ − γ_PA
λ	E[y \| PA, Nov] − E[y \| PA, Feb] = λ_Nov − λ_Feb
δ	DD estimate = {E[y\|NJ,Nov] − E[y\|NJ,Feb]} − {E[y\|PA,Nov] − E[y\|PA,Feb]}

This is a saturated model: 4 possible values of E(y|s,t), 4 parameters.

Advantages of Regression DD:

1. Easy to add states/periods: Just include more dummies. The generalization includes a dummy for each state and period.

2. Variable treatment intensity: Instead of switched-on/off treatment, use continuous measures.

Example: Card (1992) — Federal Minimum Wage

In 1990, federal minimum increased from $3.35 to $3.80. Impact varies by state (irrelevant in high-wage Connecticut, big deal in low-wage Mississippi).

y_ist = γ_s + λ_t + δ·(fa_s × d_t) + ε_ist

where fa_s = baseline fraction of teens earning below $3.80 in state s (treatment intensity).

Outcome	Δ Mean Log Wage	Δ Emp/Pop Ratio
Fraction affected (fa_s)	0.15 (0.03)	0.02 (0.03)

Wages rose more in states where minimum wage had more bite (0.15), but employment was largely unrelated to fraction affected (0.02 ≈ 0).

3. Easy to add covariates: Control for time-varying state characteristics X_st (e.g., adult employment as proxy for state economic conditions).

Granger-Style Causality Tests: Leads and Lags

When the sample includes many years and treatment timing varies across states, we can test whether "causes happen before consequences":

y_ist = γ_s + λ_t + Σ_τ=0^m δ_−τd_s,t−τ + Σ_τ=1^q δ_+τd_s,t+τ + X_istβ + ε_ist

Lags (δ_−τ): Post-treatment effects — how do effects evolve over time?
Leads (δ_+τ): Pre-treatment "effects" — should be zero if treatment is causal!

Example: Autor (2003) — Employment Protection & Temp Workers

State court rulings allowing "unjust dismissal" lawsuits → Do firms use more temp workers?

Estimated leads and lags pattern:

2 years before, 1 year before: No effect (leads ≈ 0) ✓
Year of adoption: Small positive effect
1-3 years after: Sharply increasing effects
4+ years after: Effects flatten at permanently higher level

This pattern is consistent with a causal interpretation: no anticipation, gradual adjustment.

State-Specific Trends

Alternative robustness check: allow treatment and control to follow different linear trends:

y_ist = γ_0s + γ_1s·t + λ_t + δd_st + X_istβ + ε_ist

This allows limited heterogeneity in trends. It's heartening if results survive, discouraging otherwise.

Example: Besley & Burgess (2004) — Labor Regulation in India

Specification	Labor Regulation Effect
DD only	−0.186 (0.064)
DD + state-level controls	−0.104 (0.039)
DD + state-specific trends	0.0002 (0.02)

Interpretation: Without trends, labor regulation appears to reduce output. With state trends, the effect disappears → regulation increased in states where output was already declining.

Picking Controls: Composition Changes

DD sets up an implicit treatment-control comparison. A potential pitfall: composition changes as a result of treatment.

Example: Welfare benefits and labor supply

If generous welfare states attract poor people with weak labor force attachment (program-induced migration), DD makes generous welfare look worse for labor supply than it really is.

Fix: Use state of birth or previous residence (unchanged by treatment but correlated with current location). This can be implemented as an IV strategy.

Triple Differences (DDD)

When treatment varies along three dimensions (state × time × age), use higher-order contrasts:

y_iast = γ_st + λ_at + μ_as + δd_ast + X_iastβ + ε_iast

This controls for:

γ_st: State × time effects (common across age groups)
λ_at: Age × time effects (common across states)
μ_as: State × age effects (common across time)

Example: Yelowitz (1995) — Medicaid Expansion

Medicaid eligibility was once tied to AFDC (cash welfare). In the 1980s, some states extended coverage to children in families ineligible for AFDC.

Treatment varies by state, time, and child's age. DDD compares across all three dimensions, providing more convincing control than standard DD.

5.3 Fixed Effects versus Lagged Dependent Variables

The Dilemma

FE and DD are based on time-invariant omitted variables. But for many questions, this assumption doesn't seem plausible.

Example: Training Program Evaluation

People in government training programs have often suffered a recent setback (job loss). Many programs explicitly target such people.

Ashenfelter (1978), Ashenfelter & Card (1985): Training participants exhibit a pre-program earnings dip.

Past earnings is a time-varying confounder that cannot be subsumed in a time-invariant α_i.

Two Competing Models

	Fixed Effects	Lagged Dependent Variable
Selection based on	Time-invariant unobservables (α_i)	Past outcomes (y_it−h)
CIA	E(y_0it\|α_i, X_it, d_it) = E(y_0it\|α_i, X_it)	E(y_0it\|y_it−h, X_it, d_it) = E(y_0it\|y_it−h, X_it)
Model	y_it = α_i + λ_t + ρd_it + X_itβ + ε_it	y_it = θ + γy_it−h + λ_t + ρd_it + X_itβ + ε_it
Appropriate when	Permanent unobserved ability/preferences drive selection	Recent setback/change drives selection (training programs)

Can We Include Both?

Tempting to estimate a model with both α_i and y_it−1:

y_it = α_i + γy_it−1 + λ_t + ρd_it + X_itβ + ε_it

To remove α_i, we difference:

Δy_it = γΔy_it−1 + Δλ_t + ρΔd_it + ΔX_itβ + Δε_it

Nickell (1981) Problem:

Δy_it−1 = y_it−1 − y_it−2 contains ε_it−1

Δε_it = ε_it − ε_it−1 also contains ε_it−1

→ Regressor correlated with error! OLS is inconsistent.

Possible fix: Use y_it−2 as an instrument for Δy_it−1. But this requires:

At least 3 periods of data
ε_it to be serially uncorrelated (unlikely — earnings are highly persistent)

The Bracketing Property

The FE and LDV models are not nested. Only the combined model (which is hard to estimate) nests both. However, they have a useful bracketing property:

If True Model Is...	But You Estimate...	Bias Direction
LDV (selection on y_it−1)	FE (differencing)	Upward — estimate too big
FE (selection on α_i)	LDV (control for y_it−1)	Downward — estimate too small

Implication: FE and LDV estimates bracket the true causal effect. You can think of them as providing bounds.

Appendix: Why Bracketing Works

Click to expand: Mathematical derivation

Case 1: LDV is correct, but you use FE

True model (simplified, no covariates/time effects, d_it−1 = 0):

y_it = α_i + ρd_it + ε_it

where ε_it is serially uncorrelated and uncorrelated with α_i, d_it.

You mistakenly control for y_it−1 = α_i + ε_it−1. The LDV estimator has probability limit:

Cov(y_it, d̃_it) / V(d̃_it)

where d̃_it = d_it − [regression of d_it on y_it−1].

Substituting α_i = y_it−1 − ε_it−1:

y_it = y_it−1 + ρd_it + ε_it − ε_it−1

The LDV estimator picks up:

ρ + σ²_ε / V(d̃_it)

Since trainees have low y_it−1, the correlation between d_it and y_it−1 is negative (π < 0). The bias term is positive → LDV estimate is too small.

Case 2: FE is correct, but you use LDV

True model:

y_it = θ + γy_it−1 + ρd_it + ε_it

where ε_it is serially uncorrelated and 0 < γ < 1 (stationarity).

You mistakenly difference (FE). Subtracting y_it−1:

y_it − y_it−1 = θ + (γ−1)y_it−1 + ρd_it + ε_it

The differenced estimator picks up:

ρ + (γ−1) × Cov(y_it−1, d_it) / V(d_it)

Since γ < 1 (so γ−1 < 0) and trainees have low y_it−1 (negative correlation), the bias term is positive → FE estimate is too big.

Practical Advice

Check robustness: Estimate both FE and LDV models. If they give similar results, you can be more confident.
Interpret as bounds: If results differ, the truth likely lies between them (FE upper bound, LDV lower bound for positive effects).
Think about selection: Is selection more plausibly based on permanent characteristics (FE) or recent history (LDV)?

Example: Guryan (2004) uses this bracketing reasoning in studying the effects of court-ordered busing on Black high school graduation rates.

Chapter 5 Summary

Concept	Key Point
Fixed Effects	Eliminates time-invariant unobserved confounders using within-unit variation
FE Estimation	Deviations from means or first-differencing (equivalent with T=2)
FE Limitations	Measurement error amplified; removes both good and bad variation
DD	FE for aggregate data: (ΔTreatment) − (ΔControl)
Parallel Trends	Key DD assumption — treatment & control would follow same trend absent treatment
Regression DD	State + time dummies + interaction; allows variable treatment intensity, covariates
Testing DD	Pre-trends, leads/lags (Granger), state-specific trends, triple differences
FE vs. LDV	Different assumptions; not nested; estimates bracket true effect
Bracketing	FE too big if LDV true; LDV too small if FE true → bounds on causal effect

Practical Checklist:

✓ FE/DD exploit within-unit variation over time — gives up level comparisons
✓ Always test parallel trends with pre-treatment data when possible
✓ Check for measurement error effects (FE may be attenuated)
✓ Run leads/lags specification — leads should be zero
✓ Try state-specific trends as robustness check
✓ Consider both FE and LDV — they bracket the truth
✓ Watch for composition changes in treatment/control groups