Chapter 5: Fixed Effects, Differences-in-Differences, and Panel Data

한국어

Angrist & Pischke, Mostly Harmless Econometrics — Chapter 5

"The first thing to realize about parallel universes... is that they are not parallel." — Douglas Adams

Core Message

When important confounders are unobserved but fixed over time, we can eliminate them using panel data strategies: fixed effects (within-person variation) or differences-in-differences (parallel trends assumption). These methods "punt on comparisons in levels" while requiring counterfactual trends to be the same.

The identification toolkit so far:

  • Chapter 3: Control for observed confounders (regression, matching)
  • Chapter 4: Use instruments when confounders are unobserved
  • Chapter 5: Exploit time/cohort dimension when confounders are unobserved but fixed

5.1 Individual Fixed Effects

Motivation: Union Wage Premium

Classic question in Labor Economics: Do workers whose wages are set by collective bargaining earn more because of this, or would they earn more anyway (perhaps because they are more experienced or skilled)?

The Problem: Unobserved worker ability Ai affects both union status and wages. If more able workers are more likely to join unions, OLS overstates the union effect.

The Fixed Effects Setup

Let yit = log earnings of worker i at time t, and dit = union status. Assume:

Conditional independence:

E(y0it | Ai, Xit, t, dit) = E(y0it | Ai, Xit, t)

Union status is as good as randomly assigned conditional on unobserved ability Ai, observed covariates Xit, and time.

Key assumption: The unobserved Ai appears without a time subscript in a linear model:

E(y0it | Ai, Xit, t) = α + λt + A'iγ + Xitβ

With constant, additive treatment effect ρ:

E(y1it | Ai, Xit, t) = E(y0it | Ai, Xit, t) + ρ

This implies the fixed effects model:

yit = αi + λt + ρdit + Xitβ + εit

where αi ≡ α + A'iγ is the individual fixed effect (treated as a parameter to be estimated), and λt is a year effect (coefficients on time dummies).

Note: These assumptions are more restrictive than those in Chapter 3. We need the linear, additive functional form to make progress on unobserved confounders using panel data without instruments.

Estimation Strategy 1: Deviations from Means

With panel data (repeated observations on individuals), we can eliminate αi. First, calculate individual averages:

ȳi = αi + λ̄ + ρd̄i + X̄iβ + ε̄i

Subtracting from the original equation:

(yit − ȳi) = (λt − λ̄) + ρ(dit − d̄i) + (Xit − X̄i)β + (εit − ε̄i)

The fixed effect αi is eliminated! This is called the "within estimator" or "analysis of covariance".

Why does this work algebraically?

By the regression anatomy formula (3.1.3), estimating with a full set of person dummies is the same as regressing on the residuals from a regression on those dummies. The residuals from regressing on person dummies are exactly deviations from person means.

Estimation Strategy 2: First Differencing

An alternative to deviations from means:

Δyit = Δλt + ρΔdit + ΔXitβ + Δεit

where Δyit = yit − yit−1.

Method Deviations from Means First Differencing
With T = 2 Algebraically identical
With T > 2 More efficient if εit is homoskedastic & serially uncorrelated May be more convenient; note Δεit is serially correlated

Fixed Effects vs. Random Effects

Random effects assumes αi is uncorrelated with the regressors. Then αi becomes part of the residual (no OVB from ignoring it), but residuals for a given person are correlated across periods.

The authors prefer: OLS with fixed effects + robust standard errors, rather than GLS under random effects. GLS requires stronger assumptions (linear CEF, homoskedasticity) and efficiency gains are typically modest.

Example: Union Wage Effects (Freeman 1984)

Freeman uses four panel data sets to estimate union wage effects:

Survey Cross-section Fixed Effects
May CPS, 1974-75 0.19 0.09
NLS Young Men, 1970-78 0.28 0.19
Michigan PSID, 1970-79 0.23 0.14
QES, 1973-77 0.14 0.16

Pattern: FE estimates (0.09–0.19) are generally smaller than cross-section estimates (0.14–0.28). This suggests positive selection bias in cross-section — more able workers join unions and earn more.

Caution 1: Measurement Error

FE estimates are notoriously susceptible to attenuation bias:

  • Economic variables like union status tend to be persistent (a union member this year is likely a union member next year)
  • Measurement error often changes year-to-year (union status may be misreported this year but not next)
  • → While few workers are misclassified in any single year, observed year-to-year changes in union status may be mostly noise
  • → More measurement error in Δdit than in dit → FE estimates biased toward zero

Possible fixes:

  • IV: Use cross-sibling reports as instruments (Ashenfelter & Krueger 1994)
  • External validation: Adjust estimates using measurement error rates from validation surveys (Card 1996)

Caution 2: Removing Good Variation (Twins Example)

Differencing/demeaning removes both good and bad variation. The transformation may kill OVB bathwater but also remove useful information.

Twins and Returns to Schooling:

Ashenfelter & Krueger (1994) and Ashenfelter & Rouse (1998) estimate returns to schooling using twins, controlling for family fixed effects (common family/genetic background).

Surprising result: Within-family estimates are larger than OLS!

Bound & Solon (1999) critique:

  • Even twins have small differences: first-borns typically have higher birth weight and higher IQ
  • While within-twin differences are small, so is the difference in their schooling
  • → A small amount of unobserved ability differences could cause substantial bias
Bottom line: Avoid overly strong claims when interpreting fixed-effects estimates. The exact nature of unobserved variables typically remains somewhat mysterious.

5.2 Differences-in-Differences (DD)

When Treatment Varies at Group Level

FE requires panel data with repeated observations on the same individuals. Often, however, treatment varies only at a more aggregate level (state, cohort). Examples:

  • State policies on health care benefits for pregnant workers
  • State minimum wages
  • Court rulings on employment law

The source of OVB must therefore be unobserved variables at the state and year level.

Classic Example: Card & Krueger (1994) — Minimum Wage

Classic question: In a competitive labor market, higher minimum wages should reduce employment (moving up a downward-sloping demand curve). Does this actually happen?

Natural experiment:

  • April 1, 1992: New Jersey raised state minimum from $4.25 to $5.05
  • Pennsylvania: Stayed at $4.25 (federal minimum)
  • Data: Employment at fast food restaurants (Burger King, Wendy's, etc.) in NJ and eastern PA
  • Timing: February 1992 (before) and November 1992 (after)

The DD Model

Define potential outcomes:

y1ist = employment if high minimum wage
y0ist = employment if low minimum wage

Key assumption — parallel trends in absence of treatment:

E(y0ist | s, t) = γs + λt

This says: in the absence of a minimum wage change, employment is determined by the sum of:

  • γs: Time-invariant state effect (plays the role of αi in individual FE)
  • λt: Year effect common across states

With constant treatment effect δ:

yist = γs + λt + δdst + εist

where dst is a dummy for high-minimum-wage state-periods and E(εist | s, t) = 0.

Deriving the DD Estimator

Control state (PA):

E[y|PA, Nov] − E[y|PA, Feb] = λNov − λFeb

Treatment state (NJ):

E[y|NJ, Nov] − E[y|NJ, Feb] = λNov − λFeb + δ

Difference-in-differences:

[E[y|NJ, Nov] − E[y|NJ, Feb]] − [E[y|PA, Nov] − E[y|PA, Feb]] = δ

Card & Krueger Results

FTE Employment PA (Control) NJ (Treatment) NJ − PA
Before (Feb) 23.33 (1.35) 20.44 (0.51) −2.89 (1.44)
After (Nov) 21.17 (0.94) 21.03 (0.52) −0.14 (1.07)
Change −2.16 (1.25) +0.59 (0.54) +2.76 (1.36)

Interpretation:

  • PA employment fell by 2.16 workers per store
  • NJ employment rose by 0.59 workers per store
  • DD = +2.76 — opposite of standard prediction!
  • Higher minimum wage did not reduce employment; if anything, it slightly increased it

Visual Representation

Employment
    │
    │                    ●───────● Treatment (observed)
    │                   ╱         
    │                  ╱  ← Treatment effect (δ)
    │                 ╱           
    │                ●─ ─ ─ ─ ─ ●  Counterfactual
    │               ╱               (parallel to control)
    │              ╱
    │  ●─────────●  Control (observed)
    │
    └────────────────────────────── Time
              Before      After

Key insight: We never observe the counterfactual.
The parallel trends assumption lets us use
the control group's change as a proxy.
                

Testing Parallel Trends

The identifying assumption can be investigated with multiple pre-treatment periods. Do treatment and control follow similar trends before treatment?

Card & Krueger (2000) Follow-up:

Administrative payroll data for restaurants in NJ and PA for multiple years reveal:

  • Feb-Nov 1992: Slight PA decline, little NJ change (consistent with original survey)
  • But: Substantial year-to-year variation in other periods
  • Employment swings often differ substantially between states
  • PA employment fell relative to NJ over 1992-1995, mostly before the 1996 federal minimum increase

Concern: PA may not provide a good measure of counterfactual NJ employment.

Better Example: Pischke (2007) — German School Term Length

  • Until 1960s: German states (except Bavaria) started school in Spring
  • 1966-67: Non-Bavarian states switched to Fall start
  • Transition required two short school years (24 weeks instead of 37)
  • Outcome: Grade repetition rates for 2nd graders

Results:

  • Bavaria (control): Flat repetition rates ~2.5% from 1966 onwards
  • Treatment states: Higher baseline (~4-4.5%), jump by ~1 percentage point for affected cohorts, then return to baseline
  • → Strong visual evidence of parallel trends + transitory treatment effect

5.2.1 Regression DD

DD can be estimated via regression. Let NJs = dummy for NJ, dt = dummy for November:

yist = α + γ·NJs + λ·dt + δ·(NJs × dt) + εist

Parameter interpretation:

Parameter Meaning
α E[y | PA, Feb] = γPA + λFeb
γ E[y | NJ, Feb] − E[y | PA, Feb] = γNJ − γPA
λ E[y | PA, Nov] − E[y | PA, Feb] = λNov − λFeb
δ DD estimate = {E[y|NJ,Nov] − E[y|NJ,Feb]} − {E[y|PA,Nov] − E[y|PA,Feb]}

This is a saturated model: 4 possible values of E(y|s,t), 4 parameters.

Advantages of Regression DD:

1. Easy to add states/periods: Just include more dummies. The generalization includes a dummy for each state and period.

2. Variable treatment intensity: Instead of switched-on/off treatment, use continuous measures.

Example: Card (1992) — Federal Minimum Wage

In 1990, federal minimum increased from $3.35 to $3.80. Impact varies by state (irrelevant in high-wage Connecticut, big deal in low-wage Mississippi).

yist = γs + λt + δ·(fas × dt) + εist

where fas = baseline fraction of teens earning below $3.80 in state s (treatment intensity).

Outcome Δ Mean Log Wage Δ Emp/Pop Ratio
Fraction affected (fas) 0.15 (0.03) 0.02 (0.03)

Wages rose more in states where minimum wage had more bite (0.15), but employment was largely unrelated to fraction affected (0.02 ≈ 0).

3. Easy to add covariates: Control for time-varying state characteristics Xst (e.g., adult employment as proxy for state economic conditions).

Granger-Style Causality Tests: Leads and Lags

When the sample includes many years and treatment timing varies across states, we can test whether "causes happen before consequences":

yist = γs + λt + Στ=0m δ−τds,t−τ + Στ=1q δds,t+τ + Xistβ + εist
  • Lags−τ): Post-treatment effects — how do effects evolve over time?
  • Leads): Pre-treatment "effects" — should be zero if treatment is causal!

Example: Autor (2003) — Employment Protection & Temp Workers

State court rulings allowing "unjust dismissal" lawsuits → Do firms use more temp workers?

Estimated leads and lags pattern:

  • 2 years before, 1 year before: No effect (leads ≈ 0) ✓
  • Year of adoption: Small positive effect
  • 1-3 years after: Sharply increasing effects
  • 4+ years after: Effects flatten at permanently higher level

This pattern is consistent with a causal interpretation: no anticipation, gradual adjustment.

State-Specific Trends

Alternative robustness check: allow treatment and control to follow different linear trends:

yist = γ0s + γ1s·t + λt + δdst + Xistβ + εist

This allows limited heterogeneity in trends. It's heartening if results survive, discouraging otherwise.

Example: Besley & Burgess (2004) — Labor Regulation in India

Specification Labor Regulation Effect
DD only −0.186 (0.064)
DD + state-level controls −0.104 (0.039)
DD + state-specific trends 0.0002 (0.02)

Interpretation: Without trends, labor regulation appears to reduce output. With state trends, the effect disappears → regulation increased in states where output was already declining.

Picking Controls: Composition Changes

DD sets up an implicit treatment-control comparison. A potential pitfall: composition changes as a result of treatment.

Example: Welfare benefits and labor supply

If generous welfare states attract poor people with weak labor force attachment (program-induced migration), DD makes generous welfare look worse for labor supply than it really is.

Fix: Use state of birth or previous residence (unchanged by treatment but correlated with current location). This can be implemented as an IV strategy.

Triple Differences (DDD)

When treatment varies along three dimensions (state × time × age), use higher-order contrasts:

yiast = γst + λat + μas + δdast + Xiastβ + εiast

This controls for:

  • γst: State × time effects (common across age groups)
  • λat: Age × time effects (common across states)
  • μas: State × age effects (common across time)

Example: Yelowitz (1995) — Medicaid Expansion

Medicaid eligibility was once tied to AFDC (cash welfare). In the 1980s, some states extended coverage to children in families ineligible for AFDC.

Treatment varies by state, time, and child's age. DDD compares across all three dimensions, providing more convincing control than standard DD.

5.3 Fixed Effects versus Lagged Dependent Variables

The Dilemma

FE and DD are based on time-invariant omitted variables. But for many questions, this assumption doesn't seem plausible.

Example: Training Program Evaluation

People in government training programs have often suffered a recent setback (job loss). Many programs explicitly target such people.

Ashenfelter (1978), Ashenfelter & Card (1985): Training participants exhibit a pre-program earnings dip.

Past earnings is a time-varying confounder that cannot be subsumed in a time-invariant αi.

Two Competing Models

Fixed Effects Lagged Dependent Variable
Selection based on Time-invariant unobservables (αi) Past outcomes (yit−h)
CIA E(y0iti, Xit, dit) = E(y0iti, Xit) E(y0it|yit−h, Xit, dit) = E(y0it|yit−h, Xit)
Model yit = αi + λt + ρdit + Xitβ + εit yit = θ + γyit−h + λt + ρdit + Xitβ + εit
Appropriate when Permanent unobserved ability/preferences drive selection Recent setback/change drives selection (training programs)

Can We Include Both?

Tempting to estimate a model with both αi and yit−1:

yit = αi + γyit−1 + λt + ρdit + Xitβ + εit

To remove αi, we difference:

Δyit = γΔyit−1 + Δλt + ρΔdit + ΔXitβ + Δεit

Nickell (1981) Problem:

Δyit−1 = yit−1 − yit−2 contains εit−1

Δεit = εit − εit−1 also contains εit−1

Regressor correlated with error! OLS is inconsistent.

Possible fix: Use yit−2 as an instrument for Δyit−1. But this requires:

  • At least 3 periods of data
  • εit to be serially uncorrelated (unlikely — earnings are highly persistent)

The Bracketing Property

The FE and LDV models are not nested. Only the combined model (which is hard to estimate) nests both. However, they have a useful bracketing property:

If True Model Is... But You Estimate... Bias Direction
LDV (selection on yit−1) FE (differencing) Upward — estimate too big
FE (selection on αi) LDV (control for yit−1) Downward — estimate too small

Implication: FE and LDV estimates bracket the true causal effect. You can think of them as providing bounds.

Appendix: Why Bracketing Works

Click to expand: Mathematical derivation

Case 1: LDV is correct, but you use FE

True model (simplified, no covariates/time effects, dit−1 = 0):

yit = αi + ρdit + εit

where εit is serially uncorrelated and uncorrelated with αi, dit.

You mistakenly control for yit−1 = αi + εit−1. The LDV estimator has probability limit:

Cov(yit, d̃it) / V(d̃it)

where d̃it = dit − [regression of dit on yit−1].

Substituting αi = yit−1 − εit−1:

yit = yit−1 + ρdit + εit − εit−1

The LDV estimator picks up:

ρ + σ²ε / V(d̃it)

Since trainees have low yit−1, the correlation between dit and yit−1 is negative (π < 0). The bias term is positive → LDV estimate is too small.


Case 2: FE is correct, but you use LDV

True model:

yit = θ + γyit−1 + ρdit + εit

where εit is serially uncorrelated and 0 < γ < 1 (stationarity).

You mistakenly difference (FE). Subtracting yit−1:

yit − yit−1 = θ + (γ−1)yit−1 + ρdit + εit

The differenced estimator picks up:

ρ + (γ−1) × Cov(yit−1, dit) / V(dit)

Since γ < 1 (so γ−1 < 0) and trainees have low yit−1 (negative correlation), the bias term is positive → FE estimate is too big.

Practical Advice

  1. Check robustness: Estimate both FE and LDV models. If they give similar results, you can be more confident.
  2. Interpret as bounds: If results differ, the truth likely lies between them (FE upper bound, LDV lower bound for positive effects).
  3. Think about selection: Is selection more plausibly based on permanent characteristics (FE) or recent history (LDV)?

Example: Guryan (2004) uses this bracketing reasoning in studying the effects of court-ordered busing on Black high school graduation rates.

Chapter 5 Summary

Concept Key Point
Fixed Effects Eliminates time-invariant unobserved confounders using within-unit variation
FE Estimation Deviations from means or first-differencing (equivalent with T=2)
FE Limitations Measurement error amplified; removes both good and bad variation
DD FE for aggregate data: (ΔTreatment) − (ΔControl)
Parallel Trends Key DD assumption — treatment & control would follow same trend absent treatment
Regression DD State + time dummies + interaction; allows variable treatment intensity, covariates
Testing DD Pre-trends, leads/lags (Granger), state-specific trends, triple differences
FE vs. LDV Different assumptions; not nested; estimates bracket true effect
Bracketing FE too big if LDV true; LDV too small if FE true → bounds on causal effect

Practical Checklist:

  1. ✓ FE/DD exploit within-unit variation over time — gives up level comparisons
  2. ✓ Always test parallel trends with pre-treatment data when possible
  3. ✓ Check for measurement error effects (FE may be attenuated)
  4. ✓ Run leads/lags specification — leads should be zero
  5. ✓ Try state-specific trends as robustness check
  6. ✓ Consider both FE and LDV — they bracket the truth
  7. ✓ Watch for composition changes in treatment/control groups
← Ch 4-3: IV Details Ch 6: RDD →
This note was written with the assistance of LLM (Claude).