Chapter 3: Making Regression Make Sense

한국어

Angrist & Pischke, Mostly Harmless Econometrics

Core Message

Regression is useful because it provides the best linear approximation to the Conditional Expectation Function (CEF). The question of when regression is causal depends on the Conditional Independence Assumption (CIA).

3.1 Regression Fundamentals

3.1.1 The Conditional Expectation Function (CEF)

The CEF is the expected value of Yi given Xi:

E[Yi | Xi]

Example: The CEF of log wages given schooling shows that people with more education earn more on average (~10% per year).

The Law of Iterated Expectations

An unconditional expectation equals the expectation of the CEF:

E[Yi] = E{ E[Yi | Xi] }

Three Key Properties of the CEF

Property 1: CEF Decomposition

Yi = E[Yi | Xi] + εi

where:

  • εi is mean-independent of Xi: E[εi | Xi] = 0
  • εi is uncorrelated with any function of Xi

→ Any random variable can be decomposed into a part "explained by X" (the CEF) and an orthogonal residual.

Property 2: CEF Prediction

E[Yi | Xi] = arg minm(X) E[(Yi − m(Xi))²]

→ The CEF is the Minimum Mean Squared Error (MMSE) predictor of Y given X.

Property 3: ANOVA Theorem

V(Yi) = V(E[Yi | Xi]) + E[V(Yi | Xi)]

→ Total variance = Variance explained by X + Residual variance

3.1.2 Linear Regression and the CEF

The population regression coefficient is defined as:

β = arg minb E[(Yi − Xi'b)²]

Solution:

β = E[XiXi']−1 E[XiYi]

Regression Anatomy Formula

For the k-th regressor in a multivariate regression:

βk = Cov(Yi, x̃ki) / V(x̃ki)

where x̃ki is the residual from regressing xki on all other covariates.

Interpretation: Each coefficient in a multivariate regression is the bivariate slope after "partialling out" all other variables.

Three Justifications for Regression

Theorem Statement When it applies
Linear CEF If the CEF is linear, regression gives you the CEF Joint normality, saturated models
Best Linear Predictor X'β is the best linear predictor of Y (MMSE) Always
Regression-CEF X'β provides the best linear approximation to E[Y|X] Always (even if CEF is nonlinear)

Key insight: Even if the CEF is nonlinear, regression provides the best linear approximation to it. This is the most general justification for using regression.

3.1.3 Asymptotic OLS Inference

The OLS estimator:

β̂ = (Σ XiXi')−1 Σ XiYi

Key Asymptotic Results

Result What it says
Law of Large Numbers Sample moments → Population moments
Central Limit Theorem √N(β̂ − β) → Normal distribution
Slutsky's Theorem Can replace probability limits with constants

Heteroskedasticity-Robust Standard Errors

The robust variance estimator:

V(β̂) = E[XiXi']−1 E[XiXi'ei²] E[XiXi']−1

Why use robust SEs?

  • If CEF is nonlinear, residuals vary with X → heteroskedasticity is natural
  • Default (homoskedastic) SEs assume E[ei² | Xi] = σ² (constant)
  • Robust SEs are valid without this assumption

3.1.4 Saturated Models

Definition: A saturated model has a separate parameter for every possible value of X.

Example with two dummies (x1 = college, x2 = female):

Yi = α + β·x1i + γ·x2i + δ·(x1i·x2i) + εi
Term Name Interpretation
β, γ Main effects Effect of each variable separately
δ Interaction term How college effect differs by gender

Key point: Saturated models fit the CEF perfectly because the CEF is linear in the dummy regressors.

3.2 Regression and Causality

Central Question: When does regression have a causal interpretation?

Answer: When the CEF it approximates is causal, which requires the Conditional Independence Assumption (CIA).

3.2.1 The Conditional Independence Assumption (CIA)

Setup: Potential Outcomes

For schooling s, let Ysi = fi(s) denote person i's potential earnings with s years of education.

The CIA states:

Ysi ⊥ si | Xi

"Potential outcomes are independent of actual schooling, conditional on X"

What does CIA mean?

  • Selection on observables: Xi captures all reasons why schooling and potential outcomes are correlated
  • As good as random: Conditional on X, schooling is "as good as randomly assigned"

Implications of CIA

Given CIA, conditional comparisons are causal:

E[Yi | Xi, si = s] − E[Yi | Xi, si = s−1] = E[fi(s) − fi(s−1) | Xi]

→ The difference in mean earnings between schooling levels has a causal interpretation!

From CIA to Regression

Assume a linear constant-effects model:

fi(s) = α + ρs + ηi

where ηi is the random part of potential earnings.

Decompose ηi:

ηi = Xi'γ + vi

The causal regression model becomes:

Yi = α + ρsi + Xi'γ + vi

Given CIA, vi is uncorrelated with si and Xi, so ρ is the causal effect.

3.2.2 The Omitted Variables Bias (OVB) Formula

Consider a "long" regression with ability controls Ai:

Yi = α + ρsi + Ai'γ + εi

And a "short" regression without Ai:

Yi = α̃ + ρ̃si + ε̃i

The OVB Formula

ρ̃ = ρ + γ'δAs

Short = Long + (Effect of omitted) × (Regression of omitted on included)

where δAs is the coefficient from regressing Ai on si.

Application: Returns to Schooling

Controls Schooling Coefficient
None 0.132
Age dummies 0.131
+ Family background 0.114
+ AFQT score 0.087
+ Occupation dummies 0.066

Source: NLSY data

→ Coefficient decreases as we add controls that are positively correlated with both wages and schooling.

3.2.3 Bad Control

Bad controls are variables that are themselves outcomes of the treatment.

Good controls are variables determined before the treatment.

Example: Controlling for Occupation

Should we control for occupation in a schooling regression?

Problem: College affects occupation choice!

  • wi = 1 if white collar job
  • College → more likely white collar

Comparing within occupation:

E[Yi | wi=1, ci=1] − E[Yi | wi=1, ci=0]

= E[Y1i − Y0i | w1i=1] + {E[Y0i | w1i=1] − E[Y0i | w0i=1]}

↑ Selection bias from composition change

Why is this bias?

  • College graduates who work white collar = typical graduates
  • Non-graduates who work white collar = exceptional non-graduates
  • → Comparing different types of people!

Proxy Control Problem

What if we use a "late" ability measure (measured after schooling)?

ali = π0 + π1si + π2ai

If schooling increases measured ability (π1 > 0), controlling for late ability biases the schooling coefficient downward.

Rule of Thumb

Timing matters!

  • ✅ Variables measured before treatment → Good controls
  • ❌ Variables measured after treatment → Potentially bad controls

Chapter 3 Summary

Concept Key Point
CEF E[Y|X] - the MMSE predictor of Y given X
Regression Best linear approximation to the CEF
Regression Anatomy βk = bivariate slope after partialling out other Xs
CIA Ys ⊥ s | X - makes regression causal
OVB Formula Short = Long + (Omitted effect) × (Omitted on included)
Bad Control Don't control for outcomes of treatment

References

  • Barnow, B., Cain, G., & Goldberger, A. (1981). Selection on observables. Evaluation Studies Review Annual.
  • White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator. Econometrica.
  • Frisch, R., & Waugh, F. (1933). Partial time regressions as compared with individual trends. Econometrica.
  • Angrist, J. (1998). Estimating the labor market impact of voluntary military service. Econometrica.
← Chapter 2: The Experimental Ideal Back to Study Notes →
This note was written with the assistance of LLM (Claude).