Angrist Ch.3 - Making Regression Make Sense

Chapter 3: Making Regression Make Sense

한국어

Angrist & Pischke, Mostly Harmless Econometrics

Core Message

Regression is useful because it provides the best linear approximation to the Conditional Expectation Function (CEF). The question of when regression is causal depends on the Conditional Independence Assumption (CIA).

3.1 Regression Fundamentals

3.1.1 The Conditional Expectation Function (CEF)

The CEF is the expected value of Y_i given X_i:

E[Y_i | X_i]

Example: The CEF of log wages given schooling shows that people with more education earn more on average (~10% per year).

The Law of Iterated Expectations

An unconditional expectation equals the expectation of the CEF:

E[Y_i] = E{ E[Y_i | X_i] }

Three Key Properties of the CEF

Property 1: CEF Decomposition

Y_i = E[Y_i | X_i] + ε_i

where:

ε_i is mean-independent of X_i: E[ε_i | X_i] = 0
ε_i is uncorrelated with any function of X_i

→ Any random variable can be decomposed into a part "explained by X" (the CEF) and an orthogonal residual.

Property 2: CEF Prediction

E[Y_i | X_i] = arg min_m(X) E[(Y_i − m(X_i))²]

→ The CEF is the Minimum Mean Squared Error (MMSE) predictor of Y given X.

Property 3: ANOVA Theorem

V(Y_i) = V(E[Y_i | X_i]) + E[V(Y_i | X_i)]

→ Total variance = Variance explained by X + Residual variance

3.1.2 Linear Regression and the CEF

The population regression coefficient is defined as:

β = arg min_b E[(Y_i − X_i'b)²]

Solution:

β = E[X_iX_i']⁻¹ E[X_iY_i]

Regression Anatomy Formula

For the k-th regressor in a multivariate regression:

β_k = Cov(Y_i, x̃_ki) / V(x̃_ki)

where x̃_ki is the residual from regressing x_ki on all other covariates.

Interpretation: Each coefficient in a multivariate regression is the bivariate slope after "partialling out" all other variables.

Three Justifications for Regression

Theorem	Statement	When it applies
Linear CEF	If the CEF is linear, regression gives you the CEF	Joint normality, saturated models
Best Linear Predictor	X'β is the best linear predictor of Y (MMSE)	Always
Regression-CEF	X'β provides the best linear approximation to E[Y\|X]	Always (even if CEF is nonlinear)

Key insight: Even if the CEF is nonlinear, regression provides the best linear approximation to it. This is the most general justification for using regression.

3.1.3 Asymptotic OLS Inference

The OLS estimator:

β̂ = (Σ X_iX_i')⁻¹ Σ X_iY_i

Key Asymptotic Results

Result	What it says
Law of Large Numbers	Sample moments → Population moments
Central Limit Theorem	√N(β̂ − β) → Normal distribution
Slutsky's Theorem	Can replace probability limits with constants

Heteroskedasticity-Robust Standard Errors

The robust variance estimator:

V(β̂) = E[X_iX_i']⁻¹ E[X_iX_i'e_i²] E[X_iX_i']⁻¹

Why use robust SEs?

If CEF is nonlinear, residuals vary with X → heteroskedasticity is natural
Default (homoskedastic) SEs assume E[e_i² | X_i] = σ² (constant)
Robust SEs are valid without this assumption

3.1.4 Saturated Models

Definition: A saturated model has a separate parameter for every possible value of X.

Example with two dummies (x₁ = college, x₂ = female):

Y_i = α + β·x_1i + γ·x_2i + δ·(x_1i·x_2i) + ε_i

Term	Name	Interpretation
β, γ	Main effects	Effect of each variable separately
δ	Interaction term	How college effect differs by gender

Key point: Saturated models fit the CEF perfectly because the CEF is linear in the dummy regressors.

3.2 Regression and Causality

Central Question: When does regression have a causal interpretation?

Answer: When the CEF it approximates is causal, which requires the Conditional Independence Assumption (CIA).

3.2.1 The Conditional Independence Assumption (CIA)

Setup: Potential Outcomes

For schooling s, let Y_si = f_i(s) denote person i's potential earnings with s years of education.

The CIA states:

Y_si ⊥ s_i | X_i

"Potential outcomes are independent of actual schooling, conditional on X"

What does CIA mean?

Selection on observables: X_i captures all reasons why schooling and potential outcomes are correlated
As good as random: Conditional on X, schooling is "as good as randomly assigned"

Implications of CIA

Given CIA, conditional comparisons are causal:

E[Y_i | X_i, s_i = s] − E[Y_i | X_i, s_i = s−1] = E[f_i(s) − f_i(s−1) | X_i]

→ The difference in mean earnings between schooling levels has a causal interpretation!

From CIA to Regression

Assume a linear constant-effects model:

f_i(s) = α + ρs + η_i

where η_i is the random part of potential earnings.

Decompose η_i:

η_i = X_i'γ + v_i

The causal regression model becomes:

Y_i = α + ρs_i + X_i'γ + v_i

Given CIA, v_i is uncorrelated with s_i and X_i, so ρ is the causal effect.

3.2.2 The Omitted Variables Bias (OVB) Formula

Consider a "long" regression with ability controls A_i:

Y_i = α + ρs_i + A_i'γ + ε_i

And a "short" regression without A_i:

Y_i = α̃ + ρ̃s_i + ε̃_i

The OVB Formula

ρ̃ = ρ + γ'δ_As

Short = Long + (Effect of omitted) × (Regression of omitted on included)

where δ_As is the coefficient from regressing A_i on s_i.

Application: Returns to Schooling

Controls	Schooling Coefficient
None	0.132
Age dummies	0.131
+ Family background	0.114
+ AFQT score	0.087
+ Occupation dummies	0.066

Source: NLSY data

→ Coefficient decreases as we add controls that are positively correlated with both wages and schooling.

3.2.3 Bad Control

Bad controls are variables that are themselves outcomes of the treatment.

Good controls are variables determined before the treatment.

Example: Controlling for Occupation

Should we control for occupation in a schooling regression?

Problem: College affects occupation choice!

w_i = 1 if white collar job
College → more likely white collar

Comparing within occupation:

E[Y_i | w_i=1, c_i=1] − E[Y_i | w_i=1, c_i=0]

= E[Y_1i − Y_0i | w_1i=1] + {E[Y_0i | w_1i=1] − E[Y_0i | w_0i=1]}

↑ Selection bias from composition change

Why is this bias?

College graduates who work white collar = typical graduates
Non-graduates who work white collar = exceptional non-graduates
→ Comparing different types of people!

Proxy Control Problem

What if we use a "late" ability measure (measured after schooling)?

al_i = π₀ + π₁s_i + π₂a_i

If schooling increases measured ability (π₁ > 0), controlling for late ability biases the schooling coefficient downward.

Rule of Thumb

Timing matters!

✅ Variables measured before treatment → Good controls
❌ Variables measured after treatment → Potentially bad controls

Chapter 3 Summary

Concept	Key Point
CEF	E[Y\|X] - the MMSE predictor of Y given X
Regression	Best linear approximation to the CEF
Regression Anatomy	β_k = bivariate slope after partialling out other Xs
CIA	Y_s ⊥ s \| X - makes regression causal
OVB Formula	Short = Long + (Omitted effect) × (Omitted on included)
Bad Control	Don't control for outcomes of treatment

References

Barnow, B., Cain, G., & Goldberger, A. (1981). Selection on observables. Evaluation Studies Review Annual.
White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator. Econometrica.
Frisch, R., & Waugh, F. (1933). Partial time regressions as compared with individual trends. Econometrica.
Angrist, J. (1998). Estimating the labor market impact of voluntary military service. Econometrica.