Angrist Ch.6 - Regression Discontinuity Designs

Chapter 6: Regression Discontinuity Designs

한국어

Angrist & Pischke, Mostly Harmless Econometrics — Chapter 6

"The more rules, the tinier the rules, the more arbitrary they are, the better." — Douglas Adams

Core Message

Regression Discontinuity (RD) exploits precise knowledge of the rules determining treatment. In a rule-based world, some rules are arbitrary and therefore provide good natural experiments. The key insight: if treatment switches on/off at a known cutoff, units just above and just below the cutoff are essentially comparable — like a local randomized experiment.

Two flavors of RD:

Sharp RD: Treatment is a deterministic function of a running variable — crossing the cutoff switches treatment on/off completely
Fuzzy RD: Crossing the cutoff changes the probability of treatment — leads to an IV setup

6.1 Sharp RD

The Setup

Sharp RD is used when treatment status is a deterministic and discontinuous function of a covariate x_i (the "running variable" or "forcing variable"):

d_i = 1(x_i ≥ x₀) = { 1 if x_i ≥ x₀
{ 0 if x_i < x₀

where x₀ is a known threshold or cutoff.

Deterministic: Once we know x_i, we know d_i
Discontinuous: No matter how close x_i gets to x₀, treatment is unchanged until x_i = x₀

Motivating Example: National Merit Scholarships

The first RD study (Thistlethwaite & Campbell, 1960) asked: Do students who win National Merit Scholarship Awards have higher college completion rates because of the award?

Running variable (x_i): PSAT score
Cutoff (x₀): Award threshold
Treatment (d_i): Receiving the scholarship
Outcome (y_i): College completion

RD approach: Compare students with PSAT scores just above and just below the threshold. Any jump in college completion at the threshold is evidence of a treatment effect.

Key Feature: No Overlap

Important distinction from matching/regression:

In RD, there is no value of x_i where we observe both treatment and control units. Unlike matching strategies based on overlap, RD validity turns on extrapolation — our willingness to assume the conditional mean function is smooth through the cutoff.

→ This is why we cannot be as agnostic about functional form in RD as in Chapter 3.

The Sharp RD Model

Assume potential outcomes follow a linear, constant-effects model:

E[y_0i | x_i] = α + βx_i
y_1i = y_0i + ρ

This leads to the regression:

y_i = α + βx_i + ρd_i + ε_i

where ρ is the causal effect of interest.

Key difference from Chapter 3 regression:

Here d_i is not just correlated with x_i — it's a deterministic function of x_i. RD captures causal effects by distinguishing:

The discontinuous function: 1(x_i ≥ x₀)
The smooth function: x_i

Visual Intuition

Panel A: Linear E[y₀|x]          Panel B: Nonlinear E[y₀|x]

  y│                               y│
   │        ●●●●                    │           ●●●●
   │       ●                        │         ●●
   │      ● ← Jump (ρ)              │       ●● ← Jump (ρ)
   │     ●                          │     ●●
   │   ●●                           │   ●●
   │ ●●                             │ ●●
   └──────────────── x              └──────────────── x
          x₀                               x₀

Panel C: Nonlinearity mistaken for discontinuity

  y│
   │               ●●●●
   │           ●●●●
   │        ●●●    ← Sharp curve, NOT treatment!
   │     ●●●
   │   ●●
   │ ●●
   └──────────────── x
          x₀

Polynomial Controls

What if E[y_0i | x_i] = f(x_i) is nonlinear? Model f(x_i) with a p^th-order polynomial:

y_i = α + β₁x_i + β₂x_i² + ... + β_px_i^p + ρd_i + ε_i

As long as f(x_i) is continuous at x₀, we can still identify the discontinuous jump ρ.

Allowing Different Slopes on Each Side

A more flexible model allows different trend functions for E[y_0i|x_i] and E[y_1i|x_i]. Define x̃_i ≡ x_i − x₀ (centering at the cutoff):

y_i = α + β₀₁x̃_i + β₀₂x̃_i² + ... + β_0px̃_i^p
+ ρd_i + δ₁d_ix̃_i + δ₂d_ix̃_i² + ... + δ_pd_ix̃_i^p + ε_i

ρ = treatment effect at x_i = x₀
Interactions (d_ix̃_i, d_ix̃_i², ...) allow different slopes above/below cutoff
Centering at x₀ ensures ρ still captures the effect at the cutoff

Nonparametric RD

To avoid functional form dependence entirely, focus on a narrow window around the cutoff:

lim_ε→0 { E[y_i | x₀ < x_i < x₀+ε] − E[y_i | x₀−ε < x_i < x₀] } = E[y_1i − y_0i | x_i = x₀]

Comparing averages in small neighborhoods left and right of x₀ provides an estimate that doesn't depend on correctly specifying f(x_i).

Practical approaches:

Local linear regression: Weighted least squares with more weight near x₀ (Hahn, Todd, van der Klaauw, 2001)
Discontinuity sample: Restrict to observations within [x₀−h, x₀+h] for bandwidth h (Angrist & Lavy, 1999)

Robustness Checks for Sharp RD

Check	What to Look For
Bandwidth sensitivity	Estimates should be stable as you narrow the window around x₀ (fewer polynomial terms needed)
Pre-treatment covariates	No jump in covariates determined before treatment (balance check)
Density of running variable	No bunching/manipulation around x₀ (McCrary, 2008 test)
Placebo cutoffs	No jumps at other values of x_i where there is no policy change

Example: Lee (2008) — Incumbency Advantage

Question: Does winning an election give parties an advantage in the next election (incumbency effect)?

Running variable (x_i): Democratic vote share margin in election t
Cutoff (x₀): 0 (50% vote share)
Treatment (d_i): Democrat won election t (incumbent party)
Outcome (y_i): Probability Democrat wins election t+1

Key insight: Because d_i = 1(vote margin ≥ 0) is a deterministic function of x_i, there are no confounding variables other than x_i. This is a signal feature of RD.

Results:

Win probability is an increasing function of past vote share (unsurprising)
Dramatic jump of ~40 percentage points at the 0% margin
Barely winning (vs. barely losing) increases next-election win probability by 40pp

Validity check: Lee examines Democratic victories before the last election. These should show no jump at the current cutoff — and they don't, increasing confidence in the design.

Manipulation concern: Could parties manipulate vote shares near the cutoff?

The 2000 Florida recount suggests this is a real concern in close elections. McCrary (2008) proposes formal tests for manipulation by examining the density of x_i around x₀.

6.2 Fuzzy RD is IV

When Treatment Isn't Deterministic

In many settings, crossing the cutoff doesn't perfectly determine treatment — it only changes the probability of treatment. This is fuzzy RD.

P[d_i = 1 | x_i] = { g₁(x_i) if x_i ≥ x₀
{ g₀(x_i) if x_i < x₀ where g₁(x₀) ≠ g₀(x₀)

The functions g₀ and g₁ can be anything as long as they differ at x₀ (and the more the better!).

Fuzzy RD = IV

Define t_i = 1(x_i ≥ x₀) as a dummy for crossing the threshold. The discontinuity t_i becomes an instrument for treatment d_i.

2SLS Setup:

First stage:

d_i = π₀ + π₁x_i + π₂x_i² + ... + π_px_i^p + γt_i + η_1i

where γ is the first-stage effect (jump in treatment probability at cutoff).

Second stage:

y_i = α + β₁x_i + β₂x_i² + ... + β_px_i^p + ρd_i + ε_i

Reduced Form

Substituting the first stage into the second stage:

y_i = α' + β'₁x_i + β'₂x_i² + ... + β'_px_i^p + (ργ)t_i + η_2i

The reduced-form coefficient on t_i equals ργ (causal effect × first stage).

Nonparametric Fuzzy RD: The Wald Estimator

In a small neighborhood around x₀, fuzzy RD becomes a simple Wald/IV estimator:

ρ = lim_ε→0 E[y_i | x₀ < x_i < x₀+ε] − E[y_i | x₀−ε < x_i < x₀]
E[d_i | x₀ < x_i < x₀+ε] − E[d_i | x₀−ε < x_i < x₀] = Reduced form jump
First stage jump

LATE Interpretation

Fuzzy RD estimates a Local Average Treatment Effect (LATE):

The effect is for compliers — individuals whose treatment status changes as x_i moves from just below to just above x₀.

Double locality:

LATE is for compliers only (as with any IV)
Effect is estimated at x_i = x₀ (local to the cutoff)

Example: Angrist & Lavy (1999) — Class Size Effects

Question: Do smaller classes improve student test scores? (Same question as Tennessee STAR experiment)

Setting: Israeli schools have a maximum class size of 40 ("Maimonides' Rule").

Grades with ≤40 students → 1 class (up to 40 students)
Grades with 41 students → 2 classes (~20 students each)
Grades with 81 students → 3 classes (~27 students each)

Maimonides' Rule formula:

m_sc = e_s / (int[(e_s−1)/40] + 1)

where e_s = enrollment, m_sc = predicted class size.

Why Fuzzy?

Maimonides' Rule doesn't predict class size perfectly — some schools split classes at enrollments below 40. This creates a fuzzy design.

The RD Setup

RD Component	In This Study
Running variable (x_i)	Grade enrollment (e_s)
Cutoffs (x₀)	40, 80, 120, ...
Treatment (d_i)	Actual class size (n_sc)
Instrument (t_i)	Predicted class size from Maimonides' Rule (m_sc)
Outcome (y_i)	Test scores

Visual: The Sawtooth Pattern

Class size
    │
 40 │     ●●●●●                ●●●●●
    │    ●     \              ●     \
 30 │   ●       \            ●       \
    │  ●         \          ●         \
 20 │ ●           ●●●●●●●●●●           ●●●●
    │              ↑                    ↑
    └───────────────────────────────────────── Enrollment
              40  41          80  81

    --- = Maimonides' Rule (predicted)
    ●●● = Actual class size (fuzzy)

Results: 5th Grade Math Scores

	OLS			2SLS (Full)		2SLS (±5)		Wald (±3)
Class size	+.322	+.076	+.019	−.230	−.261	−.185	−.443	−.270
(s.e.)	(.039)	(.036)	(.044)	(.092)	(.113)	(.151)	(.236)	(.281)
Controls	None	%disadv	+enroll	linear	quadratic	linear	quadratic	dummies

Key findings:

OLS: Positive relationship (larger classes → higher scores) — likely due to selection (better schools have larger classes)
OLS + controls: Effect shrinks toward zero
2SLS: Strong negative effect (−0.23 to −0.26) — smaller classes improve scores
Discontinuity samples: Less precise but similar magnitude (~−0.27)

Interpretation: A 7-student reduction in class size (as in Tennessee STAR) raises Math scores by ~1.75 points, effect size ≈ 0.18σ. Similar to Tennessee STAR results!

Precision vs. Robustness tradeoff:

As we shrink the discontinuity sample, estimates become less precise (larger s.e.) but more robust to functional form assumptions. The fact that estimates remain stable (~−0.25) across specifications is reassuring.

Chapter 6 Summary

Concept	Key Point
RD Core Idea	Arbitrary rules create natural experiments — treatment determined by cutoff in running variable
Sharp RD	d_i = 1(x_i ≥ x₀) deterministically; selection-on-observables story
Fuzzy RD	P(d_i=1) jumps at x₀; IV setup where t_i=1(x_i≥x₀) instruments for d_i
Identification	Distinguish discontinuous jump (treatment) from smooth trend (running variable)
Functional Form	Must model E[y₀\|x] — use polynomials, allow different slopes, or focus on narrow bandwidth
Validity Checks	Pre-treatment covariate balance, no manipulation (density test), placebo cutoffs, bandwidth sensitivity
LATE	RD estimates are local to x₀; fuzzy RD is LATE for compliers at the cutoff

Practical Checklist for RD:

✓ Verify treatment assignment rule is based on known cutoff
✓ Check whether design is sharp or fuzzy
✓ Plot outcome vs. running variable — look for visible jump
✓ Control for smooth function of running variable (polynomial)
✓ Allow different slopes on each side of cutoff
✓ Check balance of pre-treatment covariates at cutoff
✓ Test for manipulation (density of running variable)
✓ Vary bandwidth — estimates should be stable
✓ For fuzzy RD: check first-stage strength

Sharp vs. Fuzzy Summary:

	Sharp RD	Fuzzy RD
Treatment at cutoff	Switches 0→1 with certainty	Probability increases
Estimation	OLS with polynomial controls	2SLS (IV)
Estimand	ATE at x₀	LATE for compliers at x₀
Example	Lee (2008) — election win	Angrist & Lavy (1999) — class size