Chapter 6: Regression Discontinuity Designs

한국어

Angrist & Pischke, Mostly Harmless Econometrics — Chapter 6

"The more rules, the tinier the rules, the more arbitrary they are, the better." — Douglas Adams

Core Message

Regression Discontinuity (RD) exploits precise knowledge of the rules determining treatment. In a rule-based world, some rules are arbitrary and therefore provide good natural experiments. The key insight: if treatment switches on/off at a known cutoff, units just above and just below the cutoff are essentially comparable — like a local randomized experiment.

Two flavors of RD:

  • Sharp RD: Treatment is a deterministic function of a running variable — crossing the cutoff switches treatment on/off completely
  • Fuzzy RD: Crossing the cutoff changes the probability of treatment — leads to an IV setup

6.1 Sharp RD

The Setup

Sharp RD is used when treatment status is a deterministic and discontinuous function of a covariate xi (the "running variable" or "forcing variable"):

di = 1(xi ≥ x0) = { 1 if xi ≥ x0
{ 0 if xi < x0

where x0 is a known threshold or cutoff.

  • Deterministic: Once we know xi, we know di
  • Discontinuous: No matter how close xi gets to x0, treatment is unchanged until xi = x0

Motivating Example: National Merit Scholarships

The first RD study (Thistlethwaite & Campbell, 1960) asked: Do students who win National Merit Scholarship Awards have higher college completion rates because of the award?

  • Running variable (xi): PSAT score
  • Cutoff (x0): Award threshold
  • Treatment (di): Receiving the scholarship
  • Outcome (yi): College completion

RD approach: Compare students with PSAT scores just above and just below the threshold. Any jump in college completion at the threshold is evidence of a treatment effect.

Key Feature: No Overlap

Important distinction from matching/regression:

In RD, there is no value of xi where we observe both treatment and control units. Unlike matching strategies based on overlap, RD validity turns on extrapolation — our willingness to assume the conditional mean function is smooth through the cutoff.

→ This is why we cannot be as agnostic about functional form in RD as in Chapter 3.

The Sharp RD Model

Assume potential outcomes follow a linear, constant-effects model:

E[y0i | xi] = α + βxi
y1i = y0i + ρ

This leads to the regression:

yi = α + βxi + ρdi + εi

where ρ is the causal effect of interest.

Key difference from Chapter 3 regression:

Here di is not just correlated with xi — it's a deterministic function of xi. RD captures causal effects by distinguishing:

  • The discontinuous function: 1(xi ≥ x0)
  • The smooth function: xi

Visual Intuition

Panel A: Linear E[y₀|x]          Panel B: Nonlinear E[y₀|x]

  y│                               y│
   │        ●●●●                    │           ●●●●
   │       ●                        │         ●●
   │      ● ← Jump (ρ)              │       ●● ← Jump (ρ)
   │     ●                          │     ●●
   │   ●●                           │   ●●
   │ ●●                             │ ●●
   └──────────────── x              └──────────────── x
          x₀                               x₀

Panel C: Nonlinearity mistaken for discontinuity

  y│
   │               ●●●●
   │           ●●●●
   │        ●●●    ← Sharp curve, NOT treatment!
   │     ●●●
   │   ●●
   │ ●●
   └──────────────── x
          x₀
                

Polynomial Controls

What if E[y0i | xi] = f(xi) is nonlinear? Model f(xi) with a pth-order polynomial:

yi = α + β1xi + β2xi² + ... + βpxip + ρdi + εi

As long as f(xi) is continuous at x0, we can still identify the discontinuous jump ρ.

Allowing Different Slopes on Each Side

A more flexible model allows different trend functions for E[y0i|xi] and E[y1i|xi]. Define x̃i ≡ xi − x0 (centering at the cutoff):

yi = α + β01i + β02i² + ... + β0pip
    + ρdi + δ1dii + δ2dii² + ... + δpdiip + εi
  • ρ = treatment effect at xi = x0
  • Interactions (dii, dii², ...) allow different slopes above/below cutoff
  • Centering at x0 ensures ρ still captures the effect at the cutoff

Nonparametric RD

To avoid functional form dependence entirely, focus on a narrow window around the cutoff:

limε→0 { E[yi | x0 < xi < x0+ε] − E[yi | x0−ε < xi < x0] } = E[y1i − y0i | xi = x0]

Comparing averages in small neighborhoods left and right of x0 provides an estimate that doesn't depend on correctly specifying f(xi).

Practical approaches:

  • Local linear regression: Weighted least squares with more weight near x0 (Hahn, Todd, van der Klaauw, 2001)
  • Discontinuity sample: Restrict to observations within [x0−h, x0+h] for bandwidth h (Angrist & Lavy, 1999)

Robustness Checks for Sharp RD

Check What to Look For
Bandwidth sensitivity Estimates should be stable as you narrow the window around x0 (fewer polynomial terms needed)
Pre-treatment covariates No jump in covariates determined before treatment (balance check)
Density of running variable No bunching/manipulation around x0 (McCrary, 2008 test)
Placebo cutoffs No jumps at other values of xi where there is no policy change

Example: Lee (2008) — Incumbency Advantage

Question: Does winning an election give parties an advantage in the next election (incumbency effect)?

  • Running variable (xi): Democratic vote share margin in election t
  • Cutoff (x0): 0 (50% vote share)
  • Treatment (di): Democrat won election t (incumbent party)
  • Outcome (yi): Probability Democrat wins election t+1

Key insight: Because di = 1(vote margin ≥ 0) is a deterministic function of xi, there are no confounding variables other than xi. This is a signal feature of RD.

Results:

  • Win probability is an increasing function of past vote share (unsurprising)
  • Dramatic jump of ~40 percentage points at the 0% margin
  • Barely winning (vs. barely losing) increases next-election win probability by 40pp

Validity check: Lee examines Democratic victories before the last election. These should show no jump at the current cutoff — and they don't, increasing confidence in the design.

Manipulation concern: Could parties manipulate vote shares near the cutoff?

The 2000 Florida recount suggests this is a real concern in close elections. McCrary (2008) proposes formal tests for manipulation by examining the density of xi around x0.

6.2 Fuzzy RD is IV

When Treatment Isn't Deterministic

In many settings, crossing the cutoff doesn't perfectly determine treatment — it only changes the probability of treatment. This is fuzzy RD.

P[di = 1 | xi] = { g1(xi) if xi ≥ x0
{ g0(xi) if xi < x0
  where g1(x0) ≠ g0(x0)

The functions g0 and g1 can be anything as long as they differ at x0 (and the more the better!).

Fuzzy RD = IV

Define ti = 1(xi ≥ x0) as a dummy for crossing the threshold. The discontinuity ti becomes an instrument for treatment di.

2SLS Setup:

First stage:

di = π0 + π1xi + π2xi² + ... + πpxip + γti + η1i

where γ is the first-stage effect (jump in treatment probability at cutoff).

Second stage:

yi = α + β1xi + β2xi² + ... + βpxip + ρdi + εi

Reduced Form

Substituting the first stage into the second stage:

yi = α' + β'1xi + β'2xi² + ... + β'pxip + (ργ)ti + η2i

The reduced-form coefficient on ti equals ργ (causal effect × first stage).

Nonparametric Fuzzy RD: The Wald Estimator

In a small neighborhood around x0, fuzzy RD becomes a simple Wald/IV estimator:

ρ = limε→0 E[yi | x0 < xi < x0+ε] − E[yi | x0−ε < xi < x0]
E[di | x0 < xi < x0+ε] − E[di | x0−ε < xi < x0] = Reduced form jump
First stage jump

LATE Interpretation

Fuzzy RD estimates a Local Average Treatment Effect (LATE):

The effect is for compliers — individuals whose treatment status changes as xi moves from just below to just above x0.

Double locality:

  1. LATE is for compliers only (as with any IV)
  2. Effect is estimated at xi = x0 (local to the cutoff)

Example: Angrist & Lavy (1999) — Class Size Effects

Question: Do smaller classes improve student test scores? (Same question as Tennessee STAR experiment)

Setting: Israeli schools have a maximum class size of 40 ("Maimonides' Rule").

  • Grades with ≤40 students → 1 class (up to 40 students)
  • Grades with 41 students → 2 classes (~20 students each)
  • Grades with 81 students → 3 classes (~27 students each)

Maimonides' Rule formula:

msc = es / (int[(es−1)/40] + 1)

where es = enrollment, msc = predicted class size.

Why Fuzzy?

Maimonides' Rule doesn't predict class size perfectly — some schools split classes at enrollments below 40. This creates a fuzzy design.

The RD Setup

RD Component In This Study
Running variable (xi) Grade enrollment (es)
Cutoffs (x0) 40, 80, 120, ...
Treatment (di) Actual class size (nsc)
Instrument (ti) Predicted class size from Maimonides' Rule (msc)
Outcome (yi) Test scores

Visual: The Sawtooth Pattern

Class size
    │
 40 │     ●●●●●                ●●●●●
    │    ●     \              ●     \
 30 │   ●       \            ●       \
    │  ●         \          ●         \
 20 │ ●           ●●●●●●●●●●           ●●●●
    │              ↑                    ↑
    └───────────────────────────────────────── Enrollment
              40  41          80  81

    --- = Maimonides' Rule (predicted)
    ●●● = Actual class size (fuzzy)
                

Results: 5th Grade Math Scores

OLS 2SLS (Full) 2SLS (±5) Wald (±3)
Class size +.322 +.076 +.019 −.230 −.261 −.185 −.443 −.270
(s.e.) (.039) (.036) (.044) (.092) (.113) (.151) (.236) (.281)
Controls None %disadv +enroll linear quadratic linear quadratic dummies

Key findings:

  • OLS: Positive relationship (larger classes → higher scores) — likely due to selection (better schools have larger classes)
  • OLS + controls: Effect shrinks toward zero
  • 2SLS: Strong negative effect (−0.23 to −0.26) — smaller classes improve scores
  • Discontinuity samples: Less precise but similar magnitude (~−0.27)

Interpretation: A 7-student reduction in class size (as in Tennessee STAR) raises Math scores by ~1.75 points, effect size ≈ 0.18σ. Similar to Tennessee STAR results!

Precision vs. Robustness tradeoff:

As we shrink the discontinuity sample, estimates become less precise (larger s.e.) but more robust to functional form assumptions. The fact that estimates remain stable (~−0.25) across specifications is reassuring.

Chapter 6 Summary

Concept Key Point
RD Core Idea Arbitrary rules create natural experiments — treatment determined by cutoff in running variable
Sharp RD di = 1(xi ≥ x0) deterministically; selection-on-observables story
Fuzzy RD P(di=1) jumps at x0; IV setup where ti=1(xi≥x0) instruments for di
Identification Distinguish discontinuous jump (treatment) from smooth trend (running variable)
Functional Form Must model E[y0|x] — use polynomials, allow different slopes, or focus on narrow bandwidth
Validity Checks Pre-treatment covariate balance, no manipulation (density test), placebo cutoffs, bandwidth sensitivity
LATE RD estimates are local to x0; fuzzy RD is LATE for compliers at the cutoff

Practical Checklist for RD:

  1. ✓ Verify treatment assignment rule is based on known cutoff
  2. ✓ Check whether design is sharp or fuzzy
  3. ✓ Plot outcome vs. running variable — look for visible jump
  4. ✓ Control for smooth function of running variable (polynomial)
  5. ✓ Allow different slopes on each side of cutoff
  6. ✓ Check balance of pre-treatment covariates at cutoff
  7. ✓ Test for manipulation (density of running variable)
  8. ✓ Vary bandwidth — estimates should be stable
  9. ✓ For fuzzy RD: check first-stage strength

Sharp vs. Fuzzy Summary:

Sharp RD Fuzzy RD
Treatment at cutoff Switches 0→1 with certainty Probability increases
Estimation OLS with polynomial controls 2SLS (IV)
Estimand ATE at x0 LATE for compliers at x0
Example Lee (2008) — election win Angrist & Lavy (1999) — class size
← Ch 5: Fixed Effects & DD Back to Study Notes →
This note was written with the assistance of LLM (Claude).