Angrist & Pischke, Mostly Harmless Econometrics — Chapter 6
"The more rules, the tinier the rules, the more arbitrary they are, the better." — Douglas Adams
Suhyeon Lee
Angrist & Pischke, Mostly Harmless Econometrics — Chapter 6
"The more rules, the tinier the rules, the more arbitrary they are, the better." — Douglas Adams
Regression Discontinuity (RD) exploits precise knowledge of the rules determining treatment. In a rule-based world, some rules are arbitrary and therefore provide good natural experiments. The key insight: if treatment switches on/off at a known cutoff, units just above and just below the cutoff are essentially comparable — like a local randomized experiment.
Two flavors of RD:
Sharp RD is used when treatment status is a deterministic and discontinuous function of a covariate xi (the "running variable" or "forcing variable"):
where x0 is a known threshold or cutoff.
The first RD study (Thistlethwaite & Campbell, 1960) asked: Do students who win National Merit Scholarship Awards have higher college completion rates because of the award?
RD approach: Compare students with PSAT scores just above and just below the threshold. Any jump in college completion at the threshold is evidence of a treatment effect.
Important distinction from matching/regression:
In RD, there is no value of xi where we observe both treatment and control units. Unlike matching strategies based on overlap, RD validity turns on extrapolation — our willingness to assume the conditional mean function is smooth through the cutoff.
→ This is why we cannot be as agnostic about functional form in RD as in Chapter 3.
Assume potential outcomes follow a linear, constant-effects model:
This leads to the regression:
where ρ is the causal effect of interest.
Key difference from Chapter 3 regression:
Here di is not just correlated with xi — it's a deterministic function of xi. RD captures causal effects by distinguishing:
Panel A: Linear E[y₀|x] Panel B: Nonlinear E[y₀|x]
y│ y│
│ ●●●● │ ●●●●
│ ● │ ●●
│ ● ← Jump (ρ) │ ●● ← Jump (ρ)
│ ● │ ●●
│ ●● │ ●●
│ ●● │ ●●
└──────────────── x └──────────────── x
x₀ x₀
Panel C: Nonlinearity mistaken for discontinuity
y│
│ ●●●●
│ ●●●●
│ ●●● ← Sharp curve, NOT treatment!
│ ●●●
│ ●●
│ ●●
└──────────────── x
x₀
What if E[y0i | xi] = f(xi) is nonlinear? Model f(xi) with a pth-order polynomial:
As long as f(xi) is continuous at x0, we can still identify the discontinuous jump ρ.
A more flexible model allows different trend functions for E[y0i|xi] and E[y1i|xi]. Define x̃i ≡ xi − x0 (centering at the cutoff):
To avoid functional form dependence entirely, focus on a narrow window around the cutoff:
Comparing averages in small neighborhoods left and right of x0 provides an estimate that doesn't depend on correctly specifying f(xi).
Practical approaches:
| Check | What to Look For |
|---|---|
| Bandwidth sensitivity | Estimates should be stable as you narrow the window around x0 (fewer polynomial terms needed) |
| Pre-treatment covariates | No jump in covariates determined before treatment (balance check) |
| Density of running variable | No bunching/manipulation around x0 (McCrary, 2008 test) |
| Placebo cutoffs | No jumps at other values of xi where there is no policy change |
Question: Does winning an election give parties an advantage in the next election (incumbency effect)?
Key insight: Because di = 1(vote margin ≥ 0) is a deterministic function of xi, there are no confounding variables other than xi. This is a signal feature of RD.
Results:
Validity check: Lee examines Democratic victories before the last election. These should show no jump at the current cutoff — and they don't, increasing confidence in the design.
Manipulation concern: Could parties manipulate vote shares near the cutoff?
The 2000 Florida recount suggests this is a real concern in close elections. McCrary (2008) proposes formal tests for manipulation by examining the density of xi around x0.
In many settings, crossing the cutoff doesn't perfectly determine treatment — it only changes the probability of treatment. This is fuzzy RD.
The functions g0 and g1 can be anything as long as they differ at x0 (and the more the better!).
Define ti = 1(xi ≥ x0) as a dummy for crossing the threshold. The discontinuity ti becomes an instrument for treatment di.
2SLS Setup:
First stage:
where γ is the first-stage effect (jump in treatment probability at cutoff).
Second stage:
Substituting the first stage into the second stage:
The reduced-form coefficient on ti equals ργ (causal effect × first stage).
In a small neighborhood around x0, fuzzy RD becomes a simple Wald/IV estimator:
Fuzzy RD estimates a Local Average Treatment Effect (LATE):
The effect is for compliers — individuals whose treatment status changes as xi moves from just below to just above x0.
Double locality:
Question: Do smaller classes improve student test scores? (Same question as Tennessee STAR experiment)
Setting: Israeli schools have a maximum class size of 40 ("Maimonides' Rule").
Maimonides' Rule formula:
where es = enrollment, msc = predicted class size.
Maimonides' Rule doesn't predict class size perfectly — some schools split classes at enrollments below 40. This creates a fuzzy design.
| RD Component | In This Study |
|---|---|
| Running variable (xi) | Grade enrollment (es) |
| Cutoffs (x0) | 40, 80, 120, ... |
| Treatment (di) | Actual class size (nsc) |
| Instrument (ti) | Predicted class size from Maimonides' Rule (msc) |
| Outcome (yi) | Test scores |
Class size
│
40 │ ●●●●● ●●●●●
│ ● \ ● \
30 │ ● \ ● \
│ ● \ ● \
20 │ ● ●●●●●●●●●● ●●●●
│ ↑ ↑
└───────────────────────────────────────── Enrollment
40 41 80 81
--- = Maimonides' Rule (predicted)
●●● = Actual class size (fuzzy)
| OLS | 2SLS (Full) | 2SLS (±5) | Wald (±3) | |||||
|---|---|---|---|---|---|---|---|---|
| Class size | +.322 | +.076 | +.019 | −.230 | −.261 | −.185 | −.443 | −.270 |
| (s.e.) | (.039) | (.036) | (.044) | (.092) | (.113) | (.151) | (.236) | (.281) |
| Controls | None | %disadv | +enroll | linear | quadratic | linear | quadratic | dummies |
Key findings:
Interpretation: A 7-student reduction in class size (as in Tennessee STAR) raises Math scores by ~1.75 points, effect size ≈ 0.18σ. Similar to Tennessee STAR results!
Precision vs. Robustness tradeoff:
As we shrink the discontinuity sample, estimates become less precise (larger s.e.) but more robust to functional form assumptions. The fact that estimates remain stable (~−0.25) across specifications is reassuring.
| Concept | Key Point |
|---|---|
| RD Core Idea | Arbitrary rules create natural experiments — treatment determined by cutoff in running variable |
| Sharp RD | di = 1(xi ≥ x0) deterministically; selection-on-observables story |
| Fuzzy RD | P(di=1) jumps at x0; IV setup where ti=1(xi≥x0) instruments for di |
| Identification | Distinguish discontinuous jump (treatment) from smooth trend (running variable) |
| Functional Form | Must model E[y0|x] — use polynomials, allow different slopes, or focus on narrow bandwidth |
| Validity Checks | Pre-treatment covariate balance, no manipulation (density test), placebo cutoffs, bandwidth sensitivity |
| LATE | RD estimates are local to x0; fuzzy RD is LATE for compliers at the cutoff |
Practical Checklist for RD:
Sharp vs. Fuzzy Summary:
| Sharp RD | Fuzzy RD | |
|---|---|---|
| Treatment at cutoff | Switches 0→1 with certainty | Probability increases |
| Estimation | OLS with polynomial controls | 2SLS (IV) |
| Estimand | ATE at x0 | LATE for compliers at x0 |
| Example | Lee (2008) — election win | Angrist & Lavy (1999) — class size |