“Robust” Standard Errors

Heteroskedasticity and Clustering

Francis J. DiTraglia

University of Oxford

The Linear Regression “Assumptions”

You can always do a linear projection. – Jim Hamilton

No assumptions needed: can always run linear regression!
- Minimize \(\mathbb{E}[(Y - X'\beta)^2]\) over \(\beta\).
Assumptions required for:
- Interpretation (causality etc.)
- Inference (standard errors)
Today we’ll focus on inference for linear regression.

Review of “Classical” Regression SEs

This is Just Algebra

If \(Y_i = \alpha + \beta X_i + U_i\), then

\[ \begin{align} \sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y}) &= \sum_{i=1}^n (X_i - \bar{X})Y_i \\ &= \sum_{i=1}^n (X_i - \bar{X}) (\alpha + \beta X_i + U_i) \\ &= \alpha \sum_{i=1}^n (X_i - \bar{X}) + \beta \sum_{i=1}^n (X_i - \bar{X})X_i + \sum_{i=1}^n (X_i - \bar{X}) U_i\\ &= \beta \sum_{i=1}^n (X_i - \bar{X})^2 + \sum_{i=1}^n (X_i - \bar{X}) U_i \end{align} \]

OLS Slope Estimator

\[ \begin{align*} \widehat{\beta} &= \frac{\sum_{i=1}^n(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^n (X_i - \bar{X})^2}\\ \\ &= \frac{\beta \sum_{i=1}^n (X_i - \bar{X})^2 + \sum_{i=1}^n (X_i - \bar{X}) U_i }{\sum_{i=1}^n (X_i - \bar{X})^2} \\ \\ &= \beta + \frac{\sum_{i=1}^n U_i (X_i - \bar{X})}{\sum_{i=1}^n (X_i - \bar{X})^2} \end{align*} \]

Inference Conditional on \(X\)

\[ \begin{align} \text{Var}(\widehat{\beta}|X_1, \dots X_n) &= \text{Var}\left[\left.\beta + \frac{\sum_{i=1}^n U_i (X_i - \bar{X})}{\sum_{i=1}^n (X_i - \bar{X})^2}\right| \mathbf{X}\right] \\ \\ &=\frac{\text{Var}\left[\left.\sum_{i=1}^n U_i (X_i - \bar{X})\right| \mathbf{X}\right]}{\left[\sum_{i=1}^n (X_i - \bar{X})^2\right]^2} \end{align} \]

The question is how we simplify the numerator…

“Classical” Standard Errors

Conditional on all the \(X_i\) observations, all of the errors \(U_i\) are uncorrelated and have the same variance: \(\text{Var}[\mathbf{U}|\mathbf{X}] = \sigma^2 \mathbf{I}\)

\[ \begin{align} \text{Var}\left[\left.\sum_{i=1}^n U_i (X_i - \bar{X})\right| \mathbf{X}\right] &= \sum_{i=1}^n \text{Var}\left[\left. U_i (X_i - \bar{X})\right| \mathbf{X}\right] \\ &= \sum_{i=1}^n (X_i - \bar{X})^2 \text{Var}[U_i|\mathbf{X}] \\ &= \sigma^2 \sum_{i=1}^n (X_i - \bar{X})^2 \end{align} \]

“Classical” Standard Errors

\(\text{Var}[\mathbf{U}|\mathbf{X}] = \sigma^2 \mathbf{I}\) implies that

\[ \text{Var}(\widehat{\beta}|\mathbf{X}) = \frac{\sigma^2 \sum_{i=1}^n (X_i - \bar{X})^2}{\left[\sum_{i=1}^n (X_i - \bar{X})^2\right]^2} =\frac{\sigma^2}{\sum_{i=1}^n (X_i - \bar{X})^2}. \] Then simply estimate \(\sigma^2\) using OLS residuals: \[ \text{Var}(\widehat{\beta}|\mathbf{X}) \approx \frac{\hat{\sigma}^2}{\sum_{i=1}^n (X_i - \bar{X})^2}, \quad \text{where } \hat{\sigma}^2 = \frac{\sum_{i=1}^n \hat{U}_i^2}{n - 2} \] Asymptotics: as \(n \rightarrow \infty\) \[ \frac{n \hat{\sigma}^2}{\sum_{i=1}^n (X_i - \bar{X})^2} = \frac{\hat{\sigma}^2}{\frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^2} \rightarrow_p \frac{\sigma^2}{\text{Var}(X_i)} = \text{AVar}(\widehat{\beta}) \]

Weakening the Assumptions

Heteroskedasticity
- Variance of \(U_i\) changes with \(X_i\)
- Can’t factor out \(\sigma^2\) in the sum!
Clustered Sampling
- Correlation between \(U_i\) and \(U_j\) given \(\mathbf{X}\)
- Can’t exchange sum and variance!
In either case, it all comes down to how we estimate the variance of the sum \(\sum_{i=1}^n U_i(X_i - \bar{X})\) given \(\mathbf{X}\).

Heteroskedasticity

What is heteroskedasticity?¹

Greek-inspired word that literally means “different scatter”
- Homoskedasticity / Homoskedastic: \(\text{Var}(U_i|X_i) = \mathbb{E}(U_i^2|X_i) = \sigma^2(X_i) = \sigma^2\)
- Heteroskedasticity / Heteroskedastic: \(\text{Var}(U_i|X_i) = \mathbb{E}(U_i^2|X_i) = \sigma^2(X_i)\) varies with \(X_i\)

Heteroskedasticity

Conditional on all the \(X_i\) observations, all of the errors \(U_i\) are uncorrelated but \(\text{Var}(U_i|\mathbf{X}) = \text{Var}(U_i|X_i) = \sigma^2(X_i)\)

\[ \begin{align} \text{Var}\left[\left.\sum_{i=1}^n U_i (X_i - \bar{X})\right| \mathbf{X}\right] &= \sum_{i=1}^n \text{Var}\left[\left. U_i (X_i - \bar{X})\right| \mathbf{X}\right] \\ &= \sum_{i=1}^n (X_i - \bar{X})^2 \text{Var}[U_i|\mathbf{X}] \\ &= \sum_{i=1}^n \sigma^2(X_i) (X_i - \bar{X})^2 \end{align} \]

Dealing with Heteroskedasticity

\[ \text{Var}(\mathbf{U}|\mathbf{X}) = \text{diag}\{ \sigma^2(X_i)\}, \; \text{Var}(\widehat{\beta}|\mathbf{X}) =\frac{\sum_{i=1}^n \sigma^2(X_i) (X_i - \bar{X})^2}{\left[\sum_{i=1}^n (X_i - \bar{X})^2\right]^2} \]

Weighted Least Squares
- Model and estimate the function \(\sigma^2(X_i)\)
- Run OLS on a re-weighted model: new estimates and SEs
Heteroskedasticity-Consistent Standard Errors
- Don’t model \(\sigma^2(X_i)\), just substitute \(\hat{U}_i^2\) for it.
- Same estimates as original OLS, but new SEs.

Heteroskedasticity-Consistent SEs

By definition, \(\text{Var}(U_i|X_i) = \mathbb{E}(U_i^2|X_i) = \sigma^2(X_i)\).
If \(U_i^2\) were observed, could substitute for \(\sigma^2(X_i)\) to get unbiased variance estimator: \[ \mathbb{E}\left\{\left. \frac{\sum_{i=1}^n U_i^2 (X_i - \bar{X})^2}{\left[\sum_{i=1}^n (X_i - \bar{X})^2\right]^2}\right| \mathbf{X}\right\} =\frac{\sum_{i=1}^n \mathbb{E}(U_i^2|X_i) (X_i - \bar{X})^2}{\left[\sum_{i=1}^n (X_i - \bar{X})^2\right]^2} = \text{Var}(\widehat{\beta}|\mathbf{X}) \]
Idea: replace squared errors \(U_i^2\) (unobserved) with squared residuals \(\hat{U}_i^2\) (observed)
Hence the Heteroskedasticity-Consistent variance estimator (aka Huber-Eicker-White): \[ \frac{\sum_{i=1}^n \hat{U}_i^2 (X_i - \bar{X})^2}{\left[\sum_{i=1}^n (X_i - \bar{X})^2\right]^2} \approx \text{Var}(\widehat{\beta}|\mathbf{X}), \quad \quad \frac{n\sum_{i=1}^n \hat{U}_i^2 (X_i - \bar{X})^2}{\left[\sum_{i=1}^n (X_i - \bar{X})^2\right]^2} \rightarrow_p \text{AVar}(\widehat{\beta}) \]
The above is called the “HC0” estimator to distinguish it from some alternatives:
- HC1, HC2, and HC3: “tweaks” to improve finite-sample performance (problem set).

`lm_robust()`

lm_robust() function from estimatr package provides heterosekdasticity-consistent SEs.
Syntax is nearly identical to lm() except:
- se_type = 'classical' gives same SEs as lm()
- se_type = 'stata' matches Stata’s “reg, robust”
- Others: HC0, HC1 (Stata), HC2 (default), HC3.
- You’ll explore these on the problem set.
Plays nicely with broom::tidy() and modelsummary()

`lm_robust()` - Example

library(modelsummary)
library(estimatr)
library(modelsummary)

reg_classical <- lm_robust(y1 ~ x, mydat, se_type = 'classical')

reg_robust <- lm_robust(y1 ~ x, mydat, se_type = 'HC0')

modelsummary(list(Classical = reg_classical, Robust = reg_robust), 
             fmt = 2, 
             gof_omit = 'R2 Adj.|AIC|BIC')

`lm_robust()` - Example

	Classical	Robust
(Intercept)	0.34	0.34
	(0.22)	(0.17)
x	0.78	0.78
	(0.38)	(0.40)
Num.Obs.	100	100
R2	0.041	0.041
RMSE	1.03	1.03

Things to notice

The estimates are identical; only the standard errors change
The robust SE for the intercept was smaller
The robust SE for the slope was larger
Different SEs \(\implies\) different CIs and p-values
It could go either way, but typically robust increases SE

F-tests with `lm_robust`

Heteroskedasticity invalidates lm-based F-test from linearHypothesis()
Easy to fix: just use lm_robust with desired SEs instead

library(car)
library(broom)
linearHypothesis(reg_classical, 'x = 0') |> tidy()

# A tibble: 1 × 8
  term  null.value estimate std.error statistic p.value df.residual    df
  <chr>      <dbl>    <dbl>     <dbl>     <dbl>   <dbl>       <dbl> <dbl>
1 x              0    0.777     0.379      4.21  0.0402          98     1

linearHypothesis(reg_robust, 'x = 0') |> tidy()

# A tibble: 1 × 8
  term  null.value estimate std.error statistic p.value df.residual    df
  <chr>      <dbl>    <dbl>     <dbl>     <dbl>   <dbl>       <dbl> <dbl>
1 x              0    0.777     0.403      3.72  0.0538          98     1

💪 Exercise

Replicate the simulation from the right panel of my plot above:
1. Set the seed to 4321 and generate n = 100 uniform draws for x
2. Set y equal to 0.2 + 0.9 * x + error where error is a vector of independent, mean-zero normal errors with standard deviation sqrt(2 * x).
3. Replicate my plot and check that yours matches it.
Using x and y, replicate my regression and F-test results from above.
Use the formulas from earlier in this lecture to compute the “classical” and “HC0” standard errors for the regression slope “by hand” based on x and y. Check that your results match those of lm_robust().

Clustered Sampling

What is a clustered sample?

Complicated topic; lots of ongoing research.
Failure of iid sampling assumption
Roughly: correlation between errors in “groups”
Panel data:
- If I randomly sample you, I observe you in all \(T\) periods.
Classrooms / Villages / etc.
- First randomly sample villages.
- Then randomly sample villagers within sampled villages.

Why is this a problem?

Mechanically: \[ \text{Var}\left[\left.\sum_{i=1}^n U_i(X_i - \bar{X})\right|\mathbf{X}\right] \neq \sum_{i=1}^n \text{Var}[U_i|\mathbf{X}](X_i - \bar{X})^2 \] Intuitively:

Observations “more similar” within cluster than between.
An extra observation from an already-sampled cluster provides “less info” than one from an un-sampled cluster.

Upgrading our Notation

\[ Y_{gi} = \alpha + \beta X_{gi} + U_{gi} \]

\(g = 1, 2, \dots, G\) indexes groups
\(i = 1, 2, \dots, n_g\) indexes individuals within group \(g\)
Total sample size: \(N = \sum_{g=1}^G n_g\)

Note

In a (balanced) panel dataset the groups are individuals and \(n_g = T\).

Re-writing the Variance Expression

\[ \begin{align} \text{Var}(\widehat{\beta}|\mathbf{X}) &=\frac{\text{Var}\left[\left.\sum_{j=1}^N U_j (X_j - \bar{X})\right| \mathbf{X}\right]}{\left[\sum_{j=1}^N (X_j - \bar{X})^2\right]^2} \\ \\ &=\frac{\text{Var}\left[\left.\sum_{g=1}^G \sum_{i=1}^{n_g} U_{gi} (X_{gi} - \bar{X})\right| \mathbf{X}\right]}{\left[\sum_{g=1}^G \sum_{i=1}^{n_g} (X_{gi} - \bar{X})^2\right]^2} \end{align} \]

As before, the key is the numerator; the denominator is easy.

Upgrading Our Notation Again

Define the shorthand \(\tilde{X}_{gi} \equiv (X_{gi} - \bar{X})\)
Define vectors of errors and regressors for each group
- \(\mathbf{U}_g' = (U_{g1}, \dots, U_{gn_g})\)
- \(\tilde{\mathbf{X}}_g' = (\tilde{X}_{g1}, \dots, \tilde{X}_{g n_g})\)

\[ \begin{align} \text{Var}(\widehat{\beta}|\mathbf{X}) &=\frac{\text{Var}\left[\left.\sum_{g=1}^G \sum_{i=1}^{n_g} U_{gi} \tilde{X}_{gi} \right| \mathbf{X}\right]}{\left[\sum_{g=1}^G \sum_{i=1}^{n_g} \tilde{X}_{gi}^2\right]^2}=\frac{\text{Var}\left[\left.\sum_{g=1}^G \tilde{\mathbf{X}}'_g \mathbf{U}_g \right| \mathbf{X}\right]}{\left[\sum_{g=1}^G \tilde{\mathbf{X}}_g'\tilde{\mathbf{X}}_g\right]^2} \end{align} \]

Just in case I haven’t said it enough already: the key is the numerator!

Random Sampling of Clusters

If clusters are drawn at random from some population, then there is no correlation between them: \[ \begin{align} \text{Var}\left[\left.\sum_{g=1}^G \tilde{\mathbf{X}}'_g \mathbf{U}_g \right| \mathbf{X}\right] &= \sum_{g=1}^G \text{Var}\left[\left. \tilde{\mathbf{X}}'_g \mathbf{U}_g\right|\mathbf{X} \right] \\ &= \sum_{g=1}^G \tilde{\mathbf{X}}_g' \, \mathbb{E}[\mathbf{U}_g \mathbf{U}_g'|\mathbf{X}_g]\, \tilde{\mathbf{X}}_g\\ &= \sum_{g=1}^G \tilde{\mathbf{X}}_g' \Sigma_g \tilde{\mathbf{X}}_g. \end{align} \]

Cluster-Robust Standard Errors

\[ \begin{align} \text{Var}(\widehat{\beta}|\mathbf{X}) = \frac{\sum_{g=1}^G \tilde{\mathbf{X}}'_g \Sigma_g \tilde{\mathbf{X}}_g }{\left[\sum_{g=1}^G \tilde{\mathbf{X}}_g'\tilde{\mathbf{X}}_g\right]^2} \approx \frac{\sum_{g=1}^G \tilde{\mathbf{X}}'_g \hat{\mathbf{U}}_g \hat{\mathbf{U}}_g' \tilde{\mathbf{X}}_g}{\left[\sum_{g=1}^G \tilde{\mathbf{X}}_g'\tilde{\mathbf{X}}_g\right]^2} \end{align} \]

\(\Sigma_g \equiv \mathbb{E}[\mathbf{U}_g \mathbf{U}_g'|\mathbf{X}_g]\) allows within-cluster correlation.
Allows for heteroskedasticity: \(\Sigma_g\) depends on \(\mathbf{X}_g\); problem is that we don’t know \(\Sigma_g\)
Like Huber-Eicker-White, cluster robust SE replaces \(\Sigma_g\) with residuals \(\hat{\mathbf{U}}_g \hat{\mathbf{U}}_g'\).

\[ \frac{G \sum_{g=1}^G \tilde{\mathbf{X}}'_g \hat{\mathbf{U}}_g \hat{\mathbf{U}}_g' \tilde{\mathbf{X}}_g}{\left[\sum_{g=1}^G \tilde{\mathbf{X}}_g'\tilde{\mathbf{X}}_g\right]^2} \rightarrow_p \text{AVar}(\widehat{\beta}) \quad \text{as} \quad G \rightarrow \infty \]

Does it really matter in practice?

Simplest interesting example:

Identically-sized groups: \(n_g = n\) for all \(g\)
Errors independent of regressors \(\implies\) homoskedastic
\(X_{gi}\) takes the same value for all \(i\) within a given cluster: \(X_g\)
\(U_{gi} = \epsilon_{gi} + \eta_g\)
- group-specific unobserved error \(\eta_g\), indep across \(g\)
- \(\epsilon_{gi}\) is iid across \(i\) and \(g\)
- \(\text{Var}(U_{gi}) = \sigma^2\)

Simplest Interesting Example

\[ \begin{align} \Sigma_g &= \mathbb{E}[\mathbf{U}_g \mathbf{U}_g'|\mathbf{X}_g] = \mathbb{E}[\mathbf{U}_g\mathbf{U}_g'|X_g] = \mathbb{E}[\mathbf{U}_g\mathbf{U}_g']\\ &= \sigma^2 \begin{bmatrix} 1 & \rho & \cdots & \rho \\ \rho & 1 & \\ \vdots & & \ddots & \vdots\\ \rho & \rho & \cdots & 1 \end{bmatrix} = \sigma^2[(1 - \rho) \mathbf{I}_n + \rho \,\boldsymbol{\iota}_n\boldsymbol{\iota}_n'] \end{align} \] where \(\rho \equiv \text{Var}(\eta_g) / \sigma^2\) and \(\sigma^2 \equiv \text{Var}(U_{gi})\).

Simplest Interesting Example

Suppose that:

\(\Sigma_g = \mathbb{E}[\mathbf{U}_g \mathbf{U}_g'] = \sigma^2[(1 - \rho) \mathbf{I}_n + \rho \,\boldsymbol{\iota}_n\boldsymbol{\iota}_n']\)
\(X_{gi} = X_g\) is the same for everyone in group \(g\).

After some algebra: \[ \text{Var}(\widehat{\beta}|\mathbf{X}) = [1 + (n - 1)\rho]\times \left[ \frac{\sigma^2}{\sum_{j=1}^N (X_j - \bar{X})^2}\right] \] Clustering scales “usual” variance of \(\widehat{\beta}\) by \([1 + (n-1)\rho]\)!

Clustered SEs with `lm_robust()`

Set argument clusters to name of grouping variable.
Set se_type to CR0, CR2 (default) or stata
Otherwise same as lm()
Works with tidy(), modelsummary(), linearHypothesis()

Duflo, Dupas & Kremer (2011)

Experimental data for 121 schools in Kenya
Schools randomly assigned to tracking or control (50/50)
- tracking: students allocated to classes based pre-test.
- control: students allocated to classes at random.
Does tracking improve test scores?
Clustered sample at level of schools.
Expect correlation within schools, but not between.

Duflo, Dupas & Kremer (2011)

library(haven)

ddk2011 <- read_dta('https://ditraglia.com/data/DDK2011.dta')

# convert outcome to z-score
ddk2011 <- ddk2011 |> 
  mutate(testscore = scale(totalscore)) 

reg_classical <- lm(testscore ~ tracking, ddk2011)

# Defaults to HC2 
reg_heterosked <- lm_robust(testscore ~ tracking, ddk2011) 

# Defaults to CR2
reg_cluster <- lm_robust(testscore ~ tracking, ddk2011, 
                         clusters = schoolid)  

modelsummary(list(Classical = reg_classical, 
                  Heterosked = reg_heterosked,
                  Cluster = reg_cluster), 
             gof_omit = 'R2 Adj,|AIC|BIC|F|Log.Lik.')

Duflo, Dupas & Kremer (2011)

	Classical	Heterosked	Cluster
(Intercept)	-0.071	-0.071	-0.071
	(0.019)	(0.019)	(0.055)
tracking	0.138	0.138	0.138
	(0.026)	(0.026)	(0.078)
Num.Obs.	5795	5795	5795
R2	0.005	0.005	0.005
R2 Adj.	0.005	0.005	0.005
RMSE	1.00	1.00	1.00
Std.Errors			by: schoolid

Wait, what about IV / TSLS?

iv_robust() from estimatr has you covered!
Syntax is very similar to ivreg() from AER
- E.g. formula y ~ x1 + x2 | z1 + z2 declares x1 and x2 as endog regressors, z1 and z2 as exog IVs.
Set clusters and se_type as in lm_robust()
Works with tidy(), modelsummary() etc.
More information and examples here.

“Robust” Standard Errors

The Linear Regression “Assumptions”

Review of “Classical” Regression SEs

This is Just Algebra

OLS Slope Estimator

Inference Conditional on \(X\)

“Classical” Standard Errors

“Classical” Standard Errors

Weakening the Assumptions

Heteroskedasticity

What is heteroskedasticity?1

Heteroskedasticity

Dealing with Heteroskedasticity

Heteroskedasticity-Consistent SEs

lm_robust()

lm_robust() - Example

lm_robust() - Example

Things to notice

F-tests with lm_robust

💪 Exercise

Clustered Sampling

What is a clustered sample?

Why is this a problem?

Upgrading our Notation

Re-writing the Variance Expression

Upgrading Our Notation Again

Random Sampling of Clusters

Cluster-Robust Standard Errors

Does it really matter in practice?

Simplest Interesting Example

Simplest Interesting Example

Clustered SEs with lm_robust()

Duflo, Dupas & Kremer (2011)

Duflo, Dupas & Kremer (2011)

Duflo, Dupas & Kremer (2011)

Wait, what about IV / TSLS?

What is heteroskedasticity?¹

`lm_robust()`

`lm_robust()` - Example

`lm_robust()` - Example

F-tests with `lm_robust`

Clustered SEs with `lm_robust()`