“Robust” Standard Errors

Heteroskedasticity and Clustering

Francis J. DiTraglia

University of Oxford

The Linear Regression “Assumptions”

You can always do a linear projection. – Jim Hamilton

  • No assumptions needed: can always run linear regression!
    • Minimize \(\mathbb{E}[(Y - X'\beta)^2]\) over \(\beta\).
  • Assumptions required for:
    • Interpretation (causality etc.)
    • Inference (standard errors)
  • Today we’ll focus on inference for linear regression.

Review of “Classical” Regression SEs

This is Just Algebra

If \(Y_i = \alpha + \beta X_i + U_i\), then

\[ \begin{align} \sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y}) &= \sum_{i=1}^n (X_i - \bar{X})Y_i \\ &= \sum_{i=1}^n (X_i - \bar{X}) (\alpha + \beta X_i + U_i) \\ &= \alpha \sum_{i=1}^n (X_i - \bar{X}) + \beta \sum_{i=1}^n (X_i - \bar{X})X_i + \sum_{i=1}^n (X_i - \bar{X}) U_i\\ &= \beta \sum_{i=1}^n (X_i - \bar{X})^2 + \sum_{i=1}^n (X_i - \bar{X}) U_i \end{align} \]

OLS Slope Estimator

\[ \begin{align*} \widehat{\beta} &= \frac{\sum_{i=1}^n(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^n (X_i - \bar{X})^2}\\ \\ &= \frac{\beta \sum_{i=1}^n (X_i - \bar{X})^2 + \sum_{i=1}^n (X_i - \bar{X}) U_i }{\sum_{i=1}^n (X_i - \bar{X})^2} \\ \\ &= \beta + \frac{\sum_{i=1}^n U_i (X_i - \bar{X})}{\sum_{i=1}^n (X_i - \bar{X})^2} \end{align*} \]

Inference Conditional on \(X\)

\[ \begin{align} \text{Var}(\widehat{\beta}|X_1, \dots X_n) &= \text{Var}\left[\left.\beta + \frac{\sum_{i=1}^n U_i (X_i - \bar{X})}{\sum_{i=1}^n (X_i - \bar{X})^2}\right| \mathbf{X}\right] \\ \\ &=\frac{\text{Var}\left[\left.\sum_{i=1}^n U_i (X_i - \bar{X})\right| \mathbf{X}\right]}{\left[\sum_{i=1}^n (X_i - \bar{X})^2\right]^2} \end{align} \]

The question is how we simplify the numerator…

“Classical” Standard Errors

Conditional on all the \(X_i\) observations, all of the errors \(U_i\) are uncorrelated and have the same variance: \(\text{Var}[\mathbf{U}|\mathbf{X}] = \sigma^2 \mathbf{I}\)

\[ \begin{align} \text{Var}\left[\left.\sum_{i=1}^n U_i (X_i - \bar{X})\right| \mathbf{X}\right] &= \sum_{i=1}^n \text{Var}\left[\left. U_i (X_i - \bar{X})\right| \mathbf{X}\right] \\ &= \sum_{i=1}^n (X_i - \bar{X})^2 \text{Var}[U_i|\mathbf{X}] \\ &= \sigma^2 \sum_{i=1}^n (X_i - \bar{X})^2 \end{align} \]

“Classical” Standard Errors

\(\text{Var}[\mathbf{U}|\mathbf{X}] = \sigma^2 \mathbf{I}\) implies that

\[ \text{Var}(\widehat{\beta}|\mathbf{X}) = \frac{\sigma^2 \sum_{i=1}^n (X_i - \bar{X})^2}{\left[\sum_{i=1}^n (X_i - \bar{X})^2\right]^2} =\frac{\sigma^2}{\sum_{i=1}^n (X_i - \bar{X})^2}. \] Then simply estimate \(\sigma^2\) using OLS residuals.

Weakening the Assumptions

  • Heteroskedasticity
    • Variance of \(U_i\) changes with \(X_i\)
    • Can’t factor out \(\sigma^2\) in the sum!
  • Clustered Sampling
    • Correlation between \(U_i\) and \(U_j\) given \(\mathbf{X}\)
    • Can’t exchange sum and variance!
  • In either case, it all comes down to how we estimate the variance of the sum \(\sum_{i=1}^n U_i(X_i - \bar{X})\) given \(\mathbf{X}\).

Heteroskedasticity

What is heteroskedasticity?1

  • Greek-inspired word that literally means “different scatter”
    • Homoskedasticity / Homoskedastic: \(\text{Var}(U_i|X_i) = \mathbb{E}(U_i^2|X_i) = \sigma^2(X_i) = \sigma^2\)
    • Heteroskedasticity / Heteroskedastic: \(\text{Var}(U_i|X_i) = \mathbb{E}(U_i^2|X_i) = \sigma^2(X_i)\) varies with \(X_i\)

Heteroskedasticity

Conditional on all the \(X_i\) observations, all of the errors \(U_i\) are uncorrelated but \(\text{Var}(U_i|\mathbf{X}) = \text{Var}(U_i|X_i) = \sigma^2(X_i)\)

\[ \begin{align} \text{Var}\left[\left.\sum_{i=1}^n U_i (X_i - \bar{X})\right| \mathbf{X}\right] &= \sum_{i=1}^n \text{Var}\left[\left. U_i (X_i - \bar{X})\right| \mathbf{X}\right] \\ &= \sum_{i=1}^n (X_i - \bar{X})^2 \text{Var}[U_i|\mathbf{X}] \\ &= \sum_{i=1}^n \sigma^2(X_i) (X_i - \bar{X})^2 \end{align} \]

Dealing with Heteroskedasticity

\[ \text{Var}(\mathbf{U}|\mathbf{X}) = \text{diag}\{ \sigma^2(X_i)\}, \; \text{Var}(\widehat{\beta}|\mathbf{X}) =\frac{\sum_{i=1}^n \sigma^2(X_i) (X_i - \bar{X})^2}{\left[\sum_{i=1}^n (X_i - \bar{X})^2\right]^2} \]

  1. Weighted Least Squares
    • Model and estimate the function \(\sigma^2(X_i)\)
    • Run OLS on a re-weighted model: new estimates and SEs
  2. Heteroskedasticity-Consistent Standard Errors
    • Don’t model \(\sigma^2(X_i)\), just substitute \(\hat{U}_i^2\) for it.
    • Same estimates as original OLS, but new SEs.

Heteroskedasticity-Consistent SEs

  • By definition, \(\text{Var}(U_i|X_i) = \mathbb{E}(U_i^2|X_i) = \sigma^2(X_i)\).
  • If \(U_i^2\) were observed, could substitute for \(\sigma^2(X_i)\) to get unbiased variance estimator: \[ \mathbb{E}\left\{\left. \frac{\sum_{i=1}^n U_i^2 (X_i - \bar{X})^2}{\left[\sum_{i=1}^n (X_i - \bar{X})^2\right]^2}\right| \mathbf{X}\right\} =\frac{\sum_{i=1}^n \mathbb{E}(U_i^2|X_i) (X_i - \bar{X})^2}{\left[\sum_{i=1}^n (X_i - \bar{X})^2\right]^2} = \text{Var}(\widehat{\beta}|\mathbf{X}) \]
  • Squared OLS residual \(\hat{U}_i^2\) is a consistent estimator of \(U_i^2\)
  • Hence the Heteroskedasticity-Consistent variance estimator (aka Huber-Eicker-White): \[ \frac{\sum_{i=1}^n \hat{U}_i^2 (X_i - \bar{X})^2}{\left[\sum_{i=1}^n (X_i - \bar{X})^2\right]^2} \approx \text{Var}(\widehat{\beta}|\mathbf{X}) \]
  • The above is called the “HC0” estimator to distinguish it from some alternatives:
    • HC1, HC2, and HC3: “tweaks” to improve finite-sample performance (problem set).

lm_robust()

  • lm_robust() function from estimatr package provides heterosekdasticity-consistent SEs.
  • Syntax is nearly identical to lm() except:
    • se_type = 'classical' gives same SEs as lm()
    • se_type = 'stata' matches Stata’s “reg, robust”
    • Others: HC0, HC1 (Stata), HC2 (default), HC3.
    • You’ll explore these on the problem set.
  • Plays nicely with broom::tidy() and modelsummary()

lm_robust() - Example

library(modelsummary)
library(estimatr)
library(modelsummary)

reg_classical <- lm_robust(y1 ~ x, mydat, se_type = 'classical')

reg_robust <- lm_robust(y1 ~ x, mydat, se_type = 'HC0')

modelsummary(list(Classical = reg_classical, Robust = reg_robust), 
             fmt = 2, 
             gof_omit = 'R2 Adj.|AIC|BIC')

lm_robust() - Example

Classical Robust
(Intercept) 0.34 0.34
(0.22) (0.17)
x 0.78 0.78
(0.38) (0.40)
Num.Obs. 100 100
R2 0.041 0.041
RMSE 1.03 1.03

Things to notice

  • The estimates are identical; only the standard errors change
  • The robust SE for the intercept was smaller
  • The robust SE for the slope was larger
  • Different SEs \(\implies\) different CIs and p-values
  • It could go either way, but typically robust increases SE

F-tests with lm_robust

  • Heteroskedasticity invalidates lm-based F-test from linearHypothesis()
  • Easy to fix: just use lm_robust with desired SEs instead
library(car)
library(broom)
linearHypothesis(reg_classical, 'x = 0') |> tidy()
# A tibble: 1 × 8
  term  null.value estimate std.error statistic p.value df.residual    df
  <chr>      <dbl>    <dbl>     <dbl>     <dbl>   <dbl>       <dbl> <dbl>
1 x              0    0.777     0.379      4.21  0.0402          98     1
linearHypothesis(reg_robust, 'x = 0') |> tidy()
# A tibble: 1 × 8
  term  null.value estimate std.error statistic p.value df.residual    df
  <chr>      <dbl>    <dbl>     <dbl>     <dbl>   <dbl>       <dbl> <dbl>
1 x              0    0.777     0.403      3.72  0.0538          98     1

💪 Exercise

  1. Replicate the simulation from the right panel of my plot above:
    1. Set the seed to 4321 and generate n = 100 uniform draws for x
    2. Set y equal to 0.2 + 0.9 * x + error where error is a vector of independent, mean-zero normal errors with standard deviation sqrt(2 * x).
    3. Replicate my plot and check that yours matches it.
  2. Using x and y, replicate my regression and F-test results from above.
  3. Use the formulas from earlier in this lecture to compute the “classical” and “HC0” standard errors for the regression slope “by hand” based on x and y. Check that your results match those of lm_robust().

Clustered Sampling

What is a clustered sample?

  • Complicated topic; lots of ongoing research.
  • Failure of iid sampling assumption
  • Roughly: correlation between errors in “groups”
  • Panel data:
    • If I randomly sample Ana, I observe her in all \(T\) periods.
  • Classrooms / Villages / etc.
    • First randomly sample villages.
    • Then randomly sample villagers within sampled villages.

Why is this a problem?

Mechanically: \[ \text{Var}\left[\left.\sum_{i=1}^n U_i(X_i - \bar{X})\right|\mathbf{X}\right] \neq \sum_{i=1}^n \text{Var}[U_i|\mathbf{X}](X_i - \bar{X})^2 \] Intuitively:

  • Observations “more similar” within cluster than between.
  • An extra observation from an already-sampled cluster provides “less info” than one from an un-sampled cluster.

Upgrading our Notation

\[ Y_{ig} = \alpha + \beta X_{ig} + U_{ig} \]

  • \(g = 1, 2, \dots, G\) indexes groups
  • \(i = 1, 2, \dots, n_g\) indexes individuals within group \(g\)
  • Total sample size: \(N = \sum_{g=1}^G n_g\)

Note

In a (balanced) panel dataset the groups are individuals and \(n_g = T\).

Re-writing the Variance Expression

\[ \begin{align} \text{Var}(\widehat{\beta}|\mathbf{X}) &=\frac{\text{Var}\left[\left.\sum_{j=1}^N U_j (X_j - \bar{X})\right| \mathbf{X}\right]}{\left[\sum_{j=1}^N (X_j - \bar{X})^2\right]^2} \\ \\ &=\frac{\text{Var}\left[\left.\sum_{g=1}^G \sum_{i=1}^{n_g} U_{ig} (X_{ig} - \bar{X})\right| \mathbf{X}\right]}{\left[\sum_{g=1}^G \sum_{i=1}^{n_g} (X_{ig} - \bar{X})^2\right]^2} \end{align} \]

As before, the key is the numerator; the denominator is easy.

Upgrading Our Notation Again

  • Define the shorthand \(\tilde{X}_{ig} \equiv (X_{ig} - \bar{X})\)
  • Define vectors of errors and regressors for each group
    • \(\mathbf{U}_g' = (U_{1,g}, \dots, U_{n_g g})\)
    • \(\tilde{\mathbf{X}}_g' = (\tilde{X}_{1g}, \dots, \tilde{X}_{n_g g})\)

\[ \begin{align} \text{Var}(\widehat{\beta}|\mathbf{X}) &=\frac{\text{Var}\left[\left.\sum_{g=1}^G \sum_{i=1}^{n_g} U_{ig} \tilde{X}_{ig} \right| \mathbf{X}\right]}{\left[\sum_{g=1}^G \sum_{i=1}^{n_g} \tilde{X}_{ig}^2\right]^2}=\frac{\text{Var}\left[\left.\sum_{g=1}^G \tilde{\mathbf{X}}'_g \mathbf{U}_g \right| \mathbf{X}\right]}{\left[\sum_{g=1}^G \tilde{\mathbf{X}}_g'\tilde{\mathbf{X}}_g\right]^2} \end{align} \]

  • Just in case I haven’t said it enough already: the key is the numerator!

Random Sampling of Clusters

If clusters are drawn at random from some population, then there is no correlation between them: \[ \begin{align} \text{Var}\left[\left.\sum_{g=1}^G \tilde{\mathbf{X}}'_g \mathbf{U}_g \right| \mathbf{X}\right] &= \sum_{g=1}^G \text{Var}\left[\left. \tilde{\mathbf{X}}'_g \mathbf{U}_g\right|\mathbf{X} \right] \\ &= \sum_{g=1}^G \tilde{\mathbf{X}}_g' \, \mathbb{E}[\mathbf{U}_g \mathbf{U}_g'|\mathbf{X}_g]\, \tilde{\mathbf{X}}_g\\ &= \sum_{g=1}^G \tilde{\mathbf{X}}_g' \Sigma_g \tilde{\mathbf{X}}_g. \end{align} \]

Cluster-Robust Standard Errors

\[ \begin{align} \text{Var}(\widehat{\beta}|\mathbf{X}) = \frac{\sum_{g=1}^G \tilde{\mathbf{X}}'_g \Sigma_g \tilde{\mathbf{X}}_g }{\left[\sum_{g=1}^G \tilde{\mathbf{X}}_g'\tilde{\mathbf{X}}_g\right]^2} \approx \frac{\sum_{g=1}^G \tilde{\mathbf{X}}'_g \hat{\mathbf{U}}_g \hat{\mathbf{U}}_g' \tilde{\mathbf{X}}_g}{\left[\sum_{g=1}^G \tilde{\mathbf{X}}_g'\tilde{\mathbf{X}}_g\right]^2} \end{align} \]

  • \(\Sigma_g \equiv \mathbb{E}[\mathbf{U}_g \mathbf{U}_g'|\mathbf{X}_g]\) allows within-cluster correlation.
  • Also allows for heteroskedasticity since \(\Sigma_g\) depends on \(\mathbf{X}_g\).
  • The problem is that we don’t know \(\Sigma_g\)!
  • By analogy with Huber-Eicker-White SEs, cluster robust SE estimator replaces \(\Sigma_g\) with the residuals \(\hat{\mathbf{U}}_g \hat{\mathbf{U}}_g'\).

Does it really matter in practice?

Simplest interesting example:

  • Identically-sized groups: \(n_g = n\) for all \(g\)
  • Errors independent of regressors \(\implies\) homoskedastic
  • \(X_{ig}\) takes the same value for all \(i\) within a given cluster: \(X_g\)
  • \(U_{ig} = \epsilon_{ig} + \eta_g\)
    • group-specific unobserved error \(\eta_g\), indep across \(g\)
    • \(\epsilon_{ig}\) is iid across \(i\) and \(g\)
    • \(\text{Var}(U_{ig}) = \sigma^2\)

Simplest Interesting Example

\[ \begin{align} \Sigma_g &= \mathbb{E}[\mathbf{U}_g \mathbf{U}_g'|\mathbf{X}_g] = \mathbb{E}[\mathbf{U}_g\mathbf{U}_g'|X_g] = \mathbb{E}[\mathbf{U}_g\mathbf{U}_g']\\ &= \sigma^2 \begin{bmatrix} 1 & \rho & \cdots & \rho \\ \rho & 1 & \\ \vdots & & \ddots & \vdots\\ \rho & \rho & \cdots & 1 \end{bmatrix} = \sigma^2[(1 - \rho) \mathbf{I}_n + \rho \,\boldsymbol{\iota}_n\boldsymbol{\iota}_n'] \end{align} \] where \(\rho \equiv \text{Var}(\eta_g) / \sigma^2\) and \(\sigma^2 \equiv \text{Var}(U_{ig})\).

Simplest Interesting Example

Suppose that:

  • \(\Sigma_g = \mathbb{E}[\mathbf{U}_g \mathbf{U}_g'] = \sigma^2[(1 - \rho) \mathbf{I}_n + \rho \,\boldsymbol{\iota}_n\boldsymbol{\iota}_n']\)
  • \(X_{ig} = X_g\) is the same for everyone in group \(g\).

After some algebra: \[ \text{Var}(\widehat{\beta}|\mathbf{X}) = [1 + (n - 1)\rho]\times \left[ \frac{\sigma^2}{\sum_{j=1}^N (X_j - \bar{X})^2}\right] \] Clustering scales “usual” variance of \(\widehat{\beta}\) by \([1 + (n-1)\rho]\)!

Clustered SEs with lm_robust()

  • Set argument clusters to name of grouping variable.
  • Set se_type to CR0, CR2 (default) or stata
  • Otherwise same as lm()
  • Works with tidy(), modelsummary(), linearHypothesis()

Duflo, Dupas & Kremer (2011)

  • Experimental data for 121 schools in Kenya
  • Schools randomly assigned to tracking or control (50/50)
    • tracking: students allocated to classes based pre-test.
    • control: students allocated to classes at random.
  • Does tracking improve test scores?
  • Clustered sample at level of schools.
  • Expect correlation within schools, but not between.

Duflo, Dupas & Kremer (2011)

library(haven)

ddk2011 <- read_dta('https://ditraglia.com/data/DDK2011.dta')

# convert outcome to z-score
ddk2011 <- ddk2011 |> 
  mutate(testscore = scale(totalscore)) 

reg_classical <- lm(testscore ~ tracking, ddk2011)

# Defaults to HC2 
reg_heterosked <- lm_robust(testscore ~ tracking, ddk2011) 

# Defaults to CR2
reg_cluster <- lm_robust(testscore ~ tracking, ddk2011, 
                         clusters = schoolid)  

modelsummary(list(Classical = reg_classical, 
                  Heterosked = reg_heterosked,
                  Cluster = reg_cluster), 
             gof_omit = 'R2 Adj,|AIC|BIC|F|Log.Lik.')

Duflo, Dupas & Kremer (2011)

Classical Heterosked Cluster
(Intercept) −0.071 −0.071 −0.071
(0.019) (0.019) (0.055)
tracking 0.138 0.138 0.138
(0.026) (0.026) (0.078)
Num.Obs. 5795 5795 5795
R2 0.005 0.005 0.005
R2 Adj. 0.005 0.005 0.005
RMSE 1.00 1.00 1.00
Std.Errors by: schoolid

Wait, what about IV / TSLS?

  • iv_robust() from estimatr has you covered!
  • Syntax is very similar to ivreg() from AER
    • E.g. formula y ~ x1 + x2 | z1 + z2 declares x1 and x2 as endog regressors, z1 and z2 as exog IVs.
  • Set clusters and se_type as in lm_robust()
  • Works with tidy(), modelsummary() etc.
  • More information and examples here.