Problem Set - Heteroskedasticity-Robust SE Simulation

As we saw in the lecture, the command lm_robust() from the estimatr package provides various “flavors” of heteroskedasticity-robust standard errors: HC0, HC1, HC2, and HC3. In this problem you will use simulation to compare the performance of these alternatives in the simplest possible example: a regression model with a single dummy variable \(D_i \in \{0, 1\}\) and normal errors.

Model set-up

Suppose that: \[ Y_i = \alpha + \beta D_i + U_i, \quad U_i|D_i \sim \text{indep. N}\big(0, (1 - D_i) \sigma^2 + D_i \big) \] for \(i = 1, ..., n\). Notice that this design allows for heteroskedasticity: if \(D_i = 1\) then \(U_i\) is standard normal, but if \(D_i = 0\) then \(U_i \sim \text{N}(0, \sigma^2)\). By changing the value of \(\sigma^2\) we can change the extent of heteroskedasticity in the simulation design.

OLS estimator

Our goal is to determine which method provides the most accurate approximation to the standard error of \(\widehat{\beta}\), the ordinary least squares (OLS) estimator of \(\beta\). Because this example is so simple, there’s no need to use lm_robust() or even lm() to calculate \(\widehat{\beta}\). It simply equals the difference of sample means for observations with \(D_i = 1\) compared to those with \(D_i = 0\), i.e. \[ \widehat{\beta} = \bar{Y}_1 - \bar{Y}_0, \quad \bar{Y_1} \equiv \frac{1}{n_1}\sum_{\{i \colon D_i = 1\}} Y_i, \quad \bar{Y_0} \equiv \frac{1}{n_0}\sum_{\{i \colon D_i = 0\}} Y_i \] where \(n\) is the total sample size, \(n_1\) is the number of observations for which \(D_i = 1\) and \(n_0\) is the number of observations for which \(D_i = 0\): \[ n_1 = \sum_{i=1}^n D_i, \quad n_0 = n - n_1. \]

Robust standard errors

Again, because this example is so simple, we don’t need to use lm_robust() to calculate the standard errors for us. Let \(\text{RSS}_0\) denote the residual sum of squares for the observations with \(D_i=0\) and \(\text{RSS}_1\) denote the total sum of squares for the observations with \(D_i=1\), in other words \[ \text{RSS}_0 \equiv \sum_{\{i \colon D_i = 0\}}^n (Y_i - \bar{Y}_0)^2,\quad \text{RSS}_1 \equiv \sum_{\{i \colon D_i = 1\}}^n (Y_i - \bar{Y}_1)^2 \] Using this notation, it can be shown that the usual, i.e. non-robust, standard error for \(\widehat{\beta}\) is given by \[ \text{SE}_{\text{usual}} = \sqrt{\left(\frac{n}{n_0 n_1}\right) \left( \frac{\text{RSS}_0 + \text{RSS}_1}{n - 2}\right)}. \] while the various flavors of “robust” standard errors are given by \[ \begin{aligned} \text{SE}_0 &= \sqrt{\frac{\text{RSS}_0}{n_0^2} + \frac{\text{RSS}_1}{n_1^2}}\\ \text{SE}_1 &= \sqrt{\frac{n}{n-2}\left(\frac{\text{RSS}_0}{n_0^2} + \frac{\text{RSS}_1}{n_1^2}\right)}\\ \text{SE}_2 &= \sqrt{\frac{\text{RSS}_0}{n_0(n_0 - 1)} + \frac{\text{RSS}_1}{n_1(n_1 - 1)}}\\ \text{SE}_3 &= \sqrt{\frac{\text{RSS}_0}{(n_0 - 1)^2} + \frac{\text{RSS}_1}{(n_1 - 1)^2}} \end{aligned} \] where \(\text{SE}_0\) corresponds to HC0, \(\text{SE}_1\) to HC1, and so on.

Simulation set-up

To keep things simple, set \(n = 30\) and \(\alpha = \beta = 0\) in all of your simulations below. The choice of \(\alpha\) and \(\beta\) doesn’t matter for the results. The choice of \(n\) does matter. I’ve chosen \(n = 30\) so that you can check your results against those from Table 8.1.1 of Mostly Harmless Econometrics if you wish. The parameters that will vary in this exercise are \(n_1\) and \(\sigma\).

Exercises

Exercise 1: Write code to generate simulation data.

Simulate one draw of the simulation design with \(n_1 = 3\) and \(\sigma = 0.5\). Is the sample balanced for \(n_1=3\)?

Exercise 2: Write code to calculate your estimates.

Write code to calculate each of the standard error estimators from above. Compute the standard errors for your simulation draw from exercise 1. Check your work by comparing against the standard errors provided by lm() and lm_robust(). They should match exactly.

Exercise 3: Run the simulation for fixed parameters.

Compared to \(\sigma=1\), does a design with \(\sigma=0.5\) have more heteroskedasticity? Why? (You should be able to answer this from the model set-up only.) Carry out a simulation study with \(n_1 = 3\), \(\sigma = 0.5\), and nreps = 100,000 simulation replications. You may want to consider parallelizing with future, although there are other ways to speed up the code by thinking carefully about the problem. Based on your results, answer the following questions: (i) What is the mean of the sampling distribution of \(\widehat{\beta}\)? (ii) What is the true standard error of \(\widehat{\beta}\)? (In other words, what is the standard deviation of the sampling distribution of \(\widehat{\beta}\)?) (iii) Like \(\widehat{\beta}\), each of the five standard errors from above has its own sampling distribution. What are the means and standard deviations of these distributions? (iv) Given a point estimate \(\widehat{\beta}\) and standard error estimate \(\widehat{\text{SE}}\) we can form a t-ratio \(|\widehat{\beta}|/\widehat{\text{SE}}\) to test the null hypothesis \(H_0\colon \beta =0\) against the two-sided alternative. In our simulation design \(\beta = 0\) so a 5% test should reject the null 5% of the time. If you base your test on a standard normal critical value, what is the actual rejection rate of a nominal 5% test for each of the five standard error estimators?

Pep talk: This is just another simulation study – you know how to code these up! If you’re stuck, go back the Monte Carlo recipe and use your code from exercises 1 and 2 to get you started. You can code this up in three functions:

draw_sim_data(sigma, n1, nreps = 1e5) draws the simulation data (based on exercise 1),
compute_SEs(sim_data) computes the SEs based on simulation data and returns (i)-(iv), and
get_sim_results(n1, sigma) combines the two functions and runs the study for fixed \(n_1\) and \(\sigma\).

Exercise 4

Repeat the preceding for \(n_1 = 3\) and \(\sigma = 0.85\). How much heteroskedasticity does this design have?
Repeat the preceding for \(n_1 = 3\) and \(\sigma = 1\). How much heteroskedasticity does this design have?

Exercise 5: Plot the SE sampling distribution.

Use the simulation results from exercises 3 and 4 above multi-density plot of the standard error sampling distribution. You should have three panels, one for each of the three simulation designs. Each panel should show the sampling distribution of the five standard error estimators from above.

Hint: You could create a faceted plot with one panel for each simulation exercise, or three separate plots. To create the multi-density plot, you may want to pivot_longer() your simulation data.

Exercise 6: Interpret your findings

Briefly discuss your results from the preceding three parts. Refer to the numbers you computed in exercises 2-4 and the plots you created in exercise 5. Are these good standard error estimates? You’ll notice that the robust standard errors have a much larger sample variance – why?

Hint: Consult chapter 8 of “Mostly Harmless Econometrics” to compare your results against Angrist & Pischke’s, and to see their interpretation.