# Exercise - Heteroskedasticity-Robust SEs

As we saw in lecture, the command lm_robust() from the estimatr package provides various “flavors” of heteroskedasticity-robust standard errors: HC0, HC1, HC2, and HC3. In this problem you will use simulation to compare the performance of these alternatives in the simplest possible example: a regression model with a single dummy variable $$D_i \in \{0, 1\}$$ and normal errors. Suppose that: $Y_i = \alpha + \beta D_i + U_i, \quad U_i|D_i \sim \text{indep. N}\big(0, (1 - D_i) \sigma^2 + D_i \big)$ for $$i = 1, ..., n$$. Notice that this design allows for heteroskedasticity. If $$D_i = 1$$ then $$U_i$$ is standard normal, but if $$D_i = 0$$ then $$U_i \sim \text{N}(0, \sigma^2)$$. By changing the value of $$\sigma^2$$ we can change the extent of heteroskedasticity in the simulation design.

Our goal is to determine which method provides the most accurate approximation to the standard error of $$\widehat{\beta}$$, the ordinary least squares estimator of $$\beta$$. Because this example is so simple, there’s no need to use lm_robust() or even lm() to calculate $$\widehat{\beta}$$. It simply equals the difference of sample means for observations with $$D_i = 1$$ compared to those with $$D_i = 0$$, i.e. $\widehat{\beta} = \bar{Y}_1 - \bar{Y}_0, \quad \bar{Y_1} \equiv \frac{1}{n_1}\sum_{\{i \colon D_i = 1\}} Y_i, \quad \bar{Y_0} \equiv \frac{1}{n_0}\sum_{\{i \colon D_i = 0\}} Y_i$ where $$n$$ is the total sample size, $$n_1$$ is the number of observations for which $$D_i = 1$$ and $$n_0$$ is the number of observations for which $$D_i = 0$$: $n_1 = \sum_{i=1}^n D_i, \quad n_0 = n - n_1.$ Again, because this example is so simple, we don’t need to use lm_robust() to calculate the standard errors for us. Let $$\text{RSS}_0$$ denote the residual sum of squares for the observations with $$D_i=0$$ and $$\text{RSS}_1$$ denote the total sum of squares for the observations with $$D_i=1$$, in other words $\text{RSS}_0 \equiv \sum_{\{i \colon D_i = 0\}}^n (Y_i - \bar{Y}_0)^2,\quad \text{RSS}_1 \equiv \sum_{\{i \colon D_i = 1\}}^n (Y_i - \bar{Y}_1)^2$ Using this notation, it can be shown that the usual, i.e. non-robust, standard error for $$\widehat{\beta}$$ is given by $\text{SE}_{\text{usual}} = \sqrt{\left(\frac{n}{n_0 n_1}\right) \left( \frac{\text{RSS}_0 + \text{RSS}_1}{n - 2}\right)}.$ while the various flavors of “robust” standard errors are given by \begin{aligned} \text{SE}_0 &= \sqrt{\frac{\text{RSS}_0}{n_0^2} + \frac{\text{RSS}_1}{n_1^2}}\\ \text{SE}_1 &= \sqrt{\frac{n}{n-2}\left(\frac{\text{RSS}_0}{n_0^2} + \frac{\text{RSS}_1}{n_1^2}\right)}\\ \text{SE}_2 &= \sqrt{\frac{\text{RSS}_0}{n_0(n_0 - 1)} + \frac{\text{RSS}_1}{n_1(n_1 - 1)}}\\ \text{SE}_3 &= \sqrt{\frac{\text{RSS}_0}{(n_0 - 1)^2} + \frac{\text{RSS}_1}{(n_1 - 1)^2}} \end{aligned} where $$\text{SE}_0$$ corresponds to HC0, $$\text{SE}_1$$ to HC1, and so on. To keep things simple, set $$n = 30$$ and $$\alpha = \beta = 0$$ in all of your simulations below. The choice of $$\alpha$$ and $$\beta$$ doesn’t matter for the results. The choice of $$n$$ does matter. I’ve chosen $$n = 30$$ so that you can check your results against those from Table 8.1.1 of Mostly Harmless Econometrics if you wish. The parameters that will vary in this exercise are $$n_1$$ and $$\sigma$$.

1. Write code to calculate each of the standard error estimators from above. Then simulate one draw of the simulation design with $$n_1 = 3$$ and $$\sigma = 0.5$$ and use it to check your work by comparing against the standard errors provided by lm() and lm_robust(). They should match exactly.

2. Carry out a simulation study with $$n_1 = 3$$ and $$\sigma = 0.5$$ to answer the following questions:

1. What is the mean of the sampling distribution of $$\widehat{\beta}$$?
2. What is the true standard error of $$\widehat{\beta}$$? (In other words, what is the standard deviation of the sampling distribution of $$\widehat{\beta}$$?)
3. Like $$\widehat{\beta}$$, each of the five standard errors from above has its own sampling distribution. What are the means and standard deviations of these distributions?
4. Given a point estimate $$\widehat{\beta}$$ and standard error estimate $$\widehat{\text{SE}}$$ we can form a t-ratio $$|\widehat{\beta}|/\widehat{\text{SE}}$$ to test the null hypothesis $$H_0\colon \beta =0$$ against the two-sided alternative. In our simulation design $$\beta = 0$$ so a 5% test should reject the null 5% of the time. If you base your test on a standard normal critical value, what is the actual rejection rate of a nominal 5% test for each of the five standard error estimators?
3. Repeat the preceding for $$n_1 = 3$$ and $$\sigma = 0.85$$.

4. Repeat the preceding for $$n_1 = 3$$ and $$\sigma = 1$$.

5. Briefly discuss your results from the preceding three parts.