Problem Set: Weak Instrument Simulation

In this problem you will simulate from the following simple instrumental variables model \[ \begin{aligned} Y_i &= \beta X_i + U_i \\ X_i &= \pi Z_i + V_i \end{aligned}, \quad \begin{bmatrix} U_i \\ V_i \end{bmatrix} \sim \text{iid Normal}\left( \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix} \right) \] for \(i = 1, 2, ..., n\) where \((U_i, V_i)\) are independent of \(Z_i\). The purpose of this simulation is to illustrate what can go wrong with the IV estimator of \(\beta\) and its associated standard error when \(Z_i\) is a weak instrument, i.e. \(\pi\) is small in absolute value. In the simulations below you will fix the true value of \(\beta\) equal to zero. (Any value would give the same qualitative results, and this is the simplest possibility.) You will also hold the values of the instrument \(Z_1, Z_2, ..., Z_n\) fixed across simulation replications. Note that, since the errors in this model are independent of \(Z_i\), there is no need for heteroskedasticity-robust standard errors: use the plain-vanilla ones instead. Use \(n = 100\) throughout.

  1. Generate \(n = 100\) standard normal draws z_fixed. Center the result so that mean(z_fixed) is zero to the numerical precision of your machine, and scale them so that sum(z_fixed^2) equals 100. You will use these draws as the instrumental variable in all simulation replications below. Do not re-draw \(Z_i\) at any point below!

  2. Write a function called draw_sim_data() to simulate from the above model with \(n = 100\), \(\beta=0\) and instruments z_fixed. Your function should take two arguments: pi is the first-stage coefficient, and rho is the correlation between \(U_i\) and \(V_i\). It should return a data frame (or tibble) with three named columns: x, y, and z, where x is the endogenous regressor, y is the outcome, and z is z_fixed.

  3. Write a function called get_iv_stats() that takes a single argument dat, a data frame (or tibble) matching the output of draw_sim_data() from above. Your function should return a vector with named elements est and se: the instrumental variables estimator for \(\beta\) and corresponding standard error computed from dat$y, dat$x, and dat$z. Compute these by hand, i.e. without using ivreg.

  4. Using the functions you wrote in the two preceding parts, simulate 100 observations from the model described above with \(\pi = 1\) and \(\rho = 0.5\) and calculate the instrumental variables estimator and associated standard error for \(\beta\). Use ivreg to check your results.

  5. The concentration parameter \(\mu^2\) quantifies the strength of an instrumental variable. Larger values indicate a stronger instrument. In the model described above \(\mu^2 = \pi^2 \mathbf{Z}'\mathbf{Z}\). Given the way we have scaled z_fixed, this simplifies to \(\mu^2 = 100 \pi^2\). Write a function called replicate_iv_sim() that replicates the IV simulation experiment described above at fixed parameter values. It should take three arguments: n_reps is the number of simulation replications, conc is the concentration parameter \(\mu^2\) and rho is the correlation between \(U_i\) and \(V_i\) described above. It should return a data frame (or tibble) with columns est and se corresponding to the output of get_iv_stats(), and columns conc and rho that list the parameter values used. (The latter two columns will be helpful in the next part for use with ggplot2.)

  6. Use pmap() from purrr to call replicate_iv_sim() over a grid of parameter values: \(\rho \in\{ 0.5, 0.9, 0.99\}\) and \(\mu^2 \in \{0, 0.25, 10, 100\}\) with nreps = 10000. For faster computation, you may optionally use future_pmap() from the furrr package for parallel processing. Organize your results in a tibble called simulations. This will be quite a large tibble indeed: 120,000 rows! Use simulations to make a table showing how the median bias of the IV estimator changes with the parameter values. In other words: for each combination of values \(\rho\) and \(\mu^2\) calculate the difference between the median of the sampling distribution of the IV estimator and the true value of the parameter \(\beta\). Since the true parameter value is zero, the median of the sampling distribution equals the median bias. Comment on your results.

  7. Using the tibble simulations from the preceding part, make kernel density plots similar to those in Figure 1 (a) and (b) of Stock, Wright, & Yogo (2002; JBES) “A Survey of Weak Instruments and Weak Identification in Generalized Method of Moments.” I suggest using ggplot2 and faceting by rho so that the different values of the concentration parameter appear on the same plot. I also strongly suggest filtering to exclude extremely large or small values before making your kernel density plots. Otherwise you won’t be able to see anything! (You might start with the range \([-5,5]\) and tweak from there.) You may also want to create similar plots for the t-statistics (est/se) to see how the sampling distribution of the test statistic changes with instrument strength. Comment on your findings.