Problem Set: Weak Instrument Simulation
In this problem you will simulate from the following simple instrumental variables model \[ \begin{aligned} Y_i &= \beta X_i + U_i \\ X_i &= \pi Z_i + V_i \end{aligned}, \quad \begin{bmatrix} U_i \\ V_i \end{bmatrix} \sim \text{iid Normal}\left( \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix} \right) \] for \(i = 1, 2, ..., n\) where \((U_i, V_i)\) are independent of \(Z_i\). The purpose of this simulation is to illustrate what can go wrong with the IV estimator of \(\beta\) and its associated standard error when \(Z_i\) is a weak instrument, i.e. \(\pi\) is small in absolute value. In the simulations below you will fix the true value of \(\beta\) equal to zero. (Any value would give the same qualitative results, and this is the simplest possibility.) You will also hold the values of the instrument \(Z_1, Z_2, ..., Z_n\) fixed across simulation replications. Note that, since the errors in this model are independent of \(Z_i\), there is no need for heteroskedasticity-robust standard errors: use the plain-vanilla ones instead. Use \(n = 100\) throughout.
Generate \(n = 100\) standard normal draws
z_fixed. Center the result so thatmean(z_fixed)is zero to the numerical precision of your machine, and scale them so thatsum(z_fixed^2)equals 100. You will use these draws as the instrumental variable in all simulation replications below. Do not re-draw \(Z_i\) at any point below!Write a function called
draw_sim_data()to simulate from the above model with \(n = 100\), \(\beta=0\) and instrumentsz_fixed. Your function should take two arguments:piis the first-stage coefficient, andrhois the correlation between \(U_i\) and \(V_i\). It should return a data frame (or tibble) with three named columns:x,y, andz, wherexis the endogenous regressor,yis the outcome, andzisz_fixed.Write a function called
get_iv_stats()that takes a single argumentdat, a data frame (or tibble) matching the output ofdraw_sim_data()from above. Your function should return a vector with named elementsestandse: the instrumental variables estimator for \(\beta\) and corresponding standard error computed fromdat$y,dat$x, anddat$z. Compute these by hand, i.e. without usingivreg.Using the functions you wrote in the two preceding parts, simulate 100 observations from the model described above with \(\pi = 1\) and \(\rho = 0.5\) and calculate the instrumental variables estimator and associated standard error for \(\beta\). Use
ivregto check your results.The concentration parameter \(\mu^2\) quantifies the strength of an instrumental variable. Larger values indicate a stronger instrument. In the model described above \(\mu^2 = \pi^2 \mathbf{Z}'\mathbf{Z}\). Given the way we have scaled
z_fixed, this simplifies to \(\mu^2 = 100 \pi^2\). Write a function calledreplicate_iv_sim()that replicates the IV simulation experiment described above at fixed parameter values. It should take three arguments:n_repsis the number of simulation replications,concis the concentration parameter \(\mu^2\) andrhois the correlation between \(U_i\) and \(V_i\) described above. It should return a data frame (or tibble) with columnsestandsecorresponding to the output ofget_iv_stats(), and columnsconcandrhothat list the parameter values used. (The latter two columns will be helpful in the next part for use withggplot2.)Use
pmap()frompurrrto callreplicate_iv_sim()over a grid of parameter values: \(\rho \in\{ 0.5, 0.9, 0.99\}\) and \(\mu^2 \in \{0, 0.25, 10, 100\}\) withnreps = 10000. For faster computation, you may optionally usefuture_pmap()from thefurrrpackage for parallel processing. Organize your results in a tibble calledsimulations. This will be quite a large tibble indeed: 120,000 rows! Usesimulationsto make a table showing how the median bias of the IV estimator changes with the parameter values. In other words: for each combination of values \(\rho\) and \(\mu^2\) calculate the difference between the median of the sampling distribution of the IV estimator and the true value of the parameter \(\beta\). Since the true parameter value is zero, the median of the sampling distribution equals the median bias. Comment on your results.Using the tibble
simulationsfrom the preceding part, make kernel density plots similar to those in Figure 1 (a) and (b) of Stock, Wright, & Yogo (2002; JBES) “A Survey of Weak Instruments and Weak Identification in Generalized Method of Moments.” I suggest usingggplot2and faceting byrhoso that the different values of the concentration parameter appear on the same plot. I also strongly suggest filtering to exclude extremely large or small values before making your kernel density plots. Otherwise you won’t be able to see anything! (You might start with the range \([-5,5]\) and tweak from there.) You may also want to create similar plots for the t-statistics (est/se) to see how the sampling distribution of the test statistic changes with instrument strength. Comment on your findings.