Instrumental Variables

The “Classic” Linear Homogeneous Coefficients Model

Francis J. DiTraglia

University of Oxford

A Quick Review

Classic Linear IV Model

\(Y \leftarrow \alpha + \beta X + U\)
\(X \uparrow\) one unit causes \(Y \uparrow \beta\) units
Homogeneous coefficients \((\alpha, \beta)\)
\((X, Y, Z)\) observed; \(U\) unobserved
- \(U\) causes \(X\) and \(Y\) but not \(Z\)
- \(Z\) causes \(X\) but not \(Y\)
What causes success Y: positive expert reviews X or talent U?

Classic Linear IV Model

The more traditional way of writing this model: \[ Y = \alpha + \beta X + U, \quad \text{Cov}(Z, X) \neq 0, \quad \text{Cov}(Z,U) = 0 \]

\(\text{Cov}(X, U) \neq 0 \implies\) \(X\) is an Endogenous Regressor
\(\text{Cov}(Z,X) \neq 0\) is the IV Relevance assumption
\(\text{Cov}(Z,U) = 0\) is the IV Exogeneity assumption

OLS and IV Estimands

\[ \begin{align} \beta_{OLS} &\equiv \frac{\text{Cov}(X,Y)}{\text{Var}(X)}\\ \\ &= \frac{\text{Cov}(X, \alpha + \beta X + U)}{\text{Var}(X)} \\ \\ &= \beta + \frac{\text{Cov}(X,U)}{\text{Var}(X)}\\ \end{align} \]

\[ \begin{align} \beta_{IV} &\equiv \frac{\text{Cov}(Z,Y)}{\text{Cov}(Z,X)}\\ \\ &= \frac{\text{Cov}(Z, \alpha + \beta X + U)}{\text{Cov}(Z,X)} \\ \\ &= \beta + \frac{\text{Cov}(Z,U)}{\text{Cov}(Z,X)}\\ \end{align} \]

Observe that:

If \(X\) is endogenous, \(\text{Cov}(X, U) \neq 0\), then \(\beta_{OLS} \neq \beta\).
If \(Z\) is exogenous, \(\text{Cov}(Z,U) = 0\), and relevant, \(\text{Cov}(Z, X) \neq 0\), then \(\beta_{IV} = \beta\).
If \(Z\) is endogenous, the weaker its correlation with \(X\), the farther \(\beta_{IV}\) is from \(\beta\).

Decomposing the IV Estimand

\[ \beta_{IV} \equiv \frac{\text{Cov}(Z,Y)}{\text{Cov}(Z,X)} = \frac{\text{Cov}(Z,Y)/\text{Var}(Z)}{\text{Cov}(Z,X)/\text{Var}(Z)} \equiv \frac{\gamma_1}{\pi_1} = \frac{\text{Reduced Form}}{\text{First Stage}} \] Reduced Form: Regress \(Y\) on \(Z\) \[ \gamma_0 \equiv \mathbb{E}(Y) - \gamma_1 \mathbb{E}(Z); \quad \gamma_1 \equiv \frac{\text{Cov}(Z,Y)}{\text{Var}(Z)}; \quad \epsilon \equiv Y - \gamma_0 - \gamma_1 Z \] \[ \implies Y = \gamma_0 + \gamma_1 Z + \epsilon; \quad \mathbb{E}(\epsilon) = \text{Cov}(Z, \epsilon) = 0 \] First Stage: Regress \(X\) on \(Z\) \[ \pi_0 \equiv \mathbb{E}(X) - \pi_1 \mathbb{E}(Z); \quad \pi_1 \equiv \frac{\text{Cov}(Z,X)}{\text{Var}(Z)}; \quad V \equiv X - \pi_0 - \pi_1 Z \] \[ \implies X = \pi_0 + \pi_1 Z + V; \quad \mathbb{E}(V) = \text{Cov}(Z, V) = 0 \]

A System of Two Equations

Suppose that \(Z\) is exogenous.
\(\text{Cov}(U, V) = \text{Cov}[U, \; (X - \pi_0 - \pi_1 Z)] = \text{Cov}(X, U)\).
Thus, we can write \[ \begin{align*} Y &= \alpha + \beta X + U\\ X &= \pi_0 + \pi_1 Z + V\\ \end{align*} \]
\(\text{Cov}(U, V) \neq 0; \quad \mathbb{E}(U) = \mathbb{E}(V) = \text{Cov}(Z, V) = 0\).

Simulating an IV Model

Generate instrument \(Z\).
Generated correlated errors \((U,V)\) independently of \(Z\).
Set \(X = \pi_0 + \pi_1 Z + V\).
Set \(Y = \alpha + \beta X + U\).

Note

E.g. \(Z \sim \text{N}(0,1)\) independent of \((U, V)\), a bivariate standard normal vector with correlation \(\rho\).

💪 Exercise A - (20 min)

Set the seed to 1234 and then generate 5000 iid standard normal draws to serve as your instrument z. Next, use rmvnorm() to generate 5000 iid draws from a bivariate standard normal distribution with correlation \(\rho = 0.5\), independently of z.
Use the draws you made in the preceding part to generate x and y according to the IV model with \(\pi_0 = 0.5\), \(\pi_1 = 0.8\), \(\alpha = -0.3\), and \(\beta = 1\).
Using the formulas from the preceding slides, predict the slope coefficient that you would obtain if you ran a linear regression of y on x. Run this regression to check.
Using the formulas from the preceding slides, predict the slope coefficient that you would obtain if you ran a linear regression of y on z. Run this regression to check.
Use the formulas from the preceding slides to calculate the IV estimate. How does it compare to the true causal effect in your simulation?
Try running a regression of y on both x and z. What is your estimate of the slope coefficient on x? How does it compare to the OLS and IV estimates? What gives?
Which will give the most accurate predictions of \(Y\) in this example: OLS or IV?

Two-stage Least Squares in R

A More Elaborate Example

\[ \begin{align*} Y &= \beta_0 + \beta_1 X + \beta_2 W + U\\ X &= \pi_0 + \pi_1 Z_1 + \pi_2 Z_2 + \pi_3 W + V\\ \end{align*} \]

The \(Y\) equation is a linear causal model
- Goal is to learn the causal effect \(\beta_1\) of \(X\).
- \(X\) is endogenous: \(\text{Cov}(X, U) \neq 0\)
- Exogenous “control” regressor \(W\): \(\text{Cov}(W, U) = 0\).
The \(X\) equation is a population linear regression
- \(\text{Cov}(Z_1, V) = \text{Cov}(Z_2, V) = \text{Cov}(W, V) = 0\) by construction.
Instrumental Variables Assumptions:
- Relevance: at least one of \(\pi_1\) and \(\pi_2\) is non-zero
- Exogeneity: \(\text{Cov}(Z_1, U) = \text{Cov}(Z_2, U) = 0\).

Two-Stage Least Squares (TSLS)

\[ \begin{align*} Y &= \beta_0 + \beta_1 X + \beta_2 W + U\\ X &= \pi_0 + \pi_1 Z_1 + \pi_2 Z_2 + \pi_3 W + V\\ \end{align*} \]

Regress \(X\) on \(Z_1, Z_2\), and \(W\). Save the fitted values: \(\hat{X}\).
Regress \(Y\) on \(\hat{X}\) and \(W\). The coefficient on \(\hat{X}\) is our estimate of \(\beta_1\).

Warning

This procedure gives the correct point estimates but not the correct standard errors.

Note

In a model with a single endogenous regressor and a single instrument, TSLS is equivalent to the simple IV estimator from above. It is also equivalent to a control function approach. See Three Ways of Thinking about Instrumental Variables for more.

`ivreg()`

The ivreg() function from the ivreg package carries out TSLS estimation and calculates the correct standard errors. (Today: assume homoskedasticity.)
tidy(), augment(), and glance() from broom work with ivreg(). See here.
ivreg() syntax: ivreg([FORMULA_HERE], data = [DATAFRAME_HERE])
Formula syntax: [CAUSAL_MODEL_FORMULA] | [FIRST_STAGE_FORMULA]
- First-stage formula is “one-sided” i.e. there is no LHS variable or ~
- Our running example: y ~ x + w | z1 + z2 + w

Warning

You will be tempted to drop data = and just write the name of your data frame as the second argument to ivreg() but this does not work. That’s because the second argument of ivreg() is something else that we don’t need.

💪 Exercise B - (20 min)

Set the seed to 1234 and then generate 10000 draws of \((Z_1, Z_2, W)\) from a trivariate standard normal distribution in which each pair of RVs has correlation \(0.3\). Then generate \((U, V)\) independently of \((Z_1, Z_2, W)\) as in Exercise A above.
Use the draws you made in the preceding part to generate x and y according to the IV model from above with coefficients \((\pi_0, \pi_1, \pi_2, \pi_3) = (0.5, 0.2, -0.15, 0.25)\) for the first-stage and \((\beta_0, \beta_1, \beta_2) = (-0.3, 1, -0.7)\) for the causal model.
Run TSLS “by hand” by carrying out two regressions with lm(). Compare your estimated coefficients and standard errors to those from ivreg().
Run TSLS “by hand” but this time omit w from your first-stage regression, including it only in your second-stage regression. What happens? Why?
What happens if you drop \(Z_1\) from your TSLS regression in ivreg()? Explain.
What happens if you omit w from both your first-stage and causal model formulas in ivreg()? Are there any situations in which this would work? Explain.

Empirical Example

🎹 Ginsburgh & van Ours (2003)

Do favorable expert reviews cause success, or does talent cause both success and favorable reviews?
Data from 1952-1991 Queen Elisabeth Piano Competition.
Performance order is chosen at random.
Judges give higher rankings to musicians who perform later.
Estimate the effect of judge ranking on subsequent career success using performance order to instrument for ranking.
IV Exog: order indep. of talent; no direct effect on success

Loading the Data

library(tidyverse)
data_url <- 'https://ditraglia.com/data/Ginsburgh-van-Ours-2003.csv' 
qe <- read_csv(data_url)
qe

# A tibble: 132 × 13
   `birth year`  year ranking order   age belgium  ussr   usa female critics
          <dbl> <dbl>   <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>  <dbl>   <dbl>
 1         1928  1952       1     3    24       0     0     1      0      41
 2         1923  1952       2     8    29       0     0     0      0      28
 3         1931  1952       3     9    21       0     0     0      1      29
 4         1929  1952       4    12    23       1     0     0      0      12
 5         1929  1952       5     6    23       0     0     0      0       0
 6         1926  1952       6     2    26       0     0     1      0       4
 7         1926  1952       7    10    26       0     0     1      0       6
 8         1923  1952       8    11    29       0     0     0      0      26
 9         1928  1952       9     5    24       0     0     0      0       0
10         1934  1952      10     4    18       0     0     0      0      31
# ℹ 122 more rows
# ℹ 3 more variables: bll <dbl>, gccd <dbl>, presen <dbl>

Data Dictionary

Instruments / Regressors:

birth year = year of birth
year = year of competition
ranking = ranking by the jury (1-12)
order = order of appearance (1-12)
age = age at time of performance
belgium = dummy for Belgian pianist
USSR = dummy for USSR pianist
USA = dummy for USA pianist
female = dummy for female pianist

Outcomes (Success Indicators):

critics = ratings by critics
BLL = # records in BLL catalogue
GCCD = # records in GCC/D catalogues
presen = # of catalogues present (1-3)

💪 Exercise C - (10 min)

In this exercise you will need to work with the columns critics, order, and ranking from the qe tibble.

Add a variable called first to qe that takes on the value TRUE if order equals one. You will need this variable in the following parts.
Do musicians who perform first receive different average rankings from the jury than other musicians? Discuss briefly.
Is it possible to estimate the causal effect of performing first on subsequent ratings by critics using this dataset? If so how and what is the effect?
Estimate the causal effect of ranking in the competition on future success as measured by critics’ rating two ways: via OLS and via IV using first to instrument for ranking. Discuss your findings.