The “Classic” Linear Homogeneous Coefficients Model
University of Oxford
The more traditional way of writing this model: \[ Y = \alpha + \beta X + U, \quad \text{Cov}(Z, X) \neq 0, \quad \text{Cov}(Z,U) = 0 \]
\[ \begin{align} \beta_{OLS} &\equiv \frac{\text{Cov}(X,Y)}{\text{Var}(X)}\\ \\ &= \frac{\text{Cov}(X, \alpha + \beta X + U)}{\text{Var}(X)} \\ \\ &= \beta + \frac{\text{Cov}(X,U)}{\text{Var}(X)}\\ \end{align} \]
\[ \begin{align} \beta_{IV} &\equiv \frac{\text{Cov}(Z,Y)}{\text{Cov}(Z,X)}\\ \\ &= \frac{\text{Cov}(Z, \alpha + \beta X + U)}{\text{Cov}(Z,X)} \\ \\ &= \beta + \frac{\text{Cov}(Z,U)}{\text{Cov}(Z,X)}\\ \end{align} \]
Observe that:
\[ \beta_{IV} \equiv \frac{\text{Cov}(Z,Y)}{\text{Cov}(Z,X)} = \frac{\text{Cov}(Z,Y)/\text{Var}(Z)}{\text{Cov}(Z,X)/\text{Var}(Z)} \equiv \frac{\gamma_1}{\pi_1} = \frac{\text{Reduced Form}}{\text{First Stage}} \] Reduced Form: Regress \(Y\) on \(Z\) \[ \gamma_0 \equiv \mathbb{E}(Y) - \gamma_1 \mathbb{E}(Z); \quad \gamma_1 \equiv \frac{\text{Cov}(Z,Y)}{\text{Var}(Z)}; \quad \epsilon \equiv Y - \gamma_0 - \gamma_1 Z \] \[ \implies Y = \gamma_0 + \gamma_1 Z + \epsilon; \quad \mathbb{E}(\epsilon) = \text{Cov}(Z, \epsilon) = 0 \] First Stage: Regress \(X\) on \(Z\) \[ \pi_0 \equiv \mathbb{E}(X) - \pi_1 \mathbb{E}(Z); \quad \pi_1 \equiv \frac{\text{Cov}(Z,X)}{\text{Var}(Z)}; \quad V \equiv X - \pi_0 - \pi_1 Z \] \[ \implies X = \pi_0 + \pi_1 Z + V; \quad \mathbb{E}(V) = \text{Cov}(Z, V) = 0 \]
Note
E.g. \(Z \sim \text{N}(0,1)\) independent of \((U, V)\), a bivariate standard normal vector with correlation \(\rho\).
1234
and then generate 5000 iid standard normal draws to serve as your instrument z
. Next, use rmvnorm()
to generate 5000 iid draws from a bivariate standard normal distribution with correlation \(\rho = 0.5\), independently of z
.x
and y
according to the IV model with \(\pi_0 = 0.5\), \(\pi_1 = 0.8\), \(\alpha = -0.3\), and \(\beta = 1\).y
on x
. Run this regression to check.y
on z
. Run this regression to check.y
on both x
and z
. What is your estimate of the slope coefficient on x
? How does it compare to the OLS and IV estimates? What gives?\[ \begin{align*} Y &= \beta_0 + \beta_1 X + \beta_2 W + U\\ X &= \pi_0 + \pi_1 Z_1 + \pi_2 Z_2 + \pi_3 W + V\\ \end{align*} \]
\[ \begin{align*} Y &= \beta_0 + \beta_1 X + \beta_2 W + U\\ X &= \pi_0 + \pi_1 Z_1 + \pi_2 Z_2 + \pi_3 W + V\\ \end{align*} \]
Warning
This procedure gives the correct point estimates but not the correct standard errors.
Note
In a model with a single endogenous regressor and a single instrument, TSLS is equivalent to the simple IV estimator from above. It is also equivalent to a control function approach. See Three Ways of Thinking about Instrumental Variables for more.
ivreg()
ivreg()
function from the ivreg
package carries out TSLS estimation and calculates the correct standard errors. (Today: assume homoskedasticity.)tidy()
, augment()
, and glance()
from broom
work with ivreg()
. See here.ivreg()
syntax: ivreg([FORMULA_HERE], data = [DATAFRAME_HERE])
[CAUSAL_MODEL_FORMULA] | [FIRST_STAGE_FORMULA]
~
y ~ x + w | z1 + z2 + w
Warning
You will be tempted to drop data =
and just write the name of your data frame as the second argument to ivreg()
but this does not work. That’s because the second argument of ivreg()
is something else that we don’t need.
1234
and then generate 10000 draws of \((Z_1, Z_2, W)\) from a trivariate standard normal distribution in which each pair of RVs has correlation \(0.3\). Then generate \((U, V)\) independently of \((Z_1, Z_2, W)\) as in Exercise A above.x
and y
according to the IV model from above with coefficients \((\pi_0, \pi_1, \pi_2, \pi_3) = (0.5, 0.2, -0.15, 0.25)\) for the first-stage and \((\beta_0, \beta_1, \beta_2) = (-0.3, 1, -0.7)\) for the causal model.lm()
. Compare your estimated coefficients and standard errors to those from ivreg()
.w
from your first-stage regression, including it only in your second-stage regression. What happens? Why?ivreg()
? Explain.w
from both your first-stage and causal model formulas in ivreg()
? Are there any situations in which this would work? Explain.library(tidyverse)
data_url <- 'https://ditraglia.com/data/Ginsburgh-van-Ours-2003.csv'
qe <- read_csv(data_url)
qe
# A tibble: 132 × 13
`birth year` year ranking order age belgium ussr usa female critics
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1928 1952 1 3 24 0 0 1 0 41
2 1923 1952 2 8 29 0 0 0 0 28
3 1931 1952 3 9 21 0 0 0 1 29
4 1929 1952 4 12 23 1 0 0 0 12
5 1929 1952 5 6 23 0 0 0 0 0
6 1926 1952 6 2 26 0 0 1 0 4
7 1926 1952 7 10 26 0 0 1 0 6
8 1923 1952 8 11 29 0 0 0 0 26
9 1928 1952 9 5 24 0 0 0 0 0
10 1934 1952 10 4 18 0 0 0 0 31
# ℹ 122 more rows
# ℹ 3 more variables: bll <dbl>, gccd <dbl>, presen <dbl>
Instruments / Regressors:
birth year
= year of birthyear
= year of competitionranking
= ranking by the jury (1-12)order
= order of appearance (1-12)age
= age at time of performancebelgium
= dummy for Belgian pianistUSSR
= dummy for USSR pianistUSA
= dummy for USA pianistfemale
= dummy for female pianistOutcomes (Success Indicators):
critics
= ratings by criticsBLL
= # records in BLL catalogueGCCD
= # records in GCC/D cataloguespresen
= # of catalogues present (1-3)In this exercise you will need to work with the columns critics
, order
, and ranking
from the qe
tibble.
first
to qe
that takes on the value TRUE
if order
equals one. You will need this variable in the following parts.first
to instrument for ranking. Discuss your findings.