Panel Data Basics

Francis J. DiTraglia

University of Oxford

Simplest Panel Data Model

\[ Y_{it} = \alpha + \beta X_{it} + U_{it} \]

  • Observe \((X_{it}, Y_{it})\) repeatedly for the same individuals
    • \(i = 1, \dots, N\) people in \(t = 1, \dots, T\) time periods
  • Clustered Sampling
    • Sampling Ana means we observe her for all time periods.
    • Last lecture’s notation: \(g\) becomes \(i\) and \(i\) becomes \(t\).
    • iid observations between people (\(i\) dimension)
    • Dependent observations within person (\(t\) dimension)

One-way Error Components Model

\[ Y_{it} = \alpha + \beta X_{it} + U_{it}, \quad U_{it} = \eta_i + \epsilon_{it} \]

  • Idiosyncratic error: \(\epsilon_{it}\) varies over \(i\) and \(t\).
    • \(\epsilon_{is}\) may be correlated with \(\epsilon_{it}\)
    • For today: \(\text{Cov}(\epsilon_{it}, X_{it}) = 0\)
  • Individual effect: \(\eta_i\) varies over \(i\) but is fixed over time.
    • \(\text{Cov}(\eta_i, X_{it}) = 0\) is traditionally called “random effects”
    • \(\text{Cov}(\eta_i, X_{it}) \neq 0\) is traditionally called “fixed effects”
    • (I hate this terminology, but it’s standard in economics.)

What is the role of \(\eta_i\)?

\[ Y_{it} = \alpha + \beta X_{it} + U_{it}, \quad U_{it} = \eta_i + \epsilon_{it} \]

  1. Creates within-cluster dependence.
    • E.g. suppose \(\epsilon_{it}\) iid over both \(i\) and \(t\); uncorrelated with \(\eta_i\) \[ \text{Cor}(U_{it}, U_{is}) = \sigma_\eta^2/(\sigma_\eta^2 + \sigma_\epsilon^2) \text{ for all } t \neq s \]
  2. Could be viewed as an unobserved confounder.
    • If \(\text{Cov}(\eta_i, X_{it}) \neq 0\) then \(X_{it}\) is endogenous.
    • \(\eta_i\) is a time-invariant omitted variable, e.g. “ability”

💪 Exercise A - (20 min)

Download fake-panel-data.csv from https://ditraglia.com/data. This dataset was simulated according to the one-way error components model described above. It contains six columns: person is a unique person identifier (name), year is a year index (1-5), x and y are the regressor and outcome variable, and epsilon and eta are the error terms. (In real data you wouldn’t have the errors, but this is a simulation!)

  1. Use lm to regress y on x with “classical” standard errors. Repeat with standard errors clustered by person using lm_robust(). Discuss your results.
  2. Plot y against x along with the regression line from part 1.
  3. Repeat 2, but use a different color for the points that correspond to each person in the dataset and plot a separate regression line for each person.
  4. What does the plot you made in part 3 suggest? Use the columns epsilon and eta to check your conjecture.
  5. Finally, use lm_robust() to regress y on x and a dummy variable for each person, clustering the standard errors by person. Discuss your results.

“Fixed-Effects” Regression

\[ Y_{it} = \alpha + \beta X_{it} + U_{it}, \quad U_{it} = \eta_i + \epsilon_{it} \]

  • Equivalent to adding a dummy variable for each individual in the panel (and removing the intercept).
  • “Soaks up” endogeneity from \(\text{Cov}(\eta_i, X_{it}) \neq 0\).
  • Equivalent to OLS after subtracting individual time averages: \[ \begin{align} Y_{it} &= \alpha + \beta X_{it} + \eta_i + \epsilon_{it} \\ Y_{it} - \bar{Y}_i &= \beta (\bar{X}_{it} - \bar{X}_i) + (\epsilon_{it} - \bar{\epsilon}_i) \end{align} \]

Fixed-Effects Gotchas

  • Explicitly adding dummies for each group will make your computer explode if there’s a large number of groups.
  • De-meaning “by hand” and running OLS will not give you the right standard errors:
    • Lose a degree of freedom for each of the \(N\) time averages
  • Can’t include time-invariant regressors:
    • Perfectly collinear with the dummies
    • Disappear when subtracting group means

Don’t do it by hand!

Many helpful packages:

  • lm_robust() has a fixed_effects option.
  • plm() from the plm package with model = 'within'
    • plm has a tremendous number of features but I find it a bit hard to use
  • feols() from the fixest package
    • My preferred solution: fixest is fast, flexible, and easy

feols(formula, data)

  • Two-part formula: everything after | specifies fixed effect(s)
    • y ~ x + w | person_id + year
  • Default: doesn’t display fixed effect estimates
  • Default: clusters SEs on variable(s) after | (you can override)
  • Millions of alternative SE options!

Note

It’s sometimes hard to exactly match fixed effect SEs computed by someone else, because there are so many different finite-sample adjustments in use.

feols() - Example

library(fixest)
library(broom)

data_url <- 'https://ditraglia.com/data/fake-panel-data.csv'
fake_panel <- read_csv(data_url)

feols(y ~ x | person, data = fake_panel) |> 
  tidy()
# A tibble: 1 × 5
  term  estimate std.error statistic       p.value
  <chr>    <dbl>     <dbl>     <dbl>         <dbl>
1 x       -0.999    0.0501     -19.9 0.00000000932
feols(y ~ x | person + year, data = fake_panel) |> 
  tidy()
# A tibble: 1 × 5
  term  estimate std.error statistic      p.value
  <chr>    <dbl>     <dbl>     <dbl>        <dbl>
1 x        -1.00    0.0567     -17.6 0.0000000275

💪 Exercise B - (10 min)

  1. Use dplyr to subtract the individual time averages from x and y in the simulated dataset from above. Then run OLS on the demeaned dataset with classical SEs.
  2. Compare the point estimates and standard errors from 1 to those from an OLS regression of y on x and a full set of person dummies, again with classical SEs.
  3. Consult ?lm_robust() to find out how to use the fixed_effects option. Use what you learn to regress y on x with person fixed effects, clustering by person.
  4. Compare your results from 3 to mine computed using feols() above and to your calculations with lm_robust() and clustered standard errors from Exercise A above.

“Random Effects” Estimator

  • If \(\text{Cov}(\eta_i, \epsilon_{it}) = 0 \implies\) no endogeneity in \(X_{it}\) for fixed effects to “soak up”
  • They still soak up exogenous variation \(\implies\) estimator with needlessly higher variance.
  • Could use OLS, but \(\eta_i\) creates within-cluster correlation, so need cluster-robust SEs.
  • Random Effects: GLS approach that corrects the standard errors and yields and efficient estimator at the same time.
    • Relies on assumptions: usual approach assumes \(\text{Cov}(\epsilon_{is}, \epsilon_{it}) = 0\)
    • In this case: OLS regression of \((Y_{it} - \theta \bar{Y}_i)\) on \((X_{it} - \theta \bar{X}_i)\) where \[ \theta = 1 - \sqrt{\frac{\sigma_\epsilon^2}{\sigma_\epsilon^2 + T \sigma_\eta^2}} \in [0, 1] \]
    • Notice: if \(\theta = 0\) this is (pooled) OLS; if \(\theta = 1\) it’s fixed effects!
    • Can estimate \(\theta\) using OLS and FE residuals.

Random Effects with plm()

library(plm)

fake_panel <- fake_panel |> 
  mutate(year = factor(year))

random_effects <- plm(y ~ x, data = fake_panel, 
                      index = c('person', 'year'),
                      model = 'random')

fixed_effects <- feols(y ~ x | person, fake_panel)
ols <- lm(y ~ x, fake_panel)

modelsummary(list(OLS = lm(y ~ x, fake_panel), 
                  RE = random_effects, 
                  FE = feols(y ~ x | person, fake_panel)),
             gof_omit = 'AIC|BIC|F|RMSE|R2|Log.Lik.')

Random Effects with plm()

OLS RE FE
(Intercept) 1.134 1.022
(0.196) (0.402)
x 1.345 −0.986 −0.999
(0.378) (0.058) (0.050)
Num.Obs. 50 50 50
Std.Errors by: person

Why were RE and FE so similar?

s_eta_sq <- fake_panel |>  
  group_by(person) |> 
  summarize(eta = eta[1]) |> 
  pull(eta) |> 
  var()

s_epsilon_sq <- fake_panel |> 
  pull(epsilon) |> 
  var()

nT <- fake_panel |> 
  pull(year) |> 
  unique() |> 
  length()

c(nT = nT, s_epsilon_sq = s_epsilon_sq, s_eta_sq = s_eta_sq)
          nT s_epsilon_sq     s_eta_sq 
  5.00000000   0.01169809   3.65508218 
# Theta is approximately one in this example 
1 - sqrt(s_epsilon_sq / (s_epsilon_sq + nT * s_eta_sq))
[1] 0.9747079

💪 Exercise C - (\(\infty\) min)1

  1. Install the wooldridge package and read the help file for wagepan.
  2. Run an OLS regression of lwage on educ, black, hisp, exper, exper squared, married, union, and year. Use classical standard errors.
  3. Repeat 2, but use plm() to estimate a random effects specification of the same model.
  4. Repeat 3, but use feols() to estimate a fixed-effects specification with clustered standard errors. Can you include the same variables as in parts 2 and 3? Explain.
  5. How do your estimates and standard errors of the effects of union membership vary across these three specifications? Discuss briefly.