Problem Set: USDOT Airfare Dataset

In this exercise you will use panel data methods to examine data from the Domestic Airline Consumer Airfare Report, courtesy of the U.S. Department of Transportation (USDOT). For convenience, I have posted the required data on my personal website: https://ditraglia.com/data/usdot.csv. This file contains information on average airfares on different routes throughout the U.S. between 1997 and 2000. A route is a pair of cities between which you can fly non-stop: e.g. Philadelphia-Chicago. The columns in usdot.csv are as follows:

Name Description
year Year of a given observation (1997, 1998, 1999, or 2000)
route_id Unique numeric identifier for each route (pair of cities)
distance Distance between the pair of cities (miles)
passengers_daily Average number of passengers per day who flew this route
airfare Average one-way airfare on this route (nominal U.S. $)
market_share Market share of largest carrier on this route (decimal)

Exercises

Exercise 1: Simple OLS model

Consider an OLS regression of the log of airfare on market_share, log(dist), log(dist)^2, and a full set of year dummies.

  1. Why might it make sense to include year dummies in this regression? Do the estimated coefficients for the time dummies make sense in light of this explanation?
  2. Interpret the estimated coefficient on market_share in this regression, along with the associated 95% confidence interval based on “plain-vanilla,” i.e. non-roust, standard errors.
  3. Repeat (b) accounting for potential heteroskedasticity and clustering. What is the appropriate level at which to cluster in this example? How do your results change?

Exercise 2: Airfare elasticities

  1. Recall that elasticities measure the sensitivity of one variable to another. The elasticity of \(Y\) wrt \(X\) is given by \(\frac{d Y/Y}{dX/X}\). Compute the derivative of the logarithm, \(\frac{d\log(Y)}{dY}\), rearrange, and express the formula for elasticity in terms of logarithms. Hint: You formula should have only logarithms in it!
  2. What is the elasticity of airfare with respect to distance? Use your formula from (a) to derive an expression for the elasticity in your log-log model. Hint: Your formula should give the elasticity depending on model coefficients and distance.
  3. Use this expression to add the variable elasticity to your data set. elasticity should contain airfare elasticities for each route. Visualise this variable in a figure.
  4. Comment on your findings. Do the estimated elasticities make economic sense?

Exercise 3: Route fixed effects

Suppose that we decided to add route fixed effects to the regression from above.

  1. What inference issue do fixed effects solve? Why might adding route fixed effects make sense here?
  2. Typeset the estimation equation. If we add route fixed effects, will we be able to estimate the elasticity of airfare with respect to distance? Why or why not? Hint: This depends on whether we can include time-invariant regressors in our model. Can we do that in fixed effects?
  3. Carry out the fixed effects regression, clustering standard errors appropriately. How do your results compare to those from above? Suggest a possible explanation for any differences that you find.

Exercise 4: Correlated random effects

Suppose that we instead decided to estimate a random effects specification, but with a slight twist: in addition to the regressors from part 1 above, we add the time average of market_share. This is called a correlated random effects model and is usually attributed to Mundlak. (See Baltagi (2006) for details.) To be clear, in this specification both market_share and its time average within route will appear in the regression. This specification assumes that individual effects are a linear function of the average of market_share across time.

  1. Think back to Core Metrics: what is the benefit of random effects over fixed effects? What assumption do we need for random effects? How does the Mundlak specification relax this assumption?
  2. Typeset the estimation equation. Can we estimate the elasticity of airfares with respect to distance in this specification? Why or why not?
  3. Estimate the random effects model described above. For simplicity, use the “default” random effects standard errors.
  4. Discuss your findings. How to the point estimates from this specification compare to those from above? You may find it helpful to consult Baltagi (2006).

Solutions

Begin by constructing a full table of results for the required regressions.

# Setup
library(tidyverse)
library(broom)
library(estimatr)
library(fixest)        # to use feols() for fixed effects
library(plm)           # to use plm() for random effects
library(modelsummary)

usdot <- read_csv('https://ditraglia.com/data/usdot.csv')

# Convert year to factor 
usdot <- usdot |> 
  mutate(year = as.factor(year))

# Part 1 - Plain-vanilla standard errors
ols_formula <- log(airfare) ~ year + market_share + log(distance) + 
  I(log(distance)^2)
ols <- lm(ols_formula, usdot)

# Part 1 - Robust standard errors
ols_robust <- lm_robust(ols_formula, usdot, clusters = route_id)

# Part 3 - Fixed effects

# Drop time-invariant regressors ourselves, or feols will do it for us
fixest_formula <- log(airfare) ~ market_share + year | route_id
fe <- feols(fixest_formula, usdot, cluster = ~ route_id)

# Part 4 - Mundlak correlated random effects approach

# Between() from plm package computes time averages. 
# Alternatively use dplyr to compute the mean by group.
mundlak_formula <- log(airfare) ~ year + market_share + 
  Between(market_share) + log(distance) + I(log(distance)^2)

mundlak <- plm(mundlak_formula, data = usdot, 
               index = c('route_id', 'year'), # individual & time index
               model = 'random') # random effects 

# Full table of results
modelsummary(list(`OLS` = ols, `OLS Robust` = ols_robust, 
                  `FE` = fe, `Mundlak` = mundlak), 
             fmt = 2, 
             gof_omit = 'R2 Adj.|AIC|BIC|Log.Lik.|F|RMSE|R2 Within Adj.')
OLS OLS Robust FE Mundlak
(Intercept) 6.21 6.21 6.21
(0.42) (0.92) (0.81)
year1998 0.02 0.02 0.02 0.02
(0.01) (0.00) (0.00) (0.00)
year1999 0.04 0.04 0.04 0.04
(0.01) (0.01) (0.01) (0.00)
year2000 0.10 0.10 0.10 0.10
(0.01) (0.01) (0.01) (0.00)
market_share 0.36 0.36 0.17 0.17
(0.03) (0.06) (0.05) (0.03)
log(distance) -0.90 -0.90 -0.91
(0.13) (0.27) (0.25)
I(log(distance)^2) 0.10 0.10 0.10
(0.01) (0.02) (0.02)
Between(market_share) 0.21
(0.07)
Num.Obs. 4596 4596 4596 4596
R2 0.406 0.406 0.955 0.230
R2 Within 0.135
Std.Errors by: route_id by: route_id

Exercise 1: Simple OLS model

1(a)

There are many possible reasons, among them changes in the overall price level between 1997 and 2000. Given airfares are measured in nominal USD, we would expect them to grow on average across all routes over the sample period.

Because the outcome variable is measured in logs and 1997 is the omitted category, the coefficients on 1998, 1999, and 2000 represent the average percentage point difference in fares between 1997 and the year in question, all else equal. And indeed, we find that airfare is 2% higher in 1998 than in 1997, 4% higher in 1999, and 10% higher in 2000. It makes sense that these coefficients should be positive and grow over time, because they represent a cumulative percentage change in price. The magnitudes also seem about right if we interpret this as broadly capturing inflation. Of course there may be other considerations, e.g. fuel costs.

1(b)

The coefficient is around 0.36 with a standard error of 0.03, yielding an approximate 95% confidence interval of [0.30, 0.42]. This means that, comparing two routes with identical covariates except a difference of 10 percentage points in the market share of the largest carrier, we would predict that airfares are 3-4% higher on the route with higher concentration.

1(c)

Clustering by route_id allows for correlation between the error terms across years for a given route, while assuming no correlation between the error terms for different routes. With this adjustment, the standard error becomes about twice as large as it was before. Now the confidence interval is approximately [0.24, 0.48], so we’d predict fares that are between 2.5% and 5% higher on the route in which the dominant carrier has 10 percentage points more market share.

Exercise 2: Airfare elasticities

2(a)

The elasticity of \(Y\) wrt \(X\) is \[ \frac{d Y/Y}{dX/X} \] The derivative of the logarithm is \[ \frac{d\log(Y)}{dY} = \frac{1}{Y} \iff dY/Y = d \log(Y) \] Substituting \[ \frac{d Y/Y}{dX/X} = \frac{d\log(Y)}{d\log(X)} \] So the elasticity of airfare wrt distance is the derivative of \(\log(\text{airfare})\) wrt to \(\log(\text{distance})\). We can compute this conveniently from the log-log estimation equation.

2(b)

Our regression model takes the form \[ \log (Y) = \alpha + \beta \log(X) + \gamma \log(X)^2 \] where \(Y\) is airfare in USD and X is distance in miles. Hence, the elasticity is \[ \frac{d\log(Y)}{d\log(X)} = \beta + 2\gamma\log(X). \] Substituting the estimates from above, this becomes \(-0.9 + 0.2 \times \log(\texttt{distance})\).

2(c)

There are many ways to summarize the elasticities. One simple option is to make a histogram and a lineplot:

library(patchwork)
usdot <- usdot |> 
  mutate(elasticity = -0.9 + 0.2 * log(distance))

myhist <- usdot |> 
  ggplot(aes(x = elasticity)) +
  geom_histogram() +
  labs(title="Histogram of elasticities \nfor different routes")

myline <- usdot |> 
  ggplot(aes(x = distance, y = elasticity)) +
  geom_line() +
  labs(title = "Elasticity of airfare \nwrt distance")

myhist + myline + plot_annotation(caption = "Source: USDOT.")

2(d)

The airfare elasticities for the different routes vary from 0.04 to 0.73. Since \(|\epsilon| < 1\), airfare is inelastic with respect to distance for all routes in the data set. A 1% change in distance increases airfare by less than 1%. This makes economic sense, given that air travel has decreasing costs with respect to distance: The acceleration during take-off uses up much more fuel than normal flight, and jet engines have better fuel efficiency at higher altitudes.

We would expect airfares to be increasing in distance but, somewhat alarmingly, the estimated elasticity could be negative depending on the value of distance. Re-arranging the expression from above, the elasticity is positive provided that \[ \texttt{distance} > \exp(-\beta/2\gamma) = \exp(0.9/0.2) = \exp(4.5) \approx 90. \] Fortunately, the minimum value of distance in our dataset is 95 miles, so the elasticities are always positive.

Exercise 3: Route fixed effects

3(a)

Route-fixed effects might confound the relationship between market_share and airfare. To control for them, we use route-fixed effects. If there is an unobserved characteristic that is correlated with airfare and market_share and is constant over time (or approximately so), the fixed effects approach will likely provide a more reliable estimate of the causal effect of market share on airfares.

The revised model includes a route-specific intercept.

3(b)

We can’t compute the elasticity since time-invariant regressors can’t be included in a fixed-effects regression. Thing about the de-meaning approach to calculation: since these variables are constant over time, de-meaning them turns them all into zeros! If we try to include such variables in our specification, they will be automatically dropped by feols.

3(c)

The fixed effects estimate is about half as large: a point estimate of 0.17 with a confidence interval of approximately [0.07, 0.27]. There are many possible explanations: anything that is logically possible is fine as an answer to this.

Exercise 4: Correlated random effects

4(a)

Fixed effects control for individual-specific effects that could confound the relationship of interest.

Random effects give the efficient GLS estimator, but this assumes that none of the explanatory variables are correlated with potential omitted individual-specific effects. This is highly restrictive, as it requires the model to fully captures individual effects. The Mundlak model alleviated this concern because it incorporates individual effects and models them as a linear function of the average of market_share across time.

4(b)

Yes: because we can include time-invariant regressors in a random effects specification, we can estimate the desired elasticity.

4(c)-(d)

The coefficient on market_share is exactly identical to that of the fixed effects regression:

c(FE = fe |> 
    tidy() |>  
    filter(term == 'market_share') |>  
    pull(estimate),
  Mundlak = mundlak |>  
    tidy() |>  
    filter(term == 'market_share') |>  
    pull(estimate)
  )
      FE  Mundlak 
0.168859 0.168859 

It turns out that this is always the case when we use the Mundlak approach: the within-groups estimator (fixed effects) coincides with the GLS estimator.

Recall theta-differencing: \(\hat{\beta}_{GLS}\) is obtained as the OLS estimator in th regression of \((Y_{it} - \theta \bar{Y}_i)\) on \((X_{it} - \theta \bar{X}_i)\), where \[ \theta = 1 - \sqrt{\frac{\sigma^2_{\nu}}{\sigma^2_{\nu} + T \sigma^2_{\eta}}} \in [0, 1] \] \(\nu\) denotes the time-variant error term component, \(\eta\) denotes the time-invariant error term component (i.e. the individual-specific effects). If \(\theta\) is very close to 1, then \(\hat{\beta}_{GLS}\) coincides with the within-groups estimator \(\hat{\beta}_{WG}\), which is equivalent to the fixed effects estimator. This can happen in big T panels, for very big \(\sigma^2_{\epsilon}\), or for small \(\sigma^2_{\eta}\) (i.e. if the distribution of individual effects is degenerate, or there are no individual effects). The Mundlak specification explicitly models individual effects, so \(\sigma^2_{\eta} \approx 0\) and the estimators are equivalent.

The coefficients on log(distance) and log(distance)^2 are extremely similar, although not quite identical, to those from the OLS regressions without fixed effects, so the elasticities are also very similar.