Regression - Solutions

Setup

library(tidyverse)
library(janitor)
kids <- read_csv('https://ditraglia.com/data/child_test_data.csv')
kids <- clean_names(kids) 
kids <- kids |> 
  mutate(mom_education = if_else(mom_hs == 1, 'HS', 'NoHS')) |> 
  mutate(mom_education = fct_relevel(mom_education, 'NoHS'))

Exercise A - (10 min)

  1. Interpret the regression coefficients from the following regression. Is there any way to change the model so the intercept has a more natural interpretation?
lm(kid_score ~ mom_iq, kids)
  1. Use ggplot2 to make a scatter plot with mom_age on the horizontal axis and kid_score on the vertical axis.
  2. Use geom_smooth() to add the regression line kid_score ~ mom_age to your plot. What happens if you drop method = 'lm'?
  3. Plot a histogram of the residuals from the regression kid_score ~ mom_iq. using ggplot with a bin width of 5. Is anything noteworthy?
  4. Calculate the residuals “by hand” by subtracting the fitted values from reg1 from the column kid_score in kids. Check that this gives the same result as resid().
  5. As long as you include an intercept in your regression, the residuals will sum to zero. Verify numerically using the residuals from the preceding part.
  6. Regression residuals are uncorrelated with any predictors included in the regression. Verify numerically using the residuals you computed in part 5.

Solution

Part 1

Comparing two kids whose moms differ by one IQ point, we would predict that the kid whose mom has the higher IQ will score 0.61 points higher, on average, on the test. The intercept lacks a meaningful interpretation: it is the predicted test score for a kid whose mom has an IQ of zero, an impossible score.

lm(kid_score ~ mom_iq, kids)

Call:
lm(formula = kid_score ~ mom_iq, data = kids)

Coefficients:
(Intercept)       mom_iq  
      25.80         0.61  

While this doesn’t affect our predictions in any way, it is arguably better to re-express the regression model so the intercept does have a meaningful interpretation. Consider the population linear regression \(Y = \alpha + \beta X + U\). The least squares estimate of \(\alpha\) is \(\widehat{\alpha} = \bar{y} - \widehat{\beta} \bar{x}\) where \(\widehat{\beta}\) is the least-squares estimate of \(\beta\). In other words: the least squares regression line passes through the point \((\bar{x}, \bar{y})\). We can use this fact to give the regression intercept a meaningful interpretation as follows:

kids |> 
  mutate(mom_iq = mom_iq - mean(mom_iq)) |> 
  lm(kid_score ~ mom_iq, data = _)

Call:
lm(formula = kid_score ~ mom_iq, data = mutate(kids, mom_iq = mom_iq - 
    mean(mom_iq)))

Coefficients:
(Intercept)       mom_iq  
      86.80         0.61  

Another way to obtain the same result is as follows:

lm(kid_score ~ I(mom_iq - mean(mom_iq)), kids)

Call:
lm(formula = kid_score ~ I(mom_iq - mean(mom_iq)), data = kids)

Coefficients:
             (Intercept)  I(mom_iq - mean(mom_iq))  
                   86.80                      0.61  

Replacing mom_iq with mom_iq - mean(mom_iq) gives us a transformed \(x\) variable with \(\bar{x} = 0\). Hence, the \(\widehat{\beta} \bar{x}\) term in the expression for \(\widehat{\alpha}\) vanishes and we end up with \(\widehat{\alpha} = \bar{y}\). Just to double-check:

mean(kids$kid_score)
[1] 86.79724

In this transformed regression, the slope has the same interpretation as before, but now the intercept has a meaningful interpretation as well.

Parts 2 and 3

Consulting the help file for geom_smooth() we see that dropping method = 'lm' causes ggplot2 to plot a smoother (a non-parametric regression line) using stats::loess() or mgcv::gam(), depending on how many observations are in our dataset.

myplot <- kids |> 
  ggplot(aes(x = mom_age, y = kid_score)) + 
  geom_point()

myplot

myplot + 
  geom_smooth(method = 'lm')

myplot + 
  geom_smooth()