library(tidyverse)
library(janitor)
<- read_csv('https://ditraglia.com/data/child_test_data.csv')
kids <- clean_names(kids)
kids <- kids |>
kids mutate(mom_education = if_else(mom_hs == 1, 'HS', 'NoHS')) |>
mutate(mom_education = fct_relevel(mom_education, 'NoHS'))
Regression - Solutions
Setup
Exercise A - (10 min)
- Interpret the regression coefficients from the following regression. Is there any way to change the model so the intercept has a more natural interpretation?
lm(kid_score ~ mom_iq, kids)
- Use
ggplot2
to make a scatter plot withmom_age
on the horizontal axis andkid_score
on the vertical axis. - Use
geom_smooth()
to add the regression linekid_score ~ mom_age
to your plot. What happens if you dropmethod = 'lm'
? - Plot a histogram of the residuals from the regression
kid_score ~ mom_iq
. usingggplot
with a bin width of 5. Is anything noteworthy? - Calculate the residuals “by hand” by subtracting the fitted values from
reg1
from the columnkid_score
inkids
. Check that this gives the same result asresid()
. - As long as you include an intercept in your regression, the residuals will sum to zero. Verify numerically using the residuals from the preceding part.
- Regression residuals are uncorrelated with any predictors included in the regression. Verify numerically using the residuals you computed in part 5.
Solution
Part 1
Comparing two kids whose moms differ by one IQ point, we would predict that the kid whose mom has the higher IQ will score 0.61 points higher, on average, on the test. The intercept lacks a meaningful interpretation: it is the predicted test score for a kid whose mom has an IQ of zero, an impossible score.
lm(kid_score ~ mom_iq, kids)
Call:
lm(formula = kid_score ~ mom_iq, data = kids)
Coefficients:
(Intercept) mom_iq
25.80 0.61
While this doesn’t affect our predictions in any way, it is arguably better to re-express the regression model so the intercept does have a meaningful interpretation. Consider the population linear regression \(Y = \alpha + \beta X + U\). The least squares estimate of \(\alpha\) is \(\widehat{\alpha} = \bar{y} - \widehat{\beta} \bar{x}\) where \(\widehat{\beta}\) is the least-squares estimate of \(\beta\). In other words: the least squares regression line passes through the point \((\bar{x}, \bar{y})\). We can use this fact to give the regression intercept a meaningful interpretation as follows:
|>
kids mutate(mom_iq = mom_iq - mean(mom_iq)) |>
lm(kid_score ~ mom_iq, data = _)
Call:
lm(formula = kid_score ~ mom_iq, data = mutate(kids, mom_iq = mom_iq -
mean(mom_iq)))
Coefficients:
(Intercept) mom_iq
86.80 0.61
Another way to obtain the same result is as follows:
lm(kid_score ~ I(mom_iq - mean(mom_iq)), kids)
Call:
lm(formula = kid_score ~ I(mom_iq - mean(mom_iq)), data = kids)
Coefficients:
(Intercept) I(mom_iq - mean(mom_iq))
86.80 0.61
Replacing mom_iq
with mom_iq - mean(mom_iq)
gives us a transformed \(x\) variable with \(\bar{x} = 0\). Hence, the \(\widehat{\beta} \bar{x}\) term in the expression for \(\widehat{\alpha}\) vanishes and we end up with \(\widehat{\alpha} = \bar{y}\). Just to double-check:
mean(kids$kid_score)
[1] 86.79724
In this transformed regression, the slope has the same interpretation as before, but now the intercept has a meaningful interpretation as well.
Parts 2 and 3
Consulting the help file for geom_smooth()
we see that dropping method = 'lm'
causes ggplot2
to plot a smoother (a non-parametric regression line) using stats::loess()
or mgcv::gam()
, depending on how many observations are in our dataset.
<- kids |>
myplot ggplot(aes(x = mom_age, y = kid_score)) +
geom_point()
myplot
+
myplot geom_smooth(method = 'lm')
+
myplot geom_smooth()