```
library(tidyverse)
library(janitor)
<- read_csv('https://ditraglia.com/data/child_test_data.csv')
kids <- clean_names(kids)
kids <- kids |>
kids mutate(mom_education = if_else(mom_hs == 1, 'HS', 'NoHS')) |>
mutate(mom_education = fct_relevel(mom_education, 'NoHS'))
```

# Regression - Solutions

# Setup

# Exercise A - (10 min)

- Interpret the regression coefficients from the following regression. Is there any way to change the model so the intercept has a more natural interpretation?

`lm(kid_score ~ mom_iq, kids)`

- Use
`ggplot2`

to make a scatter plot with`mom_age`

on the horizontal axis and`kid_score`

on the vertical axis. - Use
`geom_smooth()`

to add the regression line`kid_score ~ mom_age`

to your plot. What happens if you drop`method = 'lm'`

? - Plot a histogram of the residuals from the regression
`kid_score ~ mom_iq`

. using`ggplot`

with a bin width of 5. Is anything noteworthy? - Calculate the residuals “by hand” by subtracting the fitted values from
`reg1`

from the column`kid_score`

in`kids`

. Check that this gives the same result as`resid()`

. - As long as you include an intercept in your regression, the residuals will sum to zero. Verify numerically using the residuals from the preceding part.
- Regression residuals are uncorrelated with any predictors included in the regression. Verify numerically using the residuals you computed in part 5.

## Solution

### Part 1

Comparing two kids whose moms differ by one IQ point, we would predict that the kid whose mom has the higher IQ will score 0.61 points higher, on average, on the test. The intercept lacks a meaningful interpretation: it is the predicted test score for a kid whose mom has an IQ of zero, an impossible score.

`lm(kid_score ~ mom_iq, kids)`

```
Call:
lm(formula = kid_score ~ mom_iq, data = kids)
Coefficients:
(Intercept) mom_iq
25.80 0.61
```

While this doesn’t affect our predictions in any way, it is arguably better to re-express the regression model so the intercept *does* have a meaningful interpretation. Consider the population linear regression \(Y = \alpha + \beta X + U\). The least squares estimate of \(\alpha\) is \(\widehat{\alpha} = \bar{y} - \widehat{\beta} \bar{x}\) where \(\widehat{\beta}\) is the least-squares estimate of \(\beta\). In other words: the least squares regression line passes through the point \((\bar{x}, \bar{y})\). We can use this fact to give the regression intercept a meaningful interpretation as follows:

```
|>
kids mutate(mom_iq = mom_iq - mean(mom_iq)) |>
lm(kid_score ~ mom_iq, data = _)
```

```
Call:
lm(formula = kid_score ~ mom_iq, data = mutate(kids, mom_iq = mom_iq -
mean(mom_iq)))
Coefficients:
(Intercept) mom_iq
86.80 0.61
```

Another way to obtain the same result is as follows:

`lm(kid_score ~ I(mom_iq - mean(mom_iq)), kids)`

```
Call:
lm(formula = kid_score ~ I(mom_iq - mean(mom_iq)), data = kids)
Coefficients:
(Intercept) I(mom_iq - mean(mom_iq))
86.80 0.61
```

Replacing `mom_iq`

with `mom_iq - mean(mom_iq)`

gives us a transformed \(x\) variable with \(\bar{x} = 0\). Hence, the \(\widehat{\beta} \bar{x}\) term in the expression for \(\widehat{\alpha}\) vanishes and we end up with \(\widehat{\alpha} = \bar{y}\). Just to double-check:

`mean(kids$kid_score)`

`[1] 86.79724`

In this transformed regression, the slope has the same interpretation as before, but now the intercept has a meaningful interpretation as well.

## Parts 2 and 3

Consulting the help file for `geom_smooth()`

we see that dropping `method = 'lm'`

causes `ggplot2`

to plot a smoother (a non-parametric regression line) using `stats::loess()`

or `mgcv::gam()`

, depending on how many observations are in our dataset.

```
<- kids |>
myplot ggplot(aes(x = mom_age, y = kid_score)) +
geom_point()
myplot
```

```
+
myplot geom_smooth(method = 'lm')
```

```
+
myplot geom_smooth()
```