Linear Regression in R

Francis J. DiTraglia

University of Oxford

What is this?

Basics of running least-squares regression in R.
This lecture does not discuss:
- Heteroskedasticity / autocorrelation / clustering
- Causality
Plenty of time for these in later lectures!
Packages used below: janitor, broom, and car (install!)

Child Test Score Dataset

kid.score child test score at age 3
mom.age age of mother at birth of child
mom.hs mother completed high school? (1 = yes)
mom.iq mother’s IQ score

Note

Read this after the lecture if you’re rusty on interpreting interactions in a simple linear regression model. It walks you through many examples with this dataset.

Child Test Score Dataset

library(tidyverse)
kids <- read_csv('https://ditraglia.com/data/child_test_data.csv')
kids

# A tibble: 434 × 4
   kid.score mom.hs mom.iq mom.age
       <dbl>  <dbl>  <dbl>   <dbl>
 1        65      1  121.       27
 2        98      1   89.4      25
 3        85      1  115.       27
 4        83      1   99.4      25
 5       115      1   92.7      27
 6        98      0  108.       18
 7        69      1  139.       20
 8       106      1  125.       23
 9       102      1   81.6      24
10        95      1   95.1      19
# ℹ 424 more rows

Clean up with `janitor`

library(janitor)
kids <- clean_names(kids) 
kids

# A tibble: 434 × 4
   kid_score mom_hs mom_iq mom_age
       <dbl>  <dbl>  <dbl>   <dbl>
 1        65      1  121.       27
 2        98      1   89.4      25
 3        85      1  115.       27
 4        83      1   99.4      25
 5       115      1   92.7      27
 6        98      0  108.       18
 7        69      1  139.       20
 8       106      1  125.       23
 9       102      1   81.6      24
10        95      1   95.1      19
# ℹ 424 more rows

Note

You could do this manually using rename() from dplyr.

Linear Regression with `lm()`

Solve ordinary / weighted least-squares problem, etc.
Short for linear model
lm([FORMULA], [DATAFRAME])
Formula Syntax:
- ~ to separate LHS from RHS: \(Y\) from \(X\)
- + to separate RHS variables: \(X_1\) from \(X_2\)
- More details later

Regressions to predict `kid_score`

# Regress kid_score on mom_iq and intercept 
lm(kid_score ~ mom_iq, kids)


Call:
lm(formula = kid_score ~ mom_iq, data = kids)

Coefficients:
(Intercept)       mom_iq  
      25.80         0.61

# Regress kid_score on mom_iq, mom_age and intercept 
lm(kid_score ~ mom_iq + mom_age, kids)


Call:
lm(formula = kid_score ~ mom_iq + mom_age, data = kids)

Coefficients:
(Intercept)       mom_iq      mom_age  
    17.5962       0.6036       0.3881

Plotting the Regression Line

kids |> 
  ggplot(aes(x = mom_iq, y = kid_score)) +
  geom_point() +
  geom_smooth(method = 'lm') + # defaults to y ~ x
  xlab("Mother's IQ Score") + 
  ylab("Child's Test Score (Age 3)")

Getting More from `lm()`

reg1 <- lm(kid_score ~ mom_iq, kids)
str(reg1) # Recall: str() shows what's inside any R object

List of 12
 $ coefficients : Named num [1:2] 25.8 0.61
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "mom_iq"
 $ residuals    : Named num [1:434] -34.68 17.69 -11.22 -3.46 32.63 ...
  ..- attr(*, "names")= chr [1:434] "1" "2" "3" "4" ...
 $ effects      : Named num [1:434] -1808.22 190.39 -8.77 -1.96 33.73 ...
  ..- attr(*, "names")= chr [1:434] "(Intercept)" "mom_iq" "" "" ...
 $ rank         : int 2
 $ fitted.values: Named num [1:434] 99.7 80.3 96.2 86.5 82.4 ...
  ..- attr(*, "names")= chr [1:434] "1" "2" "3" "4" ...
 $ assign       : int [1:2] 0 1
 $ qr           :List of 5
  ..$ qr   : num [1:434, 1:2] -20.833 0.048 0.048 0.048 0.048 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:434] "1" "2" "3" "4" ...
  .. .. ..$ : chr [1:2] "(Intercept)" "mom_iq"
  .. ..- attr(*, "assign")= int [1:2] 0 1
  ..$ qraux: num [1:2] 1.05 1.04
  ..$ pivot: int [1:2] 1 2
  ..$ tol  : num 1e-07
  ..$ rank : int 2
  ..- attr(*, "class")= chr "qr"
 $ df.residual  : int 432
 $ xlevels      : Named list()
 $ call         : language lm(formula = kid_score ~ mom_iq, data = kids)
 $ terms        :Classes 'terms', 'formula'  language kid_score ~ mom_iq
  .. ..- attr(*, "variables")= language list(kid_score, mom_iq)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "kid_score" "mom_iq"
  .. .. .. ..$ : chr "mom_iq"
  .. ..- attr(*, "term.labels")= chr "mom_iq"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(kid_score, mom_iq)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "kid_score" "mom_iq"
 $ model        :'data.frame':  434 obs. of  2 variables:
  ..$ kid_score: num [1:434] 65 98 85 83 115 98 69 106 102 95 ...
  ..$ mom_iq   : num [1:434] 121.1 89.4 115.4 99.4 92.7 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language kid_score ~ mom_iq
  .. .. ..- attr(*, "variables")= language list(kid_score, mom_iq)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "kid_score" "mom_iq"
  .. .. .. .. ..$ : chr "mom_iq"
  .. .. ..- attr(*, "term.labels")= chr "mom_iq"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. .. ..- attr(*, "predvars")= language list(kid_score, mom_iq)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:2] "kid_score" "mom_iq"
 - attr(*, "class")= chr "lm"

Coefficients / Residuals / Fitted Values

coef(reg1)

(Intercept)      mom_iq 
 25.7997778   0.6099746

resid(reg1)[1:14] # only display first 14 to fit on slide

         1          2          3          4          5          6          7 
-34.678390  17.691747 -11.217173  -3.461529  32.627697   6.382845 -41.521041 
         8          9         10         11         12         13         14 
  3.864881  26.414387  11.208068  11.170506 -25.661788   3.935176 -17.406597

fitted.values(reg1)[1:14] # only display first 14 to fit on slide

        1         2         3         4         5         6         7         8 
 99.67839  80.30825  96.21717  86.46153  82.37230  91.61716 110.52104 102.13512 
        9        10        11        12        13        14 
 75.58561  83.79193  79.82949  83.66179  80.06482  95.40660

💪 Exercise A - (10 min)

Interpret the regression coefficients from the following regression. Is there any way to change the model so the intercept has a more natural interpretation?

lm(kid_score ~ mom_iq, kids)

Use ggplot2 to make a scatter plot with mom_age on the horizontal axis and kid_score on the vertical axis.
Use geom_smooth() to add the regression line kid_score ~ mom_age to your plot. What happens if you drop method = 'lm'?
Plot a histogram of the residuals from the regression kid_score ~ mom_iq. using ggplot with a bin width of 5. Is anything noteworthy?
Calculate the residuals “by hand” by subtracting the fitted values from reg1 from the column kid_score in kids. Check that this gives the same result as resid().
As long as you include an intercept in your regression, the residuals will sum to zero. Verify numerically using the residuals from the preceding part.
Regression residuals are uncorrelated with any predictors included in the regression. Verify numerically using the residuals you computed in part 5.

Summarizing the output of `lm()`

summary(reg1)


Call:
lm(formula = kid_score ~ mom_iq, data = kids)

Residuals:
    Min      1Q  Median      3Q     Max 
-56.753 -12.074   2.217  11.710  47.691 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 25.79978    5.91741    4.36 1.63e-05 ***
mom_iq       0.60997    0.05852   10.42  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18.27 on 432 degrees of freedom
Multiple R-squared:  0.201, Adjusted R-squared:  0.1991 
F-statistic: 108.6 on 1 and 432 DF,  p-value: < 2.2e-16

Warning

SEs / p-values assume iid, homoskedastic, normal errors.

Looking inside of `summary()`

str(summary(reg1))

List of 11
 $ call         : language lm(formula = kid_score ~ mom_iq, data = kids)
 $ terms        :Classes 'terms', 'formula'  language kid_score ~ mom_iq
  .. ..- attr(*, "variables")= language list(kid_score, mom_iq)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "kid_score" "mom_iq"
  .. .. .. ..$ : chr "mom_iq"
  .. ..- attr(*, "term.labels")= chr "mom_iq"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(kid_score, mom_iq)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "kid_score" "mom_iq"
 $ residuals    : Named num [1:434] -34.68 17.69 -11.22 -3.46 32.63 ...
  ..- attr(*, "names")= chr [1:434] "1" "2" "3" "4" ...
 $ coefficients : num [1:2, 1:4] 25.7998 0.61 5.9174 0.0585 4.36 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "(Intercept)" "mom_iq"
  .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
 $ aliased      : Named logi [1:2] FALSE FALSE
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "mom_iq"
 $ sigma        : num 18.3
 $ df           : int [1:3] 2 432 2
 $ r.squared    : num 0.201
 $ adj.r.squared: num 0.199
 $ fstatistic   : Named num [1:3] 109 1 432
  ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
 $ cov.unscaled : num [1:2, 1:2] 1.05e-01 -1.03e-03 -1.03e-03 1.03e-05
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "(Intercept)" "mom_iq"
  .. ..$ : chr [1:2] "(Intercept)" "mom_iq"
 - attr(*, "class")= chr "summary.lm"

Extracting results from `summary()`

summary(reg1)$r.squared

[1] 0.2009512

We could similarly extract the F-stat, p-values etc.
But it’s a huge pain: need to know the names and positions
A better way: the broom package from the tidymodels suite of ML packages

Tidying up with `broom`

tidy() returns a tibble with estimates, SEs etc

library(broom)
tidy(reg1)

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   25.8      5.92        4.36 1.63e- 5
2 mom_iq         0.610    0.0585     10.4  7.66e-23

glance() returns a tibble with measures of “model fit”

glance(reg1)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.201         0.199  18.3      109. 7.66e-23     1 -1876. 3757. 3769.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

`broom` plays nicely with `dplyr`

reg1 |> 
  tidy() |> 
  filter(term == 'mom_iq') |> 
  pull(p.value)

[1] 7.66195e-23

`augment()` merges fitted values, etc.

augment(reg1, kids)

# A tibble: 434 × 10
   kid_score mom_hs mom_iq mom_age .fitted .resid    .hat .sigma   .cooksd
       <dbl>  <dbl>  <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl>     <dbl>
 1        65      1  121.       27    99.7 -34.7  0.00688   18.2 0.0126   
 2        98      1   89.4      25    80.3  17.7  0.00347   18.3 0.00164  
 3        85      1  115.       27    96.2 -11.2  0.00475   18.3 0.000905 
 4        83      1   99.4      25    86.5  -3.46 0.00231   18.3 0.0000416
 5       115      1   92.7      27    82.4  32.6  0.00284   18.2 0.00456  
 6        98      0  108.       18    91.6   6.38 0.00295   18.3 0.000181 
 7        69      1  139.       20   111.  -41.5  0.0178    18.2 0.0478   
 8       106      1  125.       23   102.    3.86 0.00879   18.3 0.000200 
 9       102      1   81.6      24    75.6  26.4  0.00577   18.2 0.00611  
10        95      1   95.1      19    83.8  11.2  0.00255   18.3 0.000483 
# ℹ 424 more rows
# ℹ 1 more variable: .std.resid <dbl>

Note

More information: helpfiles for tidy.lm(), glance.lm() and augment.lm().

💪 Exercise B - (5 min)

Regress mom_iq on kid_score and mom_hs. Store your results in an object called reg_reverse and summarize with tidy() and glance().
Extract the regression estimate, standard error, t-statistic, and p-value for the predictor kid_score only.
Plot the fitted values on the x-axis and mom_iq on the y-axis. Add a 45-degree line with geom_abline().
Regress mom_iq on the fitted values from reg_reverse. Display the estimated regression coefficients and R-squared. Explain your results.

Dummy Variables with `lm()`

A “dummy variable” is a variable that is either zero or one.
You could manually create one or more dummy variables to encode categorical information.
But this is a waste of time and a waste memory.
It may also be unclear: what does sex = 1 mean?
Advice: store categorical variables with meaningful names, and let R create the dummies for you “behind the scenes.”

Recall: Dummy Variable Trap

Categorical variable: no meaningful order e.g. eye color, sex
Can’t include and intercept and a coefficient for each level that a categorical variable takes on.
E.g. can’t have intercept, Male dummy and Female Dummy
Why not? \(\text{Male} = (1 - \text{Female})\) so \(X\) is linearly dependent. This means we can’t invert \(X'X\) so no unique OLS estimator.
Easy to fix: drop something, either intercept or one category (the “omitted” category)

R Constructs the Dummy For Us

Recode mom_hs for clarity:

kids <- kids |> 
  mutate(mom_education = if_else(mom_hs == 1, 'HS', 'NoHS'))
lm(kid_score ~ mom_education, kids)


Call:
lm(formula = kid_score ~ mom_education, data = kids)

Coefficients:
      (Intercept)  mom_educationNoHS  
            89.32             -11.77

Note

Helpfully, R tells us how it encoded the dummy; unhelpfully it chose HS as the omitted category without asking us!

Specifying the Omitted Category

Using fct_relevel() from forcats, part of tidyverse

kids <- kids |> 
  mutate(mom_education = fct_relevel(mom_education, 'NoHS'))
lm(kid_score ~ mom_education, kids)


Call:
lm(formula = kid_score ~ mom_education, data = kids)

Coefficients:
    (Intercept)  mom_educationHS  
          77.55            11.77

Note

Another option, discussed below, includes both categories, but drops the intercept.

Fun with Formulas

y ~ x is an R formula; use to specify a statistical model
Within a formula, certain characters have a special meaning
You’ve already met ~ and +
The rest are ., -, 1, :, *, ^, and I().
^ and * are more exotic; I’ll skip them here
See my R Formula Cheatsheet for full details.

“Everything Else” - The Dot `.`

Regress kid_score on all other variables in kids

lm(kid_score ~ ., kids)


Call:
lm(formula = kid_score ~ ., data = kids)

Coefficients:
    (Intercept)           mom_hs           mom_iq          mom_age  
        20.9847           5.6472           0.5625           0.2248  
mom_educationHS  
             NA

Note

The coefficient on mom_educationHS is NA because this has been dropped: dummy variable trap!

Removing Predictors with `-`

Regress kid_score on everything except mom_hs

lm(kid_score ~ . - mom_hs, kids)


Call:
lm(formula = kid_score ~ . - mom_hs, data = kids)

Coefficients:
    (Intercept)           mom_iq          mom_age  mom_educationHS  
        20.9847           0.5625           0.2248           5.6472

The Intercept: `1`

Included by default, but you can remove it

lm(kid_score ~ . - 1, kids)


Call:
lm(formula = kid_score ~ . - 1, data = kids)

Coefficients:
           mom_hs             mom_iq            mom_age  mom_educationNoHS  
          26.6318             0.5625             0.2248            20.9847  
  mom_educationHS  
               NA

Or even regress on only an intercept

lm(kid_score ~ 1, kids)


Call:
lm(formula = kid_score ~ 1, data = kids)

Coefficients:
(Intercept)  
       86.8

Why remove the intercept?

Usually bad to run a regression without an intercept
With dummy variables, excluding the intercept gives an alternative parameterization of the model:

lm(kid_score ~ mom_education - 1, kids)


Call:
lm(formula = kid_score ~ mom_education - 1, data = kids)

Coefficients:
mom_educationNoHS    mom_educationHS  
            77.55              89.32

Note

Previous parameterization: mean for omitted category and difference of means.

Transforming Outcomes / Predictors

One option is to create new variables in your dataframe:

new_kids <- kids |> 
  mutate(log_kid_score = log(kid_score), 
         mom_age_sq = mom_age^2)
lm(log_kid_score ~ mom_age + mom_age_sq, new_kids)


Call:
lm(formula = log_kid_score ~ mom_age + mom_age_sq, data = new_kids)

Coefficients:
(Intercept)      mom_age   mom_age_sq  
  4.4586761   -0.0109575    0.0004214

A Cleaner Approach

Rather than constructing new variables, change the formula:

lm(log(kid_score) ~ mom_age + I(mom_age^2), kids)


Call:
lm(formula = log(kid_score) ~ mom_age + I(mom_age^2), data = kids)

Coefficients:
 (Intercept)       mom_age  I(mom_age^2)  
   4.4586761    -0.0109575     0.0004214

Note

Since ^ has a special meaning in R formulas, we have to “escape it” by wrapping it in I(). This means “as-is.” Don’t need to escape log() but it would still work.

Adding Interactions with `:`

To include \(X \times W\) as a regressor use x:w

lm(kid_score ~ mom_age:mom_iq, kids)


Call:
lm(formula = kid_score ~ mom_age:mom_iq, data = kids)

Coefficients:
   (Intercept)  mom_age:mom_iq  
      48.04678         0.01698

Note

We could alternatively use I(x*w) to “escape” the *, which has a special meaning.

Warning

It almost never makes sense to include an interaction without the main effects.

💪 Exercise C - (8 min)

What does it mean to regress \(Y\) on an intercept only? Explain briefly.
Above we regressed kid_score on mom_age and I(mom_age^2). Try omitting the I() wrapper around age-squared. What happens?
Run these regressions and interpret the results:
1. kid_score on mom_iq and mom_education
2. kid_score on mom_iq, mom_education and their interaction.
Above we used ggplot2 to plot the regression line mom_iq ~ kid_score. Re-run the associated code but add color = mom_education as an extra argument to aes() and se = FALSE as an extra argument to geom_smooth(). What happens?

`predict()`

Use predict() to predict new observations from a fitted model
Works with variety of models besides linear regression.
For today: predict from fitted lm() object.
Associated help file: ?predict.lm()
Syntax: predict(object, newdata)
- object is a fitted lm() object
- newdata is a data frame of “new” observations

`predict(object, newdata)`

newdata must be a data frame
The columns of newdata must match the names of the regressors from object

reg2 <- lm(kid_score ~ mom_iq * mom_education, kids) 
new_kids <- data.frame(mom_iq = c(100, 120, 80), 
                       mom_education = c('NoHS', 'HS', 'HS'))
new_kids

  mom_iq mom_education
1    100          NoHS
2    120            HS
3     80            HS

predict(reg2, new_kids)

       1        2        3 
85.40690 97.93995 78.55537

`predict()` without `newdata`

If you omit newdata, then predict() is equivalent to fitted.values()
I.e. sets newdata to data frame used to fit the regression.

all.equal(predict(reg1), fitted.values(reg1))

[1] TRUE

Predicting with `augment()`

agument() from broom can also be used to predict new observations.
Specify newdata exactly as we did for predict() above.
For more details, consult ?broom::augment.lm()
Here.fitted refers to the predicted values:

augment(reg2, newdata = new_kids)

# A tibble: 3 × 3
  mom_iq mom_education .fitted
   <dbl> <chr>           <dbl>
1    100 NoHS             85.4
2    120 HS               97.9
3     80 HS               78.6

It’s probably worth renaming for clarity:

augment(reg2, newdata = new_kids) |> 
  rename(kid_score_pred = .fitted)

# A tibble: 3 × 3
  mom_iq mom_education kid_score_pred
   <dbl> <chr>                  <dbl>
1    100 NoHS                    85.4
2    120 HS                      97.9
3     80 HS                      78.6

Testing a Linear Restriction

Say you want to test \(H_0\colon \mathbf{R}\boldsymbol{\beta} = \mathbf{q}\) against \(H_1\colon \mathbf{R}\boldsymbol{\beta} \neq \mathbf{q}\)
For example:
- \(\beta_1 = \beta_2\) or
- \(\beta_1 + \beta_2 + \beta_3 = 1\) or
- \(\beta_1 = 0\) and \(\beta_2 = \beta_3\).
SE from regression output only permits a test of \(\beta_j = b\).
linearHypothesis() from car provides F-test.
See this blog post if you’ve forgotten what an F-test is.

Some Warnings

Null hypothesis significance testing (NHST) is rarely the right tool for the job in applied work.
Have you see the memo on p-values?
Statistical Inference - Defense Against the Dark Arts
See also: \(F\)-tests, \(R^2\) and Other Distractions
Today: assume homoskedasticity.

Testing \(\beta_0 = \beta_1 = 1\)

library(car)
reg1


Call:
lm(formula = kid_score ~ mom_iq, data = kids)

Coefficients:
(Intercept)       mom_iq  
      25.80         0.61

linearHypothesis(reg1, c('mom_iq = 1', '(Intercept) = 1'))

Linear hypothesis test

Hypothesis:
mom_iq = 1
(Intercept) = 1

Model 1: restricted model
Model 2: kid_score ~ mom_iq

  Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
1    434 246503                                 
2    432 144137  2    102366 153.4 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Variants of the Test

By default, linearHypothesis() implements the “finite-sample” F-test.
For the “asymptotic” \(\chi^2\) version:

linearHypothesis(reg1, c('mom_iq = 1', '(Intercept) = 1'), test = 'Chisq')

Linear hypothesis test

Hypothesis:
mom_iq = 1
(Intercept) = 1

Model 1: restricted model
Model 2: kid_score ~ mom_iq

  Res.Df    RSS Df Sum of Sq  Chisq Pr(>Chisq)    
1    434 246503                                   
2    432 144137  2    102366 306.81  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

💪 Exercise D - (8 min)

Test the joint null hypothesis that the slope and intercept of the predictive relationship between kid_score and mom_iq is the same for kids whose mothers graduated from high school and those whose mothers did not. Does the p-value change much if you use the asymptotic version of the test rather than the finite-sample F?
Let n be the number of rows in kids. Generate two random vectors as follows:
- x is a vector of n independent standard normal observations.
- z equals mom_iq plus independent standard normal noise.
Carry out a new regression, reg_augmented, that adds the predictors x and z to your earlier regression of kid_score on mom_iq, mom_education and their interaction. Finally, test the null hypothesis that x and z are irrelevant for predicting kid_score after taking into account mom_iq and mom_education. Interpret your findings. Do the results of the test make sense?

Making Nice Regression Tables

Last time: datasummary_skim() from modelsummary.
Today: modelsummary() function for regression tables.
You ran these in your exercise above; I’ll give them names:

reg_pooled <- lm(kid_score ~ mom_iq, kids)
reg_hs_dummy <- lm(kid_score ~ mom_iq + mom_hs, kids)
reg_interact <- lm(kid_score ~ mom_iq * mom_hs, kids)

Please Observe
- Use clear, meaningful labels.
- Use an appropriate number of decimal places.
- Don’t report millions of “goodness of fit” measures.

library(modelsummary)
modelsummary(reg_pooled)

	(1)
(Intercept)	25.800
	(5.917)
mom_iq	0.610
	(0.059)
Num.Obs.	434
R2	0.201
R2 Adj.	0.199
AIC	3757.2
BIC	3769.4
Log.Lik.	−1875.608
F	108.643
RMSE	18.22

Clean up results

modelsummary(reg_pooled, gof_omit = 'Log.Lik|R2 Adj.|AIC|BIC|F', fmt = 2, 
             title = 'Regression results for kids dataset', 
             notes = 'Source: all data were fabricated by the authors.')

Regression results for kids dataset
	(1)
(Intercept)	25.80
	(5.92)
mom_iq	0.61
	(0.06)
Num.Obs.	434
R2	0.201
RMSE	18.22
Source: all data were fabricated by the authors.

Results of Several Regressions

kids_regressions <- list(reg_pooled, reg_hs_dummy, reg_interact)
modelsummary(kids_regressions, gof_omit = 'Log.Lik|R2 Adj.|AIC|BIC|F', fmt = 2)

	(1)	(2)	(3)
(Intercept)	25.80	25.73	−11.48
	(5.92)	(5.88)	(13.76)
mom_iq	0.61	0.56	0.97
	(0.06)	(0.06)	(0.15)
mom_hs		5.95	51.27
		(2.21)	(15.34)
mom_iq × mom_hs			−0.48
			(0.16)
Num.Obs.	434	434	434
R2	0.201	0.214	0.230
RMSE	18.22	18.07	17.89

Linear Regression in R

What is this?

Child Test Score Dataset

Child Test Score Dataset

Clean up with janitor

Linear Regression with lm()

Regressions to predict kid_score

Plotting the Regression Line

Getting More from lm()

Coefficients / Residuals / Fitted Values

💪 Exercise A - (10 min)

Summarizing the output of lm()

Looking inside of summary()

Extracting results from summary()

Tidying up with broom

broom plays nicely with dplyr

augment() merges fitted values, etc.

💪 Exercise B - (5 min)

Dummy Variables with lm()

Recall: Dummy Variable Trap

R Constructs the Dummy For Us

Specifying the Omitted Category

Fun with Formulas

“Everything Else” - The Dot .

Removing Predictors with -

The Intercept: 1

Why remove the intercept?

Transforming Outcomes / Predictors

A Cleaner Approach

Adding Interactions with :

💪 Exercise C - (8 min)

predict()

predict(object, newdata)

predict() without newdata

Predicting with augment()

Testing a Linear Restriction

Some Warnings

Testing \(\beta_0 = \beta_1 = 1\)

Variants of the Test

💪 Exercise D - (8 min)

Making Nice Regression Tables

Clean up results

Results of Several Regressions

Clean up with `janitor`

Linear Regression with `lm()`

Regressions to predict `kid_score`

Getting More from `lm()`

Summarizing the output of `lm()`

Looking inside of `summary()`

Extracting results from `summary()`

Tidying up with `broom`

`broom` plays nicely with `dplyr`

`augment()` merges fitted values, etc.

Dummy Variables with `lm()`

“Everything Else” - The Dot `.`

Removing Predictors with `-`

The Intercept: `1`

Adding Interactions with `:`

`predict()`

`predict(object, newdata)`

`predict()` without `newdata`

Predicting with `augment()`