Logistic Regression

Francis J. DiTraglia

University of Oxford

What is this?

  • Sometimes need to predict a binary outcome
    • E.g. “two truths and a lie”
    • Which statement is true? How sure are you?
    • \(Y = 1\) if correct, \(X \in \{0, 1, 2, ..., 10\}\) “certainty”
  • Linear regression can give predictions outside \([0, 1]\)
  • Logistic Regression (aka Logit)
    • \(\mathbb{P}(Y=1|X) = G(X'\beta)\), where \(G=\) std logistic CDF
    • Today: focus on intuition and implementation in R.
    • See my lecture videos for technical details.

Two Ways to Express Logit Regression

  • \(\mathbb{P}(Y=1|X)=\) nonlinear function of data/params \[ P(Y=1|X) = \texttt{plogis}(X'\beta) \]

  • \(\log \text{Odds}(Y=1|X) =\) linear function of data/params \[ \log \left[\frac{P(Y=1|X)}{P(Y=0|X)}\right] = X'\beta. \]

  • Odds and log odds are intuitive if you give them a chance!

What are odds?

  • Probabilities are not more intuitive than odds: you’ve simply been brainwashed to believe that they are!

  • Let’s take the red pill… \[ \text{Odds}(A) \equiv \frac{\mathbb{P}(A)}{\mathbb{P}(A^c)} = \frac{\mathbb{P}(A)}{1 - \mathbb{P}(A)} \]

  • Pick a door at random; behind them are {🚗, 🐐, 🐐}

    • \(\mathbb{P}\)(🚗) = (#🚗) / (#Items) = 1/3
    • \(\text{Odds}\)(🚗) = (#🚗) / (#🐐) = 1/2

💪 Exercise A - (3 min)

One in a hundred women has breast cancer \((B)\). If you have breast cancer, there is a 95% chance that you will test positive \((T)\); if you do not have breast cancer \((B^c)\), there is a 2% chance that you will nonetheless test positive \((T)\). We know nothing about Alice other than the fact that she tested positive. How likely is it that she has breast cancer?

The Usual Approach

  • \(T\equiv \{\text{Tests Positive}\}, \, B\equiv \{\text{Has Breast Cancer}\}\) \[ \begin{aligned} P(B | T) &= \frac{P(T|B)P(B)}{P(T)} = \frac{P(T|B)P(B)}{P(T|B)P(B) + P(T|B^C)P(B^C)}\\ &= \frac{0.95 \times 0.01}{0.95 \times 0.01 + 0.02 \times 0.99} \approx 0.32. \end{aligned} \]

  • \(P(B|T)=\) non-linear function of \(P(B)\), \(P(T|B)\), \(P(T|B^C)\).

  • How does changing base rate, sensitivity or specificity change posterior probability? Hard to do in your head!

Odds Make it Easy

Odds cancel complicated Bayes Theorem denominator:

\[ \frac{\mathbb{P}(B|T)}{\mathbb{P}(B^c|T)} = \frac{\displaystyle \frac{\mathbb{P}(T|B)\mathbb{P}(B)}{\mathbb{P}(T)}}{\displaystyle \frac{\mathbb{P}(T|B^c)\mathbb{P}(B^c)}{\mathbb{P}(T)}} = \frac{\mathbb{P}(B)}{\mathbb{P}(B^c)} \times \frac{\mathbb{P}(T|B)}{\mathbb{P}(T|B^c)} \] Or, to put it another way: \[ \left(\text{Posterior Odds}\right)= \left( \text{Prior Odds}\right) \times \left( \text{Likelihood Ratio}\right) \]

Odds Make it Easy

  • Prior Odds = 1/99
    • 1 woman with breast cancer for every 99 without
  • Likelihood Ratio = 95/2
    • Of 100 women with breast cancer, 95 test positive
    • Of 100 women without breast cancer, 2 test positive
  • Posterior Odds: (1/99) \(\times\) (95 / 2) \(\approx\) 50/100 = 1/2
    • Same as Odds(🚗) when drawing from {🚗, 🐐, 🐐}
    • Agrees with exact calculation: \(\mathbb{P}(B|T) \approx 0.32\).

Odds: Multiplicative Scale

\[ \left(\text{Posterior Odds}\right)= \left( \text{Prior Odds}\right) \times \left( \text{Likelihood Ratio}\right) \]

  • Scaling prior odds or LR scales the posterior odds by same.
  • E.g. 1 woman with breast cancer for every 999 without
    • Prior odds become 1/999
    • Divide previous result by 10: Odds = 1/20.
  • E.g. sensitivity increases to 99%, specificity falls to 95%
    • Likelihood ratio becomes 99/5 \(\approx 20\)
    • Odds \(\approx\) 20/100 = 1/5

Log Odds: Additive Scale

  • \(\log(\text{Posterior Odds}) = \log(\text{Prior Odds}) + \log(\text{LR})\)
  • Just add: easy back-of-the-envelope calculations.
  • Prior odds of various diseases
    • Common = 1/10, Rare = 1/100, Very Rare 1/1000
    • Log 10 scale: -1, -2, -3
  • LR of various diagnostic tests:
    • OK test = 10, Good test = 100, Excellent test = 1000
    • Log 10 scale: 1, 2, 3

💪 Exercise B - (8 min)

  1. Probabilities are between 0 and 1. What about odds? What about log odds?
  2. If the probability of an event is \(1/2\), what are the odds? What about the log odds? What does this tell us about the qualitative interpretation of odds and log odds?
  3. I haven’t given you a closed-form expression for plogis(). Use the log-odds representation of logit regression to work it out, then create a function called myplogis(). Compare to the “real” plogis() function on a grid of 400 points between -4 and 4.

Probabilities vs. Odds vs. Log Odds

  • Probabilities
    • Highly nonlinear, bounded below by 0 above by 1
    • \(\mathbb{P}(A) = 1/2 \implies A\) and \(A^c\) are equally likely.
  • Odds
    • Multiplicative, bounded below by 0, unbounded above
    • Odds of 1 \(\implies A\) and \(A^c\) are equally likely
  • Log Odds
    • Additive, unbounded above and below
    • Log odds of 0 \(\implies A\) and \(A^c\) are equally likely.

Logit Partial Effects Are Complicated!

How does changing \(x\) change \(p(x) \equiv \mathbb{P}(Y=1|X=x)\)? \[ \frac{d}{dx} \texttt{plogis}(\alpha + \beta x) = \beta \times \texttt{dlogis}(\alpha + \beta x) \]

  • dlogis()>0 since it’s a density so \(\frac{d}{dx}\) has same sign as \(\beta\).
  • Magnitude depends on \(\alpha\) and \(x\) where we evaluate.
  • On probability scale logit partial effects are highly non-linear
  • CDF plogis() eventually “flattens out” and with it \(d/dx\)

Everything’s easier with odds!

  • Partly magic of odds; partly magic of plogis() function.
  • Recall that these are equivalent:
    • \(\mathbb{P}(Y=1|X = x) = \texttt{plogis}(\alpha + \beta x)\)
    • \(\log [\text{Odds}(Y=1|X=x)] = \alpha + \beta x\)
  • Therefore: \(\text{Odds}(Y=1|X=x) = \exp(\alpha + \beta x)\)
  • Increasing \(x\) by 1 unit:
    • Adds \(\beta\) to the log odds
    • Multiplies the odds by \(e^\beta\)

💪 Exercise C - (8 min)

  1. Consider two people: one has \(X = x_2\) and the other has \(X = x_1\). Under the simple logistic regression model from above, what is the odds ratio for these two people? In other words, what is the ratio of the odds that \(Y=1\) for person two relative to those for person one?
  2. Perhaps you’ve heard of the “divide by four rule.” It goes like this: if you divide a logistic regression coefficient by four, you’ll obtain the maximum possible effect of changing the associated regressor by one unit on the predicted probability that \(Y = 1\). Where does this rule come from?
  3. The quantity \(\beta \times \texttt{dlogis}(\alpha + \beta x)\) is usually called the partial effect of \(x\). I write \(x\) to denote a realization of the random variable \(X\), so \(x\) is constant whereas \(X\) is random. In contrast, the average partial effect is defined as \(E[\beta \times \texttt{dlogis}(\alpha + \beta X)]\). and the partial effect at the average is \(\beta \times \texttt{dlogis}(\alpha + \beta E[X])\). Do these two coincide in general? Why or why not? What question does each of them answer?

Logit Regression Simulation

  • Direct Approach
    1. Simulate regressors \(X_i\) and choose parameters \(\beta\).
    2. Set \(p_i \equiv \text{plogis}(X_i'\beta)\).
    3. Simulate \(Y_i|X_i \sim \text{independent Bernoulli}(p_i)\).
  • Latent Variable Approach
    1. Simulate regressors \(X_i\) and choose parameters \(\beta\).
    2. Simulate \(\epsilon_i\sim \text{iid Logistic}(0,1)\)
    3. Set \(y_i^* = X_i'\beta + \epsilon_i\).
    4. Set \(y_i = 1\) if \(y_i^* > 0\) and \(y_i = 0\) if \(y_i^* \leq 0\).

Example: Latent Variable Approach

library(tidyverse)
set.seed(1234)
n <- 500
x <- rnorm(n, mean = 1.5, sd = 2) # Generate X
ystar <- 0.5 + 1 * x + rlogis(n) # Generate "latent variable"
y <- 1 * (ystar > 0) # Transform to 0/1
mydat <- tibble(x, y)
mydat
# A tibble: 500 × 2
        x     y
    <dbl> <dbl>
 1 -0.914     1
 2  2.05      1
 3  3.67      1
 4 -3.19      0
 5  2.36      1
 6  2.51      1
 7  0.351     1
 8  0.407     0
 9  0.371     1
10 -0.280     1
# ℹ 490 more rows

💪 Exercise C - (5 min)

My simulation code from above generated data from a logistic regression model using the “latent variable” approach. Write code to accomplish the same result using the direct approach.

glm(formula, family, data)

  • Base R function to estimate generalized linear models
  • formula and data work just like in lm()
  • family describes error distribution & link function
    • family = binomial(link = 'logit')
    • family = binomial(link = 'probit')
    • family = poisson(link = 'log')
  • Compatible with tidy(), glance() and augment()

Running a Logistic Regression in R

lreg <- glm(y ~ x, family = binomial(link = 'logit'), mydat)
summary(lreg)

Call:
glm(formula = y ~ x, family = binomial(link = "logit"), data = mydat)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.3630     0.1344   2.700  0.00693 ** 
x             0.9638     0.1004   9.596  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 555.65  on 499  degrees of freedom
Residual deviance: 381.74  on 498  degrees of freedom
AIC: 385.74

Number of Fisher Scoring iterations: 6

💪 Exercise D - (10 min)

  1. Construct approximate 95% confidence intervals for the parameters \(\beta_0\) and \(\beta_1\) based on the logistic regression output from above. Do your confidence intervals include the true parameter values that we used to simulate the data?
  2. Interpret the estimated slope coefficient from lreg.
  3. Try using coef() with lreg. What do you get? Does it work as expected? Now try the broom functions tidy() and glance() along with the function modelsummary() from the modelsummary package. What do you get?
  4. As discussed above, \(\beta\) is not the derivative of \(\texttt{plogis}(\alpha + \beta x)\) with respect to \(x\).
    1. Use the “divide by 4” rule to calculate the maximum possible partial effect of \(x\) on the predicted probability that \(Y = 1\) using the results of lreg.
    2. Calculate the partial effect of \(x\) on the predicted probability that \(Y= 1\) at evaluated at the sample mean value of \(X\) (the partial effect at the average).
    3. Calculate the average partial effect of \(x\) over the observed sample.
    4. Compare your answers to (a), (b), and (c).

Predicted Probabilities

  • predict() works with glm() almost as it does for lm()
  • To get predicted probabilities set type = 'response'
# Estimate of P(Y = 1 | X = 0)
predict(lreg, newdata = data.frame(x = 0), type = 'response')
        1 
0.5897566 
# Estimate of P(Y = 1 | X = mean(x))
predict(lreg, newdata = data.frame(x = mean(x)), type = 'response')
        1 
0.8596206 
# Estimate of P(Y = 1 | X = observed x-values)
p_hat <- predict(lreg, type = 'response')
head(p_hat)
         1          2          3          4          5          6 
0.37330981 0.91240407 0.98013788 0.06222354 0.93312689 0.94180674 

augment() with glm() objects

augment(lreg, mydat, type.predict = 'response')
# A tibble: 500 × 8
        x     y .fitted .resid    .hat .sigma   .cooksd .std.resid
    <dbl> <dbl>   <dbl>  <dbl>   <dbl>  <dbl>     <dbl>      <dbl>
 1 -0.914     1  0.373   1.40  0.00797  0.874 0.00680        1.41 
 2  2.05      1  0.912   0.428 0.00349  0.876 0.000169       0.429
 3  3.67      1  0.980   0.200 0.00240  0.876 0.0000245      0.201
 4 -3.19      0  0.0622 -0.358 0.00859  0.876 0.000290      -0.360
 5  2.36      1  0.933   0.372 0.00341  0.876 0.000123       0.373
 6  2.51      1  0.942   0.346 0.00334  0.876 0.000104       0.347
 7  0.351     1  0.668   0.898 0.00364  0.875 0.000909       0.899
 8  0.407     0  0.680  -1.51  0.00356  0.874 0.00381       -1.51 
 9  0.371     1  0.673   0.890 0.00361  0.875 0.000884       0.892
10 -0.280     1  0.523   1.14  0.00528  0.875 0.00243        1.14 
# ℹ 490 more rows

Plotting Logit Regression

Notice the new argument to stat_smooth()

ggplot(mydat, aes(x, y)) +
  stat_smooth(method='glm', method.args = list(family = "binomial")) +
  geom_point() 

Jittering to Improve Legibility

ggplot(mydat, aes(x, y)) +
  stat_smooth(method='glm', method.args = list(family = "binomial")) +
  geom_jitter(width = 0.5, # noise in x-coordinate
              height = 0.1) # noise in y-coordinate 

💪 Exercise E - (\(\infty\) min)

Data from the “two truths and a lie” experiment described at the beginning of these slides are available from two-truths-and-a-lie-2022-cleaned.csv in the data directory of my website: https://ditraglia.com/data/.

  1. Read the dataset directly into R from the web, storing it in a tibble called two_truths.
  2. Run a logistic regression that predicts guessed_right based on certainty. Make a nicely-formatted table of regression results using modelsummary() and comment briefly on your findings.
  3. Use ggplot() to depict the regression from part 2, adding jittering to make the raw data clearly visible. You may need to adjust the width and height parameters of geom_jitter().