\(\mathbb{P}(Y=1|X)=\) nonlinear function of data/params \[
P(Y=1|X) = \texttt{plogis}(X'\beta)
\]
\(\log \text{Odds}(Y=1|X) =\) linear function of data/params \[
\log \left[\frac{P(Y=1|X)}{P(Y=0|X)}\right] = X'\beta.
\]
Odds and log odds are intuitive if you give them a chance!
What are odds?
Probabilities are not more intuitive than odds: you’ve simply been brainwashed to believe that they are!
Let’s take the red pill…\[
\text{Odds}(A) \equiv \frac{\mathbb{P}(A)}{\mathbb{P}(A^c)} = \frac{\mathbb{P}(A)}{1 - \mathbb{P}(A)}
\]
Pick a door at random; behind them are {🚗, 🐐, 🐐}
\(\mathbb{P}\)(🚗) = (#🚗) / (#Items) = 1/3
\(\text{Odds}\)(🚗) = (#🚗) / (#🐐) = 1/2
💪 Exercise A - (3 min)
One in a hundred women has breast cancer \((B)\). If you have breast cancer, there is a 95% chance that you will test positive \((T)\); if you do not have breast cancer \((B^c)\), there is a 2% chance that you will nonetheless test positive \((T)\). We know nothing about Alice other than the fact that she tested positive. How likely is it that she has breast cancer?
OK test = 10, Good test = 100, Excellent test = 1000
Log 10 scale: 1, 2, 3
💪 Exercise B - (8 min)
Probabilities are between 0 and 1. What about odds? What about log odds?
If the probability of an event is \(1/2\), what are the odds? What about the log odds? What does this tell us about the qualitative interpretation of odds and log odds?
I haven’t given you a closed-form expression for plogis(). Use the log-odds representation of logit regression to work it out, then create a function called myplogis(). Compare to the “real” plogis() function on a grid of 400 points between -4 and 4.
Probabilities vs. Odds vs. Log Odds
Probabilities
Highly nonlinear, bounded below by 0 above by 1
\(\mathbb{P}(A) = 1/2 \implies A\) and \(A^c\) are equally likely.
Odds
Multiplicative, bounded below by 0, unbounded above
Odds of 1 \(\implies A\) and \(A^c\) are equally likely
Log Odds
Additive, unbounded above and below
Log odds of 0 \(\implies A\) and \(A^c\) are equally likely.
Consider two people: one has \(X = x_2\) and the other has \(X = x_1\). Under the simple logistic regression model from above, what is the odds ratio for these two people? In other words, what is the ratio of the odds that \(Y=1\) for person two relative to those for person one?
Perhaps you’ve heard of the “divide by four rule.” It goes like this: if you divide a logistic regression coefficient by four, you’ll obtain the maximum possible effect of changing the associated regressor by one unit on the predicted probability that \(Y = 1\). Where does this rule come from?
The quantity \(\beta \times \texttt{dlogis}(\alpha + \beta x)\) is usually called the partial effect of \(x\). I write \(x\) to denote a realization of the random variable \(X\), so \(x\) is constant whereas \(X\) is random. In contrast, the average partial effect is defined as \(E[\beta \times \texttt{dlogis}(\alpha + \beta X)]\). and the partial effect at the average is \(\beta \times \texttt{dlogis}(\alpha + \beta E[X])\). Do these two coincide in general? Why or why not? What question does each of them answer?
Logit Regression Simulation
Direct Approach
Simulate regressors \(X_i\) and choose parameters \(\beta\).
My simulation code from above generated data from a logistic regression model using the “latent variable” approach. Write code to accomplish the same result using the direct approach.
family describes error distribution & link function
family = binomial(link = 'logit')
family = binomial(link = 'probit')
family = poisson(link = 'log')
Compatible with tidy(), glance() and augment()
Running a Logistic Regression in R
lreg <-glm(y ~ x, family =binomial(link ='logit'), mydat)summary(lreg)
Call:
glm(formula = y ~ x, family = binomial(link = "logit"), data = mydat)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.3630 0.1344 2.700 0.00693 **
x 0.9638 0.1004 9.596 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 555.65 on 499 degrees of freedom
Residual deviance: 381.74 on 498 degrees of freedom
AIC: 385.74
Number of Fisher Scoring iterations: 6
💪 Exercise D - (10 min)
Construct approximate 95% confidence intervals for the parameters \(\beta_0\) and \(\beta_1\) based on the logistic regression output from above. Do your confidence intervals include the true parameter values that we used to simulate the data?
Interpret the estimated slope coefficient from lreg.
Try using coef() with lreg. What do you get? Does it work as expected? Now try the broom functions tidy() and glance() along with the function modelsummary() from the modelsummary package. What do you get?
As discussed above, \(\beta\) is not the derivative of \(\texttt{plogis}(\alpha + \beta x)\) with respect to \(x\).
Use the “divide by 4” rule to calculate the maximum possible partial effect of \(x\) on the predicted probability that \(Y = 1\) using the results of lreg.
Calculate the partial effect of \(x\) on the predicted probability that \(Y= 1\) at evaluated at the sample mean value of \(X\) (the partial effect at the average).
Calculate the average partial effect of \(x\) over the observed sample.
Compare your answers to (a), (b), and (c).
Predicted Probabilities
predict() works with glm()almost as it does for lm()
To get predicted probabilities set type = 'response'
# Estimate of P(Y = 1 | X = 0)predict(lreg, newdata =data.frame(x =0), type ='response')
1
0.5897566
# Estimate of P(Y = 1 | X = mean(x))predict(lreg, newdata =data.frame(x =mean(x)), type ='response')
1
0.8596206
# Estimate of P(Y = 1 | X = observed x-values)p_hat <-predict(lreg, type ='response')head(p_hat)
ggplot(mydat, aes(x, y)) +stat_smooth(method='glm', method.args =list(family ="binomial")) +geom_jitter(width =0.5, # noise in x-coordinateheight =0.1) # noise in y-coordinate
💪 Exercise E - (\(\infty\) min)
Data from the “two truths and a lie” experiment described at the beginning of these slides are available from two-truths-and-a-lie-2022-cleaned.csv in the data directory of my website: https://ditraglia.com/data/.
Read the dataset directly into R from the web, storing it in a tibble called two_truths.
Run a logistic regression that predicts guessed_right based on certainty. Make a nicely-formatted table of regression results using modelsummary() and comment briefly on your findings.
Use ggplot() to depict the regression from part 2, adding jittering to make the raw data clearly visible. You may need to adjust the width and height parameters of geom_jitter().