Statistical Inference

Defense Against the Dark Arts

Francis J. DiTraglia

University of Oxford

Epigraph

If your only tool is a hammer, then every problem looks like a nail. – Anonymous

If philosophy is outlawed, only outlaws will do philosophy. – Andrew Gelman

What is this?

Warmup

💪 Exercise A - (5 min)

Two researchers carried out independent studies to answer the same research question. The first reports an effect estimate and standard error of 25 ± 10. The second reports 10 ± 10.

  1. Are the results of the first study statistically significant at traditional significance thresholds?
  2. What about the results of the second study?
  3. Is there a statistically significant difference between the results of the studies?

Don’t compare significance!

  • Just because one effect is significantly different from zero and another isn’t, that doesn’t mean the two effects are significantly different from each other.
  • To compare effects, compute standard error of difference.
  • This error is alarmingly common: don’t fall prey to it!

💪 Exercise B - (4 min)

True or False. If false, explain.

  1. A small p-value indicates the presence of a large effect.
  2. I tested the null that my treatment has no effect against the one-sided alternative that it is effective. My p-value was 0.01. Hence there is a 99% chance that the treatment is effective and a 1% chance that it is ineffective or harmful.

The Logic of Hypothesis Testing

Questions addressed with Statistics:

  1. What should I believe?
  2. How should I act?
  3. What counts as evidence for my theory?
  • Bayesian Approach centers on question of rational belief (1)
    • Both (2) and (3) follow under model of rational choice.
  • Frequentist Approach generally uncomfortable with (1).
    • Neyman-Pearson: (3) is impossible, statistics is about (2).
    • Fisher-Mayo: (2) is limiting, (3) is crucial to science.
    • Common practice: weird combo of Fisher/Neyman

Deductive Logic

  • Valid Argument: true premises \(\implies\) true conclusions.
    • “Valid arguments are risk-free arguments”
  • Modus Ponens: If A, then B. A, therefore B.
    • All ravens are black. Mordecai is a raven. Therefore Mordecai is black.
  • Modus Tollens: If A, then B. Not B, therefore not A.
    • All ravens are black. Django is purple. Therefore Django is not a raven.

Inductive Logic

Inductive logic studies risky arguments. A risky argument can be a very good one, and yet its conclusion can be false, even when all the premises are true. Most of our arguments are risky.

  • Sample to Population
    • X is true in my sample. My sample is representative of the population, so X is probably true in the population.

Fisher’s Disjunction

  • I tested \(H_0\) and obtained a p-value of 0.01.
  • If \(H_0\) is true, this should only happen 1% of the time.

Either an exceptionally rare chance has ocurred, or the theory [\(H_0\)] is not true. – Fisher

  • Resembles modus tollens but it’s a risky argument!

Warning

Alternatively: assumptions used to compute p are wrong and nothing rare happened!

How this is usually interpreted.

If \(H_0\) is true, then this result (statistical significance) would probably not occur. This result has occurred. Then \(H_0\) is probably not true.

A logically equivalent argument:

If a person is an American [\(H_0\) is true], then he is probably not a member of Congress. (TRUE, RIGHT?) This person is a member of Congress. Therefore, he is probably not an American [\(H_0\) is probably not true].

How exactly are the data exceptional?

A spectacular vindication of the principle that each individual coin spun individually (he spins one) is as likely to come down heads as tails and therefore it should cause no surprise each individual time it does.

Rosencrantz & Guildenstern are Dead

P(HHHHHHHH)= P(TTTTHHTH) = P(HTHTHTHT) = 1/256

Inferring from an unlikely result to the presence of a significant effect presupposes that the observed result is much more likely under an implicitly conceived alternative than under the null. – Sprenger (2016)

\(H_0\colon\) Alice doesn’t have breast cancer.

One in a hundred women has breast cancer. If you have breast cancer (\(H_1\)), there is a 95% chance that you will test positive; if you do not have breast cancer \((H_0)\), there is a 2% chance that you will nonetheless test positive. We know nothing about Alice other than the fact that she tested positive. How likely is it that she has breast cancer?

  • p-value = \(\mathbb{P}(\text{Data}|H_0) = 0.02 < 0.05\) so reject \(H_0\).
  • \(\text{Odds}(H_0|\text{Data}) \approx 2 \implies \mathbb{P}(H_0|\text{Data}) \approx 2/3\).
  • Only when \(H_0\) isn’t a priori too likely does a low p-value give strong evidence against it.

What researchers really want

  • Want to talk about how likely \(H_0\) is in the light of data.
  • Bayesian approach allows this; not so easy for Frequentists
  • Neyman-Pearson
    • Inductive logic is an impossible dream: give up!
    • Instead: inductive behavior
    • Use data to decide which of \(H_0\) and \(H_1\) to accept and which to reject
    • Design a rule so we are likely to reject false and unlikely to reject true hypotheses

Neyman-Pearson framework

  • Two kinds of errors:
    • Type I: rejecting true \(H_0\)
    • Type II: failing to accept true \(H_1\) (reject false \(H_0\))
  • Traditionally: fix Type I error rate at \(\alpha\) and minimize Type II error rate subject to this constraint.
  • If \(H_0\), \(H_1\) and \(\alpha\) are chosen in light of costs/benefits this is simple decision theoretic approach

Decision Theory

Classical statistics is directed towards the use of sample information … in making inferences about \(\theta\). These classical inferences are for the most part without regard to the use to which they are put. In decision theory, on the other hand, an attempt is made to combine the sample information with other relevant aspects of the problem in order to make the best decision.

Decision Theory Basics

  • Need to decide what action \(a \in \mathscr{A}\) to take.
  • Unfortunately the state of nature \(\theta \in \Theta\) is unknown.
  • Incur loss \(L(\theta, a)\) if state of nature is \(\theta\) and we choose \(a\).
  • Observe data \(X\) from a distribution that depends on \(\theta\).
  • Decision rule \(\delta(x)\) is a function that tells us which action to take if we observe data \(x\).
  • Roughly speaking: try to choose a decision rule so that we will minimize the average loss incurred.

Skepticism about losses from Fisher

In the field of pure research no assessment of the cost of wrong conclusions … can conceivably be more than a pretence, and in any case such an assessment would be in admissible and irrelevant in judging the state of the scientific evidence

In other words: Fisher thinks Neyman is missing the point of science. It’s not generally about solving decision problems, even if such problems do genuinely arise in some areas.

A More Pedestrian Problem

  • Neyman-Pearson approach sounds like decision theory, but the way it is used in practice is purely conventional.
  • \(\alpha = 0.05\) is totally arbitrary (so is power of 80%)
  • Type I error assumed “worse” than Type II, but \(H_0\) is usually chosen purely for mathematical convenience:
    • \(H_0:\) This drug to treat terminal cancer has no effect.
    • FDA only approves treatment if we reject \(H_0\).
    • Which error is worse for a patient with terminal cancer?

Statistical Power

💪 Exercise C - (6 min)

  1. Suppose that \(Z \sim \text{N}(0,1)\) and \(\kappa\) and \(c\) are constants. Write a line of R code to compute each of the following:
    1. \(\mathbb{P}(Z + \kappa < -c)\)
    2. \(\mathbb{P}(Z + \kappa > c)\)
    3. \(\mathbb{P}(|Z + \kappa| > c)\)
  2. Suppose \(\widehat{\theta} \sim \text{N}(\theta, \text{SE}^2)\) and consider a test of \(H_0\colon \theta = \theta_0\). If the null is false what is the distribution of the test statistic \(T \equiv (\widehat{\theta} - \theta_0)/\text{SE}\)?

Statistical Power

  • Size: probability of rejecting \(H_0\) given that it is true.
    • Also called Type I error rate, significance level
  • Power: probability of rejecting \(H_0\) given that it is false.
    • 1 - (Type II Error rate)
  • Size is easier since there’s only one way for \(H_0\) to be true, but infinitely many ways for it to be false!
  • More precisely: power against a particular alternative
    • Given \(\alpha\) and a decision rule, if the true parameter value is \(\theta \neq \theta_0\), what is the probability of rejecting \(H_0\colon \theta = \theta_0\)?

Power Calculation

Suppose that \(\widehat{\theta} \sim \text{N}(\theta, \text{SE})\) and define the shorthand \[ T \equiv \frac{\widehat{\theta} - \theta_0}{\text{SE}}, \quad \kappa \equiv (\theta - \theta_0) / \text{SE}, \quad c_p \equiv \texttt{qnorm}(1 - p) \] One-sided test of \(H_0\colon \theta = \theta_0\) versus \(H_0\colon \theta > \theta_0\) \[ \mathbb{P}(T>c_{\alpha}) = \mathbb{P}(Z + \kappa > c_\alpha) = 1 - \texttt{pnorm}(c_\alpha - \kappa) \] Two-sided test of \(H_0\colon \theta = \theta_0\) versus \(H_1 \colon \theta \neq \theta_0\) \[ \mathbb{P}(|T|>c_{\alpha/2}) = \mathbb{P}(|Z+\kappa|>c_{\alpha/2}) = \texttt{pnorm}(-c_{\alpha/2} - \kappa) + 1 - \texttt{pnorm}(c_{\alpha/2}-\kappa) \]

💪 Exercise D - (10 min)

\(X_1, \dots, X_n \sim \text{N}(\mu, \sigma^2)\); estimate \(\mu\) using \(\bar{X}_n\)

  1. How does \(\text{SE}(\bar{X}_n)\) depend on \(n\) and \(\sigma\)?
  2. Let \(H_0\colon \mu = 0\). What is \(\kappa\) this example?
  3. Continuing from 2, plot the power of a one-sided test with \(\alpha = 0.05\), \(n = 100\) and \(\sigma^2 = 25\) as a function of \(\mu\).
  4. Suppose that \(\mu = \sigma/5\). Plot the power of a one-sided test with \(\alpha = 0.05\) as a function of \(n\).

Power Calculations

  • Traditional Approach
    • Choose null \(\mu_0\), alternative hypothesis, and \(\alpha\)
    • Suppose true mean is \(\mu\) and population variance is \(\sigma^2\)
    • How large a sample size do I need to get Power = 80%?
  • Minimum Detectable Effect
    • Choose null \(\mu_0\), alternative hypothesis, and \(\alpha\)
    • Suppose I use a sample size of \(n\).
    • What is the smallest value of \(\mu\) such that Power = 80%?
  • Base R function power.t.test() does both for t-tests.

Everything is significant!

  • Implausible that any effect in social science is exactly zero.
  • Previous example: \(\mu_0 = 0\) and \(\kappa = \sqrt{n} \mu/\sigma\)
  • Even if \(\mu\) is tiny, power approaches one as \(n\) grows.
  • In large enough samples we will certainly reject the null!
  • Advice: prefer confidence intervals, think about magnitudes
  • Economic / Practical rather than Statistical Significance

Of Beauty, Sex, and Power1

Kanazawa (2007): Beautiful people have more daughters

  • National Longitudinal Study of Adolescent Health
  • Sample size of 2,972 individuals with at least one child
  • Observe attractiveness rating (1-5) and sex of first-born.
  • Highest attractiveness: 56% girls; All others (1-4): 48% girls

Such a large difference is implausible.

There is a large literature on variation in the sex ratio of human births, and the effects that have been found have been on the order of 1 percentage point (for example, the probability of a girl birth shifting from 48.5 percent to 49.5 percent). Variation attributable to factors such as race, parental age, birth order, maternal weight, partnership status and season of birth is estimated at from less than 0.3 percentage points to about 2 percentage points, with larger changes (as high as 3 percentage points) arising under economic conditions of poverty and famine.

The Statistical Significance Filter

  • Filedrawer Bias: researchers are less likely to report findings that are not statistically significant.
  • Publication Bias: journals are less likely to publish papers with statistically insignificant results.
  • Why Most Published Findings are False.
  • But there’s a more nuanced way of looking at this…

Kanazawa (2007) has no power

  • An effect of half a percentage point would be very large.
  • Ballpark: equal sized groups, symmetric deviation from 0.5
  • At Kanazawa’s sample size, power is around 0.06
power.prop.test(n = 2972 / 2, p1 = 0.4975, 0.5025, strict = TRUE)

     Two-sample comparison of proportions power calculation 

              n = 1486
             p1 = 0.4975
             p2 = 0.5025
      sig.level = 0.05
          power = 0.05855136
    alternative = two.sided

NOTE: n is number in *each* group

Type M and Type S Errors

If your test has low power and you reject the null:

  • Type M Error
    • The magnitude of estimated effect is greatly exaggerated
  • Type S Error
    • Good chance that the sign of estimated effect is wrong.

Source: Andrew Gelman