P-values

Introduction to Statistical Inference

Suppose that our goal is to figure out whether a coin is fair. We toss the coin 50 times, and it comes up heads 20 times. Does it mean the coin is unfair?

Not necessarily. It might be that we just got a little unlucky and got 20 instead of 25.

Let’s ask ourselves:

Assuming the coin is fair, what is the probability of obtaining a result that’s as extreme or more extreme than 20?

In other words, how often would we expect 20 heads or fewer, or 30 heads or more, just by chance, even if the coin is fair?

We can compute this using

pbinom(q = 20, size = 50, prob = 0.5) + (1 - pbinom(q = 29, size = 50, prob = 0.5))

## [1] 0.2026388

So, about once every 5 times, we’d expect a result that’s at least as extreme as 20. Perhaps that’s not enough to conclude the coin is not fair.

Let’s visualize this with a histogram:

tosses <- rbinom(n = 10000, size = 50, prob = 0.5)
tosses.df <- data.frame(n = tosses)
ggplot(tosses.df) + geom_bar(mapping = aes(x = n, y = ..prop..)) + geom_vline(xintercept = 20, color = "red" ) + geom_vline(xintercept = 30, color = "red")

We can get the probability of values that are at least as extreme as 20 by adding up the columns outside the red lines.

What if things were the same, but with 500 tosses and 200 heads? Here is the histogram.

tosses <- rbinom(n = 10000, size = 500, prob = 0.5)
tosses.df <- data.frame(n = tosses)
ggplot(tosses.df) + geom_bar(mapping = aes(x = n, y = ..prop..)) + geom_vline(xintercept = 200, color = "red" ) + geom_vline(xintercept = 300, color = "red")

Clearly, the probability if getting 200 heads with 500 tosses and a fair coin is just 0. But we can verify that:

pbinom(q = 200, size = 500, prob = 0.5) + (1 - pbinom(q = 300, size = 500, prob = 0.5))

## [1] 7.395812e-06

We can also confirm this via simulation:

tosses <- rbinom(n = 1000, size = 500, prob = 0.5)
mean(tosses <= 200 | tosses >= 300)

## [1] 0

The p-value

What we just computed is known as the p-value: the probability of getting results at least as extreme as what we observe, if the null hypothesis is true.

In our case, the null hypothesis is that the coin is fair.

Small p-values provide evidence against the Null Hypothesis – a small p-value indicates that what we see is unlikely if the Null Hypothesis were true.

N.B.: the p-value is ABSOLUTELY NOT the probability that the null hypothesis is true.

(So what is the probability that the coin is fair, in our experiment?)

(Glib answer: 0 or 1, depending on whether the coin is fair or not. We’ll discuss this later again.)

The Null Hypothesis

The Null Hypothesis is the hypothesis that nothing interesting is going on – e.g., that the coin is fair, or that the new drug we that we administered to a group of patient is no better than a placebo.

Project 1 variable selection

Suppose you added in a variable, and classification performance on the validation set improved by 0.5% from 65.0% to 65.5%. Is this a genuine improvement?

Null hypothesis: there is no improvement. (The probability of getting an example correct is still 65%)

Model: $Prob(CR = x) = dbinom(x = x, size = 1000, prob = .65)$ (assuming the Null Hypothesis is true)

The probability of 655 Heads out of 1000 trials:

dbinom(x = 655, size = 1000, prob = .65)

## [1] 0.02510964

The p-value is then the probability of observing an outcome that’s as far or further away than 650 as 655. In other words, it is Prob(x >= 655 or x <= 645):

pbinom(q = 645, size = 1000, prob = .65) + (1-pbinom(q = 654, size = 1000, prob = .65))

## [1] 0.7654586

(Note: we made a number of simplifying assumptions. In particular, we assumed that we know that the performance of the previous model was exactly 65%, when actually we just had an estimate of the performance. We also assumed that running both classifiers on the validation set is like tossing a coin 1000 times, twice. But it’s not: someone who is misclassified by one classifier is fairly likely to be misclassified by the other classifier, too. This means that we can predict the outcome of the 5-th toss in the second experiment by looking at the outcome of the 5-th toss in the first experiment.)

Darwin’s Finches

Darwin’s Finches exhibit remarkable diversity, partly because the birds were isolated on different islands of the Galapagos archipelago.

Here, we will consider a dataset of the beak depth of a population of Darwin’s Finches in 1976 (before a drought) and in 1978 (after a drought). It was hypothesized that the draught, which influenced the kind of food that was available, would influence the beak depth of the birds who survived the draught.

Peter and Rosemary Grant, of PU, studied the issue.

First, let’s look at the data:

# Need to run install.packages("Sleuth3") first
library(Sleuth3)
finches <- case0201

Let’s plot the beak depths in 1976 and 1978:

finches$Year <- as.character(finches$Year)
ggplot(finches, aes(x=Depth, fill = Year, y = ..prop..)) + 
    geom_bar(position = "dodge")

We might like to display the finches in a few different ways:

ggplot(finches, aes(x=Depth, fill = Year, y = ..density..)) + 
    geom_histogram(position = "identity", alpha = 0.5, bins = 15)

ggplot(finches, aes(x=Depth, fill = Year, y = ..density..)) + 
    geom_histogram(position = "identity", alpha = 0.5, bins = 15)

Let’s also compute the means and standard deviations for the population:

finches %>% group_by(Year) %>% 
  summarize(mean = mean(Depth), sd = sd(Depth), count = n())

## # A tibble: 2 x 4
##   Year   mean    sd count
##   <chr> <dbl> <dbl> <int>
## 1 1976   9.47 1.04     89
## 2 1978  10.1  0.906    89

So, are the beaks different or not?

Null hypothesis: the beak depths are not different. The difference we observe is due to chance.

We’ll approximate the distributions of the depths in 1976 and 1978 using normal distributions with the means and sd’s that we computed. Let’s generate fake data that looks like our data.

fake.1976 <- rnorm(n = 89, mean = 9.47, sd = 1.04)
fake.1978 <- rnorm(n = 89, mean = 10.1, sd = 0.906)
depth <- c(fake.1976, fake.1978)
year <- c(rep("1976", 89), rep("1978", 89))
fake.df <- data.frame(Depth = depth, Year = year)
fake.df$Year <- as.character(fake.df$Year)
fake.df %>% group_by(Year) %>% 
  summarize(mean = mean(Depth), sd = sd(Depth), count = n())

## # A tibble: 2 x 4
##   Year   mean    sd count
##   <chr> <dbl> <dbl> <int>
## 1 1976   9.45 0.951    89
## 2 1978  10.1  0.847    89

ggplot(fake.df, aes(x=Depth,  fill = Year))  +
    geom_histogram(position = "dodge", bins = 30)

Now, let’s imagine the means are actually the same. How often would we get very different histrograms by chance?

Our null hypothesis is that the means are actually the same – we’ll estimate them as mean(finches$Depth). The difference we observe is 0.63. Let’s see how often we’ll see a difference this large or larger.

ExpDepthDiff <- function(){
  fake.1976 <- rnorm(n = 89, mean = mean(finches$Depth), sd = 1.04)
  fake.1978 <- rnorm(n = 89, mean = mean(finches$Depth), sd = 0.906)
  depth <- c(fake.1976, fake.1978)
  year <- c(rep("1976", 89), rep("1978", 89))
  fake.df <- data.frame(Depth = depth, Year = year)
  fake.df$Year <- as.character(fake.df$Year)
  
  stats <- fake.df %>% group_by(Year) %>% summarize(mean = mean(Depth), sd = sd(Depth), count = n())
  return(stats$mean[1]-stats$mean[2])
}

diffs <- replicate(10000, ExpDepthDiff())
diffs.df <- data.frame(diff = diffs)
ggplot(diffs.df) + geom_histogram(mapping = aes(x = diff), bins = 30) + geom_vline(xintercept = -0.63, color = "red") + geom_vline(xintercept = 0.63, color = "red")

So, how often do we observe a difference larger than what we observe here?

mean(abs(diffs) > 0.63)

## [1] 0

We have very strong evidence against the null hypothesis that the populations have the same average beak length.