Week 9 Lecture 1 – Normal Approximations

Let’s load the finches data again

library(Sleuth3)
finches <- case0201

Approximation using the Normal Distribution

Let’s go back to the Binomial distribution. Let’s sample fake data and plot the histogram for a Binomial, and then do the same with a normal distribution with the same mean and standard deviation.

set.seed(0)
exp10 <- rbinom(n = 500000, size = 1000, prob = 0.7)
exp10.df <- data.frame(n.heads = exp10)
ggplot(exp10.df) + geom_bar(mapping = aes(x=n.heads, y = ..prop..)) + scale_x_continuous(breaks = 1:10)

exp.norm <- rnorm(n = 50000, mean = 700, sd = 14.9)
exp.norm.df <- data.frame(r = round(exp.norm))
ggplot(exp.norm.df) + geom_bar(mapping = aes(x=r, y = ..prop..)) +  scale_x_continuous(limits = c(620, 780))

Since the histograms are very similar, we saw that the cumulative distributions are similar:

pbinom(q = 693, size = 1000, prob = 0.7)

## [1] 0.3255677

pnorm(q = 693, mean = 700, sd = 14.9)

## [1] 0.3192494

This means that we can answer questions about the Binomial distribution using pnorm, as long as size is reasonably large and prob is close to 50%.

So, what is the probability of observing at most 690 heads when tossing a coin 1000 times, with prob = 0.7?

pnorm(q = 690, mean = 700, sd = 14.9)

## [1] 0.2510654

Assuming that a coin is fair and we toss it 1000 times, what is the probability of getting a number of heads that’s as extreme or more extreme than 550?

\(sd = \sqrt{0.5\times(1-0.5)\times 1000} = 15.8\)

s <-  15.8
m <- 500
pnorm(450, mean = m, sd = s) + (1 - pnorm(549, mean = m, sd = s))

## [1] 0.001740072

Let’s double check this

pbinom(450, size = 1000, prob = 0.5) + (1 - pbinom(549, size = 1000, prob = 0.5))

## [1] 0.001730536

A symmetry

Note that since the distribution is symmetric, the probability of getting 450 or fewer heads is the same as the probability of getting 450 or fewer tails. So we could write the above (and a lot more) as

2 * pbinom(450, size = 1000, prob = 0.5)

## [1] 0.001730536

An approximation

Just like we approximated the Binomial distribution using a Guassian, we can approximate the difference between two populations that are drawn from a Guassian distribution (this is itself an approximation, usually) using the t-distribution. Specifically, we can compute

\[T = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{s_1^{2}/N_1 + s_2^{2}/N_2}}\]

Here, \(N_1\) and \(N_2\) are the sample sizes, and \(s_1\) and \(s_2\) are the sample standard deviations.

\[T\] would be distributed with the \(t(\nu)\) distribution. Here, \(N_1 + N_2 - 2\) is the number of degrees of freedom of the t-distribution. \(t(5), t(6), ...\) are all different distributions. They all have mean 0. (Why is this important to know?)

The number of degrees of freedom \(\nu\) is approximated using

\[\nu = \frac{\left(\frac{s_1^{2}}{N_1} + \frac{s_2^{2}}{N_2}\right)}{\frac{s_1^{4}}{N_1^{2}(N_1-1)} + \frac{s_2^{4}}{N_2^{2}(N_2-1)}}\]

a <- finches %>% group_by(Year) %>% summarize(mean = mean(Depth), sd = sd(Depth), count = n())
N1 = a$count[1]
N2 = a$count[2]
mu1 <- a$mean[1]
mu2 <- a$mean[2]
sd1 <- a$sd[1]
sd2 <- a$sd[2]
t <- (mu1 - mu2)/(sqrt(sd1^2/N1 + sd2^2/N2 ))
nu = ((sd1^2/N1 + sd2^2/N2)^2)/(sd1^4/(N1^2 * (N1-1)) + sd2^4/(N2^2 * (N2-1))  )                                  

pt(t, df = nu) +  1 - pt(-t, df = nu)

## [1] 8.739145e-06

We can also use the built-in t.test function.

t.test(filter(finches, Year == "1976")$Depth, filter(finches, Year == "1978")$Depth, alternative = "two.sided", var.equal = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  filter(finches, Year == "1976")$Depth and filter(finches, Year == "1978")$Depth
## t = -4.5833, df = 172.98, p-value = 8.739e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.9564436 -0.3806350
## sample estimates:
## mean of x mean of y 
##  9.469663 10.138202

a <- t.test(filter(finches, Year == "1976")$Depth, filter(finches, Year == "1978")$Depth, alternative = "two.sided", var.equal = FALSE)
a$p.value

## [1] 8.739145e-06

What is t-distributed?

Again, what is t-distributed is the t-statistic \(T = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{s_1^{2}/N_1 + s_2^{2}/N_2}}\). That is, if we repeatedly caught 89 finches in the two years and measured their beaks, we’d get a different \(T\) every time. This \(T\) would be t-distributed.