Let’s load the finches data again

library(Sleuth3)
finches <- case0201

The sampling distribution of the sample mean

Suppose a population is normally distributed according to \(\mathcal{N}(\mu, \sigma^2)\). We said before that the sample mean of a sample of size \(n\) will be normally disributed according to

\[\bar{X}\sim\mathcal{N}(\mu, \sigma^2/n)\]

From that, we concluded that the standartized mean is distributed according to the standard normal distribution.

\[\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}\sim \mathcal{N}(0, 1)\]

But that is only useful if we know the \(\sigma\), which we usually do not. What if we have to approximate the standard deviation from the sample?

Approximating the standard deviation is easy:

\[s^2 = \frac{\sum_i (X_i-\bar{X})^2}{n-1}\]

But now,

\[\frac{\bar{X}-\mu}{s/\sqrt{n}}\sim t(n-1)\]

\(t(n-1)\) is the t-Distribution, with \(n-1\) degrees of freedom. “Degrees of freedom” is a parameter, similarly to how \(\mu\) and \(\sigma\) are parameters of the normal distribution. For reasonably large \(n\), the t-distribution looks very much like the standard normal distribution \(\mathcal{N}(0, 1)\) (how can we deduce this from what we already know?)

Problem

We measured the heights of 100 randomly-selected male students. The mean of the sample is 175cm, and the standard deviation of the sample is 10cm. Can we rejected the null hypothesis that the average height is 177cm?

Another approximation

We can estimate the difference between two sample means of populations where individuals are normally-disributed (this is itself an approximation, usually) using the t-distribution as follows:

\[T = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{s_1^{2}/n_1 + s_2^{2}/n_2}}\]

Here, \(n_1\) and \(n_2\) are the sample sizes, and \(s_1\) and \(s_2\) are the sample standard deviations.

\(T\) would be distributed with the \(t(\nu)\) distribution.

The number of degrees of freedom \(\nu\) is approximated using

\[\nu = \frac{\left(\frac{s_1^{2}}{N_1} + \frac{s_2^{2}}{N_2}\right)}{\frac{s_1^{4}}{N_1^{2}(N_1-1)} + \frac{s_2^{4}}{N_2^{2}(N_2-1)}}\]

a <- finches %>% group_by(Year) %>% 
  summarize(mean = mean(Depth), sd = sd(Depth), count = n())
N1 = a$count[1]
N2 = a$count[2]
mu1 <- a$mean[1]
mu2 <- a$mean[2]
sd1 <- a$sd[1]
sd2 <- a$sd[2]
t <- (mu1 - mu2)/(sqrt(sd1^2/N1 + sd2^2/N2 ))
nu = ((sd1^2/N1 + sd2^2/N2)^2)/(sd1^4/(N1^2 * (N1-1)) + sd2^4/(N2^2 * (N2-1))  )                                  

pt(t, df = nu) +  1 - pt(-t, df = nu)
## [1] 8.739145e-06

We can also use the built-in t.test function.

t.test(filter(finches, Year == "1976")$Depth, filter(finches, Year == "1978")$Depth, alternative = "two.sided", var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  filter(finches, Year == "1976")$Depth and filter(finches, Year == "1978")$Depth
## t = -4.5833, df = 172.98, p-value = 8.739e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.9564436 -0.3806350
## sample estimates:
## mean of x mean of y 
##  9.469663 10.138202
a <- t.test(filter(finches, Year == "1976")$Depth, filter(finches, Year == "1978")$Depth, alternative = "two.sided", var.equal = FALSE)
a$p.value
## [1] 8.739145e-06

What is t-distributed?

Again, what is t-distributed is the t-statistic \(T = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{s_1^{2}/N_1 + s_2^{2}/N_2}}\). That is, if we repeatedly caught 89 finches in the two years and measured their beaks, we’d get a different \(T\) every time. This \(T\) would be t-distributed.