Let’s load the finches data again
library(Sleuth3)
finches <- case0201
Suppose a population is normally distributed according to \(\mathcal{N}(\mu, \sigma^2)\). We said before that the sample mean of a sample of size \(n\) will be normally disributed according to
\[\bar{X}\sim\mathcal{N}(\mu, \sigma^2/n)\]
From that, we concluded that the standartized mean is distributed according to the standard normal distribution.
\[\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}\sim \mathcal{N}(0, 1)\]
But that is only useful if we know the \(\sigma\), which we usually do not. What if we have to approximate the standard deviation from the sample?
Approximating the standard deviation is easy:
\[s^2 = \frac{\sum_i (X_i-\bar{X})^2}{n-1}\]
But now,
\[\frac{\bar{X}-\mu}{s/\sqrt{n}}\sim t(n-1)\]
\(t(n-1)\) is the t-Distribution, with \(n-1\) degrees of freedom. “Degrees of freedom” is a parameter, similarly to how \(\mu\) and \(\sigma\) are parameters of the normal distribution. For reasonably large \(n\), the t-distribution looks very much like the standard normal distribution \(\mathcal{N}(0, 1)\) (how can we deduce this from what we already know?)
We measured the heights of 100 randomly-selected male students. The mean of the sample is 175cm, and the standard deviation of the sample is 10cm. Can we rejected the null hypothesis that the average height is 177cm?
We can estimate the difference between two sample means of populations where individuals are normally-disributed (this is itself an approximation, usually) using the t-distribution as follows:
\[T = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{s_1^{2}/n_1 + s_2^{2}/n_2}}\]
Here, \(n_1\) and \(n_2\) are the sample sizes, and \(s_1\) and \(s_2\) are the sample standard deviations.
\(T\) would be distributed with the \(t(\nu)\) distribution.
The number of degrees of freedom \(\nu\) is approximated using
\[\nu = \frac{\left(\frac{s_1^{2}}{N_1} + \frac{s_2^{2}}{N_2}\right)}{\frac{s_1^{4}}{N_1^{2}(N_1-1)} + \frac{s_2^{4}}{N_2^{2}(N_2-1)}}\]
a <- finches %>% group_by(Year) %>%
summarize(mean = mean(Depth), sd = sd(Depth), count = n())
N1 = a$count[1]
N2 = a$count[2]
mu1 <- a$mean[1]
mu2 <- a$mean[2]
sd1 <- a$sd[1]
sd2 <- a$sd[2]
t <- (mu1 - mu2)/(sqrt(sd1^2/N1 + sd2^2/N2 ))
nu = ((sd1^2/N1 + sd2^2/N2)^2)/(sd1^4/(N1^2 * (N1-1)) + sd2^4/(N2^2 * (N2-1)) )
pt(t, df = nu) + 1 - pt(-t, df = nu)
## [1] 8.739145e-06
We can also use the built-in t.test
function.
t.test(filter(finches, Year == "1976")$Depth, filter(finches, Year == "1978")$Depth, alternative = "two.sided", var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: filter(finches, Year == "1976")$Depth and filter(finches, Year == "1978")$Depth
## t = -4.5833, df = 172.98, p-value = 8.739e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.9564436 -0.3806350
## sample estimates:
## mean of x mean of y
## 9.469663 10.138202
a <- t.test(filter(finches, Year == "1976")$Depth, filter(finches, Year == "1978")$Depth, alternative = "two.sided", var.equal = FALSE)
a$p.value
## [1] 8.739145e-06
Again, what is t-distributed is the t-statistic \(T = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{s_1^{2}/N_1 + s_2^{2}/N_2}}\). That is, if we repeatedly caught 89 finches in the two years and measured their beaks, we’d get a different \(T\) every time. This \(T\) would be t-distributed.