Let’s review coin tosses. Here, we’ll generate \(100\) coin tosses where the weight of the coin is 0.7 (i.e., we expect the coin to come up Heads (1) 7 times out of 10).
tosses <- rbinom(n = 100, size = 1, prob = 0.7)
tosses
## [1] 1 0 0 0 1 1 0 0 0 1 1 1 0 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 0 1
## [36] 1 0 0 1 0 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 0 1 1 0
## [71] 1 0 1 1 0 0 1 1 1 0 1 1 1 0 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1
We can comfirm that the coin indeed came up heads about 70% of the time:
mean(tosses)
## [1] 0.68
What is the probability mass function here?
Tails | Heads |
---|---|
0.3 | 0.7 |
Let’s now try something different: we’ll toss the coin 10 times at a time, and record how many times it came up Heads. Then we’ll repeat that number 100 times.
The result will be 100 records of the number of heads out of 10 tosses:
rbinom(n = 100, size = 10, prob = 0.7)
## [1] 8 9 5 7 5 8 8 9 8 7 9 8 7 6 6 6 6 9 6 8 6 9 8
## [24] 6 5 8 5 6 6 6 8 5 6 8 9 8 6 9 5 7 8 5 7 7 8 5
## [47] 8 8 9 6 8 6 7 8 9 6 5 7 6 5 6 6 5 6 9 6 9 7 6
## [70] 6 7 7 10 8 8 5 9 6 7 6 6 7 8 8 7 9 10 6 10 3 8 8
## [93] 6 4 5 8 8 7 8 7
Let’s now make a histogram out of this. We’ll use 50000 tosses so that the histogram looks “smoother”
set.seed(0)
exp10 <- rbinom(n = 500000, size = 10, prob = 0.7)
exp10.df <- data.frame(n.heads = exp10)
ggplot(exp10.df) + geom_bar(mapping = aes(x=n.heads, y = ..prop..)) + scale_x_continuous(breaks = 1:10)
We can take this histogram as representing (approximately) the probability mass function of the Binomial distribution with probability \(0.70\) and 10 trials.
In other words, here are the probabilities of obtaining \(k\) heads for each \(k\) between \(0\) and \(10\):
binom.pmf.df <- data.frame(n.heads = 0:10,
binom.pmf = dbinom(x = 0:10, size = 10, prob = 0.7))
binom.pmf.df
## n.heads binom.pmf
## 1 0 0.0000059049
## 2 1 0.0001377810
## 3 2 0.0014467005
## 4 3 0.0090016920
## 5 4 0.0367569090
## 6 5 0.1029193452
## 7 6 0.2001209490
## 8 7 0.2668279320
## 9 8 0.2334744405
## 10 9 0.1210608210
## 11 10 0.0282475249
We didn’t need to look the numbers up on the histogram (and the histogram was an approximation anyway). R can compute the PMF using a built-in formula. But it’s nice that those match up!
Let’s look more closely at dbinom
for a second. dbinom
computes the probability that you’ll get a particular number of heads with size
trials:
dbinom(x = 2, size = 10, prob = 0.7)
## [1] 0.001446701
You can give pbinom
several inputs by sending a vector as the x
dbinom(x = c(2, 3), size = 10, prob = 0.7)
## [1] 0.001446701 0.009001692
Here, we have 11 possible outcomes. All probabilities will always add up to 1. Think of it this way: the probability of 0 or 1 or 2 or … or 10 heads is 1 (since one of those is sure to happen). To get that probability, we could add up the probabilities of the individual events. In other words:
\(P(\text{nH = 0 or nH = 1 or ... or nH = 10}) = P(\text{nH = 0}) + P(\text{nH = 1}) + ... + P(\text{nH = 10}) = 1\).
Here’s another application of this idea. What is the probability of observing a coin coming up heads once or more?
We have \[P(\text{nHeads} \geq 1) + P(\text{nHeads} = 0) = 1\]
since one of those events will definitely happen and the events are mutually exclusive. So we have
\[P(\text{nHeads} \geq 1) = 1 - P(\text{nHeads} = 0)\]
and
1 - dbinom(x = 0, size = 10, prob = 0.7)
## [1] 0.9999941
We can use the same idea to compute the probability that the coin will come up heads more than once:
\(\text{P(nH = 2)} + ... + \text{P(nH = 10)} = 1 - \text{P(nH = 0)} - \text{P(nH = 1)}\)
1 - dbinom(x = 0, size = 10, prob = 0.7) - dbinom(x = 1, size = 10, prob = 0.7)
## [1] 0.9998563
The cumulative mass function of a distribution is the probability that the outcome will be \(q\) or less. You can compute the cumulative mass function for the binomial distribution using pbinom
.
For example, here is the probability of the coin’s coming up heads at most once:
pbinom(q = 1, size = 10, prob = 0.7)
## [1] 0.0001436859
We can check that this works the same way with dbinom
:
dbinom(x = 0, size = 10, prob = 0.7) + dbinom(x = 1, size = 10, prob = 0.7)
## [1] 0.0001436859
We indeed got the same number.
We could compute the cumulative mass function for the Bernoulli distribution too – it would just be kind of boring! (Why?)
Like we saw in the previous lecture, a Binomial distribution is “bell-shaped” if the number of trials size
is large:
set.seed(0)
exp10 <- rbinom(n = 50000, size = 1000, prob = 0.7)
exp10.df <- data.frame(n.heads = exp10)
ggplot(exp10.df) + geom_bar(mapping = aes(x=n.heads, y = ..prop..)) + scale_x_continuous(limits = c(620, 780))
This is the shape of a Gaussian (or Normal) distribution. Specifically, it’s (approximately) the shape of a normal distribution with mean \(700\) and standard deviation \(14.9\). (We’ll see how to figure out those numbers later)
Here is how we can generate data from the normal distribution:
rnorm(n = 1, mean = 700, sd = 14.9)
## [1] 705.1339
rnorm(n = 1, mean = 700, sd = 14.9)
## [1] 713.5617
rnorm(n = 1, mean = 700, sd = 14.9)
## [1] 719.4835
rnorm(n = 1, mean = 700, sd = 14.9)
## [1] 672.043
As you can see, we are now generating real numbers rather than just integers. The Gaussian distribution is a continuous distribution – that means the random process generates real numbers.
This means that the probability of generating any number in particular is 0 – there are a lot of real numbers. (But we can still compute something that’s like a PMF that will tell as how likely a number is to be generated relative to another number – the function that does that is the probability density function, or PDF).
Let’s generate (almost) the same histogram as before, but using the Gaussian random number generator:
exp.norm <- rnorm(n = 50000, mean = 700, sd = 14.9)
exp.norm.df <- data.frame(r = round(exp.norm))
ggplot(exp.norm.df) + geom_bar(mapping = aes(x=r, y = ..prop..)) + scale_x_continuous(limits = c(620, 780))
We will now use the cumulative probability density function for the normal distribution to verify that the probability of getting a number smaller than 700 is 0.5:
pnorm(q = 700, mean = 700, sd = 14.9)
## [1] 0.5
The same is (approximately) true about the Binomial distribution, which looks very similar:
pbinom(q = 700, size = 1000, prob = 0.7)
## [1] 0.5119275
We can verify that the distrbutions are similar by checking the cumulative distribution function values elsewhere:
pnorm(q = 680, mean = 700, sd = 14.9)
## [1] 0.08975231
pbinom(q = 680, size = 1000, prob = 0.7)
## [1] 0.08978433
Finally, we can check that this is actually true in our simulations:
mean(exp10 <= 680)
## [1] 0.0882
mean(exp.norm <= 680)
## [1] 0.0884
Pretty close!
The Guassian distribution (or Bell Curve) is defined by its mean – the average of the random numbers it generates – and its standard deviation. The standard deviation is the measure of how “spread out” the numbers are.
The standard deviation of a set of numbers \(x\) is \[sd(x_1, x_2, x_3, ..., x_n) = \sqrt{\frac{ (x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + ... + (x_n - \bar{x})^2}{n}}.\]
Here, \(\bar{x} = \frac{x_1 + x_2 + ... + x_n}{n}\) is the mean of the x’s.
The standard deviation is the square root of the average squared distance from a datapoint to the mean. If we just measured the average distance to the mean (instead of the square root of the average squared distance), we’d get a similar quantity that’s easier to understand.
In R, we can use sd
to compute the standard deviation:
sd(c(1, 2, 10, 2, 1)) # small
## [1] 3.834058
sd(c(100, 2, 10, 2, 1)) # large
## [1] 43.19722
Let’s generate some numbers from a Gaussian distribution:
nums <- rnorm(n = 200, mean = 20, sd = 15)
We can now estimate the mean and standard deviation:
mean(nums)
## [1] 20.64939
sd(nums)
## [1] 15.7887
(Actually, \(\sqrt{\frac{n}{n-1}} sd(nums)\) is a better estimate for the standard deviation. We won’t get into detail here, but the issue is that we don’t assume we know the mean, so we estimate the mean using the mean of the 200 numbers, which won’t be exact. So if we just compute the sd
, we’d be systematically underestimating the true standard deviation).
We can estimate many distributions using the Gaussian distribution. You already saw this in action – we can estimate the Binomial distribution using the Gaussian distribution.
set.seed(0)
exp10 <- rbinom(n = 50000, size = 1000, prob = 0.7)
exp10.df <- data.frame(n.heads = exp10)
mean(exp10)
## [1] 700.05
sd(exp10)*sqrt(50000/49999)
## [1] 14.50686
The Gaussian distribution with mean 700 and standard deviation 14.5 would be a good approximation here, empirically.
For \(size\) (number of tosses) large enough (say over 50) and \(prob\) reasonably close to 0.5, we can estimate the binomial distribution using the normal distribution with mean \(size\times prob\) and standard deviation \(\sqrt{size \times prob \times (1-prob)}\).
This is one application of the Central Limit Theorem (we will not get into this much further.)