Lecture Outline for Week 11 I. Random Numbers - humans are notoriously bad judges of random numbers - which is more random? 01010101 or 1100101? - properties of a random number sequence - it is impossible to determine the next number of the sequence from the previous numbers - will pass a statistical test on the distribution - any relationship among the numbers is unnoticable - computer is deterministic so true random numbers are impossible to generate - pseudorandom number - the numbers appear random to most applications - common pseudorandom algorithm - the linear congruential method - let x_k be the kth number of the sequence - x_(k+1) = (ax_k + c) mod m - the x_k numbers will be between 0 and m - 1 - the period of a sequence is the number of different values generated before the sequence repeats - the period of the linear congruential method is at most m - generally m is much larger than the desired range for our random number and we must multiply the x_k by the necessary factor to decrease the value to the desired range - for example: we would like to use a random number generator with a period in the millions if we are generating dice rolls so that we do not see the sequence of dice rolls repeat - choosing a, c, and m must be done carefully - Knuth's guidelines (1981) - m must be very large (millions at least) - a should be one digit less than m and the decimal representation of a should end with ...b21 where b is even - these guidelines avoid some notoriously bad cases - in mid-1980's it was shown that the linear congruential method is poor in the sense that it is easy to predict what the next number of the sequence will be - other more complex pseudorandom number generators have been created since then - if a sequence always starts with the same number, we will generate the same sequence - must "seed" the random number generator - pick a unique value each time for x_1 - a common method is to use the current time - it is often nice to have multiple random number generators - statistical properties of the numbers often lost if you use every third number or only the lower order bits II. Random Numbers in C - C had several random number generators in stdlib.h - rand() - a bad linear congruential method, but exists in all ANSI C implementations - lower order bits of rand cycle very frequently - random() - does not use the linear congruential method - all of the bits may be used unlike with rand() - may not exist in all C implementations - lrand48() - another random number package - may not exist in all C implementations - you should use random() instead of rand() - random() - returns a uniform random number between 0 and LONG_MAX - srandom(int seed) - seeds the random number with seed - srandom(time(NULL)) - seeds the random number with the current time - init_state() - initializes the state of the random number - not for everyone! You should have a good understanding of the mathematics before you attempt this - set_state() - allows the user to switch between states of the random number generator - so that we can implement multiple random number sequences III. Random Number Distributions - a distribution is how the random numbers are spread across the range - uniform discete distribution - every number has an equal probability of occuring - if we are generating N different values, the probability that a specific number is the next value of the sequence is 1/N - all C random number generators listed above generate uniform discrete distributions - Pr[x = a] = the probability that random number x has value a - continuous uniform distribution - pick a random real number in the range [a, b] - since there are an infinite number of values, Pr[x = a] -> 0 - instead, we consider the probability that x lies within some range [c, d] - Pr[x in [c,d] ] = (d - c) / (b - a) assuming a <= c <= d <= b - computers are discrete so it is impossible to generate a continuous random number - can approximate it by letting the kth random number be (x_k)/m where x_k and m are defined above - this gives a probability in the interval [0, 1) - many different statistical distributions - uniform, binomial, Poisson, exponential, geometric, etc. - for a model, we find the probability distribution that best fits the random event - if none fit, we can generate our own distribution from the experimental evidence - how do we create a non-uniform distribution from a discrete uniform random number generator? - first, compute the cumulative probability distribution F - F(a) is the probability that x <= a - to find x with a non-uniform distribution - generate y with uniform probability over (0, 1] - let x = F^(-1)(y) IV. Statistical Tools - the expected value of a random variable EV(x) = SUM (a_i * Pr[x = a_i]) over all i - the expected value is also called the mean or average - the variave of x looks at how much x tends to differ from the mean Var(x) = SUM ((a_i - EV(x))^2 Pr[x=a_i]) over all i - the standard deviation - another measure of variation of x sigma = sqrt(Var(x)) - why the squares and squareroots? - this is approximating the distance a random variable usually is from the mean - remember geometry, the distance from (x,y) to (a,b) is sqrt((x - a)^2 + (y - b)^2) - median - the middle value of a distribution - for a random number, it is tha value with the smallest cumulative probability that is >= 0.5 V. Statistics for Simulation Output - the output of a probabilistic simulation is also a random variable - we would like to analyze output for its mean and deviation, but we do not know the true probability that certain values occur - sample mean X = 1/N SUM(x_j) over all outputs x_j where N is the number of trials - sample variance S^2 = 1/(N-1) SUM(x_j - X)^2 over all outpus x_j - how close is the sample mean to the true mean? - we find a confidence interval for the true mean - example: "we have P confidence that the mean is in this range" - example: polling data - central limit theorem - if we do enough (N) tests, the average of the output will follow a normal distribution with mean r and variance (s^2)/N where r is the true mean of the output and s^2 is the true variance (s is the true deviation) of the output - a random value in a normal distribution has a 95% probability of being with int 2*(the deviation) of the norm - so to find the condfidence interval for the output, if we do enough tests, the true mean of the output is 95% likely to lie within the range X +/- 2*sqrt((S^2)/N) where X and S^2 are defined above