Lecture Outline for Week 11

I. Random Numbers
	- humans are notoriously bad judges of random numbers
		- which is more random?  01010101 or 1100101?
	- properties of a random number sequence
		- it is impossible to determine the next number of the
		  sequence from the previous numbers
		- will pass a statistical test on the distribution
		- any relationship among the numbers is unnoticable
	- computer is deterministic so true random numbers are impossible to
		generate
	- pseudorandom number
		- the numbers appear random to most applications
	- common pseudorandom algorithm - the linear congruential method
		- let x_k be the kth number of the sequence
		- x_(k+1) = (ax_k + c) mod m
		- the x_k numbers will be between 0 and m - 1
		- the period of a sequence is the number of different values
		  generated before the sequence repeats
		- the period of the linear congruential method is at most m
		- generally m is much larger than the desired range for
		  our random number and we must multiply the x_k by the
		  necessary factor to decrease the value to the desired range
			- for example: we would like to use a random number
			  generator with a period in the millions if we are
			  generating dice rolls so that we do not see the
			  sequence of dice rolls repeat
		- choosing a, c, and m must be done carefully
			- Knuth's guidelines (1981)
				- m must be very large (millions at least)
				- a should be one digit less than m and the
				  decimal representation of a should end with
				  ...b21 where b is even
			- these guidelines avoid some notoriously bad cases
		- in mid-1980's it was shown that the linear congruential
                  method is poor in the sense that it is easy to predict what
		  the next number of the sequence will be
		- other more complex pseudorandom number generators have
		  been created since then
		- if a sequence always starts with the same number, we will
		  generate the same sequence
			- must "seed" the random number generator
				- pick a unique value each time for x_1
			- a common method is to use the current time
		- it is often nice to have multiple random number generators
			- statistical properties of the numbers often lost
			  if you use every third number or only the lower order
			  bits

II. Random Numbers in C
	- C had several random number generators in stdlib.h
		- rand() - a bad linear congruential method, but exists in all
		  ANSI C implementations
			- lower order bits of rand cycle very frequently
		- random() - does not use the linear congruential method
			- all of the bits may be used unlike with rand()
			- may not exist in all C implementations
		- lrand48() - another random number package
			- may not exist in all C implementations
	- you should use random() instead of rand()
	- random() - returns a uniform random number between 0 and LONG_MAX
	- srandom(int seed) - seeds the random number with seed
		- srandom(time(NULL)) - seeds the random number with the
		  current time
	- init_state() - initializes the state of the random number
		- not for everyone!  You should have a good understanding of
		  the mathematics before you attempt this
	- set_state() - allows the user to switch between states of the
	  random number generator
		- so that we can implement multiple random number sequences

III. Random Number Distributions
	- a distribution is how the random numbers are spread across the range
	- uniform discete distribution
		- every number has an equal probability of occuring
		- if we are generating N different values, the probability
		  that a specific number is the next value of the sequence
		  is 1/N
		- all C random number generators listed above generate
		  uniform discrete distributions
		- Pr[x = a] = the probability that random number x has value a
	- continuous uniform distribution
		- pick a random real number in the range [a, b]
		- since there are an infinite number of values,
			Pr[x = a] -> 0
		- instead, we consider the probability that x lies within
		  some range [c, d]
		- Pr[x in [c,d] ] = (d - c) / (b - a) assuming a <= c <= d <= b
		- computers are discrete so it is impossible to generate a
		  continuous random number
		- can approximate it by letting the kth random number be
		  (x_k)/m  where x_k and m are defined above
			- this gives a probability in the interval [0, 1)
	- many different statistical distributions
		- uniform, binomial, Poisson, exponential, geometric, etc.
		- for a model, we find the probability distribution that best
		  fits the random event
			- if none fit, we can generate our own distribution
			  from the experimental evidence
		- how do we create a non-uniform distribution from a discrete
		  uniform random number generator?
		- first, compute the cumulative probability distribution F
			- F(a) is the probability that x <= a
		- to find x with a non-uniform distribution
			- generate y with uniform probability over (0, 1]
			- let x = F^(-1)(y)

IV. Statistical Tools
	- the expected value of a random variable
		EV(x) = SUM (a_i * Pr[x = a_i]) over all i
	- the expected value is also called the mean or average 
	- the variave of x looks at how much x tends to differ from the mean
		Var(x) = SUM ((a_i - EV(x))^2 Pr[x=a_i]) over all i
	- the standard deviation - another measure of variation of x
		sigma = sqrt(Var(x))
	- why the squares and squareroots?
		- this is approximating the distance a random variable usually
		  is from the mean
		- remember geometry, the distance from (x,y) to (a,b)
		  is sqrt((x - a)^2 + (y - b)^2)
	- median
		- the middle value of a distribution
		- for a random number, it is tha value with the smallest
		  cumulative probability that is >= 0.5

V. Statistics for Simulation Output
	- the output of a probabilistic simulation is also a random variable
	- we would like to analyze output for its mean and deviation, but we
	  do not know the true probability that certain values occur
	- sample mean
		X = 1/N SUM(x_j) over all outputs x_j where N is the number of
			trials
	- sample variance
		S^2 = 1/(N-1) SUM(x_j - X)^2 over all outpus x_j
	- how close is the sample mean to the true mean?
		- we find a confidence interval for the true mean
		- example: "we have P confidence that the mean is in this
			    range"
		- example: polling data
	- central limit theorem
		- if we do enough (N) tests, the average of the output will
		  follow a normal distribution with mean r and variance (s^2)/N
		  where r is the true mean of the output and s^2 is the true
		  variance (s is the true deviation) of the output
	- a random value in a normal distribution has a 95% probability of
	  being with int 2*(the deviation) of the norm
	- so to find the condfidence interval for the output, if we do enough
	  tests, the true mean of the output is 95% likely to lie within the
	  range X +/- 2*sqrt((S^2)/N) where X and S^2 are defined above