Lecture Outline for Week 3

I. Floating point numbers
	- computers are discrete not continuous
	- need real numbers for many types of computations
		- image processing, weather forecasting, modelling, etc.
	- how to represent a real number
		- use scientific notation: x = m(10^e)
		- m is called the mantissa
		- e is called the exponent
		- n will represent the number of digits in the mantissa
	- a floating point number is normalized if the mantissa is shifted
	  so that the first number after the decimal point is non-zero and
	  there are no digits before the decimal point
	- a normalized floating point number is of the form
		x = +/- .d1d2...dn x 10^e , in base 10
		x = +/- .d1d2...dn x 2^e , in base 2
	- IEEE floating point standard for single precision floating point
		- 1 sign bit
		- 8 bit exponent
		- 23 bit mantissa
		x = +/- .d1d2...dn x 2^(e-127)
		- we don't need a sign bit for exponent so we save space
		- also first digit of mantissa is always 1 so we don't
		  have to store it
	- for double precision floating point, extra bits are added to mantissa
		why not exponent? 2^128 = 10^38

II. Floating Point Arithmetic
	- how to add and subtract
		- shift smaller exponent to be equal to larger
		- shift mantissa of smaller the same numbe rof digits to 
		  the right
		- add new mantissa to n+1 places
			- n+1 digit is the "guard digit"
		- round to n digits
		Example: .245876x10^1 + .134117x10^(-1) = .247217x10^1
	- how to multiply
		- fully multiply two n-digit mantissas to get a 2n-digit
		  mantissa
		- add exponents
		- normalize and round to n places
		Example: (.1344x10^2) x (.7188x10^1) = .9661x10^2
	- problems with addition
		- (A + B) + C != A + (B + C)
		Example: A = .505x10^1, B = .400x10^(-2), C = .400x10^(-2)
		- to preserve all accuracy possible, add the two smallest
		  first
	- problems with subtraction
		- can lose significant digits if we subtract values that are
		  close together
		Example: .76545421x10^1 - .76544200x10^1 = .12210000x10^(-3)
		- in a large computation this loss can be significant
		- want to avoid this situation
		- Important Fact: A formula written in two mathematically
		  identical ways will yield different results.
		Example: Compute 1-cos x when x is close to 0
		Example: Solve ax^2 + bx + c = 0
III. How accurate are floating point numbers?
	- can plot each number which can be accurately represented on the real
	  number line
	- suppose we need to store an arbitrary number x
	- IEEE standard: represent x by the closest floating point number
	  to x
		- call this number fl(x)
	- if there is a tie, the tie is broken arbitrarily
		- tie only possible if x = (x' + x") / 2
		- suppose x = .d1d2...d(n-1)dnd(n+1)...
		- if a tie, d(n+1) = 5, di = 0 for i > n+1
		- round up or down?
			- if only round one way, "biased rounding"
		- want to even distribution of rounding up vs. down
		- rule of thumb: if dn is odd, round up, else round down
	- what is the maximum absolute rounding error?
		- x" - x' = 10^(e-n)
	- so |x - fl(x)| <= (1/2)10^(e-n)
	- since n is fixed, the space between floating point numbers depends
	  on e
	- thus floating point numbers are not distributed uniformly
		- a logarithmic distribution
IV. Significant digits
	- significant digits refers to the number of digits which can be
	  used with confidence
	Example: loss of significant digits from subtraction
		pi - 3.14 = .1593 x 10^(-2) or .2000 x 10^(-2)
	- we have to truncate numbers and this is ok as long as the truncated
	  digits are never used.
	- if we subtract 2 close numbers, these digits may be used at the
	  next calculation
	- can we define significant digits as equal digits?
		Example: x = .32120, x* = .32129
	- or as equal rounded digits?
		Example: x = .32124, x* = .32116
	- strict definition: If x* is an approximation to x, x* is accurate
	  to r significant digits (base 10) if 
		|x - x*|\10^([log_10|x|]) <= 5x10^(-r)
V. Types of error
	- round off error: from rounding/truncation a real number to fit in
	  the float/double
	- truncation error: from truncating an infinite computation
	  Example: evaluate sin(x) by infinite series
	- approximation error: from using an approximate calculation or from
	  using an iterative method
VI. Relative error
	- absolute error, or true error, is |x - x*|
	- many times absolute error is not a good indication of accuracy
	- absolute relative error: |x - x*| / |x|, provided x != 0
	- relative round off error: |x - fl(x)| / |x| <= 5x10^(-n)
		- u = unit round off error = 5x10^(-n)
	- Proof: ...
	- for error analysis, let delta = the relative round off error
		- delta = (fl(x) - x) / x, thus |delta| <= u
		- fl(x) = x(1 + delta)
	- rule of thumb: relative error will overestimate significant digits
	  by at most 1 digit
	  Example: x = pi, x* = 3.1428
	           relative error < 3.8x10^(-4)
VII: Round off error
	- define floating point arithmetic
		- (+), (-), (*), (/) are the computer version of the math
		  operators
	- if x and y are exact, x (+) y = (x + y)(1 + delata)
	- backward error analysis
	  Example: compute z = x1y1 + x2y2