Lecture Outline for Week 3 I. Floating point numbers - computers are discrete not continuous - need real numbers for many types of computations - image processing, weather forecasting, modelling, etc. - how to represent a real number - use scientific notation: x = m(10^e) - m is called the mantissa - e is called the exponent - n will represent the number of digits in the mantissa - a floating point number is normalized if the mantissa is shifted so that the first number after the decimal point is non-zero and there are no digits before the decimal point - a normalized floating point number is of the form x = +/- .d1d2...dn x 10^e , in base 10 x = +/- .d1d2...dn x 2^e , in base 2 - IEEE floating point standard for single precision floating point - 1 sign bit - 8 bit exponent - 23 bit mantissa x = +/- .d1d2...dn x 2^(e-127) - we don't need a sign bit for exponent so we save space - also first digit of mantissa is always 1 so we don't have to store it - for double precision floating point, extra bits are added to mantissa why not exponent? 2^128 = 10^38 II. Floating Point Arithmetic - how to add and subtract - shift smaller exponent to be equal to larger - shift mantissa of smaller the same numbe rof digits to the right - add new mantissa to n+1 places - n+1 digit is the "guard digit" - round to n digits Example: .245876x10^1 + .134117x10^(-1) = .247217x10^1 - how to multiply - fully multiply two n-digit mantissas to get a 2n-digit mantissa - add exponents - normalize and round to n places Example: (.1344x10^2) x (.7188x10^1) = .9661x10^2 - problems with addition - (A + B) + C != A + (B + C) Example: A = .505x10^1, B = .400x10^(-2), C = .400x10^(-2) - to preserve all accuracy possible, add the two smallest first - problems with subtraction - can lose significant digits if we subtract values that are close together Example: .76545421x10^1 - .76544200x10^1 = .12210000x10^(-3) - in a large computation this loss can be significant - want to avoid this situation - Important Fact: A formula written in two mathematically identical ways will yield different results. Example: Compute 1-cos x when x is close to 0 Example: Solve ax^2 + bx + c = 0 III. How accurate are floating point numbers? - can plot each number which can be accurately represented on the real number line - suppose we need to store an arbitrary number x - IEEE standard: represent x by the closest floating point number to x - call this number fl(x) - if there is a tie, the tie is broken arbitrarily - tie only possible if x = (x' + x") / 2 - suppose x = .d1d2...d(n-1)dnd(n+1)... - if a tie, d(n+1) = 5, di = 0 for i > n+1 - round up or down? - if only round one way, "biased rounding" - want to even distribution of rounding up vs. down - rule of thumb: if dn is odd, round up, else round down - what is the maximum absolute rounding error? - x" - x' = 10^(e-n) - so |x - fl(x)| <= (1/2)10^(e-n) - since n is fixed, the space between floating point numbers depends on e - thus floating point numbers are not distributed uniformly - a logarithmic distribution IV. Significant digits - significant digits refers to the number of digits which can be used with confidence Example: loss of significant digits from subtraction pi - 3.14 = .1593 x 10^(-2) or .2000 x 10^(-2) - we have to truncate numbers and this is ok as long as the truncated digits are never used. - if we subtract 2 close numbers, these digits may be used at the next calculation - can we define significant digits as equal digits? Example: x = .32120, x* = .32129 - or as equal rounded digits? Example: x = .32124, x* = .32116 - strict definition: If x* is an approximation to x, x* is accurate to r significant digits (base 10) if |x - x*|\10^([log_10|x|]) <= 5x10^(-r) V. Types of error - round off error: from rounding/truncation a real number to fit in the float/double - truncation error: from truncating an infinite computation Example: evaluate sin(x) by infinite series - approximation error: from using an approximate calculation or from using an iterative method VI. Relative error - absolute error, or true error, is |x - x*| - many times absolute error is not a good indication of accuracy - absolute relative error: |x - x*| / |x|, provided x != 0 - relative round off error: |x - fl(x)| / |x| <= 5x10^(-n) - u = unit round off error = 5x10^(-n) - Proof: ... - for error analysis, let delta = the relative round off error - delta = (fl(x) - x) / x, thus |delta| <= u - fl(x) = x(1 + delta) - rule of thumb: relative error will overestimate significant digits by at most 1 digit Example: x = pi, x* = 3.1428 relative error < 3.8x10^(-4) VII: Round off error - define floating point arithmetic - (+), (-), (*), (/) are the computer version of the math operators - if x and y are exact, x (+) y = (x + y)(1 + delata) - backward error analysis Example: compute z = x1y1 + x2y2