================================= WEEK 5 ==================================== Arithmetic Operations on Floating-Point Numbers ----------------------------------------------- * For all of these examples, we will use a base-10, 4-digit mantissa and 1-digit exponent floating-point system. * Most FP calculations deal with normalized numbers: - inputs of the calculation must be normalized numbers - outputs are normalized * Routines dealing with FPA are usually written in machine-dependent languages, taking advantages of the particularities of the involved machines. * Note that FPA is approximate calculation, thus not exact; therefore we use different symbols for it: (+), (-), (*) (or (x)), (/) are the FPA version of the "true" arithmetic operators +, -, * (or x), and /, respectively. * Addition is performed by first "aligning" the exponents and then rounding the final result. Specifically, we do the the following: - shift smaller exponent to be equal to larger - shift mantissa of smaller the same number of digits to the right - add new mantissa to that of the unchanged mantissa - round final result * Precise algorithm in Knuth, Vol II, Section 4.2 Examples: 0.3472e+2 0.3472 e+2 (+) 0.1120e-1 ---> (+) 0.0001120e+2 ---------------- 0.3473 e+2 0.2458 e+1 0.2458 e+1 (+) 0.1361 e-1 ---> (+) 0.001361 e+1 ------------------ 0.247161 e+1 0.2472 e+1 Notice that adding very large and very small numbers can lead to loss of information (as in our first example), and that in extreme cases, the sum might remain equal to one of the operands (e.g., 0.3472e+2 + 0.1120e-4 = 0.3472e+2 after rounding). * Subtraction is performed similarly to sum. A different problem arises in this case, for example: 0.4573e+2 (-) 0.4571e+2 = 0.0002e+2 = 0.2000e-1 Because of normalization, the last three digits of the answer are *not* significant. Other example (assuming 8-digit mantissa): 0.76545421 e+1 (-) 0.76544200 e+1 ---------------- 0.00001221 e+1 This result is normalized as 0.1221000 e-3. The four last zero's are no longer significant! * Multiplication and division are performed by adding (or subtracting) the exponents and multiplying (or dividing) the mantissas, using a double-precision register for intermediate results and then rounding and normalizing the result. No issues arise in these cases except for possible overflow or underflow (which cannot be avoided). That is, do the following: - fully multiply two n-digit mantissas to get a 2n-digit mantissa - add exponents - normalize and round to n places * Precise algorithm in Knuth, Vol II, Section 4.2 Example: (.1344x10^2) (x) (.7188x10^1) = .9661x10^2 * Note that addition is no longer associative for floating-point numbers; i.e., (A + B) + C is not equal to A + (B + C). For example, ((((100 (+) 0.03) (+) 0.03) (+) 0.03) (+) 0.03) = 100 because 0.1000e+3 (+) 0.3000e-1 = 0.1000e+3. On the other hand, (100 (+) (0.03 (+) (0.03 (+) (0.03 (+) 0.03)))) = 100.1 because the addition of the 0.03's gives 0.12 = 0.1200e0 and 0.1000e+3 (+) 0.1200e0 = 0.1001e+3 = 100.1 To preserve all accuracy possible, add the two smallest first. * Similarly, subtractive cancellation may not have a huge impact on a single calculation, but when many operations are performed and the answer for each one depends on the results of the previous operations, such errors can quickly become magnified and completely ruin the answer. For example, if we try to evaluate the value of e^{-x} using the formula 1 (-) |x| (+) |x|^2/2! (-) |x|^3/3! (+) ... each term's value is close enough to the previous partial sum that we are performing many successive subtractions in a row, and each one of those subtractions makes us lose many significant figures because of normalization, so that fairly quickly, the answer is completely off... * Moral of the story: when performing even simple operations with floating-point numbers, care needs to be taken. Formulas should be rewritten if necessary to avoid subtracting "close" numbers and to perform additions in the correct order (adding smaller numbers first). Root-Finding Methods -------------------- * Root-finding methods are numerical methods where we'll see how to apply the notions of error that we just learned. * A "root" of a function f(x) over the real numbers is a value r such that f(r) = 0. Given f(x), we want to find a root r. That is, we solve f(x)=0 for x. * Example: solve f(x) = x^2 - A = 0 fo A > 0. * There are many methods for finding roots, we will just examine 3 of them. * Why do we care? a root of f(x) = x^2-2 gives us the value of sqrt(2) a root of f(x) = 1/x-7 gives us the value of 1/7 etc. * we assume that we can evaluate f(x). Bracketing methods ------------------ * Given a function f(x), an interval [x_0,x_1] whithin which a root is known to lie, x_0 <= x <= x_1, and f is continuous. * General algorithm: (0) start with the interval [x_0,x_1] (1) reduce the size of [x_0,x_1] to [x'_0,x'_1] s.t. x_0 <= x'_0, x'_1 <= x_1, x'_0 < x'_1, and f(x'_0)*f(x'_1) < 0. Go to (1) with x_0 <- x'_0 and x_1 <- x'_1. Go to (2) if f(x'_0)* f(x'_1) = 0. (2) At leat one of x'_0 and x'_1 is root; pick it and output it. * Note: - f(x'_0)*f(x'_1) < 0 iff f crosses the x axis between f(x'_0) and f(x'_1). Therefore there is a root between x'_0 and x'_1. Tht is, if a continuous function changes sign across an interval, it must take value 0 somewhere in that interval. - The interval [x'_0,x'_1] is called a bracket.