Covariance -- Appendix B: Disciplined Exceptions -or- Error Management is Risk Management Exception handling is another concept that has been abused to get easy software extensions. "Oh! We have forgotten a case here. Let's make it an exception!" This technique is very promising: you discover an additional layer of complexity in your application problem and you map it to an additional layer of your solution, namely the exception handling mechanism. The notion of "exception" sounds so promising and it seems to fit so many cases in application problems: there are male and female customers -- and kids that don't pay (exception); there are first class and second class wagons in a train -- and the restaurant wagon (exception); we can drive this to the extreme: "Every natural number is the successor of another natural number -- zero is an exception." Unfortunately this technique doesn't scale the least little bit: what do you do if you discover yet another layer in the application domain? Of course no sane programmer would abuse the exception handling mechanism in such an exaggerated way, when introducing exceptions the examples are usually more reasonable. But still there is one problem: the meaning of "exception" is not well defined; no current method provides clear guidelines for what the exception mechanism should be used for and what not. This message outlines such guidelines, and that in a very strict way. This means that the developers creativity when using the exception mechanism is heavily restricted; that sounds like a disadvantage, but has indeed a lot of advantages: first, the resulting designs become much simpler to understand because they embody a clear separation of concerns; second, the resulting designs are scalable: adding exceptions to the exceptions (as will happen in most software evolution or maintenance) is no problem; and third, the developer can concentrate his creativity on the really challenging part: treating application exceptions reasonable (this requires good analysis) and managing the risk of remaining errors in his software. Application Exceptions versus Remaining Errors OOSC1 introduces what comes nearest to a method of how to use exceptions well. (The relevant part of the book can be found in Meyer's paper "Disciplined Exceptions".) However, I think we can still come nearer to very sharp principles. The basis of my method is the observation that the Eiffel language equivates every raised exception with a failed assertion which is the origin of that exception. If we take this point of view serious, we find that each raised exception is due to an error in the software! (Including the special case where the error consists in a false assumption about the software's environment and is therefore not visible in the software alone.) As a consequence of this observation I will use the term "run-time error" instead of "exception" and "error handling" instead of "exception handling". Only this change in terminology already helps to make clear what the "error handling mechanisms" are about! In short the guideline I propose is to use the normal tool box of OO and Design by Contract to treat all kinds of application situations, and to reserve the error handling mechanism for a purpose that normal programming can't accomplish: the management of errors in the software itself. When restricting the use of exceptions in such a strict way, developers must be assured that they can indeed treat all application exceptions with standard programming techniques. But fortunately this is simple when taking the right point of view on the software requirements. For example, the case that a system of linear equations is unsolvable can be seen as exceptional, but on the other hand if we think of its set of solutions, it is quite normal that this set can be empty. Sometimes it is difficult to distinguish between an application exception and a implementation (or analysis) error, especially since simple OO analysis methods do not well support such differentiation. For example, you might find as precondition of an operation `reserve_seat' that the flight number must be valid. Now the question is: who is responsible to ensure that precondition. Is it another part of the software that calls `reserve_seat'? Then we can assume that the precondition does always hold, otherwise we'll have a run-time error due to a bug in the software. Or is it just impossible for the user to enter an invalid number? For example because he chooses the number from a list, or because he enters the number by inserting an electronic ticket which is guaranteed to contain a valid number. In any such case we can as well suppose that the precondition will hold, otherwise we'll have a run-time error due to a bug in the analysis. The last case, when the user or some other unreliable system is responsible for the validity of the flight number must be considered more thoroughly in analysis. What should the software do in that case? Probably give the user a chance to enter the number again. (But what when he copied the number correctly from the ticket, only the number on the ticket was invalid?) It might also be useful to indicate to the user, why the number is invalid; perhaps the flight already took off? Anyways, under such circumstances the condition `flight_number.is_valid' is not a software precondition, but a condition that distinguishes the normal flow of the use case and non-normal flow. From the user's point of view, an invalid flight number is an exception, from the software's point of view it is just another case to consider. While all such application exceptions will finally be implemented by cases distinctions in the software (or melted cases, ...), the analysis process itself may profit from introducing several levels where each level can assume conditions that were treated on previous levels. Michael Jackson (??) gives the example of a compiler where the syntactic analysis and the type checking are two levels that are described independently in the requirements and specification phase and that may be finally implemented in an interlocked way. In general we can say that any "error" or exceptional behaviour, for that requirements describe a certain reaction, will be treated via case distinctions and similar in the implementation. The error handling mechanism of the programming, on the other hand, is only used to handle exceptions for which no reaction has been specified and that "should actually never occur". This principle accommodates for the fact, that software is a discrete world: either things can happen or they can not happen; while reality is a continuous world: some things can happen with very low probability. In requirements analysis we decide which things we want to assume as impossible; and while programming, run-time error handling ensures that the damage is limited when "the impossible happens". That's why I distinguish between implementation errors/bugs and analysis errors/bugs: the former are purely logical errors, e.g. you write CHECK x /= Void END when in fact `x' could be Void; the latter can also be logical errors, but more important are the "approximation errors". According to Jackson we always have to approximate the real world, when we map it to the discrete world of software. Even if we don't make any development or analysis error in this process, there will be same cases left, which we simply cannot treat because they are to improbable. One has to stop somewhere, as Jackson says. The practical difference is, that a perfect software development will have no logical errors left when it's completed. Although significant projects can not reach that goal with reasonable efforts given the current state of technology, good projects will get very near to that ideal. For the "approximation errors", however, we can never be "100% error free": we can only minimise the probability of remaining errors. After all, software is finite and thus can have only finitely many errors, which we can in theory (perhaps some day in practice) all find and correct, however many they are. But the real world is infinite and we can never all perfectly approximate it with software. So much for that litte excursion. Debugging and Risk Management: How to treat Remaining Errors How to correctly treat all the exceptional cases that occur in software analysis and how to make a good design from the requirements is the topic of a general software development method. No special language feature can help there. On the other hand, we need a special language feature to care for errors in the software. Of course, we could emulate assertion checking with normal IF statements, but that would not be adequate to the dual nature of specifications and code. Already documentation and readability purposes dictate that assertions and error handling code are clearly separate from the rest. But what should the error handling mechanism do? In most programming languages the error handling constructs are basically some new kind of conditional construct: IF exception THEN actions END. This way is of course very general, but it doesn't offer the advantages of a real error handling mechanism either. The only thing it does is "jumping out of some routines, passing error information downwards", but the same thing can also be done using error states in some objects and other conventional programming idioms; see general design method. ** Bad error handling methods suppose that there was some action to ** do, when an error has occured, but this is just contrary to what ** we have seen in the last section: approximation errors just cover ** the cases where no reaction has been specified and logical errors ** indicate that some assumption is false and by their nature we ** don't know which one (otherwise we could have fixed it in the ** first place), so we can't know what to do either. The only sensible thing to do, when an error has occured is to stop the system (or a large part of it) and possibly restart it. We have to remind ourselves that the purpose of assertions is to prevent incorrect results, consequently we can only stop the system when we find incorrect behaviour. This is the first obligation of a good error handling mechanism. The second obligation is to gather information together that allows to find out, what the error is. If this information is good, all of debugging can happen via strong assertions, complete test cases and crashes: the stack trace points directly to the error. This debugging functionality can be provided without extra efforts from the programmer: one doesn't need to write any error handling code to get the crash and stack-trace, the run-time system can do it automatically. For the second purpose of error handling, however, the programmer must make some effort, because he must encode some decisions that follow from the design and the requirements. Bertrand Meyer describes error handling code as "patching up the environment" so that an operation can be retried or can be cancelled without harm for the rest of the system. Furthermore, error handling code should "restore the invariant". And all this must be done without any precondition, since any of the assertions in the program before might be false. Although this first sounds reasonable, it is not at all. It is just impossible! A class may be able to restore its internal invariant, but it cannot restore its representation invariant without any precondition. Look for example at the invariant of the class RESIZEABLE_ARRAY_EXTENDABLE: size = 0 implies storage = Void size /= 0 implies size <= storage.size size /= 0 implies storage.size < 4*size If an exception happens, we can simple do `size := 0; storage := Void' to reestablish that invariant. But in doing this, we destroy the representation invariant since it corresponds to deleting all elements from the data structure. Thereby we do break the data structure's client in a way just contrary to the purpose of error handling: instead of signalling an error and avoiding a wrong output, we swallow the error and provoke a wrong output. Bravo, bravissimo! In place of Meyer's rules for error handling code I therefore propose the following two. The purpose of error handling code is always 1. Either to restrict the crash of the system to a crash of a component, 2. Or to translate between different notions of "error" in different components. The first point requires that there is still something sensible to do when the component has failed, this "sensible thing" has to be found in the context of the application domain and the requirements. The translatation from the second point is used to translate from what one component of the system considers an error to what another component just considers an exceptional case. The latter component will then just catch the error and call the code that treats the exceptional case. Example for the first case: if the spelling correction in a text processor crashes, it can simply be desactivated and the programs continues to run. Think about the application context (especially the user, since it's an interactive application) is important here: otherwise we could just say: a defect spelling correction just supposes every word is correct. But this would give the user a wrong sense of security; a sensible de-escalation must clearly indicate to the user, that the spelling correction is now inactive. Example for the second case: for an optimisation problem you have a relatively simple algorithm which calculates a not too suboptimal solution and another relatively simple algorithm which calculates the optimal solution, but which is not guaranteed to use reasonably much memory, in other words: it may possibly run out of memory (even if there's "much memory" available). Of course you want to write a little routine that calls the second algorithm to calculate a solution and then calls the first algorithm in case the second one fails. This routine translates the error of the second algorithm ("out of memory") to a simple case of the problem ("optimal solution not easy to find"). Translation error handlers are also very important when interfacing to bad software where run-time errors are wronly used to signal exceptional cases. Bad software also often suffers from the fact that its authors didn't know preconditions and a lot of errors disappear when one introduces design by contract. Another very typical example which can be viewn as both case one or case two is the roll-back (or cancelling) of (data base) transactions. On the one hand, we can say that any error during the transaction induces the abortion of the transaction (and only the transaction) -- that's a crash restriction. On the other hand, we translate any implementation or other internal error to an externally visibly action: the roll-back of the transaction. (Note: the German software engineering vocabulary is more precise than the words "internal" and "external" here: we may speak of "fachliche Gründe" for anything that relates to the application domain and the requirements (a user clicking "cancel" or a business rule that induces the same action) and "technische Gründe" for anything that relates only to our technical solution (programming error, out of memory, ...).) Programming language let error handling code distinguish the cases of different errors. This raises the question how this possibility should be used and alas conventional methods don't say much good things here. In some text book examples we can even find the following disastrous pattern. (Java semantics, imaginated syntax.) catch FOO_ERROR do ACTION1 catch BAR_ERROR do ACTION2 else -- ignore all other errors end This is absolutely contrary to the principle of honour "Prefer crash over wrong output or uncontrolled behaviour". What we instead must do is "delegate all other errors". Incidentally this is already the semantics of "rescue" clauses: all run-time errors that are not explicitly cancelled keep walking up the call stack and lead to a crash if they aren't cancelled on any level. As a consequence of the principle of honour we should program error handlers in a defensive way: cancel only the errors which are really harmless and let others pass by. This is especially true for error translation: we must only cancel the single kind of error which signals the exception we want to translate, all other errors must be left as they are. De-escalation error handlers can be more universal: they can cancel any run-time errors that will not affect the functioning of other components, and by default in an object-oriented system we can assume that for all kinds of errors. (For exceptions see the keyword Modular Protection below.) Aspects of continued debugging Error handling code thus serves to make programming and analysis errors less fatal when they trigger failures in the field. While doing this, it is still our responsibilty to exterminate those errors ultimately. We should therefore still collect information about the errors and their context, even if they don't lead to a complete crash any more. Since this collection of information is very similar to the one that happens on a crash (stack trace etc.), it should also be done by the run-time system. Case distinction: "translation error handlers" don't need any debugging mechanism, since the error is not considered an error any more. Other errors are still considered defects of the software and should be monitored in order to eliminate them. For logical errors (see section "Application Exceptions versus Remaining Errors") we should collect the same information as on a crash and save it so that it can reach the developer (or maintainer) one day. For approximation errors, however, such treatment may collect too much information, since we might have intensionally took into account that such errors occur from time to time. We are already aware that our analysis is not perfect (because it cannot be) and therefore we're not interested in a detailed report of run-time errors that occured due to our approximation. However, it can still be interesting to collect data on how often such an approximation leads to a run-time error. Usually such errors require manual treatment by a human and our assumption during the analysis was that this would only happen in rare cases. A statistic on the frequency and type of such errors can help to validate that assumption and change the software, if necessary. Expecting the unexpected In a completely specified and well-tested piece of software, the only errors that can happen are in the run-time system (libraries, operating system, hardware), or the "out of memory" error. This level of quality can and must be reached at least for reusable libraries and any software that admits complete specifications. Assertions and informative crashes help to reach that goal (if the test cases are sufficient in number and quality). The "out of memory" error is a relative of the "out of time" error which usually doesn't happen as an error, but rather seems like a non-reacting program. For many applications we can (and should) estimate the needed time and space and (given bounds on the input size) we can often guarantee that memory will suffice and responses will be timely. But for data structure libraries, as a counter example, we cannot provide such guarantees, since it will always be possible for the client to create a data structure that is larger than available memory. In such cases it is important that only the data structure that is concerned by that overflow will be give up function, so that the client's error handling code can stop one component of the system and keep the rest running. In this context the concept of Modular Protection is important: to find bugs in the software it is already important that those bugs manifest in errors in the same module (they will must at latest be caught by the contract at the interface between modules). And to restrict the effect of bugs in already deployed software. Especially memory errors can easily spread out into arbritray modules, that's why an automatic memory management is so important for serious programming languages! And that's another advantage of the object-oriented method: by default all instances of a class are independent; if some instances are part of a crashing component, the others will still continue to run as they should. That's why one has to take special attention when using singleton objects, for example to manage unique ressources. In such a context, one has to write error handling code which prevents errors in one user of the singleton to spread over to all the others. (Of course, this is only necessary if the users of the singleton are in different "crash zones", however, libraries must always assume this, because clients may use the same library in a very serious part of their program (application core) and in many not so serious parts (graphical interface, web interface, plug ins, optional components, ...). Then the library must ensure that errors can not spread between those parts.) But back to the sources of errors: obviously we do never know exactly which errors in the software can lead to a run-time error, but usually we should have an idea about it and most often we can think of some examples. Logical errors can be everywhere in the software, only if we create a journal during debugging, we can get an idea about the probabilities of remaining errors and the parts where they could be. (Of course this requires a very disciplined and structured testing, otherwise we'll suppose errors in parts where we tested most and will be surprised by errors where we tested less.) Also, we can create a list of possible approximation errors during the work with the costumer. If the software engineer asks: "What if in situation X we get Y?" and the client replies: "Actually that should never happen." then the semantics of "actually" gives us a condidate for an approximation error... In any case, errors are by definition unexpected and therefore it is absolutely unsensible to force the declaration of possible "exceptions" in routines as the Java language does it. That's absolutely opposed to the sense of error handling! But well, perhaps the next programming language will be more sensible. A detour on "self-correcting software": I recently read about a guy who had invented a way to check the output of some algorithm for correctness and that can even correct that output, that is, he had invented a new algorithm that would calculate a correct result starting from an almost correct one. Students of error handling might now suspect that one day error handling will also be able to work like that: spot and correct errors in the results. But that's not at all what error management is about! What the guy has invented is just a way to create a new algorithm from a given one and that can be programmed with completely conventional methods: Result := calculate_old_way if not correct(Result) then Result := calculate_new_way(Result) end We have to keep in mind that error handling treats a technical concept of error, a bug that has been created accidentally during development. Error handling doesn't treat the many kinds of errors that might exist in the application domain. Recognition errors, spelling errors, typing errors, scanning errors, ... All those can be treated with conventional methods in a good software design. Unfortunately, people often think it would be advantageous when the programming language provides specialised features to model the concepts that occur in the problem. But the contrary is true! A programmer needs to understand his tool completely and he needs to understand the application problem. The simpler the tool, the more energy can be spent on the real problem. Complex problems can only be solved with simple tools. In a simple programming languages all the necessary concepts can be created and they be reused through libraries. The user of a simple programming language has the power to create his world and to use other's worlds. Simply. Another bad example is the crash management of Microsoft Word: each time the text processor crashes it will restart automatically and present the user with a list of files containing the last saved version of each open file, the last auto-saved version and a version saved just at the time of the crash. The user must then choose which of these versions to restore and the program supports this decision by displaying for each version of each file a list of internal errors that are in that file, but the program doesn't say what that errors mean to the user, it seems that in general they can be ignored. This is all horribly bad: the poor user has done nothing bad, but suddenly he has to wait while the program is restarted and he is faced with all those useless information and decisions. Why can't the program not clean up the mess all by itself that it created all by itself? Why can't it just restart and reload the file that it just saved during the crash? Or, since it actually could do this, why can't it just not crash and keep the file open? Why shouldn't have that the same result? Many programmers support the argument that software is becoming more and more complex and thus becomes more and more like human beings: it sometimes makes errors, but it is "robust" against such errors and allows interactive correction. Now, part of that argument is surely true: software is becoming more complex. But the conclusion is based on an analogy that doesn't hold. Clearly software will not become perfect and will not even soon be able to act like human beings. And clearly software may indeed make errors when it tries to predict the users, but this must not be confused with a crash! A crash has always a cause (that is a bug) and that cause can be eliminated. Good software doesn't crash. Never. Error handling is for the really rare cases and only for software where even in such a rare case, some basic requirements must be fulfilled. For a text processing program it is no problem, if the program crashes and looses all data since the last save, as long as such crashes happen very rarely (e.g. every 200 years on daily usage, that is once a year for one user out of 200). Instead of writing error handling code that insolently bothers the user, programmers should kill the bugs in their code and minimise crashes to a level where they're not statistically noticeable any more. (I wonder how many of Word's crashes are due to "segmentation faults", a symptom that cannot occur in garbage collected software, see Modular Protection above. Perhaps the introduction of Design by Contract alone would kill most of Word's bugs, but that's not our story.) It is only so sad, that such a trend-setting program shows such a bad weakness. That's not good for the industry. Acknowledgments: Many thanks to Michael Jackson (not the singer) and Bertrand Meyer for their inspiring works.