{29.Nov'03} {15.Dez'03} {19.Dez'03} Modularity without Abstraction is no more than Syntactic Sugar Abstract: To (re)use some piece of code without looking at its source, you need an efficient description of what it does. Without such abstract descriptions (called specifications) you cannot achieve the independence of several parts of a program (or system). Specifications can be easily expressed without much ado -- if one knows how to treat abstraction properly. Knowing the distinction between abbreviations and abstractions can make you a better programmer. Design by Contract increases productivity by an order of magnitude. ** Introduction ** It has been quite agreed on that Modularity is the number one mechanism for the Efficient Construction of Large Software. And Modularity for most people means, that different parts of a program can be understood and modified in a more or less independent manner. And more independence is better. In this note I want to consider that there are two kinds of Modularity supported by programming languages. The distinction can be made using existing and well-known programming languages, so don't be afraid that all my discourse comes down to the proposal of a new language construct. I think we have enough of them! This note is about mastering languages better. What I have to say is not specific to a language construct called Module, but applies to any program unit with a scope, the smallest of which is usually the routine followed by Abstract Data Types and other stuff. In this note I consider the object-oriented case where Modules and ADTs are unified in the concept of Class. In other languages the argumentation holds for Modules and ADTs separately, or whatever are the units of a program in a particular language. Consequently, I'll consider only routines and classes in my examples. ** Syntactic Modularity ** The most concrete representation of information hiding in programming languages is the hiding of certain identifiers inside of program units. There have been many disputes on which identifiers exactly should be hidden from which program units for which clients and under which circumstances. In my opinion all these disputes are void, because they concentrate on the wrong side of the program unit's interface: what the clients are interested in, are the exported identifiers, not the hidden ones. And the reason of hiding identifiers is to a large extent to emphasise the others, the exported ones. If the exported identifiers are well designed and documented, there is no problem in exporting an identifier to much, whether accidentally, or because the language rules are not subtle enough. A client programmer will not use that identifier anyway. And if the supplier programmer wants to be sure, he can add a comment: "Don't use this." Of course it would be better, if he could express that in the programming language. But a client programmer using the identifier nevertheless is calling for his own trouble... Equally often, there have been disputes over whether a language should be permissive or restrictive, allow much or forbid much (which means in the scoping issue: hide much). The theory explained here will help us to decide on some of those issues. For example, we will see whether it is useful that clients can assign values to class attributes or not. In this note I will argue for some restriction, but first I want to consider, that the proponents of permissiveness had always a principal argument on their side: Of course, programmers do want something from their language, they want benefits, not prohibitions. So if a language prohibits something, there must be a benefit. If a program unit hides some identifiers, where is the benefit? I already considered it beneficial, that hiding represents a recommendation to use the exported identifiers. This is the first benefit of Syntactic Modularity. Another benefit is simply the ability to reuse the hidden names! I remember an early article about library design (then the author called it "Principles of Package Design"), in which one of the recommendations was to choose especially ugly names for library internal identifiers so that the risk was low, that a client programmer would accidentally choose the same name. (The language FORTRAN made this especially funny with its restriction of identifier names to eight characters and no support for qualified names. In modern languages hidden identifiers can be reused completely freely without even thinking about it, and imported identifiers can be used in a qualified way, so that they don't conflict with the client's names. (OO languages make the qualification via the type of the call target, since all imported identifiers are class features to be called on objects.)) This separation of names is very useful, because it allows quite much independent modification of program units without any extra efforts: to profit from it, programmers have to learn nothing special and the concepts -- for example, routine locals -- are very obvious and only hard to learn, if one had prior exposure to worse languages. Furthermore it also allows separate compilation and the use of libraries, again, without any extra effort on the side of the programmer. You see, Syntactic Modularity has a lot of advantages and that's why it is supported by all modern programming languages, and it's one of the reasons those languages are better than many ancient ones. However, Semantic Modularity has infinitely more to offer... ** Semantic Modularity ** What I mean by Semantic Modularity will probably be easy to understand for people acquainted with the mathematics of programming. Its about specifications, implementations and the latter refining the former. But in fact, one doesn't need mathematical training to understand Semantic Modularity. It's just about explanations and descriptions. It is clear that one can only reuse a Module (routine, class, ...) if it is well documented or if one reads its source code. The premise of this article is that in the general case, we cannot rely on the source code. This is not for the reason that the source is not available, since the present theory applies principally to inner-project Modularity and to the design of medium to huge systems. It is rather for the reason that the source is often too complicated to serve as its own documentation. Advocates of Extreme Programming say that every piece of code should be so simple that it can serve as its own documentation. I say that every piece of code should be as simple as possible, and nevertheless a separate description can often still be simpler! And the user of a Module needs the simplest description possible. In the following paragraphs I will consider that only a good documentation of a Module can make the Module really independent from other Modules and I'll also show how this documentation looks like. Usually the documentation will be in the source file forming one document with the executable code, but nevertheless it is distinct from the executable code. Requirements of precision, which I'll explain in a moment, demand that the "documentation language" is "somewhat near" to the programming language, but nevertheless it mustn't be as restricted as a programming language necessarily is (because the latter must be efficiently executable). Although each source file contains code and documentation it logically consists of two descriptions: one is for the computer and tells him what to do (I call this "code"), the other is for the user of the Module and tells him what the Module does (this is the "specification"). Note that I used the words "what .. do" in both definitions: for code and specification. Traditionally it is said that only the specification describes the "what" and the code describes the "how". However, such a distinction is much to imprecise to be helpful. Let's spend a minute to make the distinction more clear. It is a good idea to define the two terms by their _purpose_ and then to derive their properties needed to fulfil the purpose. Stealing from Carroll Morgan (1994) we can say: The purpose of code is to be efficiently executable, while the purpose of a specification is to be efficiently understandable. (Please excuse me, that the following three paragraphs are not really clear. I'll have to rewrite them. In case of trouble, skip to the next paragraph.) In software engineering it is important that every description is complete regarding its scope. Parts that are left undescribed are unusable. (Well, you may just try out how it works, but then you'll maybe rely on accidental properties which can change as soon as you use the program unit in another context (another platform, another version, use of other operations, ...).) Apart from being complete, the descriptions should also be short and simple. Some people will want to add "descriptions should be abstract," but this is already a consequence of the former: to be short, simple and complete, our descriptions will have no choice being abstract. And not abstract in a theoretical or mathematical sense, but abstract in mentioning just the essential aspects and leaving all else aside. This is a good moment to cite Tony Hoare who defines in his "Notes on Data Structuring" abstraction to be specific to a purpose, "the matter at hand". Descriptions can be written in different languages, and usually at least one the languages used by a programmer is a programming language. In that case completeness is not a point, since an incomplete program would be incorrect anyway. In fact, the program code is the description that defines completeness and the other descriptions have to keep up with it. Shortness and simplicity, however, the other two qualities of descriptions do of course apply to program code, too. ** Abstraction is Simplicity ** I have already stated the above criteria in recent postings to comp.lang.eiffel where you can also find some further examples. In the preceding section I wrote on the the description of the module interface which should be as simple as possible. But it is an even more important part of True Modularity that the interface itself is as simple as possible. This simplicity is so important for the user's of modules (whose own tasks are already complex), that it is really worth sacrificing some other goals. Especially information hiding does often mean, hiding some internal concepts that could be useful to some clients (especially to do optimisations), but which are not needed for the general use of the Module. Nowadays everyone knows, that there are only a few things worse than premature optimisation, so we should be ready to hide (those concepts from the client). I will take data structures as an example for interface design. The availability of practical mathematical models (like sets) for such Modules makes the difference between a collection of routine and a really good abstraction easy to see. Iterators, for example, can be a bad abstraction, since they don't hide the explicit concept of a "position" in a data structure, which is arbitrary in many cases. In those cases the desired effect can more simply be reached with a routine of the data structure class itself, such as `filter' or `copy'. In many cases such a routine will take another routine as argument, for example `for_all' and `exits' take a boolean function (aka a predicate) as argument and check whether this function yields true for all or at least one element of the data structure. We can call such routines "higher-order routines" because at least one of their arguments (and therefore the data they treat) is a routine itself. Another common example is the routine `do_all' which applies a command to all elements of the data structure. Of course, iterators can also be useful, first for the implementation of those higher-order routines (which is a library-internal job and not visible to the client), and second for applications which do need more than what is provided by the higher-order routines. However, the complexity of iterators (and loops) is too much for the many common applications where higher-order routines suffice. Many data structure libraries provided the higher-order routines in a later version or in a way not really highlighted in the manual, so that in fact many programmers do still write stupid loops where one routine call would do the job. Another lack of simplicity is a feature `capacity' as provided by the classes of most data-structure libraries. The mathematical model of sets or bags or sequences don't have such a `capacity', and application programmers don't need it. Consequently, due to the lack of a simple model, every operation on a data structure must be explained by saying how it changes the size, the capacity, and the contents of the structure. And what makes it difficult is that those things are deeply connected: if you change one thing, the others will perhaps also have to change, and it must be told, how they will change; or maybe the value of one variable doesn't permit the change of another (e.g., size can't be greater than capacity), and then also it must be said what happens. All this requires a huge documentation effort and if one considers how simple an implementation of VECTOR with an underlying array is, most people will prefer the XP (extreme programming) variant and consider the code as specification. In an array-based implementation the specification of the `add' routine, for example, would have O(1) running time if "size /= capacity" and O(size) else. Furthermore the specification would have to say by how much the capacity is changed. I once saw a library whose arrayed collection did indeed the stupid thing to copy the whole array just to increase the capacity by one. Such a specification makes the `add' routine unusable for repeated insertions, clients must somehow "block" their operations to increment the capacity manually beforehand or to call specialised operations like `add_many'. This can considerably complicate the client's code, if the insertions happen to be in something more difficult than a trivial loop, where the number of insertions is known before. It is certainly a bad interface, which makes trivial things not easy and non-trivial things even more difficult. ** Abbreviations vs. Abstractions ** The great Edsger Dijkstra has once criticised programmers for using routines as pure abbreviations of the code they contain and not as abstractions which introduce higher-level program parts. In the modern method of programming, routines can serve as both, but only as abstractions they make a real difference. I'm very happy to have a good example at hand: take integer square roots, which are more formally defined by the floor of the square root, which is in OO: "x.square_root.floor". You notice that "square_root" is not actually a feature of INTEGER, so you finally write: i : INTEGER ... i.to_real.square_root.floor This is a rather long queue, so you introduce an abbreviation: integer_square_root( i : INTEGER ) : INTEGER is require i >= 0 do -- ensure Result := i.to_real.square_root.floor end The comment "-- ensure" tells us that we want to consider the code of this routine as its specification. Furthermore the routine has no header comment, since the specification provides already enough formal documentation and the routine name suffices as an informal reminder. Remember that abstraction is about useful redundancy, and not about stupid repetition. Maybe some reader will oppose that abbreviation, since it isn't actually much shorter than the original code and it needs six additional lines of code (and three more with the useless repetition prescribed by some methods). Furthermore, having the direct formal version in the code is more precise than just "integer_square_root". On the other hand, just "integer_square_root" is easier to read and it will also work as an abbreviation in the reader's mind. As such it can help avoid little errors, such as accidentally writing "i.to_double.square_root.floor" or "is.to_real.square_root.ceiling" or even "i.to_real.abs.square_root.floor". But that's not all! While programming you may make use of the property that the square of "i.integer_square_root" is never larger than "i" itself. You can document this observation as an additional postcondition to the routine: ensure -- consequence: Result * Result <= i The comment "-- consequence" reminds us that the stated assertion follows logically from the definition of "Result", we haven't strengthened the postcondition by adding it. Distinguishing between definitions and consequences helps understanding, because one can show different useful views which are equivalent and everyone can pick out the variant most useful to him. One can implement one view and then use another view in building a client --- the specifier has already made sure, that the views are consistent. Furthermore you notice that the "Result" is also the largest number which fulfils "Result * Result <= i". Since multiplication is monotonic we can write that as "i < (Result+1) * (Result+1)". Perhaps in your program these are the only two assumptions you ever make about the "integer_square_root". This is no wonder, since these two assertions already completely specify the routine. As a professional programmer you will therefore change your routine to: integer_square_root( i : INTEGER ) : INTEGER is require i >= 0 do -- ensure Result := i.to_real.square_root.floor ensure -- alternative definition: Result * Result <= i i < (Result+1) * (Result+1) end The comment "-- alternative definition" tells us that each of the two definitions implies the other. The equation implied by the assignment on the one hand, and the two inequalities on the other hand. Now both implementer of the routine as well as the user can decide which of the two definitions they will implement or rely on, respectively. That's an advantage. (Dijkstra would also tell you to prove the equivalence of the two definitions. I rather recommend, you look it up in a mathematics text book, that also helps you to stay near to conventional definitions, which in turn means you can reuse more already proven laws. By the way, that's why I defined the integer_square_root using floor, not ceiling.) (Remark on "equation implied by the assignment": Each assignment of the form "var := expression" has the canonical postcondition "var = expression", if "var" doesn't occur in "expression". If "var" is local this is easy to check; for class attributes one needs to take aliasing into account. "salary := boss.salary - 300" will not have the canonical postcondition, if "Current" is her own boss.) What started as a simple abbreviation has turned out to be more useful than thought. In my opinion: Abbreviations have the same importance as definitions in mathematics. But that's only where the story starts! All the development I presented so far could have happened in a real project: you start with the simple abbreviation and than add more useful information. (Believe it or not, that's really how I program.) Later, when testing your program, you will of course notice that it is to slow. I say "of course", because when programming very abstractly, you often know that it is not efficient enough -- you just don't know where the inefficiency is. Okay, you run the profiler. Let's suppose the hardware your program runs on has no proper floating point unit and integer_square_root is the culprit. (Yeah, any scientific example must be unrealistic somewhere, and besides, such hardware might really come upon you, in embedded software development, for example. And before someone objects: I really believe we should use OO technology also for embedded software: anyway we're going to speed up the algorithm now.) Okay, what we do is just changing a little bit: integer_square_root( i : INTEGER ) : INTEGER is require i >= 0 do ... ensure Result = i.to_real.square_root.floor -- alternative definition: Result * Result <= i i < (Result+1) * (Result+1) end It is easy to write an exponential-time algorithm working on with integer multiplication: just start from "Result := 0", keeping "Result * Result <= i" invariant and increasing "Result" by one until "i < (Result+1) * (Result+1)". This algorithm can then be transformed into a linear-time one, as is shown in at least one book on the so-called Programming Methodology: Dijkstra, or Gries, or Carroll, see below. Perhaps its even in two or all of them, alas, I can't look it up at the moment. But actually, it's so easy, we can do it right here. The idea is to increase "Result" not by one, but by some number "j" starting from "i" and being halved as necessary. The program then becomes: integer_square_root( i : INTEGER ) : INTEGER is require i >= 0 local j : INTEGER do from j := i invariant 0 <= Result and Result * Result <= i -- A 1 <= j and j <= i -- B variant i - Result + j -- Either we decrease `j' or we increase `Result'. until i < (Result + 1) * (Result + 1) -- C loop if i < (Result + j) * (Result + j) -- D then check j > 1 end -- from "not C and A", i.e. -- (Result+1)^2 <= i < (Result+j)^2 j := j \\ 2 else Result := Result + j end end ensure Result := i.to_real.square_root.floor -- alternative definition: Result * Result <= i i < (Result+1) * (Result+1) end Given the loop invariant and exit condition, the loop obviously establishes the second definition of the integer square root, given that it ever should terminate. But that is easy, too: "j" cannot go below "1" and Result can not go above "i", yet one of them approaches its bound in each iteration. That's all of the proof! With a little more effort, one can show that the running time is really linear in the number of bits of "i". (Note that we have not only proved the algorithm correct, but we proved the program. The result applies directly to the executable code.) Now imagine you were given that program without the assertions. And perhaps the variables a bit more transformed, so that the apparent redundancy of "(Result + ?)" disappears, and probably with some more instructions, since without "Programming Methodology" one usually doesn't arrive at the shortest possible code. Then, could you tell what the program did? Even if it was called "integer_square_root" (and not just "isqr" or something), how would you know whether it's the "floor" or the "ceil" version? How would you know whether it uses "abs" or has a precondition? And even if you'd knew all this: how would you know whether it really does this or perhaps has a little bug right between the numbers 8 and 9 where the result is suddenly changing? Not to ask about running time... Abstractions have the same importance as theorems in mathematics. ** Divide and Conquer ** Every programmer knows Divide and Conquer Algorithms: they divide the input data in several parts, treat those parts separately and then join again. And every programmer knows that this technique makes algorithms asymptotically faster: they may not be faster for small inputs, but they are faster by an arbitrary factor, if the input is sufficiently large. Therefore programmers consider the asymptotically fast algorithms to be on a better level of (performance) quality, even if a competing algorithm has been hand-optimised for the smallest constant factor possible -- still, that only wins for small inputs. Only a few programmers are familiar with Design by Contract and split up their programs into different parts, implementing each part relying on the specification of the other parts. Most programmers know that specifications often just repeat the code and therefore cost a lot of effort, which doesn't easily pay off. In a system with three parts, you would write three specifications (perhaps each only just a bit simpler than the corresponding code) and to code each of the three parts we'd rely on the specification of the two other parts. That may hardly be a win. But in a large system the simplicity of all the Module specifications adds up so that you still can understand a Module implementation and all specifications it depends on, while without specifications, all the code would simply not fit into your brain any more. Consequently, Design by Contract, Divide and Conquer on the design level, has an asymptotic advantage over other techniques. Every other language feature does only yield a constant factor in programmer productivity, this can beat Design by Contract for small programs. (Anyway, most language features only make programs shorter, and often make them harder to read. Given how fast people can type (and that the computer can help) while no-one helps you understanding a program, those features are questionable anyway.) But even the simplest programming language with proper Design by Contract can beat any other programming language on large projects. Design by Contract is no silver bullet -- no more than the classical Divide and Conquer algorithms are. But Design by Contract is an essential technique which increases programmer productivity by an order of magnitude. Programming by Refinement is the technology of the future and it isn't yet ripe for practical use. But there's already one important thing we can learn from it: It's perfectly fine to consider code as its own specification, as long as there isn't any simpler way to specify the behaviour of that code. With this in mind, we can use Design by Contract efficiently even for small problems. I would like to conclude this note with a reference to a good book on Design by Contract, but I don't know of any. (Bertrand Meyer has announced one long ago, but given his other projects, I might complete mine faster.) So I can only leave the reader with my usual recommendation: drink hot tea, read good code and care for abstraction. References: Dijkstra and Feijen (sp?), ??: "The Method of Programming" David Gries, 1980: "The Science of Programming", Springer Verlag Carroll Morgan, 1994: "Programming from Specifications", 2nd ed., Prentice Hall. Bertrand Meyer, ?? "Principles of Package Design" Tony Hoare, 1972: "Notes on Data Structuring", in Dahl, Dijkstra, Hoare "Structured Programming", Academic Press. Robert Will, 2003: "Resizeable Arrays in Optimal Time and Space - An Example for the Constructive Approach to Programming" http://matrix.fem.tu-ilmenau.de/~robert/resizeables