Robert Will, 12 August 2003
This document is a draft for future publication and copyrighted by
the author.
Abstract: A reference implementation of a new data structure algorithm is presented in an object-oriented language with structured programming. The implementation is noteworthy because it accomplishes two normally opposed goals:Finally I corrected two little errors in the original description.
- The implementation comes from a "real" library and does not implement a "pure" data structure, but a module (class) that is directly useful to software developers (not only developers of higher-level algorithms and data-structures). The implemented structures are a simple collection and an abstract list.
- The implementation was done in a systematic and literate way, meaning that (a) the program was derived semi-formally from a set of basic, but precise decisions, and (b) the program is presented in a text book like form: read once straight through and you understand.
Download this note together with source code and specification files. The links to the program sources below are to text files, until I have found a way to convert them to HTML with syntax highlighting. Until then you may use any of the free (as in beer) Eiffel IDEs or the free (as in speech) Eiffel mode for Emacs. Furthermore there is a printable version of the source (PDF) (or ps.gz), both made with a2ps.
A recent (1999) research paper describes a new basic data structure - resizable arrays - that can be used as a building block for other, more complicated structures. The paper shows applications of resizable arrays to stacks, queues, priority queues and deques. Since these are really fundamental structures and the resizable arrays make them strictly better (using much less space, with only negligible constant overhead in time), resizable arrays are already revolutionary for that reason alone. However, general programming does use efficient algorithms and data structures much less than it uses simple collection classes, one of the most popular being the VECTOR of Java and C++. (I'm using the Eiffel convention of writing class names in caps.) And exactly this popular collection structure can be considerably improved by the resizable array algorithm.
The aim of this note is to present the first published implementation of that "revolutionary" algorithm. I do that because unlike many other results of modern algorithm research it is directly applicable to programming practice and should spread as quickly as possible. Furthermore my implementation is a good example of programming methodology: how to develop a program, and to a little extend also on how to use inheritance well.
The next section and largest part of this note gives a rationale for the specifications of the implemented data structures, which are not "standard" ones, but especially tailored for software engineering. The implementation itself is mostly described in the source code below, with a short introduction to the cooperation of the classes in the Implementation section. The implementation language, Eiffel, which was designed as a language for software engineering performs extremely well supporting a clear and efficient implementation. Indeed the code is written in a literal fashion and together with the contained documentation makes up the heart of this note - the following sections only provide some context and - at the end - some conclusions.
The resizable arrays were implemented as part of DESSY - A library of data structures for software engineering. DESSY is the first data structure library that does not have its origin in the area of efficient algorithms and data structures, but rather in software engineering, especially software specification. To explain the difference between those two approaches let me cite a typical introductory example to algorithm libraries: an implementation of Dijkstra's algorithm (of finding shortest paths in a graph) is given, using a library version of the PQ data structure. But how many software projects do actually implement Dijkstra's algorithm or even a more complicated one?
On the other hand, software projects often have to cope with complex functional requirements and need a clear and precise way to specify them. For example, let's specify "the best grade of all CS graduates in the class of 2003". Where
Then the requested best grade is simply minimum(final_grade[cs_students ∩ students_2003]) (where the brackets rel[set] denote the relational image { x | (y, x) ∈ rel ∧ y ∈ set }). Software engineers will recognise such formulas from the specification languages B and Z. However that specification is already deterministic enough to be implemented directly. In the self-explaining Eiffel syntax we could write (the overloaded operators "+", "*" and "-" denoting set union, intersection and subtraction in the usual way):
local cs_students, students_2003 : SET[STUDENT] final_grade : MAP[STUDENT, GRADE] best_grade : GRADE do best_grade := final_grade.image(cs_students * students_2003).minimum end
This is an executable specification right in your implementation language! This way DESSY brings you higher abstraction with no extra language, no extra tools and no extra learning. And the concern for efficiency is separate: the data structure implementations will care for it. And in the case that the program doesn't deliver the needed performance and profiling reveals that the above one-liner belongs to the program's bottle-neck code, you can still keep the specification as comment and as checkable assertion:
best_grade : GRADE is
-- The best grade of all CS graduates in the class of 2003.
do
... more efficient version goes here ...
ensure
Result = final_grade.image(cs_students * students_2003).minimum
end -- best_grade
Note: best_grade is an Eiffel function that calculates its result in the automatic variable Result which is in turn used to express the postcondition in the ensure-clause.
By now the reader will agree that DESSY's design goals are rather different from conventional libraries. A much deeper rationale and explanation of the design of DESSY's specification will be given in an upcoming report.
DESSY uses inheritance to structure the interface of its different data structure specifications. In this section I present some of the important class specifications / interfaces of DESSY, while the next section focuses specifically on the successors of VECTORs.
Classes in DESSY are currently separated into four categories.
Client programmer will only need to know the model and operational classes and also only those they use. Automatically generated interfaces in Eiffel contain already the specifications of all inherited features. Operational and model classes will usually deferred classes, while instances must of course all be effective so instances can be created. (Deferred classes are called "abstract" in Java and C++, but since that concept was invented by Eiffel and the word "abstract" is already overloaded enough, I stick to the more precise terminology. Effective is the contrary of deferred.) By Eiffel tradition, deferred classes are marked with a star (*) in diagrams.
DESSY chooses a default instance for each data structure specification, so client programmers do not need to know about the instance classes. The current DESSY convention is to use the class categories as stereotypes in diagrams: <<model>>, <<operational>> and <<helper>>. Instance classes have no stereotype.
The difference between the model and operational specifications is very informal but still very useful. Typical abstractions include sets as already used above - which concentrate on the notion of element uniqueness - and sequences - which concentrate on the notion of an explicit ordering of elements, i.e. there is a first, a second, ... a last element in the sequence. In the absence of such mathematical abstractions, data structure specification are subject to some arbitrariness and hard-to-remember operations. Consider the example of an ARRAY class, which is indexed from lower to upper and an operation slice(i, j) of that array which returns a segment of the array (or a sub-array). There was a discussion which should be the bounds lower and upper of the returned array. Possibilities are:
The second one makes sense, if the array is viewed as a mapping (relation) from the set {lower, lower+1, .. upper} to the arrays domain. slice(i, j) will then return the restriction of the relation's domain to the set {i, i+1, .. j}. The first and last one make sense, if the array is viewed as a list (where lower = 1 and both cases are the same). In many cases arrays are just used as a simple collection and such a slice operation is not useful at all, only distracting programmers and raising discussions on its proper semantics. Similar problems arise when grow and shrink operations are considered. (How should new array elements be initialised? In which applications are "default values" sensible?)
Incidentally this example illustrates the two data structure specifications that are implemented here with resizable arrays, namely lists with specification class LIST, and the simple collection in class ARRAY_EXTENDABLE (whose name will become clear in a moment). DESSY does also provide a structure for the mapping interpretation, namely ARRAY_TABLE (which is a special case of the general TABLE class representing mutable maps).
So for what, then, do we need operational classes?
The factoring out of common operations has several advantages in turn:
Some classes that represent such out-factored abstractions are presented in the next section.
Note: Conformant inheritance between two classes means that the subtyping relationship is established between the types defined by those classes. The conformant heirs are part of a class' interface so clients know where they can pass instances of the class to. Non-conformant inheritance, on the other hand, is a private affair of a class, it's independent of the interface.
DESSY's top-most classes are TRAVERSABLE, SETABLE, and EXTENDABLE. TRAVERSABLE is the top of the hierarchy defining an interface to which all other classes conform. It defines some ubiquitous features and serves also as a standard of compatibility: many routines work with TRAVERSABLEs. In the diagram only the most important features are shown (but there are not many more) and inherited features are not repeated. SETABLE and EXTENDABLE add some basic mutating operations to TRAVERSABLE's interface. EXTENDABLE adds add and delete which can be thought of as operations of "dynamic" data structures, while SETABLE adds set_item fitting "static" data structures. Most (mutable) data structure implement both interfaces, but SETs, for example, are only EXTENDABLE, and FIXED_TABLES as shown in the diagram, are only SETABLE. Immutable classes, such as STRING, conform of course only to TRAVERSABLE.
The diagram is a client-view of some DESSY classes: the upper part shows the three top-most classes together with their iterators, the lower part shows some example heirs, LIST and ARRAY_EXTENDABLE are the two classes implemented in this note, ARRAY_TABLE is the third abstraction born out of the traditional array, that has found its place in DESSY.
Before I close in on the arrays, let me make on remark on the diagram. UML class boxes have a section for attributes and a section for "methods", and obviously may diagram doesn't conform to that convention. The diagram is a client-view of the library and what does a client care if an argumentless query is a function or an attribute? Instead I adopted the Eiffel convention to separate a class' features into queries and commands. The former are either attributes or functions, the latter are procedures. The difference is important, because Eiffel adopts the structured programming principle that all functions are free of client-visible side-effects. Consequently such a separation is the only sensible way to draw up an interface diagram. And if UML provides no notation, we have to do it ad-hoc.
Arrays and linked lists are the traditional basic data structures in imperative programming. And arrays have always been preferred by programmers, because they are simpler (actually built-in in the language) and considered more efficient. Arrays have first been provided in a static fashion (bounds fixed at compile-time) as in Pascal and FORTRAN, then in a dynamic fashion (bounds fixed at allocation-time) as in C, and finally in a resizable fashion, like the Perl built-in arrays, or the VECTOR classes of C++ and Java.
Nowadays VECTORs are the most-popular data structure in said languages, where the use of the low-level arrays is even considered bad style in many communities. (Although this is also based on the fact, that the low-level array of C++ and Java is the only structure that is no ADT, and this fact is just an unnecessary inconsistency in the language. DESSY builds up on the class BOB_NATIVE_ARRAY which is implemented by the run-time-system as a typed chunk of memory (with a size), and which can be used as a normal ADT - it even has pre- and postconditions.)
Despite their popularity, however, VECTORs have a couple of drawbacks:
Now, it's good to hear that the resizable arrays do make the capacity stuff indisputably obsolete, because they are efficient in the first place! The new algorithms offer optimum efficiency and DESSY attempts to pay with that efficiency a new level of abstraction. Perhaps I should copy and paste that phrase to a poster and hang it over my bed, for it captures the essence of my work: not only to be faster and leaner, but to make programming easier and to make better programs possible.
Finally, I can present the automatically generated specifications of LIST and ARRAY_EXTENDABLE. (ARRAY_TABLE will be released soon, together with all other already drafted classes from DESSY.)
Both implementations use the same main design idea: the data structure is implemented twice, once with a standard algorithm and once with the resizable array. This approach has several advantages:
The latter improvement does of course defeat the competition tests. But since the shared routines are much simpler than the low-level ones so that a lot of errors can still be found.
All code has been written in a completely structured programming way: no jump-like constructs, all functions side-effect free, no call-by-reference, ubiquitous use of routines, invariants for all non-trivial loops. As a consequence the code is very readable, but still efficient. Since the code is not optimised in any way, it can of course be made more efficient. Structured programming keeps the code clean enough to make this possible.
Source Code:
ARRAY_EXTENDABLE,
ARRAY_EXTENDABLE_ITERATOR,
DOUBLING_ARRAY_EXTENDABLE,
RESIZABLE_ARRAY_EXTENDABLE.
ARRAY_EXTENDABLE and its iterator are the only classes visible to clients. These two define the interface of a simple collection, which is exactly a union of SETABLE and EXTENDABLE. The iterator and the command add are implemented using the deferred features put, at, grow and shrink, the feature clear from EXTENDABLE rests deferred. As a consequence, the classes RESIZABLE_ARRAY_EXTENDABLE and DOUBLING_ARRAY_EXTENDABLE have really only those 5 features to implement which is basically the same as in the paper. All the other features are inherited.
Source Code:
LIST,
LIST_ITERATOR,
ARRAY_LIST,
CIRCULAR_ARRAY_LIST,
RESIZABLE_ARRAY_LIST.
Here the pattern is similar: LIST and LIST_ITERATOR define the interface, ARRAY_LIST and LIST_ITERATOR implement most of the stuff and the deferred routines are: put, at, (shrink|grow)_(front|back)_by and set_size. The classes CIRCULAR_ARRAY_LIST and RESIZABLE_ARRAY_LIST have even some of their design factored out: in both cases the routine set_size is responsible for maintaining the size_invariant, while the growers and shrinkers maintain the offset.
Side-note: In contrast with the simple collection the list implementations have grow and shrink functions for arbitrary values, not only shrink-by-one. This is a minor optimisation that doesn't really make the code more complex. Although it may seem at first sight, this has nothing to do with the following fact: repeated insertions or deletions that happen at one end only cost asymptotically the same when performed all-at-once by a specialised add_many command. This is why the ARRAY_EXTENDABLE's add_many stays with the standard implementation from EXTENDABLE which is just a sequence of adds. On the other hand, insertions in the middle of a list l take O(l.length) time, and when adding i elements with a specialised function, the time is still O(i + l.length) = O(i.max(l.length)), which is much better than the quadratic O(i * l.length). That's why LIST has the insert_segment function and ARRAY_LIST doesn't implement it with a sequence of inserts. The determining operation for the time to insert in the middle is the moving of the elements before or after, not the growing or shrinking. This means that the operations insert_segment, delete_segment and keep_segment would still only need linear time even if growing and shrinking was in steps of one.
The last thing to know when reading the code, is that routine contracts (pre- and postconditions) are inherited and not restated in the heir. (Although a proper editor will show them for you.)
The usual way to use a low-level data structure such as a resizable array is as a supplier for higher-level modules (classes). A collection or list implementation would have a resizable array as attribute and use it just like the native arrays are used as storage. This allows easily to use the resizable arrays for the implementation of several independent structures and also to keep those independent from the resizable array itself. However, in my library I had only one structure to implement with each kind of resizable array, and structures like the ARRAY_EXTENDABLE don't offer many features anyway, so the relation of actual code with forwarding-only code would not be very good. (This also indicates, that in case the design will be changed to include a new abstraction à la RESIZABLE_NATIVE_ARRAY this change will be very easy to do.)
Still the classes contain "in-class" interfaces, that is, a layer of secret routines that is used like an own little abstraction. This makes the classes easier to understand and it can serve as a break-up point in later refactorings. To summarise, I don't think that the inheritance-based approach is superior to a client-based approach (which is in any case more general, because it makes interfaces explicit), but for present classes it is just the right solution.
As the cited paper explains, both algorithms interact very badly with the buddie algorithm for dynamic memory allocation as it is used, e.g., in someBSD. However, the problem doesn't occur with other allocation algorithms, e.g., Doug Lea's malloc which is widely used. Anyway, I think that in cases of bad interaction, an adaption of the resizable array technically belongs to the lower level: the run-time system should offer resizable arrays, just like it offers BOB_NATIVE_ARRAY now. This ensures that the data structure library will really stay portable and the separation of concerns is maintained, keeping all components simple.
The version of the algorithm used for LIST (which differs from the paper as explained in the code) is my own invention which I made even before reading the cited paper. I also had the idea for a completely non-copying version, but I didn't come up with the invariant and the superblock concept presented in the reference.
As an extra gimmick the LIST class linked to above, contains an implementation of binary search which is an small and self-contained example of the constructive approach to programming and an homage to Edsger W. Dijkstra.
Robert Sedgewick's Algorithms (then the second edition) in its Pascal version was the first computer science book I ever read. This book formed my opinion on what computer science is, for quite some time. It is all the more too bad, that professor Sedgewick has chosen three languages à la mode which are all behind the modern concept of design by contract and don't even support generic typing, which is so important for abstract data structures. Eiffel is a practice-proven language that supports both of these and additionally a systematic and structured approach to programming. Eiffel is the ideal language for a book on algorithms, and if I ever have time, I'll rewrite Sedgewick's examples in the language that they deserve.