Data Structures for Software Engineering
Dessy: Simpler, Faster
Extended Abstract
DESSY is a new design of a library of data structures which differs
considerably from the state of the art and the libraries widely used
in practice. DESSY focuses strongly on the needs day-to-day
programming of data processing application (in contrast to highly
optimised algorithmic solutions). DESSY provides all ubiquitously
needed functionality and easily allows addition according to the need
of each project. With its simple design DESSY explicitly supports
software design and specification. The comprehensible description and
structuring of the libraries classes is supported by the use of some
simple mathematical concepts and an extensive use of object-oriented
techniques (including contracts as a matter of course). DESSY uses
up-to-date algorithms for its examplary implementation (among them the
new memory-efficient resizable arrays). The code and internal
documentation is of high clarity and meant to be studied by others.
Clarity on the stage and efficiency behind the scenes.
The design towards those goals has lead to some of DESSY's uncommon
properties:
-
A ubiquitous use of point-free (or holistic) operations:
instead of writing loops with iterators, the DESSY way is to use
operations that work on the whole data structure at once. DESSY's
simple (yet powerful) iterators can be used to implemented more
complicated operations (and they are used internally to implement some
of DESSY's operations).
-
DESSY's classes are divided into operational and
model-based ones. The operational structures correspond to
simple patterns of usage, like DESSY's most basic class TRAVERSABLE
which just offers features like for_all, do_all,
... and is used as argument of many of other classes' routines.
Model-based classes on the other hand are based on a mathematical
model, such as a set or a table, and offer any operation that is
sensible on that structure. This way the programmer is freed of
memorising all the operations and at the same time it allows for real
implementation independence. (Usually at the cost of some efficiency,
this is discussed below.)
-
The model-based classes are divided into mutable and
immutable ones. Most programmers know only mutable data
structures, such as tables (dictionaries) and list, which allow their
contents to be updated. On the other hand, many important data types
are immutable, for example integers are never updated (only integer
variables can be updated by explicit assignment) and every operation
yields new integer values. Likewise, DESSY provides immutable
structures, like SETs, MAPs and SEQUENCEs (which correspond to the
mutable MEST, TABLE and LIST). Those classes describe values that can
be combined to yield new values (e.g., SEQUENCES can be concatenated,
all of them can be filtered and element-wise transformed). Immutable
classes have diverse applications: they yield more abstract code than
their mutable counterparts, they can be used for specification
purposes and they allow persistent implementations: two structures
that have almost the same contents can share memory and need barely
more space together than one of them needs alone.
-
Many other libraries do just implement the most common algorithms
(such as linked lists, hash tables, resizable arrays using the
doubling technique) and offer for each the operations that the
implementation most naturally provides. DESSY on the other hand,
relies on the application programmer's needs to design the class
specifications. Therefore it sometimes needs unconventional
implementations. However, none of the implementations is ad hoc, all
are theoretically analysed algorithms, they are asymptotically optimal
in most of the provided operations. For example, DESSY has no singly
linked list because all the list and sequence data structures are
symmetric (regarding first and last, front and back) in their
interface. And there is no doubly linked list for reasons of
persistence in the immutable case and element access by index in the
mutable case. Instead DESSY's resizable arrays (which implement
ARRAY_EXTENDABLE and LIST) and the functional deque (which implements
SEQUENCE) are relatively uncommon algorithms that fit best the needs
of those structures. (They are uncommon, because they were invented
after computer science had decided about its standard set of most
important algorithms, which unfortunately still contains linked lists
and a lot of complicated balanced trees.)
-
Other libraries provide a class VECTOR (ARRAY in EiffelBase) which is
basically a Resizeable Array with added features. Unfortunately this
class represents several models at the same time: Like the traditional
Array it is a mutable mapping from an integer interval to some domain.
Furthmore it provides operations like insert at any place which rather
belong to the List model. And finally it is most often used as a
simple container with just one add and delete operation and simple
traversals. The result of that mixing of models and features is that
that VECTOR is not good on any of its purposes. In Dessy, VECTOR is
seperated in three parts: First, some descendants of FIXED_TABLE,
which represent traditional Arrays for several ordinal domain types
(INTEGER, CHARACTER, descendants of ENUM); second, an implementation
of LIST using Arrays, and third a class ARRAY_EXTENDABLE which
represents a simple container. Its name stems from the fact that it
does only implement the basic interfaces TRAVERSABLE and EXTENDABLE.
This way ARRAY_EXTENDABLE offers a very simple interface, a very
efficient implementation, and it doesn't depend on any object equality
mechanisms which would otherwise have to be designed by the
application programmer. (Note: of course there is usually a standard
equality function, but to decide whether that fits your application
purpose you have to think about the design anyway. But often, you
won't ever call that function, so ARRAY_EXTENDABLE spares you that
thought.)
-
The separation of specification and implementation (which is supported
by the fact that many implementations really comply to the same
specification instead of providing specialised operations) allows a
simple and efficient means of program optimisations: just try out
which implementation is the fastest (or leanest) for your application,
you don't need to rewrite any code except the choice of the structure
implementation.
-
The algorithms in Dessy are intensionally not optimised.
Optimisation always goes towards a specialised goal, a special
application, special time/memory tradeoffs, special usage patterns, a
special underlying machine, ... Dessy is meant to be a general
library, not a special one. Instead Dessy provides features that
usually perform in asymptotically optimal time. This means that no
operation is slower than would be intuitively expected by the client
programmer, which means that Dessy needs no warnings of the kind
"don't call this when ..." or "only call that when ...", which means
that Dessy provides a true higher level of abstraction, because the
application programmer need not care about efficiency for a long time
(and for most applications never at all). And if the performance of a
Dessy algorithm or its implementation doesn't suffice the needed
optimisations depend on the application and can usually not predicted.
(Or they can predicted, but that optimisation would surely hurt
another application.)