February 2004
The design combines the two principles of
My original inspiration were the structures of the Z and B specification languages (long live Abrial). But those are based on classical mathematics, while the structures described here (as well as Hehner's) are meant to be self-contained: the source itself contains the minimal theory needed to reason about the structures.
The idea is that:
Having separate specifications and implementations (whose consistency is automatically checked) includes positive redundancy in the code: there are several views for clients and implementors which make the code easier to understand, and (due to the automatic checking) make the life of bugs very, very hard.
Inheritance, on the other hand, reduces negative redundancy: any kind of repetition is factored out to a higher level, this makes the code shorter for better maintenance and it makes the commonalities more explicit which helps understanding and abstraction.
Data Structures serve three different purposes in programs which correspond to the three basic activities of programmers:
The first duty of a software developer is to describe what his program his going to to. This must happen in terms of the software's application domain, but as the software process proceeds we turn to describe the software's behaviour in terms of its data structures. This description is the specification of the software, and special languages have been developed to write such specifications. Having abstract data structures which can be regarded from many points of view is a distinguishing feature of such specification languages. For example, in the Z and B languages Sequences are Functions from 0..(N-1) to the elements and Functions in turn are right-unique Relations. So we can, for example, use relational composition to specify with one simple expression the concept of sequence permutations. Using the right abstract data structures a programming language can be used to write such specifications and many of them will even be (efficiently) executable.
Next, any program we develop should best split in parts and these parts have to communicate somehow. Only the smallest functions in a program will work with single items of data, most functions treat a lot of data in one go, and this data must go in and out the function somehow. The format in which this data is to be transferred should be independent from both the consumer and the producer, and that's where we need abstract data structures. Then data is produced and consumed in an abstract way and the concrete data structure can be chosen separately for each producer-consumer-pair (instead of being dictated by either one) — that's separation of concerns, just what a good software design needs. Abstract data structures also separate the concerns of efficiency (choice of the data structure) from functionality (use of the data structure, see "specification").
Algorithms are the last point in my list, although many libraries
of data structures are just built for that purpose. (Also research on
data structures concentrates here.) Many formal specifications are
not efficiently executable and then we must derive an algorithm
to perform the specification. This derivation (and only this
part of the programming process) is a purely formal activity, whose
correctness can be demonstrated with a (formal) proof. (At least a
sketch of that proof should be part of the algorithm's (internal)
documentation — the maintainer will need it.) Contrary to data
structures for specification and communication which have to provide a
lot of operations none of which all too inefficient (to permit several
points of view), data structures made for algorithm development only
have to provide very view operations (just those needed by the
algorithm) and these should be as efficient as possible. The average
programmer will implement such algorithms very rarely and that's why
this standard is biased towards the purposes of Specification and
Communication. However, "algorithmic data structures" use the same
mathematical semantics for their operations, so both algorithmic and
general- The purpose of abstract data structures is first to make programs
clearer and only second to make them faster.
Although this short presentation suggests that the activities of
Description, Design, and Derivation are performed in a sequential
fashion during software development, this is not so: the activities
are intertwined and most often they can really not be separated from
each other. Each design has to be described, each program is a
description, too, and specifications and algorithms have to be
designed to make them really clear and useful. The same holds for the
use of abstract data structures: a function specified very abstractly
can be an efficient implemtation if we choose the right concrete data
structure.
The three first criteria for the library standard are: Simplicity,
Simplicity, and Simplicity. Only a simple standard can really be used
by all programmers (including beginners and experts in other fields
for whom programming is only a tool). Only a simple standard allows
extensions without drowning in complexity. And only a simple standard
can integrate all aspects of the problem without artifically excluding
some. A good standard captures the essentials of the field, no more,
no less. (I wrote a separate note that discusses the interaction of simplicity and efficiency.)
The simplicity of this standard is more profound than other notions
of simplicity since all the semantics of the operations is given
explicitly, formally and completely. There's nothing left implicit,
so there's no hidden complexity.
The many data structures of this standard have many similar
operations and it has to be decided how they should be named.
Inheritance is one possibiltiy to share names between different
structures, if the corresponding operations have (the same signature
and) common semantic laws. Furthermore we have the possibility to use
similar names for similar things (names that follow a pattern)
or to use Module namespaces to have one and the same name denote
formally independent, but informally similar things. In order to
narrow the design space, this standard proposes the semantic name
principle:
Every systematic sharing of names should have a formal semantic
basis, which is made explicit in the standard. By consequence,
every sharing of names can happen by inheritance.
In fact this means that this standard rejects the use of Module
name spaces and large-scale naming patterns to share names. Each
group of similar functions has either common semantic laws, then they
share a name via inheritance; or the similarities are rather weak,
then they get only similar names. This case occurs so rarely that we
need no naming patterns. Operations have either different names and
semantics (like add insert put or remove subtract
delete_key) or they already have different signatures (like
insert_element insert_segment).
The Semantic Name Principle is actually about much more than names:
by allowing us to concentrate all the specifications (and
implementations) of one concept in one place. It is just this
principle which avoids us a lot of negative redundancy. If
you have already seen one of the many incomplete libraries of data
structures out there, you'll probably have noticed that it contains
huge piles of duplicated code. Well, you can imagine how much code a
complete library would have. The Semantic Name Principle
saves us from this code flood. After all, aren't functional
programming languages good at eliminating bad redundancy?
The Order of Arguments was chosen so that (1) infix uses are
as meaningful as possible (that is meaningful infix uses should be
possible ;-), and (2) in other cases let the collection argument(s) be
the first. The first rule extends to such cases as f <: xs ++
ys >: b where you can put the parenthesis' as you like, the
meaning will always stay the same — and the constituents of the
resulting sequence will be in there in just the order they are
written. The second rule was mainly choses to have a rule at all, so
that users don't have to learn the order of each separate function's
arguments. However, having the collection first is the most simple
choice and it ressembles a bit the object-oriented style.
If we take into account all the names that have been used in other
libraries and all the different semantics they have, we'll get quite a
mess. Therefore I have taken a step back and taken the simplest names
possible. The most striking examples of this renaming is the new name
apply for the function formerly known as map because
that crashes with the name of a data structure (and the name fits
better for the data structure), and the abandoning of the "list
terminology" which utterly inconsistent with abstract data
structures. I always confused CAR and CDR anyway.
Binary functions, especially if they have algebraic properties like
associativity, commutativity, should be given operator names.
However, names that have no tradition are not very readable. And
operators that are composed of ASCII-Characters are not very
beautiful. That's why some functions have two names. The following
table also contains proposed Unicode symbols, first because I hope
Unicode-supporting Editors will appear soon, and second because they
are already useful in documentation generated as HTML or PDF.
Notes:
Finally, we have to bear in mind, that many of the symbol-named
operators can be continued on Collections (see the discussion of the Prelude functions) and
we have to reserve verbal names for this purpose. (In Unicode we can
however take "big versions" of the operators to denote their
continuations.)
The immutable data structures of a purely functional languages are
like mathematical values, for example the integers or mathematical
sets. The behaviour of such classes can be completely specified via
simple laws like in algebra. That's why we call them algebraic data
structures. The algebraic laws make the data structure specifications
autonomous: we can understand them without the need to understand
anything else. Furthermore the laws can be used directly in
programming. Finally sometimes the laws for a function are a
canonical implementation for that function. In such cases we will
state the laws as a function definition: this is executable code, but
at the same time it is a specification function: the function
may actually be implemented in any other way, but it must behave in
exactly the same way as the specificative definition.
The specification of the data structures is centered around the
three (abstract) constructors empty, single, and
<+> (pronounced "join"). These form an algebra which
is a monoid for all collections. In formulae we have:
Using the three constructors with these laws and one single
"deconstructor" split we can already specify a lot of other
functions completely, i.e. by giving an abstract implementation of the
functions. split observes the following laws to which
subclasses of Collection will add more:
The ⇒ in the last expression denotes logical implication. The
expression states that an element occurring in one of
<+>'s arguments, must also occur in the result. This
property does indeed constrain split because has is
defined in the following way:
In usual algebraic specifications, every expression is assumed to
denote a different value (as if all the functions were data
constructors), unless there is an explicit axiom (like
`join_associative` above), that states equality. However, for our
Collection that doesn't work, since the subclasses will introduce more
axioms.
That's why I give the axiom `dont_loose`: it gives a lower bound on
equality, stating that (a <+> b) cannot be equal to a Collection that
doesn't have `x`. Don't ask me, how that fits into the theory of
abstract data types. I think it's practical here.
The functions that are defined using split are the
following: is_empty size count has / first last but_first
but_last front back has_index (!!) index_of / remove /
fold reduce apply filter fold1 reduce1 zip_with zip. All
those functions are completely defined with laws that hold in all
subclasses of Collection. (For the definitions see the standard.) The variation of their
semantics depends entirely on the axioms for empty single
<+> split that the subclasses will add.
Note that the function (!!) has a meaning
distinct from the Haskell Prelude, such that first coll =
"1st element of coll" = coll !! 1. While
in Computer Science it is often practical to count from zero, for
applications it is more consistent to count from one.
It is remarkable that the functions hence defined are not only
quite readable (not worse than algebraic specifications ;-)), but
there are also efficient. That's because the result of
split is underspecified just at the right point: each data
structure implementation can return two parts of the structure as it
pleases. This has the consequence that many operations are
structure- In this respect, our split technique is comparable to
iterators in imperative data structures. Of course, functional data
structures cannot have imperative iterators, since those are able to
mutate the structure. But all the other advantages of
iterators are retained by the split approach. A splitted
structure represents the "position" just between the two parts
returned by split. This is indeed the nearest we can get to an
interator and seen in the previous paragraph we have no need to get
nearer.
One consequence of our three-fold algebra are the new fold
and reduce functions which take at their interface one
function for each of the abstract constructors and which traverse in
their implementation each data structure according to its concrete
constructors.
Example uses of the functions in the standard:
As you can see even the traditional folding functions can be made
adapting their traversal direction.
The specification languages Z and B, which have
largely influenced the design of Relations in this proposal, view Maps
(functions) as a subset of Relations (Sets of Tupels). This is very
natural in classical mathematics, but not in the data structure
business. (Hehner's practical theory doesn't do so, either.) Here we
have separate Maps and Relations and the class
AssociativeCollection only serves to factor out for of their
common operations — domain range image inverse —
which can all be specified using set comprehension: (Until that is
available in Haskell, we use filter and apply in the
real code.)
These functions could as well have been given outside of a type
class, but I envision that some implementations of Maps and Relations
can implement them faster than the obvious linear time. For instance,
it is part of the "democracy" principle that neither side of a
relation dominates the other. Consequently there will be
implementations that do inverse in constant time, just as
democratic sequences do with reverse.
Also note a correspondance of the function image with the
well-known apply when we have functions as data (that is,
Maps).
Relational (and Functional) Composition is one of the most
expressive operations in the Z and B languages: there is
only one operator defined for Relations, and since Functions are a
special case, it will automatically adapt to functions. (Composition
is closed on functions, i.e. a function composed with a function
always gives a functions.) In our programming language, however, we
have to cope with several different types: Relations and Maps which
are not subtypes of one-another, and the program-functions. This
gives the following nine possibilities of how we can compose things:
Assumed the types:
And set being the ordinary conversion function.
The operators /./ /. on Maps and their correspondents on
Relations /°/ /° do actually have the same type and could
already be defined in AssociativeCollection. However, since
the type of ./ and °/ is so fundamentally different,
I preferred to separate all symbols into a consistent set. By the
way, it is also the operation °/ with its result of type
a → set b which makes it impossible to define just one
overloaded operator:
Futhermore one could criticise that Maps have to be converted to
compose them with Relations (we don't have an extra symbol for that
case), while the symbols /./ and /. could be
replaced in the same way by writing at map . fun and at
map ./ map, respectively. However, I think that the proposed set
with six new operators and the existing . is the
simplest one we can offer.
Perhaps confusing: A Function is defined via f =
λ x → y (from left to right) and the type a
→ b has the same "direction", but the Function is applied
via y = f x (right to left). Consequently, functional
composition "threads" values from right to left: (f . g) x = f (g
x) which gives the types as seen above. For Functions this easy
to keep in mind, but for Maps and Relations we have the additional
possibility to view them as Collections of Pairs and in all the
methods of Collections the Pairs have the form (x, y): again
from left to right. From that point of view it would certainly be
more intuitive to have composition types like (a → b) →
(b → c) → (a → c), but unfortunately this would
break with too many conventions. (TODO: be more precise here). So
one has to keep in mind, that Functions, Maps, and Relations are all
applied and composed "against the sense of their type".
The axioms specified in Collection are already all we know
about Sequences: every two Sequences that cannot be proven
equal with the Monoid axioms are different. A way to say that
formally is by specifying (++) (Sequence's synonym
for <+>):
I have used the functions first but_first which are
already specified (like any other query) using split.
Incidentally most of the famous "list" operations are already
specified in Collection, so Sequence doesn't add
many more. Three of them are add_first, add_last
and reverse. (In the meanwhile the first two have already
been renamed to <: and >:, which are much more
intuitive together with the View data constructors :< and
:>.)
Obsolete rationale for the names add_first
add_last: in the next section I'll reveal that
CummutativeCollections (i.e. Sets and Bags) have an operation
add and Sequence have operations
insert_element and insert_segment to insert a single
and a sequence respectively at any position of a sequence. This would
actually suggest to call the operations insert_first and
insert_last for consistency, but I think these names are just
one syllable too long. And we can think of a good explanation why the
functions are called add_*: instead of inserting the new
elements into the sequence, they are added on the
outside. Well, if that doesn't convince you, change the name.
In earlier versions of the specification the functions
were called cons and snoc which I think are very
funny names. But on 20 february 2004, midnight I realised that
the semantic link between cons front and snoc back
is just too weak: consistency literally cries for names that contain
the words "first" and "last". (Not "front" and "back" since those
would suggest adding sequences, but that is already done by
(++).) Well, giving up cons and snoc is a
bit sad, but I am deeply convinced by Eric Hehner's statement
("Unified Algebra") that we should name things as simple as possible
and not with historic terms.
As explained in my note on Democratic Sequences, the Sequences of this standard are not
biased towards one or the other end, neither to their middle or any
other point. There exist implementations that perform any point-wise
Sequence operation (including (++) front back) in logarithmic
time. (In Dessy such an implementation is the default.)
Furthermore there are implementations of Sequence that
implement the following functions in constant (amortised) time:
empty single add_first add_last / first last but_first
but_last. The Sequence class therefore makes Stacks,
Queues and Deques superfluous (and also the classical "lists"):
instead of writing separate specifications and implementations for
them, we can use the Sequence specification together with the
specialised implementations. This makes life easier for library
users, especially beginners: they don't have to learn the Stack, Queue
and Deque specifications and understand the formal relation to
Sequence. (Experienced programmers will only appreciate that
after some habituation, since they already know all this
specifications, and have to appreciate that they can actually forget
them without harm.) (In Dessy these specialised implementations are
deque sdeque queue.)
Unlike the orderings of Sequences which depends on the order (or
place) of insertion, Ordered Collections have an implicit ordering of
elements which is determined by a relation on the elements.
Consequently all collection have (either an implicit or an explicit)
ordering. The advantages of that fact are explained in an earlier document and the earlier specification.
The implicit ordering is formally specified like this:
The implicit ordering can be made explicit by the following
conversion functions with the obvious semantics:
It is good explicit style to use always one of these
functions when converting from an Ordered Collection to
Sequence. Indeed I am not yet in how many places the
function ascending should be called implicitly. This
question arises always when we generalise a function of
Sequence to the Ordered Collections. But we have already
made a commitment to the implicit style when we defined the semantics
for fold (see the earlier design documents). For example we
can define first = reduce1 const and last = reduce1 (flip
const) which automatically gives on Ordered Collection first
= min and last = max and not vice versa.
The different subclasses realise that ordering in different ways,
with commutative and non-commutative join operators. Thanks to this
law some queries get a stronger meaning and we commemorate that by the
introduction of the following synonyms:
And the introduction of the functions smaller larger
smaller_equal larger_equal equal.
Actually, minimum and maximum are not only
synonyms, since are already defined in class Collection, thus
they function also on Sequences where they are different from
first and last. By the way, the operations
minimum but_min and maximum but_max are the essence
of priority queues and the three subclasses of
OrderedCollection describe indeed three different semantics
for priority queues. PQ implementations can thus be used with this
standard in the same way as queues etc. with Sequences.
Operations that "join" two balanced data
structures often take O(N*log N) time (with N = size a `max` size b).
For example
But if we "align" two ordered structures with each other, we would
just have to do an "element-wise" operation — we could do it in
linear time. In order to do that, defined the following function:
It's specification is that merge_with f a b will call
f (Just x) (Just x) for each x that is in both
collections, f (Just x) Nothing for each x that is
only in a (and vice versa for b) and then it joins
the resulting collections. (f Nothing Nothing is guaranteed
never to be called.) With such an operation in linear time, we can
easily implement (\/) (/\) (\\) (\+/) (\-/) all in linear
time! Nevertheless this first approach had a lot of disadvantages:
Actually the last two problems could be solved by given the
function an additional parameter (the join operator to use) and using
f Nothing Nothing as the unit of that join operator
(i.e. empty). In fact, the following function does exactly
that:
The specification of merge_with dis eq (*) is that
eq gets fed with all elements that are in both Collections,
dis gets pairs of disjoint Subcollections, and finally
(*) is used to join the results. This version has still the
problem that I can't think of a sensible formal specification that is
simpler than the implementation, but at least the implementation is
very simple! Surprisingly we can even give a default implementation
that works already optimal for many concrete data structures (since it
is based on the abstract functions smaller larger_equal).
It is interesting that the default implementation of
merge_with is in fact just the "hedge union" algorithm on
ordered trees! Every tree data structure can profit from that
algorithm and the benefits are really qualitative, since this
algorithm can be used to implement a lot of other functions. For
example, if Set union is fast, we can just write add set x = set
\/ single x and get the optimal O(log N) running time! (Since
the exact time bound of merge_with is O((N * log M) `min` (M
* log N)), or something like that.) This is a real qualitative win
for the user of a data structure, since he doesn't have to choose
between two operations: to add several elements to a set should one
use several adds or rather prepare a set of the new elements
and \/? For which number of "several" elements should one
switch from add to \/? Well, with an efficient
adaptive algorithm it is all the same!
Ordered Collections and ByMaps: Maps are actually not
ordered by the elements (in the Collection sense), but ordered by the
keys. But since the elements are pairs and pairs are ordered by the
first component first (this is the key for Maps) and since Maps have
only one element per key, ordering by keys implies ordering by
elements. The element with the minimum key is at the same time the
minimum element. However, in some respects we must treat Map's
ordering different from that of Sets and Bags. For example on Maps
maximum a < minimum b does not imply that the Maps are
disjoint (joinable in constant-time), since a and b
could have the same key and only a different image. This difference
means that a lot of functions on the ordering are different for
Sets/Bags and Maps, instead of in the class OrderedCollection
many function have to live in CommutativeCollection or
OrderedMap, respectively. But actually the functions are so
similar (redundant!) that common specification should be possible.
I'm thinking of a class OrderedByCollection... This will be
redesigned together with the ByMaps (below).
Some of the collections form a commutative monoid, that is they
observe the law
This law is observed by both OrderedBag and
OrderedSet. Sets have the additional law
The law union_idempotent completes a completely algebraic
specification of set union that doesn't need to refer to another
formal system (as an element relation and logical operators). We can
do the same for bags:
The usefulness of sets is undebatable, they occur in many
programs. Bags are not as ubiquitous, maybe their most popular employ
is in the specification of sort bag xs == bag (sort xs).
There is another point to consider: Sometimes we provide functions on
Sequences that preserve an (implicit) ordering of the sequence.
Foremost example is the insert from the standard module
List, which preserves the invariant: ordered xs
==> ordered (insert x xs). Maybe we would like
to have a data structure like Sequences but which ensures the ordering
as an fixed invariant: no operation can violate the ordering. Well,
the Ordered Bag is exactly such a data structure.
I get straight to the point: the join operator on Ordered Maps is
(<+) (pronounced "functional overwrite"), its
specification is the following:
I already wrote about the functions domain range inverse
image and the composition operators in the section on Associative
Collections. To understand the details of composition on Maps, let's
have a look at the relation between Finite Maps and (Program)
Functions. Any Map can be trivially translated to a function
(incidentally the query at does this when partially applied),
but not vice versa, because (program) functions don't carry their
domain with them. We have already seen that Map operations can
conveniently be specified by giving separately their effect on the
domain and on the "function". Indeed we can view a Map as a function,
accompanied with its domain. That's the incentive for the useful
definition:
The function-data duality gives also rise to several types of
functional composition operators.
(/./) and (./) are the chiefly useful
functions: they create new maps (which have the structure of the
second argument, only with some keys removed and values replaced).
The function (/.) is only a shorthand provided for
consistency. The inherited function apply is not very useful
on maps (quite arbitrary actually), but we can define a similar
function that preserves the structure of the domain (the name
map_apply is still tentative) and we can use it to
specfication- In fact, only the implementation of map_apply
will reasonably be overwritten by a concrete data structure. The
other implementations here are then asymptotically optimal, too.
We have the composition operators already mentioned above. (Note
that the symbols are different, that's intentional, I'm just trying
out, which ones are better.)
Usual data structure libraries have two seemingly unrelated
problems:
The first problem can be solved by implementing Sets on top of Maps
(with () as the domain range) loosing only a constant factor
in space (and perhaps time). Can we have abstraction in the code and
nevertheless by more efficient? The second problem does first also
seem like one of efficiency (since keys are stored twice), but it is
also a problem of usability: for example users have to give a key each
time they insert something, although the key is already implied by the
element.
Often the key is a field of a record, but since those fields are
accessed by functions, we can take any function on the elements. For
example, we can store persons by name no matter whether name
is a field in record Person or whether it is calculated by
concatenating first name and last name. (This corresponds to the
"Attributes are Queries" principle in the object-oriented (Eiffel)
method.) In many applications (expecially data-oriented ones) this
pattern occurs all over the place: students are accessed by
matriculation numbers, lectures are acccessed by title, symbol table
entries by numeric (or string) id, and in each of the cases, the key
is part of (or derivable from) the elements.
One way to make data structures comfortable is just to use Sets and
a weaker equality function that compares only keys, i.e. one
attribute, instead of the elements. However, this has a lot of
disadvantages:
What we need instead, is a data structure, let's call it ByMap
k a which has operations has_key at delete_key like a
map, but is an instance of Collection (ByMap k) a —
this way keys don't get into the way. This Collection would have the
same semantics as a table of a relational data base: a row (an
element) consists of key and non-key fields, but there is only at most
one element for each value of the key fields. Given the importance
and usefulness of relational data bases in the real life, I think that
such ByMaps have a great future! In functional as well as imperative
programming.
The rest of this section considers typing issues of ByMaps as in this module. The idea is to encode the "key
function" in the type of ByMaps so you don't confuse ByMaps of
different "kind". (Or what would be the result of joining a Map of
"students by name" with one of "students by matriculation number"?)
In an object- The interface for ByMaps is then:
Well, the functional solution needs type classes, and as you can
see from the code, it is not so simple. It stresses the type system
quite a lot, and I suspect an implementation is impossible without
using scoped type variables (a non-standard feature of GHC). Anyway,
GHC 5.00 doesn't compile my current draft, I'll have to try later...
ByMaps are revolutionary new and have to find their place. Perhaps
they basculate all the hierarchy and our view on practical data
structures. Perhaps they only find their place besides the other
structures. I espect something in between, that has to be seen...
Seeing all the possibilities that the abstract data structures
allow, you might think of the equally many possibilities of lazy
evaluation and ask how the two interact. Wouldn't it be too cool, if
all the presented Collections supported infinite structures, so that
we could express a lot of algorithms very nicely? However, we will
see that the combination of laziness any abstract data structures
offers more restrictions than new benefits. This sections explores
those restrictions and proposes a design that retains the benefits.
I begin with a very simple topic: it should be clear that all
ordered structures are strict in their elements, that is, for every
query result that is evaluated (In Dessy even the size
although that could be changed...), all the elements are also
evaluated. This concerns the elements of Sets and Bags, and the keys
of Maps. On the other hand, the elements of Sequences and the range
of Maps are lazy: One can do arbitrary operations on the structures
(if they don't look at the elements, of course!) without evaluating
the elements. In this respect Maps and Sequences can work like the
Haskell's lazy arrays: a strict domain and a lazy range.
It is evident, that many operations cannot be performed on infinite
structures at all. Apparent examples are the functions size
minimum maximum like any other application of fold that returns a
strict value which depends on all the elements of the structure.
Furthermore many operations cannot be performed efficiently on
infinite structures. Example are take and drop
which have to inspect their argument element-by-element, for fear that
looking too far ahead drags them into the black hole of infinite
running time. These observations already suffice to state the first
two consequences for the specification design:
The exclusion of the Ordered Collections from the benefiters of
infinity mighty first seem as an argument against the decision to make
all all Collections ordered. But even without the ordering functions
we would not get very far: for example, the simple and essential query
coll `has` element will either inspect the whole structure,
or it will need the structure to be ordered. Either of which is
impossible for infinite structures.
The following principle brings us nearer to the causes of our
restrictions.
Algebras with infinite elements have less laws. By consequence,
they offer less abstraction.
Explication of the consequence: if you have many, many laws, you
can take out some of them and declare them axioms, the rest becomes
conclusions. But you also can select other laws as axioms and get the
same conclusions. For example, sequences can be defined with the
basic construcors empty single (<+>), but also with
empty cons. If we want to specify operations that work also
with infinite structures, we must be aware where in the data structure
the infiniteness lurks.
Again, the functions take and drop are good
examples, together with their Collection versions front and
back. If we visualise a sequence in our minds, the
front operations can either be seen as taking the front or as
dropping the back. We can take imaginary scissors and cut the
sequence at the desired point. We can formalise this intuitive idea
in several ways, giving different specifications and implementations
for the functions front and back. However,
take and drop don't admit this flexibility: they
have to be aware that the tail of the Sequence can be infinite and
therefore they only have one formal way to write them (namely the one
of the Prelude). This absence of different viewpoints, of the
abstraction between specification and implementation, makes the two
functions rather operational: They are more defined by what
they do (taking, dropping) than be what they return (front or
back). It is no accident that the functions have verbs as names,
while their more abstract counterparts have nouns as names!
This strong dependency on a certain operational viewpoint (instead
of abstract diversity) suggests that we can capture all benefits of
lazyness in one single concrete data structure!
When writing the specifications for this we'll notice that all the
laws stated until now are only valid for finite structures. Indeed
the Stream specification says exactly which of the class members work
also on infinite Sequences. Furthermore, all functions that generate
infinite Sequences, directly have result type Stream, so that we don't
accidentally use an balanced implementation which can't evaluate those
expressions. On the other hand, Stream consuming functions are
defined on Sequences in general, they might be useful on other
sequence implementations, too. (However, I estimated that chance so
small that I put them in the Stream module nevertheless, suggesting
that one should usually prefer the general Sequence functions over the
(inefficient) stream consumers. For example front back over
take drop.)
The preceding section stated that all Collection implementations
but Streams cannot cope with infinity and can therefore not be used to
evaluate expressions like take 5 $ map (^2) [1..]. But that
doesn't mean, that all benefits of laziness are lost with those
implementations. To the contrary: the performance aspect of laziness
is retained by most implementations and this does really scale up!
For example we can define has_key map k = domain map `has` k
and evaluate it without creating the entire intermediate set.
(Attention! At present I think that this statement is wrong: my
implementations don't have this property and I doubt that such
implementations are possible. It's a pity.)
In the last section we have seen the abstract data constructors and
their axioms, and we listed the names of many observer functions
(whose trivial specifications we not mentioned). The interesting
higher- The Collection class contains no insertion function,
because insertion is different on different structures. Precisely we
have the following functions on Sequences:
And on two kinds of Ordered Collections:
In short, we can see that insert_* on Sequences
needs an explicit index argument like put on
OrderedMaps needs a key. On the other hand add on
OrderedSet and OrderedBag just "drops the element
in" and has it find its place. With an efficient implementation of
ordered join <+> or <+ respectively, these
default definitions are already asymptotically optimal. The same
holds for the operations on Sequence given efficient (++)
front back. By the way, on the level of
OrderedCollection we observe the relation put coll =
uncurry (add coll), which suggests to unify those two
functions to one defined in OrderedCollection. However, at
time being, I don't see an advantage of this, since the use of the
functions is albeit quite different.
It is interesting to note that the function remove is
already part of the class Collection — that's because
its semantics does uniformly apply to all structures: the function
removes all occurences of an element from the structure. (An
alternative name would be "erase".)
The reader might suspect that a function that removes only one
occurrence of an element from the structure would be equally useful,
but that function has a problem on sequences: the choice which
occurrence to remove would be a bit arbitrary. As a consequence,
class sequence has a separate function to remove an element by
position. In analogy to insert_* it takes an index as
argument. And a similar function works on maps:
The function add of CommutativeCollection has no
direct counterpart: on sets the inherited function remove
does already suffice, only on bags is a function subtract
useful:
The distinction between specification and implementation helps us
to explain the conflict between "equivalence" and "strutural equality"
that has lead other library designers to include a concept of
observability which divides the data structures into such with
observable elements and such with unobservable ones. In the proposed
standard, however, the elements of a data structure are being supposed
to be of an abstract data type, i.e. they also have separate
specification and implementation. (Well, spec and impl can be the
same, then we write deriving Eq in the data declaration and
the problem doesn't exist.) From this point of view the structural
equality is an internal affair of the implementation of each data type
and not seen by the users of the data type. Those can only use the
(==) function and that's what they consider as equality.
Each abstract data type with its definition of (==) must
observe the law "x == y
⇒ f x == f y" for each
of the functions f that have acces to the internal
representation of the data type and are part of its interface. As a
consequence the law will also hold for all other functions. Without
this law, we wouldn't be able to make any sensible use of the abstract
data type; with the law, we don't need to consider observability in
data structures. This might be surprising, but it's a hard fact. It
might seem hard to include such a strong premise in a standard, but
without the premise we would only get chaos — no real
abstraction possible.
Some data types might not have any sensible definition for
(==), but this is needed to include them in any kind of
collection: otherwise we couldn't even give the simplest specification
like single x == single x. It is clear that objects without
an equality operation (like especially functions) cannot be stored in
OrderedCollections at all, not even in unordered Sets if we had such.
For the future it might be the best solution to have classes
OperationalCollection, OperationalSequence, and
OperationalOrderedMap, which are superclasses of the existing
classes and differ only in having no equality operation.
By the way, the need of (==) for specifications is also
the reason that we need multi-parameter type classes.
The standard includes no unordered commutative collections
because of:
These arguments are not very strong, but the counter-side has even
weaker arguments. In fact, this shows that this point is not so
important at all.
Since the proposed Abstract Data Structures are algebraic by
their very nature, it doesn't make any sense to include imperative or
stateful (mutable) structures. These are formally completely
independent from the algebraic (immutable) structures and a standard
treating them can be separate as well. Only in the use of names
should those standards agree and this issue should be kept in mind.
(Functions that translate from mutable to immutable structures belong
to the mutable world, since they are aware of the state...)
Abstract Names
Semantic Name Principle
Concrete Names
verbal name ASCII symbol Unicode symbol
negate -
min max

elem notElem # /# ∈ ∉
subset strict_subset
<=# <# ⊂ ⊆
&& || ∧ ∨
and
are from
Carrol Morgan's Programming from Specifications. Hehner's
Practical Theory of Programming instead uses ∧ and
∨ since both Booleans and Numbers form lattices (partial
orders).
The algebraic basis
The diagram shows the class hierarchy proposed for the standard.
The design may seem unconventional at first, but it is completely
justified by simple semantic arguments. The classes
OrderedSet, OrderedMap, and Sequence
correspond exactly to the known notions of set, (finite) map, and
sequence. All the rest is simply a factoring out of common
properties.
General Collections
prop_join_associative x y z = (x <+> y) <+> z
== x <+> (y <+> z)
prop_empty_unit x = (x <+> empty) == x
&& x == (x <+> empty)
data Splitted coll a = Empty
| Single a
| (coll a) :+: (coll a)
split :: Collection coll a ⇒
coll a → Splitted coll a
prop_empty_is_Empty = split empty == Empty
prop_single_is_Single x = split (single x) == Single x
prop_split_inner_nonempty coll = case split coll of
(Empty :+: _) → False
(_ :+: Empty) → False
(_) → True
prop_dont_loose a b x = (a `has` x || b `has` x)
⇒ (a <+> b) `has` x
has :: Collection coll a ⇒
coll a → a → Bool
has coll x = case split coll of
(Empty) → False
(Single y) → x == y
(s :+: l) → has s x || has l x
-- empty, single x, l :+: r
fold :: b → (a → b) → (b → b → b) → coll a → b
fold1 :: (a → b) → (b → b → b) → coll a → b
reduce :: b → (b → b → b) → coll a → b
reduce1 :: (b → b → b) → coll a → b
apply f = fold empty (single . f) (<+>)
filter p = fold empty (\x -> if p x then single x else empty) (<+>)
foldr f e coll = fold id f (.) coll e
sum = reduce 0 (+)
maximum = reduce1 max
AssociativeCollection
domain coll = [ x | (x, _) ← coll ]
range coll = [ y | (_, y) ← coll ]
inverse coll = [ (y, x) | (x, y) ← coll ]
image coll set = [ y | (x, y) ← rel, x' ← set, x==x' ]
prop_image_apply map set = map `image` set == at map `apply` set
prop_apply_image map set = f `apply` set == fun_map f set `image` set
fun :: a → b
map :: OrderedMap map a b ⇒
map (a, b)
rel :: OrderedSet set (a, b) ⇒
set (a, b)
class Composable a b c x y z
where
(.) :: a x y → b z x → c z y
Sequence
prop_first_append a b = first (a ++ b)
== if is_empty a then first b else first a
prop_but_first_append a b = but_first (a ++ b)
== if is_empty a then but_first b
else but_first a ++ b
Ordered Collections
prop_split_ordered a = case split a of
(s :+: l) → and [ x ≤ y | x ← s, y ← l ]
_ → True
ascending, descending :: (OrderedCollection coll a, Sequence seq a) ⇒
coll a → seq a
minimum = first but_min = but_first min_view = front_view smallest = front
maximum = last but_max = but_last max_view = back_view largest = back
Excursion on the virtues of hedge union
(<+>) = foldr add
(\-/) = foldr substract
a `subset` b = all (b `has`) a
a == b = is_empty a && is_empty b
|| ( not (is_empty a) && not (is_empty b)
&& first a == first b && but_first a == but_first b
)
merge_with :: (Maybe a → Maybe a → coll' b) →
coll a → coll a → coll' b
merge_with :: (coll a → coll a → b) → -- dis
(a → b) → -- eq
(b → b → b) → -- (*)
coll a → coll a → b
Commutative Collections: Set and Bag
prop_join_commutative :: CommutativeCollection coll a ⇒
coll a → coll a → Bool
prop_join_commutative a b = (a <+> b) == (b <+> a)
prop_union_idempotent :: a → Bool
prop_union_idempotent x = single x ∪ single x == single x
-- where
(∪) :: OrderedSet set a ⇒
set a → set a → set a
(∪) = <+>
prop_bag_add a b x = (a \+/ b) `count` x
== a `count` x + b `count` x
prop_bag_sub a b x = (a \-/ b) `count` x
== (a `count` x) -. (b `count` x)
where i -. j = 0 `max` (i - j)
Functions as Data: the Finite Map
prop_fov_domain a b = domain (a<+b) == domain a ∪ domain b
prop_fov_at a b x = (a<+b) `at` x == if domain b `has` x then b `at` x
else a `at` x
fun_map :: (OrderedSet set a, OrderedMap map (a, b)) ⇒
(set a) → (a → b) → map (a, b)
fun_map dom f = fold empty (λx → (x, f x)) (<+) dom
(/./) :: map (a, b) → map (c, a) → map (c, b)
(./) :: (a → b) → map (c, a) → map (c, b)
(/.) :: map (a, b) → (c → a) → (c → b)
map /. f = (at map) . f
(.) :: (a → b) → (c → a) → (c → b)
apply :: ((a, b) → (c, d))
→ map (a, b) → map (c, d)
map_apply :: ((a, b) → c)
→ map (a, b) → map (a, c)
map_apply f = apply ( λ(a, b) → (a, f (a, b)) )
f ./ map = map_apply (f . snd) map
map /./ map' = (at map) ./ map'
Relations
(°) :: rel (a, b) → rel (c, a) → rel (c, b)
(.°) :: (a → b) → rel (c, a) → rel (c, b)
(°.) :: rel (a, b) → (c → a) → (c → set b)
ByMaps: Data Structures meet Software Engineering
class KEY_FUNCTION[K, A] feature
key ( element : A ) : K
end
class BY_MAP[ K, A, FUN <- KEY_FUNCTION[K, A] ]
inherit COLLECTION[A]
feature
has_key ( key : K ) : BOOLEAN
at ( key : K ) : A
...
join( other : BY_MAP[K, A, FUN] ) : BY_MAP[K, A, FUN]
...
end
Summary
Collection empty single (<+>)
Associativity of (<+>), Unitness of empty
- Sequence (++)
Explicit Ordering: no loss of information
- Ordered Collection
Implicit Ordering
+ Ordered Map <+
+ Commutative Collection
a <+> b == b <+> a
- Ordered Sets
∪, ∩,\\
Idempotency of set union: a ∪ a = a
- Ordered Bags
\+/, \-/
Laziness and infinite structures
Strictness in the elements
Infinity
Laziness Abstraction Principle
data Stream a = Nil | Cons a (Stream a)
instance (Eq a) ⇒ Collection Stream a where ...
instance (Eq a) ⇒ Sequence Stream a where ...
Performace
Modifiying Collections
Inserting
insert_element :: Sequence seq a ⇒
seq a → Int → a → seq a
insert_segment :: Sequence seq a ⇒
seq a → Int → seq a → seq a
prop_insert_element_pre s i _ = has_gap s i
insert_element s i x
= front s (i - 1) ++ single x ++ back s n
where n = size s - (i - 1)
prop_insert_element_post s i x = (insert_element s i x) !! i == x
prop_insert_segment_pre s i _ = has_gap s i
insert_segment s i s'
= front s (i - 1) ++ s' ++ back s n
where n = size s - (i - 1)
add :: CommutativeCollection coll a ⇒
coll a → a → coll a
add coll a = coll <+> single a
put :: OrderedMap map a b ⇒
map (a, b) → a → b → map (a, b)
put map x y = map <+ single (x, y)
Removing
remove :: Collection coll a ⇒ coll a → a → coll a
remove coll x = filter (/=x) coll
delete :: Sequence seq a ⇒
seq a → Int → seq a
prop_delete_pre s i = has_gap s i
delete s i = front s (i - 1) ++ back s n
where n = size s - i
delete_key :: OrderedMap map a ⇒
map a → Int → map a
delete_key map k = filter ((/=k).snd) map
subtract :: OrderedBag bag a ⇒
bag a → Int → bag a
subtract bag x = bag \-/ single x
Three things that are not here