February 2004

Design Principles of a Hierarchy of Functional Data Structures

The design combines the two principles of

  1. Direct Algebraic Specification, as done for the bunches and integers in A Pratical Theory of Programming by Eric Hehner (2nd ed. 2004). (To embed this in Haskell I use Ralf Hinze's Abstract Leaf Trees: A fresh look at binary search trees, JFP 2001.)
  2. Inheritance to share common specifications, as done in my work. (A bit is explained in http://www.stud.tu-ilmenau.de/~robertw/eiffel/cov/a.txt. See also the contract refinement in "Object-Oriented Software Construction", 1st edition, Bertrand Meyer, 1988.)

My original inspiration were the structures of the Z and B specification languages (long live Abrial). But those are based on classical mathematics, while the structures described here (as well as Hehner's) are meant to be self-contained: the source itself contains the minimal theory needed to reason about the structures.

The idea is that:

  1. A high level of abstraction can only be reached when separating specifications and implementations.
  2. Specifications can be written formally in the programming language itself, either with algebraic laws (QuickCheck properties), or by giving an alternative implementation (which is probably less efficient and hopefully much more readable).
  3. Many data structures have similar specification, either in form of common functions with the same specification, or in form of common axioms for some functions (while other axioms for the same functions are different).

Having separate specifications and implementations (whose consistency is automatically checked) includes positive redundancy in the code: there are several views for clients and implementors which make the code easier to understand, and (due to the automatic checking) make the life of bugs very, very hard.

Inheritance, on the other hand, reduces negative redundancy: any kind of repetition is factored out to a higher level, this makes the code shorter for better maintenance and it makes the commonalities more explicit which helps understanding and abstraction.

Three Purposes

Data Structures serve three different purposes in programs which correspond to the three basic activities of programmers:

The first duty of a software developer is to describe what his program his going to to. This must happen in terms of the software's application domain, but as the software process proceeds we turn to describe the software's behaviour in terms of its data structures. This description is the specification of the software, and special languages have been developed to write such specifications. Having abstract data structures which can be regarded from many points of view is a distinguishing feature of such specification languages. For example, in the Z and B languages Sequences are Functions from 0..(N-1) to the elements and Functions in turn are right-unique Relations. So we can, for example, use relational composition to specify with one simple expression the concept of sequence permutations. Using the right abstract data structures a programming language can be used to write such specifications and many of them will even be (efficiently) executable.

Next, any program we develop should best split in parts and these parts have to communicate somehow. Only the smallest functions in a program will work with single items of data, most functions treat a lot of data in one go, and this data must go in and out the function somehow. The format in which this data is to be transferred should be independent from both the consumer and the producer, and that's where we need abstract data structures. Then data is produced and consumed in an abstract way and the concrete data structure can be chosen separately for each producer-consumer-pair (instead of being dictated by either one) — that's separation of concerns, just what a good software design needs. Abstract data structures also separate the concerns of efficiency (choice of the data structure) from functionality (use of the data structure, see "specification").

Algorithms are the last point in my list, although many libraries of data structures are just built for that purpose. (Also research on data structures concentrates here.) Many formal specifications are not efficiently executable and then we must derive an algorithm to perform the specification. This derivation (and only this part of the programming process) is a purely formal activity, whose correctness can be demonstrated with a (formal) proof. (At least a sketch of that proof should be part of the algorithm's (internal) documentation — the maintainer will need it.) Contrary to data structures for specification and communication which have to provide a lot of operations none of which all too inefficient (to permit several points of view), data structures made for algorithm development only have to provide very view operations (just those needed by the algorithm) and these should be as efficient as possible. The average programmer will implement such algorithms very rarely and that's why this standard is biased towards the purposes of Specification and Communication. However, "algorithmic data structures" use the same mathematical semantics for their operations, so both algorithmic and general-purpose implementations can implement the same standard interfaces.

The purpose of abstract data structures is first to make programs clearer and only second to make them faster.

Although this short presentation suggests that the activities of Description, Design, and Derivation are performed in a sequential fashion during software development, this is not so: the activities are intertwined and most often they can really not be separated from each other. Each design has to be described, each program is a description, too, and specifications and algorithms have to be designed to make them really clear and useful. The same holds for the use of abstract data structures: a function specified very abstractly can be an efficient implemtation if we choose the right concrete data structure.

The three first criteria for the library standard are: Simplicity, Simplicity, and Simplicity. Only a simple standard can really be used by all programmers (including beginners and experts in other fields for whom programming is only a tool). Only a simple standard allows extensions without drowning in complexity. And only a simple standard can integrate all aspects of the problem without artifically excluding some. A good standard captures the essentials of the field, no more, no less. (I wrote a separate note that discusses the interaction of simplicity and efficiency.) The simplicity of this standard is more profound than other notions of simplicity since all the semantics of the operations is given explicitly, formally and completely. There's nothing left implicit, so there's no hidden complexity.

Abstract Names

The many data structures of this standard have many similar operations and it has to be decided how they should be named. Inheritance is one possibiltiy to share names between different structures, if the corresponding operations have (the same signature and) common semantic laws. Furthermore we have the possibility to use similar names for similar things (names that follow a pattern) or to use Module namespaces to have one and the same name denote formally independent, but informally similar things. In order to narrow the design space, this standard proposes the semantic name principle:

Semantic Name Principle

Every systematic sharing of names should have a formal semantic basis, which is made explicit in the standard. By consequence, every sharing of names can happen by inheritance.

In fact this means that this standard rejects the use of Module name spaces and large-scale naming patterns to share names. Each group of similar functions has either common semantic laws, then they share a name via inheritance; or the similarities are rather weak, then they get only similar names. This case occurs so rarely that we need no naming patterns. Operations have either different names and semantics (like add insert put or remove subtract delete_key) or they already have different signatures (like insert_element insert_segment).

The Semantic Name Principle is actually about much more than names: by allowing us to concentrate all the specifications (and implementations) of one concept in one place. It is just this principle which avoids us a lot of negative redundancy. If you have already seen one of the many incomplete libraries of data structures out there, you'll probably have noticed that it contains huge piles of duplicated code. Well, you can imagine how much code a complete library would have. The Semantic Name Principle saves us from this code flood. After all, aren't functional programming languages good at eliminating bad redundancy?

The Order of Arguments was chosen so that (1) infix uses are as meaningful as possible (that is meaningful infix uses should be possible ;-), and (2) in other cases let the collection argument(s) be the first. The first rule extends to such cases as f <: xs ++ ys >: b where you can put the parenthesis' as you like, the meaning will always stay the same — and the constituents of the resulting sequence will be in there in just the order they are written. The second rule was mainly choses to have a rule at all, so that users don't have to learn the order of each separate function's arguments. However, having the collection first is the most simple choice and it ressembles a bit the object-oriented style.

Concrete Names

If we take into account all the names that have been used in other libraries and all the different semantics they have, we'll get quite a mess. Therefore I have taken a step back and taken the simplest names possible. The most striking examples of this renaming is the new name apply for the function formerly known as map because that crashes with the name of a data structure (and the name fits better for the data structure), and the abandoning of the "list terminology" which utterly inconsistent with abstract data structures. I always confused CAR and CDR anyway.

Binary functions, especially if they have algebraic properties like associativity, commutativity, should be given operator names. However, names that have no tradition are not very readable. And operators that are composed of ASCII-Characters are not very beautiful. That's why some functions have two names. The following table also contains proposed Unicode symbols, first because I hope Unicode-supporting Editors will appear soon, and second because they are already useful in documentation generated as HTML or PDF.

verbal nameASCII symbolUnicode symbol
negate- 
min max   
elem notElem# /#∈ ∉
subset strict_subset <=# <#⊂ ⊆
&& ||∧ ∨

Notes:

  1. The symbols and are from Carrol Morgan's Programming from Specifications. Hehner's Practical Theory of Programming instead uses and since both Booleans and Numbers form lattices (partial orders).
  2. Besides elem which has type a → coll a → Bool, we have also the flipped version has (of type a → coll a → Bool). This is for consistency, since all other queries have the Collection as a first argument. I think for such an important query we can permit a synonym. (Otherwise we'll drop it.)
  3. The fact that the ASCII symbol <=# is longer than <# suggests to use the formal names subset_equal and subset instead. I prefer the above names nevertheless, assuming that is the more fundamental operation and consequently should have the shorter name.
  4. If we add the operations and , then only as Unicode, otherwise it pollutes the namespace to much.
  5. Alternative symbols for # are
    1. : which also means in some other languages (Z, B, Hehner's notation), but which is rather indisputable given :'s position in Haskell'98.
    2. ? which is very unconventional, but also very unused until now, and not too unintuitive either, since membership and subset are some kind of canonical questions. (Besides it has some charm to prepend all truth-value returning operators with a question mark, however this conflicts with all the relational operators (of classes Eq and Ord) that have already good symbols without question marks.)

Finally, we have to bear in mind, that many of the symbol-named operators can be continued on Collections (see the discussion of the Prelude functions) and we have to reserve verbal names for this purpose. (In Unicode we can however take "big versions" of the operators to denote their continuations.)

The algebraic basis

The immutable data structures of a purely functional languages are like mathematical values, for example the integers or mathematical sets. The behaviour of such classes can be completely specified via simple laws like in algebra. That's why we call them algebraic data structures. The algebraic laws make the data structure specifications autonomous: we can understand them without the need to understand anything else. Furthermore the laws can be used directly in programming. Finally sometimes the laws for a function are a canonical implementation for that function. In such cases we will state the laws as a function definition: this is executable code, but at the same time it is a specification function: the function may actually be implemented in any other way, but it must behave in exactly the same way as the specificative definition.

The diagram shows the class hierarchy proposed for the standard. The design may seem unconventional at first, but it is completely justified by simple semantic arguments. The classes OrderedSet, OrderedMap, and Sequence correspond exactly to the known notions of set, (finite) map, and sequence. All the rest is simply a factoring out of common properties.


General Collections

The specification of the data structures is centered around the three (abstract) constructors empty, single, and <+> (pronounced "join"). These form an algebra which is a monoid for all collections. In formulae we have:

prop_join_associative x y z = (x <+> y) <+> z
                            == x <+> (y <+> z)

prop_empty_unit x = (x <+> empty) == x
                  && x == (x <+> empty)

Using the three constructors with these laws and one single "deconstructor" split we can already specify a lot of other functions completely, i.e. by giving an abstract implementation of the functions. split observes the following laws to which subclasses of Collection will add more:

data Splitted coll a = Empty
		     | Single a
		     | (coll a) :+: (coll a)

split :: Collection coll a ⇒
         coll a → Splitted coll a

prop_empty_is_Empty  =  split empty == Empty
		 
prop_single_is_Single x  =  split (single x) == Single x


prop_split_inner_nonempty coll = case split coll of
                                 (Empty :+: _) → False
		                 (_ :+: Empty) → False
		                 (_)           → True

prop_dont_loose a b x = (a `has` x || b `has` x)
		      ⇒ (a <+> b) `has` x

The ⇒ in the last expression denotes logical implication. The expression states that an element occurring in one of <+>'s arguments, must also occur in the result. This property does indeed constrain split because has is defined in the following way:

has :: Collection coll a ⇒
       coll a → a → Bool
has coll x = case split coll of
             (Empty)    → False
	     (Single y) → x == y
	     (s :+: l)  → has s x || has l x

In usual algebraic specifications, every expression is assumed to denote a different value (as if all the functions were data constructors), unless there is an explicit axiom (like `join_associative` above), that states equality. However, for our Collection that doesn't work, since the subclasses will introduce more axioms.

That's why I give the axiom `dont_loose`: it gives a lower bound on equality, stating that (a <+> b) cannot be equal to a Collection that doesn't have `x`. Don't ask me, how that fits into the theory of abstract data types. I think it's practical here.

The functions that are defined using split are the following: is_empty size count has / first last but_first but_last front back has_index (!!) index_of / remove / fold reduce apply filter fold1 reduce1 zip_with zip. All those functions are completely defined with laws that hold in all subclasses of Collection. (For the definitions see the standard.) The variation of their semantics depends entirely on the axioms for empty single <+> split that the subclasses will add.

Note that the function (!!) has a meaning distinct from the Haskell Prelude, such that first coll = "1st element of coll" = coll !! 1. While in Computer Science it is often practical to count from zero, for applications it is more consistent to count from one.

It is remarkable that the functions hence defined are not only quite readable (not worse than algebraic specifications ;-)), but there are also efficient. That's because the result of split is underspecified just at the right point: each data structure implementation can return two parts of the structure as it pleases. This has the consequence that many operations are structure-preserving even in the abstract default implementation! For example, when using apply the resulting structure has the same layout as the argument, so no rebalancing is needed. In fact, a data structure that only defines split and unsplit is already asymptotical optimal for all other operations in Collection!

In this respect, our split technique is comparable to iterators in imperative data structures. Of course, functional data structures cannot have imperative iterators, since those are able to mutate the structure. But all the other advantages of iterators are retained by the split approach. A splitted structure represents the "position" just between the two parts returned by split. This is indeed the nearest we can get to an interator and seen in the previous paragraph we have no need to get nearer.

One consequence of our three-fold algebra are the new fold and reduce functions which take at their interface one function for each of the abstract constructors and which traverse in their implementation each data structure according to its concrete constructors.

--       empty, single x,    l :+: r          
fold    :: b → (a → b) → (b → b → b) → coll a → b
fold1   ::      (a → b) → (b → b → b) → coll a → b
reduce  :: b             → (b → b → b) → coll a → b
reduce1 ::                  (b → b → b) → coll a → b

Example uses of the functions in the standard:

apply f  = fold empty (single . f) (<+>)
filter p = fold empty (\x -> if p x then single x else empty) (<+>)

foldr f e coll = fold id f (.) coll e

sum     = reduce 0 (+)
maximum = reduce1 max

As you can see even the traditional folding functions can be made adapting their traversal direction.

AssociativeCollection

The specification languages Z and B, which have largely influenced the design of Relations in this proposal, view Maps (functions) as a subset of Relations (Sets of Tupels). This is very natural in classical mathematics, but not in the data structure business. (Hehner's practical theory doesn't do so, either.) Here we have separate Maps and Relations and the class AssociativeCollection only serves to factor out for of their common operations — domain range image inverse — which can all be specified using set comprehension: (Until that is available in Haskell, we use filter and apply in the real code.)

domain coll = [ x | (x, _) ← coll ]
range  coll = [ y | (_, y) ← coll ]
inverse coll = [ (y, x) | (x, y) ← coll ]

image coll set = [ y | (x, y) ← rel, x' ← set, x==x' ]

These functions could as well have been given outside of a type class, but I envision that some implementations of Maps and Relations can implement them faster than the obvious linear time. For instance, it is part of the "democracy" principle that neither side of a relation dominates the other. Consequently there will be implementations that do inverse in constant time, just as democratic sequences do with reverse.

Also note a correspondance of the function image with the well-known apply when we have functions as data (that is, Maps).

prop_image_apply map set  =  map `image` set == at map `apply` set
prop_apply_image map set  =  f `apply` set == fun_map f set `image` set

Relational (and Functional) Composition is one of the most expressive operations in the Z and B languages: there is only one operator defined for Relations, and since Functions are a special case, it will automatically adapt to functions. (Composition is closed on functions, i.e. a function composed with a function always gives a functions.) In our programming language, however, we have to cope with several different types: Relations and Maps which are not subtypes of one-another, and the program-functions. This gives the following nine possibilities of how we can compose things:

Assumed the types:

fun :: a → b
map :: OrderedMap map a b ⇒
       map (a, b)
rel :: OrderedSet set (a, b) ⇒
       set (a, b)

And set being the ordinary conversion function.

The operators /./ /. on Maps and their correspondents on Relations /°/ /° do actually have the same type and could already be defined in AssociativeCollection. However, since the type of ./ and °/ is so fundamentally different, I preferred to separate all symbols into a consistent set. By the way, it is also the operation °/ with its result of type a → set b which makes it impossible to define just one overloaded operator:

class Composable a b c x y z
    where
    (.) :: a x y → b z x → c z y 

Futhermore one could criticise that Maps have to be converted to compose them with Relations (we don't have an extra symbol for that case), while the symbols /./ and /. could be replaced in the same way by writing at map . fun and at map ./ map, respectively. However, I think that the proposed set with six new operators and the existing . is the simplest one we can offer.

Perhaps confusing: A Function is defined via f = λ x → y (from left to right) and the type a → b has the same "direction", but the Function is applied via y = f x (right to left). Consequently, functional composition "threads" values from right to left: (f . g) x = f (g x) which gives the types as seen above. For Functions this easy to keep in mind, but for Maps and Relations we have the additional possibility to view them as Collections of Pairs and in all the methods of Collections the Pairs have the form (x, y): again from left to right. From that point of view it would certainly be more intuitive to have composition types like (a → b) → (b → c) → (a → c), but unfortunately this would break with too many conventions. (TODO: be more precise here). So one has to keep in mind, that Functions, Maps, and Relations are all applied and composed "against the sense of their type".

Sequence

The axioms specified in Collection are already all we know about Sequences: every two Sequences that cannot be proven equal with the Monoid axioms are different. A way to say that formally is by specifying (++) (Sequence's synonym for <+>):

prop_first_append a b = first (a ++ b) 
                        == if is_empty a then first b else first a

prop_but_first_append a b = but_first (a ++ b) 
                            == if is_empty a then but_first b 
	  		       else but_first a ++ b

I have used the functions first but_first which are already specified (like any other query) using split. Incidentally most of the famous "list" operations are already specified in Collection, so Sequence doesn't add many more. Three of them are add_first, add_last and reverse. (In the meanwhile the first two have already been renamed to <: and >:, which are much more intuitive together with the View data constructors :< and :>.)

Obsolete rationale for the names add_first add_last: in the next section I'll reveal that CummutativeCollections (i.e. Sets and Bags) have an operation add and Sequence have operations insert_element and insert_segment to insert a single and a sequence respectively at any position of a sequence. This would actually suggest to call the operations insert_first and insert_last for consistency, but I think these names are just one syllable too long. And we can think of a good explanation why the functions are called add_*: instead of inserting the new elements into the sequence, they are added on the outside. Well, if that doesn't convince you, change the name.

In earlier versions of the specification the functions were called cons and snoc which I think are very funny names. But on 20 february 2004, midnight I realised that the semantic link between cons front and snoc back is just too weak: consistency literally cries for names that contain the words "first" and "last". (Not "front" and "back" since those would suggest adding sequences, but that is already done by (++).) Well, giving up cons and snoc is a bit sad, but I am deeply convinced by Eric Hehner's statement ("Unified Algebra") that we should name things as simple as possible and not with historic terms.

As explained in my note on Democratic Sequences, the Sequences of this standard are not biased towards one or the other end, neither to their middle or any other point. There exist implementations that perform any point-wise Sequence operation (including (++) front back) in logarithmic time. (In Dessy such an implementation is the default.)

Furthermore there are implementations of Sequence that implement the following functions in constant (amortised) time: empty single add_first add_last / first last but_first but_last. The Sequence class therefore makes Stacks, Queues and Deques superfluous (and also the classical "lists"): instead of writing separate specifications and implementations for them, we can use the Sequence specification together with the specialised implementations. This makes life easier for library users, especially beginners: they don't have to learn the Stack, Queue and Deque specifications and understand the formal relation to Sequence. (Experienced programmers will only appreciate that after some habituation, since they already know all this specifications, and have to appreciate that they can actually forget them without harm.) (In Dessy these specialised implementations are deque sdeque queue.)

Ordered Collections

Unlike the orderings of Sequences which depends on the order (or place) of insertion, Ordered Collections have an implicit ordering of elements which is determined by a relation on the elements. Consequently all collection have (either an implicit or an explicit) ordering. The advantages of that fact are explained in an earlier document and the earlier specification.

The implicit ordering is formally specified like this:

prop_split_ordered a = case split a of
                      (s :+: l) → and [ x ≤ y | x ← s, y ← l ]
                      _         → True

The implicit ordering can be made explicit by the following conversion functions with the obvious semantics:

ascending, descending :: (OrderedCollection coll a, Sequence seq a) ⇒
                         coll a → seq a

It is good explicit style to use always one of these functions when converting from an Ordered Collection to Sequence. Indeed I am not yet in how many places the function ascending should be called implicitly. This question arises always when we generalise a function of Sequence to the Ordered Collections. But we have already made a commitment to the implicit style when we defined the semantics for fold (see the earlier design documents). For example we can define first = reduce1 const and last = reduce1 (flip const) which automatically gives on Ordered Collection first = min and last = max and not vice versa.

The different subclasses realise that ordering in different ways, with commutative and non-commutative join operators. Thanks to this law some queries get a stronger meaning and we commemorate that by the introduction of the following synonyms:

minimum = first   but_min = but_first   min_view = front_view   smallest = front
maximum = last    but_max = but_last    max_view = back_view    largest  = back

And the introduction of the functions smaller larger smaller_equal larger_equal equal.

Actually, minimum and maximum are not only synonyms, since are already defined in class Collection, thus they function also on Sequences where they are different from first and last. By the way, the operations minimum but_min and maximum but_max are the essence of priority queues and the three subclasses of OrderedCollection describe indeed three different semantics for priority queues. PQ implementations can thus be used with this standard in the same way as queues etc. with Sequences.

Excursion on the virtues of hedge union

Operations that "join" two balanced data structures often take O(N*log N) time (with N = size a `max` size b). For example

(<+>) = foldr add
(\-/) = foldr substract

a `subset` b = all (b `has`) a

a == b = is_empty a && is_empty b
       || ( not (is_empty a) && not (is_empty b)
          && first a == first b && but_first a == but_first b
	  )

But if we "align" two ordered structures with each other, we would just have to do an "element-wise" operation — we could do it in linear time. In order to do that, defined the following function:

merge_with :: (Maybe a → Maybe a → coll' b) →
	      coll a → coll a → coll' b

It's specification is that merge_with f a b will call f (Just x) (Just x) for each x that is in both collections, f (Just x) Nothing for each x that is only in a (and vice versa for b) and then it joins the resulting collections. (f Nothing Nothing is guaranteed never to be called.) With such an operation in linear time, we can easily implement (\/) (/\) (\\) (\+/) (\-/) all in linear time! Nevertheless this first approach had a lot of disadvantages:

Actually the last two problems could be solved by given the function an additional parameter (the join operator to use) and using f Nothing Nothing as the unit of that join operator (i.e. empty). In fact, the following function does exactly that:

merge_with  :: (coll a → coll a → b) →  -- dis
               (a → b) →                 -- eq 
	       (b → b → b) →            -- (*)
               coll a → coll a → b

The specification of merge_with dis eq (*) is that eq gets fed with all elements that are in both Collections, dis gets pairs of disjoint Subcollections, and finally (*) is used to join the results. This version has still the problem that I can't think of a sensible formal specification that is simpler than the implementation, but at least the implementation is very simple! Surprisingly we can even give a default implementation that works already optimal for many concrete data structures (since it is based on the abstract functions smaller larger_equal).

It is interesting that the default implementation of merge_with is in fact just the "hedge union" algorithm on ordered trees! Every tree data structure can profit from that algorithm and the benefits are really qualitative, since this algorithm can be used to implement a lot of other functions. For example, if Set union is fast, we can just write add set x = set \/ single x and get the optimal O(log N) running time! (Since the exact time bound of merge_with is O((N * log M) `min` (M * log N)), or something like that.) This is a real qualitative win for the user of a data structure, since he doesn't have to choose between two operations: to add several elements to a set should one use several adds or rather prepare a set of the new elements and \/? For which number of "several" elements should one switch from add to \/? Well, with an efficient adaptive algorithm it is all the same!

Ordered Collections and ByMaps: Maps are actually not ordered by the elements (in the Collection sense), but ordered by the keys. But since the elements are pairs and pairs are ordered by the first component first (this is the key for Maps) and since Maps have only one element per key, ordering by keys implies ordering by elements. The element with the minimum key is at the same time the minimum element. However, in some respects we must treat Map's ordering different from that of Sets and Bags. For example on Maps maximum a < minimum b does not imply that the Maps are disjoint (joinable in constant-time), since a and b could have the same key and only a different image. This difference means that a lot of functions on the ordering are different for Sets/Bags and Maps, instead of in the class OrderedCollection many function have to live in CommutativeCollection or OrderedMap, respectively. But actually the functions are so similar (redundant!) that common specification should be possible. I'm thinking of a class OrderedByCollection... This will be redesigned together with the ByMaps (below).

Commutative Collections: Set and Bag

Some of the collections form a commutative monoid, that is they observe the law

prop_join_commutative :: CommutativeCollection coll a ⇒
                         coll a → coll a → Bool
prop_join_commutative a b  =  (a <+> b) == (b <+> a)

This law is observed by both OrderedBag and OrderedSet. Sets have the additional law

prop_union_idempotent :: a → Bool
prop_union_idempotent x  =  single x ∪ single x == single x

-- where
(∪) :: OrderedSet set a ⇒
           set a → set a → set a
(∪) = <+>

The law union_idempotent completes a completely algebraic specification of set union that doesn't need to refer to another formal system (as an element relation and logical operators). We can do the same for bags:

prop_bag_add a b x  =  (a \+/ b) `count` x 
                    == a `count` x  +  b `count` x

prop_bag_sub a b x  =  (a \-/ b) `count` x 
                    == (a `count` x) -. (b `count` x)
    where i -. j = 0 `max` (i - j)

The usefulness of sets is undebatable, they occur in many programs. Bags are not as ubiquitous, maybe their most popular employ is in the specification of sort bag xs == bag (sort xs). There is another point to consider: Sometimes we provide functions on Sequences that preserve an (implicit) ordering of the sequence. Foremost example is the insert from the standard module List, which preserves the invariant: ordered xs ==> ordered (insert x xs). Maybe we would like to have a data structure like Sequences but which ensures the ordering as an fixed invariant: no operation can violate the ordering. Well, the Ordered Bag is exactly such a data structure.

Functions as Data: the Finite Map

I get straight to the point: the join operator on Ordered Maps is (<+) (pronounced "functional overwrite"), its specification is the following:

prop_fov_domain a b = domain (a<+b) == domain a ∪ domain b

prop_fov_at a b x = (a<+b) `at` x == if domain b `has` x then b `at` x
                                                         else a `at` x

I already wrote about the functions domain range inverse image and the composition operators in the section on Associative Collections. To understand the details of composition on Maps, let's have a look at the relation between Finite Maps and (Program) Functions. Any Map can be trivially translated to a function (incidentally the query at does this when partially applied), but not vice versa, because (program) functions don't carry their domain with them. We have already seen that Map operations can conveniently be specified by giving separately their effect on the domain and on the "function". Indeed we can view a Map as a function, accompanied with its domain. That's the incentive for the useful definition:

fun_map :: (OrderedSet set a, OrderedMap map (a, b)) ⇒
           (set a) → (a → b) → map (a, b)
fun_map dom f = fold empty (λx → (x, f x)) (<+) dom 	   

The function-data duality gives also rise to several types of functional composition operators.

(/./) :: map (a, b) → map (c, a) → map (c, b)
(./)  :: (a → b)   → map (c, a) → map (c, b)

(/.)  :: map (a, b) → (c → a)   → (c → b)
map /. f = (at map) . f

(.)   :: (a → b)   → (c → a)   → (c → b)

apply :: ((a, b) → (c, d)) 
         → map (a, b) → map (c, d)

(/./) and (./) are the chiefly useful functions: they create new maps (which have the structure of the second argument, only with some keys removed and values replaced). The function (/.) is only a shorthand provided for consistency. The inherited function apply is not very useful on maps (quite arbitrary actually), but we can define a similar function that preserves the structure of the domain (the name map_apply is still tentative) and we can use it to specfication-implement the other functions.

map_apply :: ((a, b) → c) 
             → map (a, b) → map (a, c)
map_apply f = apply ( λ(a, b) → (a, f (a, b)) )

f    ./ map = map_apply (f . snd)  map

map /./ map' = (at map) ./ map'

In fact, only the implementation of map_apply will reasonably be overwritten by a concrete data structure. The other implementations here are then asymptotically optimal, too.

Relations

We have the composition operators already mentioned above. (Note that the symbols are different, that's intentional, I'm just trying out, which ones are better.)

(°)  :: rel (a, b) → rel (c, a) → rel (c, b)
(.°) :: (a → b)   → rel (c, a) → rel (c, b)
(°.) :: rel (a, b) → (c → a)   → (c → set b)

ByMaps: Data Structures meet Software Engineering

Usual data structure libraries have two seemingly unrelated problems:

  1. The implementations of Sets and Maps have high redundancy.
  2. When using Maps the key is often already "somewhere in the elements" and storing it separately in the Map, again, is redundant.

The first problem can be solved by implementing Sets on top of Maps (with () as the domain range) loosing only a constant factor in space (and perhaps time). Can we have abstraction in the code and nevertheless by more efficient? The second problem does first also seem like one of efficiency (since keys are stored twice), but it is also a problem of usability: for example users have to give a key each time they insert something, although the key is already implied by the element.

Often the key is a field of a record, but since those fields are accessed by functions, we can take any function on the elements. For example, we can store persons by name no matter whether name is a field in record Person or whether it is calculated by concatenating first name and last name. (This corresponds to the "Attributes are Queries" principle in the object-oriented (Eiffel) method.) In many applications (expecially data-oriented ones) this pattern occurs all over the place: students are accessed by matriculation numbers, lectures are acccessed by title, symbol table entries by numeric (or string) id, and in each of the cases, the key is part of (or derivable from) the elements.

One way to make data structures comfortable is just to use Sets and a weaker equality function that compares only keys, i.e. one attribute, instead of the elements. However, this has a lot of disadvantages:

What we need instead, is a data structure, let's call it ByMap k a which has operations has_key at delete_key like a map, but is an instance of Collection (ByMap k) a — this way keys don't get into the way. This Collection would have the same semantics as a table of a relational data base: a row (an element) consists of key and non-key fields, but there is only at most one element for each value of the key fields. Given the importance and usefulness of relational data bases in the real life, I think that such ByMaps have a great future! In functional as well as imperative programming.

The rest of this section considers typing issues of ByMaps as in this module. The idea is to encode the "key function" in the type of ByMaps so you don't confuse ByMaps of different "kind". (Or what would be the result of joining a Map of "students by name" with one of "students by matriculation number"?) In an object-oriented language we could achive this will a little interface class:

class KEY_FUNCTION[K, A] feature
   key ( element : A ) : K 
end

The interface for ByMaps is then:

class BY_MAP[ K, A, FUN <- KEY_FUNCTION[K, A] ] 
inherit COLLECTION[A]
feature
   has_key ( key : K ) : BOOLEAN
   at ( key : K ) : A
   ...
   join( other : BY_MAP[K, A, FUN] ) : BY_MAP[K, A, FUN]
   ...
end

Well, the functional solution needs type classes, and as you can see from the code, it is not so simple. It stresses the type system quite a lot, and I suspect an implementation is impossible without using scoped type variables (a non-standard feature of GHC). Anyway, GHC 5.00 doesn't compile my current draft, I'll have to try later...

Summary

Collectionempty single (<+>) Associativity of (<+>), Unitness of empty
- Sequence(++) Explicit Ordering: no loss of information
- Ordered Collection  Implicit Ordering
   + Ordered Map<+ 
   + Commutative Collection  a <+> b == b <+> a
     - Ordered Sets , ,\\ Idempotency of set union: a ∪ a = a
     - Ordered Bags \+/, \-/ 

ByMaps are revolutionary new and have to find their place. Perhaps they basculate all the hierarchy and our view on practical data structures. Perhaps they only find their place besides the other structures. I espect something in between, that has to be seen...

Laziness and infinite structures

Seeing all the possibilities that the abstract data structures allow, you might think of the equally many possibilities of lazy evaluation and ask how the two interact. Wouldn't it be too cool, if all the presented Collections supported infinite structures, so that we could express a lot of algorithms very nicely? However, we will see that the combination of laziness any abstract data structures offers more restrictions than new benefits. This sections explores those restrictions and proposes a design that retains the benefits.

Strictness in the elements

I begin with a very simple topic: it should be clear that all ordered structures are strict in their elements, that is, for every query result that is evaluated (In Dessy even the size although that could be changed...), all the elements are also evaluated. This concerns the elements of Sets and Bags, and the keys of Maps. On the other hand, the elements of Sequences and the range of Maps are lazy: One can do arbitrary operations on the structures (if they don't look at the elements, of course!) without evaluating the elements. In this respect Maps and Sequences can work like the Haskell's lazy arrays: a strict domain and a lazy range.

Infinity

It is evident, that many operations cannot be performed on infinite structures at all. Apparent examples are the functions size minimum maximum like any other application of fold that returns a strict value which depends on all the elements of the structure. Furthermore many operations cannot be performed efficiently on infinite structures. Example are take and drop which have to inspect their argument element-by-element, for fear that looking too far ahead drags them into the black hole of infinite running time. These observations already suffice to state the first two consequences for the specification design:

  1. Since minimum and maximum are unavailable on infinite structures, all our Ordered Collections will always be finite. Consequently, only Sequences are eligible for infinity.
  2. Since some kind of balancing is needed to construct efficient data structures, this rules out a lot of Sequence implementations.

The exclusion of the Ordered Collections from the benefiters of infinity mighty first seem as an argument against the decision to make all all Collections ordered. But even without the ordering functions we would not get very far: for example, the simple and essential query coll `has` element will either inspect the whole structure, or it will need the structure to be ordered. Either of which is impossible for infinite structures.

The following principle brings us nearer to the causes of our restrictions.

Laziness Abstraction Principle

Algebras with infinite elements have less laws. By consequence, they offer less abstraction.

Explication of the consequence: if you have many, many laws, you can take out some of them and declare them axioms, the rest becomes conclusions. But you also can select other laws as axioms and get the same conclusions. For example, sequences can be defined with the basic construcors empty single (<+>), but also with empty cons. If we want to specify operations that work also with infinite structures, we must be aware where in the data structure the infiniteness lurks.

Again, the functions take and drop are good examples, together with their Collection versions front and back. If we visualise a sequence in our minds, the front operations can either be seen as taking the front or as dropping the back. We can take imaginary scissors and cut the sequence at the desired point. We can formalise this intuitive idea in several ways, giving different specifications and implementations for the functions front and back. However, take and drop don't admit this flexibility: they have to be aware that the tail of the Sequence can be infinite and therefore they only have one formal way to write them (namely the one of the Prelude). This absence of different viewpoints, of the abstraction between specification and implementation, makes the two functions rather operational: They are more defined by what they do (taking, dropping) than be what they return (front or back). It is no accident that the functions have verbs as names, while their more abstract counterparts have nouns as names!

This strong dependency on a certain operational viewpoint (instead of abstract diversity) suggests that we can capture all benefits of lazyness in one single concrete data structure!

data Stream a = Nil | Cons a (Stream a)

instance (Eq a) ⇒ Collection Stream a where ...
instance (Eq a) ⇒ Sequence Stream a where ...

When writing the specifications for this we'll notice that all the laws stated until now are only valid for finite structures. Indeed the Stream specification says exactly which of the class members work also on infinite Sequences. Furthermore, all functions that generate infinite Sequences, directly have result type Stream, so that we don't accidentally use an balanced implementation which can't evaluate those expressions. On the other hand, Stream consuming functions are defined on Sequences in general, they might be useful on other sequence implementations, too. (However, I estimated that chance so small that I put them in the Stream module nevertheless, suggesting that one should usually prefer the general Sequence functions over the (inefficient) stream consumers. For example front back over take drop.)

Performace

The preceding section stated that all Collection implementations but Streams cannot cope with infinity and can therefore not be used to evaluate expressions like take 5 $ map (^2) [1..]. But that doesn't mean, that all benefits of laziness are lost with those implementations. To the contrary: the performance aspect of laziness is retained by most implementations and this does really scale up! For example we can define has_key map k = domain map `has` k and evaluate it without creating the entire intermediate set. (Attention! At present I think that this statement is wrong: my implementations don't have this property and I doubt that such implementations are possible. It's a pity.)

Modifiying Collections

In the last section we have seen the abstract data constructors and their axioms, and we listed the names of many observer functions (whose trivial specifications we not mentioned). The interesting higher-order-observers fold reduce zip_with merge_with are discussed elsewhere. So we are left, with the point-wise modification operations which are briefly discussed in this section.

Inserting

The Collection class contains no insertion function, because insertion is different on different structures. Precisely we have the following functions on Sequences:

insert_element :: Sequence seq a ⇒ 
                  seq a → Int → a → seq a
insert_segment :: Sequence seq a ⇒ 
                  seq a → Int → seq a → seq a

prop_insert_element_pre s i _ =  has_gap s i
insert_element s i x
    = front s (i - 1) ++ single x ++ back s n
    where n = size s - (i - 1)
prop_insert_element_post s i x = (insert_element s i x) !! i == x

prop_insert_segment_pre s i _ =  has_gap s i
insert_segment s i s'
    = front s (i - 1) ++ s' ++ back s n
    where n = size s - (i - 1)

And on two kinds of Ordered Collections:

add :: CommutativeCollection coll a ⇒ 
       coll a → a → coll a
add coll a = coll <+> single a

put :: OrderedMap map a b ⇒ 
       map (a, b) → a → b → map (a, b)
put map x y = map <+ single (x, y)

In short, we can see that insert_* on Sequences needs an explicit index argument like put on OrderedMaps needs a key. On the other hand add on OrderedSet and OrderedBag just "drops the element in" and has it find its place. With an efficient implementation of ordered join <+> or <+ respectively, these default definitions are already asymptotically optimal. The same holds for the operations on Sequence given efficient (++) front back. By the way, on the level of OrderedCollection we observe the relation put coll = uncurry (add coll), which suggests to unify those two functions to one defined in OrderedCollection. However, at time being, I don't see an advantage of this, since the use of the functions is albeit quite different.

Removing

It is interesting to note that the function remove is already part of the class Collection — that's because its semantics does uniformly apply to all structures: the function removes all occurences of an element from the structure. (An alternative name would be "erase".)

remove :: Collection coll a ⇒ coll a → a → coll a
remove coll x = filter (/=x) coll

The reader might suspect that a function that removes only one occurrence of an element from the structure would be equally useful, but that function has a problem on sequences: the choice which occurrence to remove would be a bit arbitrary. As a consequence, class sequence has a separate function to remove an element by position. In analogy to insert_* it takes an index as argument. And a similar function works on maps:

delete :: Sequence seq a ⇒ 
          seq a → Int → seq a
prop_delete_pre s i =  has_gap s i
delete s i = front s (i - 1) ++ back s n
    where n = size s - i

delete_key :: OrderedMap map a ⇒ 
              map a → Int → map a
delete_key map k = filter ((/=k).snd) map

The function add of CommutativeCollection has no direct counterpart: on sets the inherited function remove does already suffice, only on bags is a function subtract useful:

subtract :: OrderedBag bag a ⇒ 
              bag a → Int → bag a
subtract bag x = bag \-/ single x

Three things that are not here

  1. No notion of observability.
  2. No unordered associative collections.
  3. No stateful data structures.

The distinction between specification and implementation helps us to explain the conflict between "equivalence" and "strutural equality" that has lead other library designers to include a concept of observability which divides the data structures into such with observable elements and such with unobservable ones. In the proposed standard, however, the elements of a data structure are being supposed to be of an abstract data type, i.e. they also have separate specification and implementation. (Well, spec and impl can be the same, then we write deriving Eq in the data declaration and the problem doesn't exist.) From this point of view the structural equality is an internal affair of the implementation of each data type and not seen by the users of the data type. Those can only use the (==) function and that's what they consider as equality.

Each abstract data type with its definition of (==) must observe the law "x == y   ⇒  f x == f y" for each of the functions f that have acces to the internal representation of the data type and are part of its interface. As a consequence the law will also hold for all other functions. Without this law, we wouldn't be able to make any sensible use of the abstract data type; with the law, we don't need to consider observability in data structures. This might be surprising, but it's a hard fact. It might seem hard to include such a strong premise in a standard, but without the premise we would only get chaos — no real abstraction possible.

Some data types might not have any sensible definition for (==), but this is needed to include them in any kind of collection: otherwise we couldn't even give the simplest specification like single x == single x. It is clear that objects without an equality operation (like especially functions) cannot be stored in OrderedCollections at all, not even in unordered Sets if we had such. For the future it might be the best solution to have classes OperationalCollection, OperationalSequence, and OperationalOrderedMap, which are superclasses of the existing classes and differ only in having no equality operation.

By the way, the need of (==) for specifications is also the reason that we need multi-parameter type classes.

The standard includes no unordered commutative collections because of:

  1. Simplicity: less classes, less taxonomy.
  2. Semantics: fold and reduce can both suppose order.
  3. Implementation: The typical functional implementations order their elements anyway. (For untypical implementations add unordered classes outside of the standard.)

These arguments are not very strong, but the counter-side has even weaker arguments. In fact, this shows that this point is not so important at all.

Since the proposed Abstract Data Structures are algebraic by their very nature, it doesn't make any sense to include imperative or stateful (mutable) structures. These are formally completely independent from the algebraic (immutable) structures and a standard treating them can be separate as well. Only in the use of names should those standards agree and this issue should be kept in mind. (Functions that translate from mutable to immutable structures belong to the mutable world, since they are aware of the state...)