Algebraic Collections: A Standard for Functional Data Structures

As the title of this page suggests, there are two different things to be found here: the first is an interrelated ↓specification of some abstract algebraic data structures. These type classes are meant to replace a large part of Haskell's standard library, namely the part that consists of « one hundred functions working on "lists" ». The other part is a reference ↓implementation of the data structures specified in the standard.

Download Documentation and Code

News: We now have Structures with no constraints on the member elements (and no MPTC, either): OperationalCollection OperationalSequence!
There are more examples, and you can use GHC with -package dessy. (See make.sh and dessy.pkg.)

Still Incomplete: the ByMap.

Instead of a FAQ: See my posts on the Haskell mailing list.

The Specification

There are too many reasons to list, why Haskell desperately needs such a standard. But what are the advantages of this proposal?

Since I'm already at it, I'll also list some disadvantages:

A description of my proposal is found in the following documents:

  1. Design Principles for the Specifications.
  2. The Module Collections contains the complete formal specification of the abstract data structures. The Modules Stream and String are part of that Module (reexported, an internal structure like that of the Prelude), while the Module CollectionTools is auxiliary similar in function to the Module List of the Haskell standard.
  3. The Module AlgebraBasic contains an older version of the spec and some interesting design rationale, see expecially the long comment at the top of the file.
  4. An earlier version of this page does also contain some design rationale (especially on naming conventions).
  5. A page on Democratic Sequences contains some anti-list propaganda.

There are two versions of the standard which I have dubbed Collections'98 and Collections-with-Haskell (Collections-H). The first is compatible with Haskell'98 plus multiple-parameter type classes (MPTC). The second is a library and language standard: it includes a re-designed Prelude and the necessary changes to the programming language Haskell, which gives the variant Haskell-with-Collections for which I kindly provide a ↓formal specification, too.

The two versions differ only slighty, they are both described here and the complete formal specification is in the module Collections. This module can be used in any Haskell-implementation with MTPC, you just need to write:

import Prelude ()
import Collections

The Collections redefines many identifiers from Prelude and it re-exports all others. To make really use of the Collections, you also need some implementations of them. I expect that future Haskell systems will bring a lot of such implementations with them. (By the way, we could reserve a branch Collections in the hierarchical libraries, so that for example Collections.Dessy is the implementation described below.) For now, to get implementations too, just write:

import Prelude ()
import Dessy

Haskell-with-Collections

After one has worked with the Collections library for some time, concrete data structures will probably not be used anymore in one's code. Then its time to switch to an updated variant of Haskell, that gets rid of the functional stacks (aka "lists") — the data structure that deprives the language from many of its predecessors' advantages! The changes needed in Haskell are entirely analogous to the support of numeric classes that is already in the language. They consist basically of two things:

  1. Collection manifests have polymorphic types.
    1 :: (Num a) ⇒ a
    [1..3] :: (Num a, Sequence seq a) ⇒ seq a
    
  2. There are standard instances to disambiguate expressions.
    print 1 -- Automatically supposes 1 :: Integer
    print [1..3] -- Automatically supposes [1..3] :: DefaultSequence Integer
    

The generalisation holds for (a) manifest sequences, (b) enumerations, and (c) sequence comprehension. Manifests of other abstract structures can be created using simple conversion functions that are already part of the standard (compilers can fuse the with the manifest so that no overhead exists):

set [1..5] \/ set [3..8]  ==  set [1..8]
map [ "France" --> "Paris"
    , "Poland" --> "Warsaw" ] `at` "France"  ==  "Paris"
bag [1, 2, 1]  ==  bag [1, 1, 2]

By the way, the maplet constructor --> uses no magic, it is defined by x --> y = (x, y).

We also generalise "list" pattern matching to Sequence (and Collection) pattern matching. And we generalise default method bindings and instance method bindings as to allow more sharing of code. But this a feature separated from the rest and mainly good for library implementors, not users. (For reference you may call the new feature "delayed method definition", see its ↓formal definition.)

Transition towards ubiquitous Collections

The recipe is easy: program using Collections'98 can be written without one single use or reference to the ancient list type. And such programs are automatically correct in the new version of Haskell — the Collections will automatically compensate for the changes. (Such programs will contain some applications of conversion function that simply do nothing in Collections-wH.)

Of course, the removal of the ancient built-in list data type will make virtually any Haskell98-program uncompilable under the new language. But removing the type signatures does already bring some up-to-date! In fact, all programs that don't use the list-specific data constructors directly will automatically use sequences. Uses of the ancient "list" data constructors have to be replaced by empty add_first on right-hand sides and front_view on the left-hand side et voilà — the program uses sequences. In the presence of ↓views in the programming language, we can even have pattern matching at the front, at the back and in the middle. Doesn't that make up for the loss of some arbitrary built-in syntax?

Many programs and libraries can be translated this way from "lists" to Sequences, or even general Collections. However, "lists" have also often be used in place where actually a Set was needed (see the function nub) or an Ordered Bag (see insert and merge) or even Map (see association lists, especially lookup). While it may first seem advantageous to have so many functions working on the same data type, it is in fact an unstructured, inefficient bulk of functions. Functions are not grouped by the invariants they preserver, these invariants are not even explicit. That's why some application and library function should not be translated to use Sequences, but to another Collection — often this involves some kind of a redesign, but the effort is worth it.

Finally, some rather algorithmical functions do still on intrinsic properties of the ancient lists. These can undergo a syntactic transformation to use the data type Stream, but perhaps Collections offer a better way to achieve the function's purpose, and probably many of those functions will also become obsolete, because a more abstract approach makes disappear the problem they solve.

Collection Fusion

GHC provides the nice mechanism of "list fusion" to eliminate intermediate data structures. In GHC it only works for "lists", here I generalised to any Collection. The idea is simple: any function that produces Collections uses the basic constructors empty single (<+>), and the fold function (as the canonical consumer) replaces each of the basic constructor with a caller-supplied function. Thus, all we need to do is to replace in the producing function the calls to the constructors with calls to the caller-suppled consumer functions. Wo do that by giving each producer three additional arguments and applying it to empty single (<+>). Than we just add the tranformation rule

fold e f g (fun empty single (<+>)) = fun e f g

that's all! The GHC documentation has a list of "good consumers" and "good producers", but in fact we have the following:

GHC has to make some extra effort to translate function definitions of e.g. apply and filter (which are recursive in the classic Prelude) to definitions via fold. We don't need to do that, because we use fold already ubiquitously.

Functions that are defined via fold and return lists can be good consumers and producers. filter, apply and join (as concat, union etc.) are most notable examples: since they use the Collection constructors only as arguments of fold, a fusioned version of them will still be a variant of fold. That makes them especially handy for manual use:

fold e f g . apply f' = fold e (f.f') g
fold e f g . filter p = fold e f' g
                      where f' x = if p x then f x else e

For the various special cases of fold on the left we get the specialised laws:

apply f . apply f' = apply (f.f')
filter p . apply f = apply f . filter (p.f)
reduce e g . apply f = fold e f g

filter p . filter p' = filter (λx → p' x && p x)

(The cases "apply/filter" and "reduce/filter" are uninteresting for human use, since the result is no different than simply expanding definitions and use the general law.) GHC does all but the second simplification. That one is for manual calculation only, since it doesn't really simplify. Note the change of order between p and p' which is important when laziness is involved: p' is the predicate checked first. Here you can see how to use the laws:

-- The sum of the ages of all non-adult subscribers...
  sum $ filter (<18) $ apply age $ filter is_subscribed $ folks
                                           -- "filter/apply" exchange
= sum $ apply age $ filter p folks         -- and "filter/filter" fusion
    where p x = is_subscribed x && age x<18 )
                                           -- "sum = reduce 0 (+)"
= fold 0 age (+) $ filter p folks          -- and "reduce/apply" fusion 
    where p x = is_subscribed x && age x<18 )

= fold 0 f (+) folks                       -- "fold/filter" fusion 
    where f x | is_subscribed x && a<18 = a
              | otherwise                = 0
	      where a = age x

Note that with "lists", functions based on foldr1 can also be good consumers, but we can't do that for fold1 (and reduce1). (Or at least I don't see, how.) The reason it that the producer may include several calls to empty and we don't have any functions to insert for empty. In "lists" this is no problem, since [] occurs exactly once in each concrete data structure, but in the abstract collections empty is a unit of <+> and can occur arbitrarily often. When folding normally, this is no problem, since e must also be a unit of g, but with fold1 we just don't have that unit.

Fusion away the ambiguous

The fusioning can be used to cope with problems of ambigous types (which I mentioned above as a disadvantage of Collection type classes). To date I have encountered two classes of problems: one with unresolved constrains, the other with ambiguous instances. The first occurs, e.g. when using apply: since it's argument and result collection instance are the same we may get a problem if the argument collection instance cannot cope with the result element type (e.g. a Patricia tree can't have String elements, or an Associative Collection with elements (a, b) cannot have a general c as element type: it takes only tuples).

We can easily remedy this problem by defining a function apply' which has different argument and result collection instances, but then we have the second problem: the concrete type of apply''s result is ambiguous. Our solution in both cases is the same: we fusion apply with its consumer or producer. (There must be at least one of them, otherwise the Collection type would be visible in the type signature and we wouldn't have the problem.) The Collections Module offers some small examples of this and here is a large one (from examples/generic_conformance.hs):

unparagraph = concat_apply (++"\n\n")
print_classes = putStr $ unparagraph $ apply show $ range classes

The problem is (as so often) the result type of range, which is ambigous. The collection type is kept by apply show and finally consumed by the concat in unparagraph. So to solve the problem, we have to fusion away almost everything:

  unparagraph $ apply show $ range classes
= concat_apply (++"\n\n") $ apply show $ apply' snd $ classes
= concat_apply ((++"\n\n").show.snd) classes

And being at it, we can still do an optimisation: using closures to make (++) unquadratic on Strings (which are unfortunately still implemented with the ancient lists, otherwise that wouldn't be a problem).

= concat_apply ( \ (_, x) → shows x "\n\n" ) classes
= fold "" ( \ (_, x) → shows x "\n\n" ) (++) classes
= fold id ( \ (_, x) → shows x . ("\n\n"++) ) (.) classes ""
= foldr ( \ (_, x) → shows x . ("\n\n"++) ) "" classes 

The last line is only for beauty: we use foldr which we can now think of as the "closure fold" (or "fold with accumulator" depending on the point of view).

Formal Definition of Language Changes

Relative to the Haskell98-Report.

3.7 Sequences (was: Lists)

Since "lists" are no longer built-in, all but one productions are removed. We only retain
      aexp → "[" exp1"," ..."," expk "]"      (manifest sequence, k ≥ 0).
And its translations is
      single exp1 ++ .. ++ single expk.
By consequence the expression has the type
      Sequence seq T ⇒ seq T where exp1, ..., expk :: T.

Note that we also allow empty manifest sequences, since [] is no longer a special lexem.

3.10 Arithmetic Sequences

The syntax and translation are retained as in the report. The semantics follows the change in section 6.3.4.

3.11 Sequence (was: List) Comprehensions

Simply replace references to "list type" by "Sequence type".

(1) Semantic changes are implied by the generalisation of concatMap (see below) and Pattern Matching. (2) An alternative specification would be to write "Collection type" and replace concatMap by joinMap.

3.17 Pattern Matching

We generalise the syntax for manifest lists to the empty case and its semantics to Sequences. Furthermore, if all the patterns are literals (i.e. they don't contain any variables), we even generalise to Collections.

apat → "[" pat1"," ..."," patk "]" (sequence pattern, k≥0)

The report contains no definition for the semantics of the list pattern. We just add the following translation rules to those already there:

(t)
case v of { [p1, ..., pn] → e; _ → e' }
= case size v of { 
     n → case v !! 1 of {                    
             p1 -> ... case v !! n of { pn -> e ; _ -> e' } ... 
             _  -> e' }     
     _ -> e' } 
(u)
case v of { [p1, ..., pn] → e; _ → e' }
= case v == [p1, ..., pn] of
     { True → e; _ → e' }

If no of the p1, ..., pn contains a variable.

When diverging expressions are involved in the first rule, the order of matching becomes important. We do the match from left to right as is already assumed in the standard. The following example though remains valid:

case [1, undefined] of { [2, 3] -> True; _ -> False } = False
case [undefined, 1] of { [2, 3] -> True; _ -> False } = undefined

4.1.2 Syntax of Types

Remove the productions atype → "[" type "]" (list type) and gtycon → "[]" (list constructor) together with their explanations and mentionings of [] in the examples.

4.3.1 Class Declarations

The generalisations in this and the following section permit more redundant function definitions to be factored from instance declarations to default methods in classes. This happens entirely straight-forward as in object-oriented languages. This change is independent from the other ones in this proposal. I'll call it "delayed method definition".

In the third kind of declarations permitted in the cdecls we change the phrase "Lastly, the cdecls may contain a default class method for any of the vi" by adding "or any of the class methods of any of its (transitive) superclasses". Such default methods may be given, even if there is already a default method for that method in one (or several of) the superclasses. The semantics of this are given in the next section.

In the presence of multi-parameter type classes (which we assume), such default definition for a superclass function can be ambigous, since one class can be superclass for several of the parameters of a MPTC. The most obvious example for this is:

class (Eq a, Eq (coll a)) ⇒ Collection coll a

Here, it may be convenient to define a default Eq instance for Collections (while it is rather improbable that one wants to define the Collection's element's Eq instance here). Anyway, a simple function binding for (==) will be ambigous: which type variable's (==) is to be defined? To solve this conflict we add one additional constraint: "When giving a default for a member function of a class that is superclass for more than one of the defined class' type variables, then each such default binding must be immediately preceded by the type declaration of the member whose default is to be defined." In our example this gives the following:

class (Eq a, Eq (coll a)) ⇒ Collection coll a
    where
    (==) :: coll a → coll a → Bool
    a == b =  size a == size b && eq a b
        where eq (a:<as) (b:<bs) = a == b && eq as bs

    (==) :: a → a → Bool
    -- don't want to redefine this...

4.3.2 Instance Declarations

We have to specify what happens, when there are several default methods for a function, but this is very simple: "we say that a one default method M overwrites another default method M' for the same funtion, when M' is defined in a superclass of where M is defined. Since the superclass relationship is acyclic, the existence of at least one default method guarantees the existence of at least one default method which is not overwritten. If there is at most one non-overwritten method than this one chosen. If not then there must be a defintion in the instance declaration (which will overwrite all defaults)."

This definition also iteracts nicely with the possibility of instances that don't implement all members of the instantiated class. If the type that becomes a class instance (say, of C1) later becomes instance of another class C2, he may get from C2 the definitions of the lacking members C1. For example:

class Eq a where ...as usual...

class Ord a where
    ...as usual plus...
    a == b = compare a b == EQ
    
instance Eq SomeType -- don't write any definitions

instance Ord SomeType where
    compare a b = one definition suffices for two instances

This possibility may lead to a programming style where all superclass instances Cx of an instance Cx ⇒C T are left empty, and all their lacking members are defined in that "last" instance. Indeed such a style is often much more readable, since the semantics of the superclass definition often already supposes that the type will also become instance of another class. (E.g. when defining <+> for a set implementation, it already has \/ semantics.) We will therefore allow the same generalisation also for instance declarations.

We replace the phrase "The declarations d may contain bindings only for the class methods of C." By: "The declarations d may contain bindings for the class methods of C and any of its superclasses, provided that the superclass instance is not implied by the current instance declaration's context." The last proviso forbids us defining, for example, (==) on type a, in the instance declaration:

instance Eq a ⇒ Collection Foo a

Likewise, the definition of (==) on type a would be forbidden in the declaration:

instance Ord a ⇒ OrderedCollection Foo a

Since the instance Eq a is implied by the context Ord a. (TODO: Check whether in general this is umambigous and computable.)

To disambiguate function definitions for different superclass instances the same type-signature rule as for class declarations applies.

4.3.4 Ambiguous Types, and Defaults for Overloaded (was: Numeric) Operations

The scheme is simply generalised to include the Collection classes, too. However, there is no implicit default declaration, rather the declaration is "imported" from the modules that implement Collections. A module that imports several different default declarations must specify its own.

If the mechanism is used extensively, it should probably be elaborated, i.e. to combine default rules for independent classes. But for the moment it think the above suffices.

6.1.3 Lists

The section is removed. (The similar type Stream is part of Collections, but it is defined entirely "in the language".)

6.3.4 The Enum Class

New types as follows. The semantics remains unchanged.

class  Enum a  where
    succ, pred     :: a → a
    toEnum         :: Int → a
    fromEnum       :: a → Int
    enumFrom       :: a → Stream a            -- [n..]
    enumFromThen   :: a → a → Stream a       -- [n,n'..]
    enumFromTo     :: a → a → Stream a       -- [n..m]
    enumFromThenTo :: a → a → a → Stream a  -- [n,n'..m]

Rationale: enumFrom and enumFromThen must return Streams, since other Sequence implementations don't support infinite structures (more precisely: are not required to support, so we can't rely on that they do). We adapt the other two types for consistency. I think its easier for the user to throw in a conversion function, then to wonder about differing types...

(new section) Backwards Compatibility

It is clear that most Haskell'98 program using the list syntax will not be valid programs in Haskell-wC. Such programs have to be compiled using Haskell'98 mode, but they can still work together with Modules of the new language: Compilers can just implicitly translate all references to the ancient lists to the new Stream data type — more is not needed neither for import nor for export of modules of different language versions.

Changes to the Prelude

Since the module Collection proposed here is very long, I'm unsure about whether it should be a part of Prelude or a separate module. Anyway the Collection hierarchy is as essential to programming as the Numeric hierarchy and so it should be imported to all modules in the same implicit way as the Prelude.

The following Prelude functions are superseded by functions in Collection:

Old FunctionNew FunctionComment
null head last tail init is_empty first last but_first but_last Basic Queries of Collection
maximum minimum
take drop splitAt /
length (!!) elem notElem
front back /
size (!!) has
More Queries of Collection
map filter fold* apply filter fold reduce Higher-Order-Functions of Collection
zip zipWith unzip zip zip_with cross cross_with unzip Yet more functions on Collection
(++) length (++) length Functions on Sequences
iterate repeat cycle Stream Producers
takeWhile dropWhile span break Stream Consumers
lookup at of Map

Generalisations of operators to sequences are specified together with the operator, using reduce instead of foldr foldl.

(Sub)ModuleOperatorGeneralisation
Collection(<+) (++) (∪) (∩) join concat union intersect /
join_apply concat_apply
Prelude(&&) (||) and or all any
PrelNum(+) (*)sum product
PrelMonad(=<<) (<<) sequence sequence_ mapM mapM_

join_apply and concat_apply are needed for the Collection or Sequence Comprehension, whatever we choose.

The following functions on lists don't yet have a correspondent in Collections. Either we'll create one, or the function will move to module CollectionTools (or similar, heritor of current Module List): foldl foldl1 scanl scanl1 foldr foldr1 scanr scanr1.

All other functions involving "lists" in Prelude and other modules of the standard will simply be converted according to the principles stated above: turned into functions on the appropriate Collection class or Streams.

What becomes of the module List?

Some of its functions become obsolete. Others are not worth having a standardised name. And the Rest migrates to the module CollectionTools, where already some of the now less useful Prelude functions reside.

Obsolete are: nub union insert intersect because of sets; and (\\) because of Bag's (\-/).

Unworthy are: elemIndex elemIndices find findIndex findIndices / delete deleteBy (which should be called delete_first_occurrence anyway) / generic{Length Take Drop SplitAt Index Replicate} (better use fromIntegral explicitly).

By the way, we define unworthiness by the criteria of When should we give a thing a name?

A positive example is the function concat_apply: as a defined term, it occurs in the specification of list comprehension. Furthermore its specification concat_apply f = concat . apply f creates an intermediate list, which the implementation concat_apply f = fold empty f (++) does not. The unworthy functions above observe neither of the two criteria. Furthermore, operations on Sequences that "go backwards" from elements to indicies, can be expressed by viewing sequences as (Maps or) Relations as done in the Z and B specification languages.

The following functions from module List are generalised to Collections:

And the following on Sequences: group group_by / intersperse transpose / is_prefix is_suffix / sort sort_by insert_by

If we have views

Some programs contain patterns like (x:y:xs). Since abstract Collections don't have data constructors we can match against, this is no longer possible. (We have to use the functions front_view and back_view explicitly.) However, the proposed language extension of Views provides just that possibility: pattern matching on abstract structures. Given that views were already proposed in 1987 (before Haskell was born) and standardised in 1996 (again before Haskell'98) it seem miraculous why views have not yet been implemented, not even by Haskell systems which otherwise contain a lot of non-standard extensions. Haskell programmers have for too long been obsessed by concrete data structures (which are so easy to define), that they couldn't make use of a mechanism that supports data structural abstraction. Maybe the availability of standard abstract data structures will provide the killer application of views.

The language extension Haskell-wC proposed above is entirely compatible with views. Views can even compensate wonderfully for the special list syntax that the proposal abandons! In a Haskell implementation that supports Views, we can replace the split function and its result type by a view, and also have views for the front and the back. The following views should be part of the Collections standard:

view Collection coll a ⇒ Splitted coll a of coll a = Empty
                                                    | Single a
                                                    | coll a :+: coll a
    where
    splitted coll | s == 0    = Empty
                  | s == 1    = first coll
                  | otherwise = a :+: b
		  where
		  s = size coll
		  (a, b) = split' coll

view Collection coll a ⇒ Front coll a of coll a = EmptyFront
                                                 | a :< coll a 
    where
    front coll | is_empty coll = EmptyFront
               | otherwise     = first coll :< but_first coll

view Collection coll a ⇒ Back coll a of coll a = EmptyBack
                                                | coll a :> a
    where
    back coll | is_empty coll = EmptyBack
              | otherwise     = last coll :> but_last coll

The constructors EmptyFront and EmptyBack will not be used in practice, they're not even exported, since one can always write Empty instead.

As you can see, the views Front and Back can simply be implemented like the functions front_view and back_view which they replace, but the implementation of Splitted is rather crude: instead of the simple, single query split, each instance must now implement size, first, and split'. (Because we want to get rid of the data type Splitted which would only be used in this single place.) This is not so bad, because the first two queries will be overwritten by most implementations anyway (to be constant-time). split' is a preconditioned (partial) function, but no problem: it is only used in this one place, only meant for implementors of the interface, not for its users. (Not exported.)

Probably the following view should also be part of Prelude in Haskell-with-Views:

view (Integral a) ⇒ Natural a of a = Zero | Succ a
     where
       natural 0             = Zero
       natural n | n > 0     = Succ (n-1)
                 | otherwise = error "Natural: Natural view excludes negative values"

I mention this, because this view can replace the controversial (n+k) patterns, without the need for a series of NaturalK views as mentioned in the paper defining views. Just use recursive patterns and their example function becomes:

fib Zero              = 1
fib (Succ Zero)       = 1
fib (Succ m@(Succ n)) = fib m + fib n

The Reference Implementation

Note: All the code provided here has not been tested, only been proven correct.

Dessy is not a reference implementation in the sense of "reference for the semantics", since the semantics is already formally defined in the proposed standard. Instead, the goals of Dessy are:

  1. Show that the specification can really be productively implemented, that is, without too much effort. More specifically:
  2. Provide the possibility to "try out" the proposed interface in normal programs. In this sense it is like a prototype one can play with to experience the more practical aspects of the specification.
  3. Last not least is also a reference for the algorithms it uses. Other data structure implementations one can download are often rather restricted and implement doubtful interfaces. Dessy's implementation are at the same time very readable (and portable) and they implement a realistic interface with real practical value. Instead of being littered with optimisations, Dessy's implementations provide a clarity that allows others to understand the algorithms and reuse them in various contexts, even more optimised implementations.

In other words: The design goal of Dessy is more to show (that and) how it can be done, than to do it. The simplest possible implementation with asymptotically optimal performance. You may look at the code to see what I mean: It is meant to be readable like a text book, but still have all qualities for use in practice. Here are the principal kinds of implementations provided by Dessy:

Sequences Ordered Collections
Queues 
Balanced Trees
 Tries

Except for the remarkable absence of priority queues (which I had no desire to implement) this is a quite complete collection of implementations of algebraical data structures. Hashtables which are so popular in the imperative world don't have a place here, since they are intrinsically based on updateable states. Hashing itself can however be used with some kind of tries.

The Queues are the smallest part of Dessy and not very spectacular. However, they are a good introduction into Dessy's programming style. The Balanced Trees are based on a new kind of framework which separates the concerns of balancing and ordering using some internal layers of abstraction. This framework is a unique contribution of Dessy. Tries are trees that use the internal structure of elements or keys to handle balancing and ordering in one go. Nevertheless this class of trees can profit from internal structuring: most of the code is still shared with other tree implementations.

Sequences

The single concrete data structure contained in the Collections standard is the Stream an instance of Sequences. Streams are the only data structure that benefits laziness and otherwise they are very constraint: their set of efficient operations is that of a Stack — a data structure fundamental to algorithmic programming, but not very useful in application programming and specification.

As I argued in my note on Democratic Sequences, a Sequence implementation should at least be symmetric in both ends, since this is possible without any loss in (asymptotical) performance. The Deque is just such a data structure: one can add and drop elements to and from both ends. This is a great benefit to the application programmer who can concentrate on the correctness of his program and the usefulness of his specification without having to take care to use the efficient end of the "list" only. For completeness of the repertoire (expecially with respect to DData ;-), Dessy also contains a real-time Queue: with a minimal overhead this structure has the same worst-case performance as Stream, only that snoc is fast, not cons. (In fact, the Stream is a Stack, only that we usually don't use it as such.)

The most democratic of all Sequence implementations are balanced trees: only balanced trees provide non-biased worst-case bounds on all of their operations. And they provide maximal sharing of memory between results and arguments of modifying operations. For example, if you apply unlines to a Sequence of Strings which are implemented using balanced trees, the resulting big String will share a lot of memory with the argument Strings. The balanced Sequence implementation of Strings has a lot in common with the Rope implementation known in the imperative world: Ropes are also a persistent (immutable, algebraic) data structure and they are much more flexible than the usual character arrays used in imperative languages. However, they are not balanced what makes some operations on them degrade performance (e.g., creating a String by inserting Characters sequentially at the front or back), the programmer has to take take to use special functions (e.g., directly creating a Rope from a character array) and those functions are sometimes not flexible enough (e.g., in a program where the characters of a String are produced in different places). Balanced Sequences don't have these problems, each of their operations has optimal worst-case bounds (even real-time bounds, i.e. no amortising) and this makes Balanced Sequences the preferred implementation of Strings.

Write-Only Sequences: In some applications Sequences are created in one part of the program and consumed all at once in another part. Since such a consumption happens usually via fold (e.g. if it's conversion to another data type), and is therefore an O(N) operation in any case. To speed up the creation of such Sequences we can construct the canonical algebraic data type for our use, that is, one data constructor for each of the abstract constructors we call. Then we implement split and we're done. Well, if you need such an implementation, you can easily make one, or you just use our unbalanced LeafTree: he provides empty single ++ in constant time and minimal space (one cell) and <: >: in constant time and almost minimal space (two cells). So easy!

Balanced Ordered Structures

The following trees are instances of Sequence and via the higher-order type constructors BalancedSet and BalancedMap they are also instances of OrderedSet, OrderedMap and OrderedBag, respectively. This way each of the concrete data structures can implement each of the abstract data structures!

Concrete Structure Constant-time Operation
WeightBalanced LeafTreesize
WeightBalanced NodeTreesize first
WeightBalanced NodeTreesize first last
AnderssonTree, AVLTreefirst

Notes:

Tries

To term Trie usually applies to Ordered Collections whose elements are Sequences. The Sequence elements are stored in the nodes along the paths in the Trie, such that Sequences with the same prefix can share memory in the Trie. At the moment Dessy does not provide such a data structure, but like DDate we have Patricia Trees which use about the same principle: Integers are viewn as Sequences of Bits (all of the same length). If we translate all integer values to a non-negative range (by adding minBound, with explicit overflow), then the lexicographical ordering is the same as the natural order of numbers — we get a Structure that is automagically balanced and correctly ordered.

Patricia-Tres can be generalised to any element type by using Hashing — only that this needs a non-trivial Hashing-Framework and it doesn't give a sensible ordering. That's not for me for the moment.

Intermediate Classes

For the implementation of Ordered Structures many functions have the same implementation on classes of different concrete data structures. Those functions can all be implemented using some more primitive functions which are efficient (i.e. constant-time) on all the concrete structures of the class. Basically all the functions of Ordered Collections (the class and all its subclasses) can be implemented using this mechanism. (With the exception of <+> merge_with as explained in the Collection's design principles.)

Class Constant-time Operation Instance
SearchTree first (and, by consequence (<.)) (WeightBalanced NodeTree), AnderssonTree, AVLTree
RangeSearchTree first last (and, by consequence (<.) (.<) (<>.)) (WeightBalanced DoubleNodeTree)
Trie (<.) (.<) (<>.) Patricia

In Haskell with delayed method definition we can easily introduce intermediate type-classes which contain those implementations. The concrete data structures will then simply become instances of the more specialised class and get specialised default implementations for free. (For the moment, all those default functions are hanging around in the Collections Module: surely they are not meant to be part of the standard, but for the moment I had no other place for that code.)

On Abstraction and Efficiency

In usual implementation of Abstract Data Structures using Balanced Trees, each function on the abstract structure must be implemented taking the form of the concrete structure into account and each time the developer has to ensure all the concrete structure's invariants. At lot of redundancy and a hiding place for bugs! In Dessy's tree framework all the balancing is encapsulated in the implementation of <+> — consequently, almost all of the code is completely freed of balancing issues. (Note the crucial difference to "all of the code is almost freed of balancing issues" — we're really completely freed and can write functions as if there were no balancing.)

The encapsulation of balancing code was already known since some time, only that Dessy is the only implementation that hides its (really simple) balancing algorithms behind (internal) abstract data types. The real revolution is that Dessy uses a similar scheme for the implementation of Ordered Collections: <+> merges two Collections to establish the ordering, all other functions can just assume that ordering.

In that way each abstract data type (although they all have the same interface) has a stronger invariant and the type system helps us to ensure the correctness. But unfortunately the work is not done by the type system alone: we also have to pay a small run-time overhead, since any construction of Collections goes via the code that preserves the invariants.

For some functions it is worth implementing them directly on the underlying structure, especially if that can be done in constant-time (as for first last size on some trees). If the underlying structure is already an instance of Collections, this optimisation is even very easy to do: we just lift the operation up, calling the function on the lower level. This is done ubiquitously for the functions first last size (except in the Cached wrapping which just does the contrary!), and it can also be done for some special cases of fold: filter does not disturb ordering, it can be lifted through ordered collections; and apply does not disturb the balancing (it preserves layout), so it can be lifted through balancers.

So-called unsafe operations

In functional programming functions are usually total, they have no precondition. Since no functional language today allows to express and (run-time) check preconditions, partial functions are considered dangerous by many functional programmers, that's why they often prefix their names with the word "unsafe".

In data structures some of the efficiency loss of abstraction can be compensated for by the use of partial functions. For example, if we knew that the argument function of apply does not change the order of the elements (e.g. as (+1)), we don't need to rearrange the structure. Thus we could have an operation unsafe_apply on Ordered Collections which is implemented in the same lifty way as apply on balanced structures. The same applies to a function unsafe_from_list which creates an Ordered Collection from an already sorted list.

Dessy dispenses with such functions, since the overhead for the invariant-preserving abstraction layers is so small: a balancer will do nothing for already balanced nodes and the ordered join is constant-time for sets with disjoint range (min..max). The latter property also guarantees us O(N) running time for the creating of Ordered Collections from sorted lists — without any additional function! This way, the user need not worry, and the efficiency gain is even noticeable for "almost sorted" lists.