CSC207 Software Design
Lectures
Persistence

Motivation

How to save and reload program state?

Two possible meanings:

Load initial configuration (e.g. preferences)

Save/reload state of running program

First usually handled by reading and writing configuration files

Used to used some random text format

These days use XML

Even though it's harder for human beings to edit and read

Second is called persistence

Sometimes also called checkpointing

What's Involved

Persistence is really three problems rolled into one

Problem #1: how to save state

Input is one or more references to data structures

Which may be linked in arbitrary ways

Output is bytes in some well-defined format

Concentrate on human-readable output for now

Binary output usually more efficient

But machine-specified

And harder to read and debug

Problem #2: how to restore state

Read a persisted structure back in

Deal with multiple references to the same object correctly

Don't create three things if the original contained only one

Problem #3: extensibility

How to handle user-defined types

Without requiring users to re-write persistence framework

Examples below use Python

Exercise for the reader: re-do them in Java

Persisting a Tree

Start by ignoring aliasing

Four cases:

Atomic values (numbers, strings, ...)

Built-in containers (lists, dictionaries, ...)

User-defined types

Non-data (functions, classes, open files, ...)

Simplify further by ignoring user-defined types and non-data

Handle each remaining case by switching on object type

from types import *

AtomicTypes = {
    IntType    : None,
    FloatType  : None,
    StringType : None,
    NoneType   : None
}

def persistObject(dest, obj):
    if type(obj) in AtomicTypes:
        dest.write(`obj`)
    elif type(obj) is ListType:
        persistList(dest, obj)
    elif type(obj) is DictType:
        persistDict(dest, obj)
    else:
        raise ValueError("Bad type: " + `t`)

Persisting Lists

def persistList(dest, obj):
    sep = ''
    dest.write('[')
    for element in obj:
        dest.write(sep)
        sep = ', '
        persistObject(dest, element)
    dest.write(']')

Persisting Dictionaries

def persistDict(dest, obj):
    sep = ''
    dest.write('{')
    for (key, value) in obj.items():
        dest.write(sep)
        sep = ', '
        persistObject(dest, key)
        dest.write(': ')
        persistObject(dest, value)
    dest.write('}')

Storing Type Information

This code produces something that looks like Python's own representation

Which is complicated to parse

Types implied by formatting

[] for lists, decimal point for floats, etc.

Reading data back in much simpler if types are explicit

Which they will have to be anyway if we want to handle user-defined classes

Solution: store atomic values as type name plus value

Store collections as type name plus length, then values

Persisting Atomic Values

Note: indent output to make them more readable

Types = {
    IntType    : 'int',
    FloatType  : 'float',
    ...etc...
}

AtomicTypes = {
    ...as before...
}

def persistObject(dest, obj, depth=0):
    t = type(obj)
    if t in AtomicTypes:
        persistAtomic(dest, obj, depth)
    elif t is ListType:
        persistList(dest, obj, depth)
    elif t is DictType:
        persistDict(dest, obj, depth)
    else:
        raise ValueError("Bad type: " + `t`)

Persisting Atomic Values and Lists

Code for dictionaries is similar

def persistAtomic(dest, obj, depth):
    for x in [' ' * depth, Types[type(obj)], ' ', `obj`, '\n']:
        dest.write(x)

def persistList(dest, obj, depth):
    for x in [' ' * depth, Types[type(obj)], ' ', `len(obj)`, '\n']:
        dest.write(x)
    for element in obj:
        persistObject(dest, element, depth+1)

Sample Output

python  : [11, [12, [13]], 'fourteen', {15: [16]}]
persist :
list 4
 int 11
 list 2
  int 12
  list 1
   int 13
 string 'fourteen'
 dict 1
  int 15
  list 1
   int 16

Reading Data Back In

Reader is slightly more complicated

But only slightly

First token on each line explains how to handle rest of data

Parse atomic values according to type

Read count for collections, then read that many items

Again, write one handler function for each data type

Makes code easy to extend

Alternative: write a class, and have users extend it with new methods

Main Handler

Handlers = {
    'int'   : intHandler,
    'float' : floatHandler,
    'list'  : listHandler,
    ...etc...
}

def reloadObject(src):
    line = src.readline()
    if not line:
        return None
    line = line.lstrip() # remove indentation
    typeName, value = line.split(' ', 2)
    handler = Handlers[typeName]
    return handler(src, value.rstrip())

Handlers

def intHandler(src, value):
    return int(value)

def stringHandler(src, value):
    assert len(value) >= 2
    return value[1:-1] # remove the quotes!

def listHandler(src, value):
    num = int(value)
    result = []
    for i in range(num):
        result.append(reloadObject(src))
    return result

Dictionaries left as an exercise for the reader

What About Circularity?

What if an object can be reached in two (or more) ways?

Could just store redundant information when writing

But then reading wouldn't re-create the original data structure

And writing will recurse infinitely if the graph is circular

Solution: give objects IDs

Assign each object a unique ID when it is first seen

Store in a temporary dictionary

If an object has already been seen, just print its ID again

Format is now:

New object: ID, type, value or count

Known object: ID

Constructing IDs

Could use some objects directly as keys into the "already seen" dictionary

But mutable objects (e.g. lists and dictionaries) can't be keys in Python

Python's built-in id() function returns a unique ID for every object

E.g. its address in memory

Can distinguish between occurrences of equal (but not identical) values

Main Decision Function

def persistObject(dest, obj, memo):
    if id(obj) in memo:
        dest.write(`id(obj)` + '\n')
    else:
        memo[id(obj)] = obj # before recursing!
        if type(obj) in AtomicTypes:
            persistAtomic(dest, obj, memo)
        elif type(obj) is ListType:
            persistList(dest, obj, memo)
        elif type(obj) is DictType:
            persistDict(dest, obj, memo)
        else:
            raise ValueError("Bad type: " + `t`)

def persistList(dest, obj, memo):
    for x in [`id(obj)`, ' ', 'list', ' ', `len(obj)`, '\n']:
        dest.write(x)
    for element in obj:
        persistObject(dest, element, memo)

Sample Output

# x = []
# x.append(x)
8820504 list 1
8820504

Handling Defined Types

How to handle types that aren't build in?

Must allow users to control what is persisted

No point persisting an open file handle

Three options:

Register a handler

Require classes to implement an interface

Use introspection

Option 1: Handlers

class C:
    ...my class...

def C_Handler(dest, obj):
    ...write instance of C to dest...

Handlers[C] = C_Handler

def persistObject(dest, obj):
    if type(obj) is InstanceType:
        Handlers[obj.__class__](dest, obj)
    ...handle built-in types...

Option 2: Implement an Interface

class C:
    ...body of class...
    def persist(dest, memo):
        dest.write(class identifier)
        dest.write(unique object ID)
        for each fragment of state:
            if type(fragment) in Persist.Builtin:
                Persist.persist(fragment)
            elif id(fragment) in memo:
                Persist.writeMemo(dest)
            else:
                fragment.persist(dest, memo)

Option 3: Introspection

def persistObject(dest, obj):
    if type(obj) is InstanceType:
        c = obj.__class__
        persistObject(dest, c.__name__)
        members = obj.__dict__
        for (name, value) in members:
            persistObject(dest, name)
            persistObject(dest, value)
    ...handle built-in types...

Python's pickle Module

Handles (just about) everything

Uses introspection

cPickle module is up to a thousand times faster

Persistence in Java

Concepts are the same no matter what the language

Implementation details:

Different containers

(Possibly) different set of primitive values

No equivalent to id() in Java

Must come up with a numbering scheme

Possible (and often useful) to write multi-language persistence framework

Persist from Java, reload in Python, or vice versa

Have to think hard about which structures to map to which

You will deal with these issues in your database courses


$Id: persist.html,v 1.1.1.1 2004/01/04 05:02:31 reid Exp $