Motivation
How to save and reload program state?
Two possible meanings:
Load initial configuration (e.g. preferences)
Save/reload state of running program
First usually handled by reading and writing configuration files
Used to used some random text format
These days use XML
Even though it's harder for human beings to edit and read
Second is called persistence
Sometimes also called checkpointing
What's Involved
Persistence is really three problems rolled into one
Problem #1: how to save state
Input is one or more references to data structures
Which may be linked in arbitrary ways
Output is bytes in some well-defined format
Concentrate on human-readable output for now
Binary output usually more efficient
But machine-specified
And harder to read and debug
Problem #2: how to restore state
Read a persisted structure back in
Deal with multiple references to the same object correctly
Don't create three things if the original contained only one
Problem #3: extensibility
How to handle user-defined types
Without requiring users to re-write persistence framework
Examples below use Python
Exercise for the reader: re-do them in Java
Persisting a Tree
Start by ignoring aliasing
Four cases:
Atomic values (numbers, strings, ...)
Built-in containers (lists, dictionaries, ...)
User-defined types
Non-data (functions, classes, open files, ...)
Simplify further by ignoring user-defined types and non-data
Handle each remaining case by switching on object type
from types import *
AtomicTypes = {
IntType : None,
FloatType : None,
StringType : None,
NoneType : None
}
def persistObject(dest, obj):
if type(obj) in AtomicTypes:
dest.write(`obj`)
elif type(obj) is ListType:
persistList(dest, obj)
elif type(obj) is DictType:
persistDict(dest, obj)
else:
raise ValueError("Bad type: " + `t`)
Persisting Lists
def persistList(dest, obj):
sep = ''
dest.write('[')
for element in obj:
dest.write(sep)
sep = ', '
persistObject(dest, element)
dest.write(']')
Persisting Dictionaries
def persistDict(dest, obj):
sep = ''
dest.write('{')
for (key, value) in obj.items():
dest.write(sep)
sep = ', '
persistObject(dest, key)
dest.write(': ')
persistObject(dest, value)
dest.write('}')
Storing Type Information
This code produces something that looks like Python's own representation
Which is complicated to parse
Types implied by formatting
[] for lists, decimal point for floats, etc.
Reading data back in much simpler if types are explicit
Which they will have to be anyway if we want to handle user-defined classes
Solution: store atomic values as type name plus value
Store collections as type name plus length, then values
Persisting Atomic Values
Note: indent output to make them more readable
Types = {
IntType : 'int',
FloatType : 'float',
...etc...
}
AtomicTypes = {
...as before...
}
def persistObject(dest, obj, depth=0):
t = type(obj)
if t in AtomicTypes:
persistAtomic(dest, obj, depth)
elif t is ListType:
persistList(dest, obj, depth)
elif t is DictType:
persistDict(dest, obj, depth)
else:
raise ValueError("Bad type: " + `t`)
Persisting Atomic Values and Lists
Code for dictionaries is similar
def persistAtomic(dest, obj, depth):
for x in [' ' * depth, Types[type(obj)], ' ', `obj`, '\n']:
dest.write(x)
def persistList(dest, obj, depth):
for x in [' ' * depth, Types[type(obj)], ' ', `len(obj)`, '\n']:
dest.write(x)
for element in obj:
persistObject(dest, element, depth+1)
Sample Output
python : [11, [12, [13]], 'fourteen', {15: [16]}]
persist :
list 4
int 11
list 2
int 12
list 1
int 13
string 'fourteen'
dict 1
int 15
list 1
int 16
Reading Data Back In
Reader is slightly more complicated
But only slightly
First token on each line explains how to handle rest of data
Parse atomic values according to type
Read count for collections, then read that many items
Again, write one handler function for each data type
Makes code easy to extend
Alternative: write a class, and have users extend it with new methods
Main Handler
Handlers = {
'int' : intHandler,
'float' : floatHandler,
'list' : listHandler,
...etc...
}
def reloadObject(src):
line = src.readline()
if not line:
return None
line = line.lstrip() # remove indentation
typeName, value = line.split(' ', 2)
handler = Handlers[typeName]
return handler(src, value.rstrip())
Handlers
def intHandler(src, value):
return int(value)
def stringHandler(src, value):
assert len(value) >= 2
return value[1:-1] # remove the quotes!
def listHandler(src, value):
num = int(value)
result = []
for i in range(num):
result.append(reloadObject(src))
return result
Dictionaries left as an exercise for the reader
What About Circularity?
What if an object can be reached in two (or more) ways?
Could just store redundant information when writing
But then reading wouldn't re-create the original data structure
And writing will recurse infinitely if the graph is circular
Solution: give objects IDs
Assign each object a unique ID when it is first seen
Store in a temporary dictionary
If an object has already been seen, just print its ID again
Format is now:
New object: ID, type, value or count
Known object: ID
Constructing IDs
Could use some objects directly as keys into the "already seen" dictionary
But mutable objects (e.g. lists and dictionaries) can't be keys in Python
Python's built-in id() function returns a unique ID for every object
E.g. its address in memory
Can distinguish between occurrences of equal (but not identical) values
Main Decision Function
def persistObject(dest, obj, memo):
if id(obj) in memo:
dest.write(`id(obj)` + '\n')
else:
memo[id(obj)] = obj # before recursing!
if type(obj) in AtomicTypes:
persistAtomic(dest, obj, memo)
elif type(obj) is ListType:
persistList(dest, obj, memo)
elif type(obj) is DictType:
persistDict(dest, obj, memo)
else:
raise ValueError("Bad type: " + `t`)
def persistList(dest, obj, memo):
for x in [`id(obj)`, ' ', 'list', ' ', `len(obj)`, '\n']:
dest.write(x)
for element in obj:
persistObject(dest, element, memo)
Sample Output
# x = [] # x.append(x) 8820504 list 1 8820504
Handling Defined Types
How to handle types that aren't build in?
Must allow users to control what is persisted
No point persisting an open file handle
Three options:
Register a handler
Require classes to implement an interface
Use introspection
Option 1: Handlers
class C:
...my class...
def C_Handler(dest, obj):
...write instance of C to dest...
Handlers[C] = C_Handler
def persistObject(dest, obj):
if type(obj) is InstanceType:
Handlers[obj.__class__](dest, obj)
...handle built-in types...
Option 2: Implement an Interface
class C:
...body of class...
def persist(dest, memo):
dest.write(class identifier)
dest.write(unique object ID)
for each fragment of state:
if type(fragment) in Persist.Builtin:
Persist.persist(fragment)
elif id(fragment) in memo:
Persist.writeMemo(dest)
else:
fragment.persist(dest, memo)
Option 3: Introspection
def persistObject(dest, obj):
if type(obj) is InstanceType:
c = obj.__class__
persistObject(dest, c.__name__)
members = obj.__dict__
for (name, value) in members:
persistObject(dest, name)
persistObject(dest, value)
...handle built-in types...
Python's pickle Module
Handles (just about) everything
Uses introspection
cPickle module is up to a thousand times faster
Persistence in Java
Concepts are the same no matter what the language
Implementation details:
Different containers
(Possibly) different set of primitive values
No equivalent to id() in Java
Must come up with a numbering scheme
Possible (and often useful) to write multi-language persistence framework
Persist from Java, reload in Python, or vice versa
Have to think hard about which structures to map to which
You will deal with these issues in your database courses
$Id: persist.html,v 1.1.1.1 2004/01/04 05:02:31 reid Exp $