CSC207 Software Design
Lectures
The Document Object Model

What It Is

Document Object Model

Cross-language API for representing XML documents as trees

Easier to manipulate than strings or streams

But may require a lot of memory for large documents

Several implementations in Java

This course uses org.jdom

Not "official", but easiest to use

Tree Structure

[DOM Tree Structure]

The Same DOM Tree Presented Differently

[DOM Tree Structure Alternate Rep]

Rules

Every document's root is an object of type Document

This has a single child of type Element

The root element of the document

Its children may be:

Other elements

Text objects

Other things that we won't worry about

Note: whitespace is preserved

Like the carriage returns in the previous slide

But comments aren't

Show Top-Level Elements

public static void main(String[] args) {
    for (int i=0; i<args.length; ++i) {
        try {
            // Build document tree
            SAXBuilder builder = new SAXBuilder();
            Document doc = builder.build(args[i]);
            
            // Show top-level elements
            Element root = doc.getRootElement();
            Iterator ic = root.getChildren().iterator();
            while (ic.hasNext()) {
                Element elt = (Element)ic.next();
                System.out.println(elt.getName());
            }
        }
        catch (Exception e) {
            System.err.println(e);
        }
    }
}

What's In This Program

Use a package called SAX to read file and create DOM tree

We would explore it if we had time

Get the root element from the document

Iterate through its children

getChildren() returns a List of Element children

Use getContent() to get all children (including text and others)

getName() returns the tag of an element

Input and Output

[DOM Example ]

Showing Structure Recursively

public static void descend(Element current, int depth) {
    for (int i=0; i<depth; ++i) {
        System.out.print(" ");
    }
    Element elt = (Element)current;
    System.out.println(elt.getName());
    Iterator ic = elt.getChildren().iterator();
    while (ic.hasNext()) {
        descend((Element)ic.next(), depth+1);
    }
}

Recursive Structure

<?xml version="1.0" ?>
<person userid="bwk">
<surname>Kernighan</surname>
<forename>Brian W.</forename>
<books>
<book isbn="020161586X">The Practice
 of Programming</book>
<book isbn="020103669X">Software
 Tools</book>
</books>
</person>
person
 surname
 forename
 books
  book
  book

The Visitor Pattern

Often want to operate on a tree recursively

Count elements, search for text that matches a pattern, etc.

Mechanics of recursing through the tree is the same every time

So build a generic visitor that knows how to traverse the tree

Give it do-nothing methods that are invoked at specific times during traversal

Users derive from this class and override the methods they're interested in

A DOM Visitor

public abstract class DomVisitor {

    public DomVisitor()
    {}

    public void visit(Element root) {
        fDepth = 0;
        preRoot(root);
        atElement(root);
        recurse(root);
        postRoot(root);
    }

    protected void preRoot(Element root)
    {}

    protected void postRoot(Element root)
    {}

    protected void atElement(Element elt)
    {}

    protected void atText(Text text)
    {}

    ...implementation...
}

DOM Visitor Internals

public abstract class DomVisitor {

    ...interface...

    protected void recurse(Element elt) {
        fDepth += 1;
        Iterator ic = elt.getContent().iterator();
        while (ic.hasNext()) {
            Object node = ic.next();
            if (node instanceof Element) {
                Element child = (Element)node;
                atElement(child);
                recurse(child);
            }
            else if (node instanceof Text) {
                atText((Text)node);
            }
        }
        fDepth -= 1;
    }

    protected int       fDepth;
}

Tracing the Visitor's Execution

public class TracingVisitor extends DomVisitor {

    public TracingVisitor(PrintStream out) {
        fOut = out;
    }

    protected void preRoot(Element root) {
        fOut.println(indent() + "preRoot");
    }

    protected void postRoot(Element root) {
        fOut.println(indent() + "postRoot");
    }

    protected void atElement(Element elt) {
        fOut.println(indent() + "atElement " + elt.getName());
    }

    protected void atText(Text text) {
        fOut.println(indent() + "atText");
    }

    protected String indent() {
        ...return string of fDepth spaces...
    }

    protected PrintStream fOut;
}

A Typical Trace

<?xml version="1.0" ?>
<html>
<p>Just a paragraph.</p>
</html>
preRoot
atElement html
 atText
 atElement p
  atText
 atText
postRoot

Attributes

Elements can have attributes of the form name="value"

Any given attribute can appear at most once

Some attributes are mandatory, others optional

Value must always be quoted

Even though old HTML parsers didn't require it

Access attributes using:

Attribute elt.getAttribute(String name)

List elt.getAttributes()

Building an Attribute Inventory

Want to find out which attributes can appear with which elements

Create a DOM visitor that inspects each element's attributes

Result is a map in which

Keys are element names (e.g. "h1")

Values are sets of attribute names (e.g. "align")

Do not record the attribute values

Exercise: extend this visitor to inventory them as well

The Inventory Visitor

public class Inventory extends DomVisitor {
    public Inventory() {
        fSeen = new HashMap();
    }

    protected void preRoot(Element root) {
        fSeen.clear();
    }

    protected void atElement(Element elt) {
        String eltName = elt.getName();
        Set seen = (Set)fSeen.get(eltName);
        if (seen == null) {
            seen = new HashSet();
            fSeen.put(eltName, seen);
        }
        Iterator ia = elt.getAttributes().iterator();
        while (ia.hasNext()) {
            String attrName = ((Attribute)ia.next()).getName();
            seen.add(attrName);
        }
    }

    protected Map       fSeen;
}

Input and Output

<doc>
<p align="left"
   role="lead">First.</p>
<p align="center">Second.</p>
<p align="right"
   font="em">Third.</p>
</doc>
doc
p
        align
        role
        font

Trimming the Tree

Can add or remove nodes in DOM tree

Be careful about deleting items in a list while iterating over that list

Like cutting the branch you are standing on

Pattern: delete or move on

When an item is deleted, items above it bump down

So either delete or increment loop index

Removing Whitespace-Only Text

protected void atElement(Element elt) {
    List content = elt.getContent();
    int i = 0;
    while (i < content.size()) {
        Object node = content.get(i);
        boolean keep = true;
        if (node instanceof Text) {
            Text text = (Text)node;
            if (text.getText().trim().length() == 0) {
                keep = false;
            }
        }
        if (keep) {
            i += 1;
        }
        else {
            content.remove(i);
        }
    }
}

Python

Like JDOM, Python's DOM library is derived from the W3C standard

Uses idiomatic Python instead of trying to be 100% compatible with standard

In fact, Python has two DOM libraries

minidom doesn't have everything

But it's fast

import sys, xml.dom.minidom

def showTree(node, indent=0):
    print '  ' * indent + node.nodeName
    for child in node.childNodes:
        if child.nodeType == child.ELEMENT_NODE:
            showTree(child, indent+1)

for filename in sys.argv[1:]:
    doc = xml.dom.minidom.parse(filename)
    root = doc.documentElement
    showTree(root)

Another Way to Handle XML

Both Python and Java have another way to manipulate XML called SAX

The Simple (or Stream) API for XML

Instead of creating a tree in memory, it calls methods each time the parser finds something interesting

Start of element

Block of text

End of element

Errors

Neither better nor worse than DOM

Needs less memory, since only a fraction of the document is stored at a time

Users have to keep track of context themselves

from xml.sax import parse, ContentHandler

class Handler(ContentHandler):

    def __init__(self):
        ContentHandler.__init__(self)
        self.depth = 0

    def startElement(self, name, attrs):
        print '  ' * self.depth + name,
        for (key, value) in attrs.items():
            print ' ' + key + '=' + value,
        print
        self.depth += 1

    def endElement(self, name):
        self.depth -= 1

if __name__ == "__main__" :
    import sys
    for filename in sys.argv[1:]:
        input = open(filename, "r")
        handler = Handler()
        parse(input, handler)
        input.close()

$Id: dom.html,v 1.1.1.1 2004/01/04 05:02:31 reid Exp $