What It Is
Document Object Model
Cross-language API for representing XML documents as trees
Easier to manipulate than strings or streams
But may require a lot of memory for large documents
Several implementations in Java
This course uses org.jdom
Not "official", but easiest to use
Tree Structure
The Same DOM Tree Presented Differently
Rules
Every document's root is an object of type Document
This has a single child of type Element
The root element of the document
Its children may be:
Other elements
Text objects
Other things that we won't worry about
Note: whitespace is preserved
Like the carriage returns in the previous slide
But comments aren't
Show Top-Level Elements
public static void main(String[] args) {
for (int i=0; i<args.length; ++i) {
try {
// Build document tree
SAXBuilder builder = new SAXBuilder();
Document doc = builder.build(args[i]);
// Show top-level elements
Element root = doc.getRootElement();
Iterator ic = root.getChildren().iterator();
while (ic.hasNext()) {
Element elt = (Element)ic.next();
System.out.println(elt.getName());
}
}
catch (Exception e) {
System.err.println(e);
}
}
}
What's In This Program
Use a package called SAX to read file and create DOM tree
We would explore it if we had time
Get the root element from the document
Iterate through its children
getChildren() returns a List of Element children
Use getContent() to get all children (including text and others)
getName() returns the tag of an element
Input and Output
Showing Structure Recursively
public static void descend(Element current, int depth) {
for (int i=0; i<depth; ++i) {
System.out.print(" ");
}
Element elt = (Element)current;
System.out.println(elt.getName());
Iterator ic = elt.getChildren().iterator();
while (ic.hasNext()) {
descend((Element)ic.next(), depth+1);
}
}
Recursive Structure
<?xml version="1.0" ?> <person userid="bwk"> <surname>Kernighan</surname> <forename>Brian W.</forename> <books> <book isbn="020161586X">The Practice of Programming</book> <book isbn="020103669X">Software Tools</book> </books> </person> |
person surname forename books book book |
The Visitor Pattern
Often want to operate on a tree recursively
Count elements, search for text that matches a pattern, etc.
Mechanics of recursing through the tree is the same every time
So build a generic visitor that knows how to traverse the tree
Give it do-nothing methods that are invoked at specific times during traversal
Users derive from this class and override the methods they're interested in
A DOM Visitor
public abstract class DomVisitor {
public DomVisitor()
{}
public void visit(Element root) {
fDepth = 0;
preRoot(root);
atElement(root);
recurse(root);
postRoot(root);
}
protected void preRoot(Element root)
{}
protected void postRoot(Element root)
{}
protected void atElement(Element elt)
{}
protected void atText(Text text)
{}
...implementation...
}
DOM Visitor Internals
public abstract class DomVisitor {
...interface...
protected void recurse(Element elt) {
fDepth += 1;
Iterator ic = elt.getContent().iterator();
while (ic.hasNext()) {
Object node = ic.next();
if (node instanceof Element) {
Element child = (Element)node;
atElement(child);
recurse(child);
}
else if (node instanceof Text) {
atText((Text)node);
}
}
fDepth -= 1;
}
protected int fDepth;
}
Tracing the Visitor's Execution
public class TracingVisitor extends DomVisitor {
public TracingVisitor(PrintStream out) {
fOut = out;
}
protected void preRoot(Element root) {
fOut.println(indent() + "preRoot");
}
protected void postRoot(Element root) {
fOut.println(indent() + "postRoot");
}
protected void atElement(Element elt) {
fOut.println(indent() + "atElement " + elt.getName());
}
protected void atText(Text text) {
fOut.println(indent() + "atText");
}
protected String indent() {
...return string of fDepth spaces...
}
protected PrintStream fOut;
}
A Typical Trace
<?xml version="1.0" ?> <html> <p>Just a paragraph.</p> </html> |
preRoot atElement html atText atElement p atText atText postRoot |
Attributes
Elements can have attributes of the form name="value"
Any given attribute can appear at most once
Some attributes are mandatory, others optional
Value must always be quoted
Even though old HTML parsers didn't require it
Access attributes using:
Attribute elt.getAttribute(String name)
List elt.getAttributes()
Building an Attribute Inventory
Want to find out which attributes can appear with which elements
Create a DOM visitor that inspects each element's attributes
Result is a map in which
Keys are element names (e.g. "h1")
Values are sets of attribute names (e.g. "align")
Do not record the attribute values
Exercise: extend this visitor to inventory them as well
The Inventory Visitor
public class Inventory extends DomVisitor {
public Inventory() {
fSeen = new HashMap();
}
protected void preRoot(Element root) {
fSeen.clear();
}
protected void atElement(Element elt) {
String eltName = elt.getName();
Set seen = (Set)fSeen.get(eltName);
if (seen == null) {
seen = new HashSet();
fSeen.put(eltName, seen);
}
Iterator ia = elt.getAttributes().iterator();
while (ia.hasNext()) {
String attrName = ((Attribute)ia.next()).getName();
seen.add(attrName);
}
}
protected Map fSeen;
}
Input and Output
<doc> <p align="left" role="lead">First.</p> <p align="center">Second.</p> <p align="right" font="em">Third.</p> </doc> |
doc
p
align
role
font
|
Trimming the Tree
Can add or remove nodes in DOM tree
Be careful about deleting items in a list while iterating over that list
Like cutting the branch you are standing on
Pattern: delete or move on
When an item is deleted, items above it bump down
So either delete or increment loop index
Removing Whitespace-Only Text
protected void atElement(Element elt) {
List content = elt.getContent();
int i = 0;
while (i < content.size()) {
Object node = content.get(i);
boolean keep = true;
if (node instanceof Text) {
Text text = (Text)node;
if (text.getText().trim().length() == 0) {
keep = false;
}
}
if (keep) {
i += 1;
}
else {
content.remove(i);
}
}
}
Python
Like JDOM, Python's DOM library is derived from the W3C standard
Uses idiomatic Python instead of trying to be 100% compatible with standard
In fact, Python has two DOM libraries
minidom doesn't have everything
But it's fast
import sys, xml.dom.minidom
def showTree(node, indent=0):
print ' ' * indent + node.nodeName
for child in node.childNodes:
if child.nodeType == child.ELEMENT_NODE:
showTree(child, indent+1)
for filename in sys.argv[1:]:
doc = xml.dom.minidom.parse(filename)
root = doc.documentElement
showTree(root)
Another Way to Handle XML
Both Python and Java have another way to manipulate XML called SAX
The Simple (or Stream) API for XML
Instead of creating a tree in memory, it calls methods each time the parser finds something interesting
Start of element
Block of text
End of element
Errors
Neither better nor worse than DOM
Needs less memory, since only a fraction of the document is stored at a time
Users have to keep track of context themselves
from xml.sax import parse, ContentHandler
class Handler(ContentHandler):
def __init__(self):
ContentHandler.__init__(self)
self.depth = 0
def startElement(self, name, attrs):
print ' ' * self.depth + name,
for (key, value) in attrs.items():
print ' ' + key + '=' + value,
print
self.depth += 1
def endElement(self, name):
self.depth -= 1
if __name__ == "__main__" :
import sys
for filename in sys.argv[1:]:
input = open(filename, "r")
handler = Handler()
parse(input, handler)
input.close()
$Id: dom.html,v 1.1.1.1 2004/01/04 05:02:31 reid Exp $