Simplifying the DOM Model
A DOM tree is composed of nodes that result from parsing an XML document. A node is the in-memory representation of an XML entity. The standard W3C DOM model has several types of nodes. For example, a text node represents a block of text in the XML document, an element node represents an element of the XML document, and an attribute node represents the name and value of an attribute placed within an element.
|Figure 1: W3C DOM tree representation of a simple XML document.
The DOM is a tree because (except for the root or document node) every other node has a parent. For example, attribute nodes are always associated with an element node, while any text enclosed within the element's open tag and close tag is mapped to a text node. The text node is a child node of the element node. So representing even a very simple form of XML document may require multiple node types. For example, Figure 1 shows a W3C DOM tree representation of the following XML document.
|Figure 2: Simplified DOM tree representation of a simple XML document.
As you can see in Figure 1, because the DOM model uses a document type node to encapsulate the entire XML document, the DOM representation uses three different types of nodes. The SimpleDOMParser makes the DOM model less complex by abstracting all DOM node types into one single type: SimpleElement. A SimpleElement captures the crucial aspects of an XML element, such as the tag name, the element's attributes and any enclosed text or XML. In addition, the SimpleDOMParser doesn't use any special node type to represent the top-level document. The result is a greatly simplified DOM tree containing only SimpleElement nodes. Figure 2 shows the simplified DOM tree for the preceding XML document.
Listing 1 shows the complete source code for the SimpleElement class.
Defining XML Parsing Primitives
To process an XML document into the simplified DOM tree model presented in the previous section, you must define several basic parsing routines. Using those routines, the parser can then easily extract tags or text blocks from the input XML document.
The VB implementation first reads the entire document from a file into a string and then parses the tags from the file. This implementation differs from the Java and .NET (see Sidebar "Build a Lightweight XML DOM Parser with C#) versions, primarily because VB doesn't have native support for streams. However, the basic parsing process is similar among all the implementations.
To use the SimpleDOMParser, you create an instance and call its Parse function, passing the path to the XML file. The Parse function reads the XML file into a string, creates a stack, and then proceeds to read tags and content sequentially from the input string via the getNextTag() function.
Public Function Parse(aFilename As String) _
Dim s As String
Dim index As Long
Dim tag As String
On Error GoTo ErrParse
' read the file
fnum = FreeFile
Open aFilename For Binary As #fnum
s = Space$(LOF(fnum))
Get #fnum, , s
' create a new stack
Set stack = New stack
index = 1
' read and parse each tag
Do While index < Len(s)
tag = getNextTag(s, index)
If Len(tag) > 0 Then
If Not isCloseTag(tag) And _
Not isEmptyTag(tag) And _
Not isCDATA(tag) Then
Call getTagContent(s, index)
If stack.Count <> 0 Then
Err.Raise 50000, "SimpleDOMParser.Parse", _
"The source XML document is not well-formed."
Set Parse = rootElement
Err.Raise Err.Number, "Parse", Err.Description