Login | Register   
RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Build Your Own Lightweight XML DOM Parser : Page 2

Microsoft's MSXML parser is rich in functionality, but in some cases a full-featured parser is too large for resource-limited environments. Don't count XML out yet though; you can write your own lightweight VB XML parser in fewer than 400 lines of code!




Full Text Search: The Key to Better Natural Language Queries for NoSQL in Node.js

Simplifying the DOM Model
A DOM tree is composed of nodes that result from parsing an XML document. A node is the in-memory representation of an XML entity. The standard W3C DOM model has several types of nodes. For example, a text node represents a block of text in the XML document, an element node represents an element of the XML document, and an attribute node represents the name and value of an attribute placed within an element.

Figure 1: W3C DOM tree representation of a simple XML document.

The DOM is a tree because (except for the root or document node) every other node has a parent. For example, attribute nodes are always associated with an element node, while any text enclosed within the element's open tag and close tag is mapped to a text node. The text node is a child node of the element node. So representing even a very simple form of XML document may require multiple node types. For example, Figure 1 shows a W3C DOM tree representation of the following XML document.

Figure 2: Simplified DOM tree representation of a simple XML document.

As you can see in Figure 1, because the DOM model uses a document type node to encapsulate the entire XML document, the DOM representation uses three different types of nodes. The SimpleDOMParser makes the DOM model less complex by abstracting all DOM node types into one single type: SimpleElement. A SimpleElement captures the crucial aspects of an XML element, such as the tag name, the element's attributes and any enclosed text or XML. In addition, the SimpleDOMParser doesn't use any special node type to represent the top-level document. The result is a greatly simplified DOM tree containing only SimpleElement nodes. Figure 2 shows the simplified DOM tree for the preceding XML document.

Listing 1 shows the complete source code for the SimpleElement class.

Defining XML Parsing Primitives
To process an XML document into the simplified DOM tree model presented in the previous section, you must define several basic parsing routines. Using those routines, the parser can then easily extract tags or text blocks from the input XML document.

The VB implementation first reads the entire document from a file into a string and then parses the tags from the file. This implementation differs from the Java and .NET (see Sidebar "Build a Lightweight XML DOM Parser with C#) versions, primarily because VB doesn't have native support for streams. However, the basic parsing process is similar among all the implementations.

To use the SimpleDOMParser, you create an instance and call its Parse function, passing the path to the XML file. The Parse function reads the XML file into a string, creates a stack, and then proceeds to read tags and content sequentially from the input string via the getNextTag() function.

Public Function Parse(aFilename As String) _ As SimpleElement Dim s As String Dim index As Long Dim tag As String On Error GoTo ErrParse ' read the file fnum = FreeFile Open aFilename For Binary As #fnum s = Space$(LOF(fnum)) Get #fnum, , s Close #fnum ' create a new stack Set stack = New stack index = 1 ' read and parse each tag Do While index < Len(s) tag = getNextTag(s, index) If Len(tag) > 0 Then Call parseTag(tag) If Not isCloseTag(tag) And _ Not isEmptyTag(tag) And _ Not isCDATA(tag) Then Call getTagContent(s, index) End If End If Loop If stack.Count <> 0 Then Err.Raise 50000, "SimpleDOMParser.Parse", _ "The source XML document is not well-formed." End If Set Parse = rootElement ExitParse: Exit Function ErrParse: Err.Raise Err.Number, "Parse", Err.Description Resume ExitParse End Function

Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



Thanks for your registration, follow us on our social networks to keep up-to-date