RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Build Your Own Lightweight XML DOM Parser : Page 4

Microsoft's MSXML parser is rich in functionality, but in some cases a full-featured parser is too large for resource-limited environments. Don't count XML out yet though; you can write your own lightweight VB XML parser in fewer than 400 lines of code!

XML Parsing Strategy and SimpleDOMParser Implementation
Unlike a regular text document, a well-formed XML document has some unique characteristics that can facilitate the parsing work:
  • All tags in an XML document match. Every open tag must have a matching close tag except when the tag itself is both an open and close tag, such as <parser />, which is the short form for <parser></parser>. Tag and attribute names are case-sensitive.
  • All tags in an XML document must be properly nested. XML tags may not be cross-nested. For example, a document containing <t1><t2>...</t1></t2> is malformed, because the closing </t1> tag occurs before the closing </t2> tag.
With these rules in mind, the SimpleDOMParser parsing strategy follows the pattern shown in the following pseudo code:

While Not EOF(Input XML Document)
  Tag = Next tag from the document
  LastOpenTag = Top tag in Stack

  If Tag is an open or empty tag (ends with "/>")
    Add Tag as the child of LastOpenTag
    Push Tag onto Stack
    // Tag is a close tag
    If Tag is the matching close tag of LastOpenTag
      Pop Stack
      // Invalid tag nesting
      Report error
    End If
  End If
  Return the DocumentElement tag
End While
The centerpiece of this algorithm is the tag stack, which keeps track of the open tags that have been taken from the input document but have not been matched by their close tags. The top item on the stack is always the last open tag encountered.

Except for the first tag, each new open tag will be either ignorable content (such as an XML declaration, a DOCTYPE, or a comment), a CDATA block, or a child tag of the last open tag. In the latter case, the parser adds the new tag as a child of the last open tag and then pushes it onto the stack, where it becomes the new last open tag. If the new tag is an empty tag, the parser adds it as a child of the last open tag, but doesn't push it onto the stack. Finally, if the input tag is a close tag, it has to match the last open tag.

A non-matching close tag indicates an XML syntax error based on the proper-nesting rule. When the close tag matches the last open tag, the parser removes the last open tag from the stack because parsing for that tag is complete. This process continues until the end of the input string, at which point the stack must be empty; otherwise, the document is malformed. Here's the logic:

  • If the new tag is a CDATA block, it adds the content to the last element found.
  • If the new tag is an opening tag, it adds it as a child of the top element on the stack and pushes it onto the stack using the pushElement() method
  • If the new tag is an empty tag (an XML tag that ends with "/>'"), it adds it as a child of the top element on the stack, but—because it's already closed—doesn't push the tag onto the stack.
  • If the new tag is a closing tag (such as </endTag>, it compares the tag name with the tag name of the top tag on the stack. If the names match, it pops the stack; otherwise, it raises an error, because that means the document contains malformed XML.
The SimpleDOMParser.Parse function returns a single SimpleElement—the document element. Using that element and a recursive loop, you can iterate through each element in the tree and process the content in whatever way you wish. The main form in the sample code simply reads a document and displays the results in a text field.

Although the SimpleDOMParser implemented in this sample code has limited functionality, it may still be quite useful for many simple applications. For example, a client application might use it to transfer data in XML format to a backend server application. Because it is extremely light-weight, the SimpleDOMParser is very attractive in an environment where resources are restricted. In addition, the implementation of the SimpleDOMParser is straightforward. Although the current implementation stores only elements, not DOCTYPE or XML declarations, you can easily modify it to handle other XML content.

Guang Yang is the founder and chief architect for Sunwest Technologies, a consulting firm that specializes in Web, e-commerce, and enterprise application development. Guang has been a Java developer since 1996. You can reach him by email at gyang@SunwestTek.com.
Email AuthorEmail Author
Close Icon
Thanks for your registration, follow us on our social networks to keep up-to-date