ML is becoming increasingly popular in the developer community as a tool for passing, manipulating, storing, and organizing information. If you are one of the many developers planning to use XML, you must carefully select and master the XML parser. The parser?one of XML’s core technologies?is your interface to an XML document, exposing its contents through a well-specified API. Confirm that the parser you select has the functionality and performance that your application requires. A poor choice can result in excessive hardware requirements, poor system performance and developer productivity, and stability issues.
I tested a of selection of Java-based XML parsers, and this article presents the results while discussing the performance issues you should consider when selecting a parser. Your software’s performance hinges on your choosing the right one.
Because XML is a standardized format, it offers more developer and product support than proprietary formats, parsers, and configuration and storage schemes. Your XML project also will be easier to manage if you keep it simple. If possible, write interface code in only one or two languages (e.g., Java or C++), using as few APIs as possible (DOM, SAX, XML, and perhaps JAXP).
Minimizing technologies sounds good in theory, but it can’t be done without effective tools. What makes a tool effective? Depending on the project, the following attributes can:
- Stable specifications
- Commercial vendor support
- Adequate performance
- Adequate API features
SAX vs. DOM
At present, two major API specifications define how XML parsers work: SAX and DOM. The DOM specification defines a tree-based approach to navigating an XML document. In other words, a DOM parser processes XML data and creates an object-oriented hierarchical representation of the document that you can navigate at run-time.
The SAX specification defines an event-based approach whereby parsers scan through XML data, calling handler functions whenever certain parts of the document (e.g., text nodes or processing instructions) are found.
In SAX’s event-based system, the parser doesn’t create any internal representation of the document. Instead, the parser calls handler functions when certain events (defined by the SAX specification) take place. These events include the start and end of the document, finding a text node, finding child elements, and hitting a malformed element.
SAX development is more challenging, because the API requires development of callback functions that handle the events. The design itself also can sometimes be less intuitive and modular. Using a SAX parser may require you to store information in your own internal document representation if you need to rescan or analyze the information?SAX provides no container for the document like the DOM tree structure.
Is having two completely different ways to parse XML data a problem? Not really, both parsers have very different approaches for processing the information. The W3C DOM specification provides a very rich and intuitive structure for housing the XML data, but can be quite resource-intensive given that the entire XML document is typically stored in memory. You can manipulate the DOM at run-time and stream the updated data as XML, or transform it to your own format if you require.
The strength of the SAX specification is that it can scan and parse gigabytes worth of XML documents without hitting resource limits, because it does not try to create the DOM representation in memory. Instead, it raises events that you can handle as you see fit. Because of this design, the SAX implementation is generally faster and requires fewer resources. On the other hand, SAX code is frequently complex, and the lack of a document representation leaves you with the challenge of manipulating, serializing, and traversing the XML document.
Putting the Parser to the Test
To determine the right parser for you, prioritize the importance of functionality, speed, memory requirements, and class footprint size. A few types of tests can help you evaluate them, although the performance of some depends on the specific nature and design of your software. These tests include parsing large and small XML documents, traversing and navigating the processed DOM, constructing a DOM from scratch, and evaluating the resource requirements of the parser.
You can tell quite a bit about a parser by using one or two simple XML documents. If your software will have to deal with many small files, see if the parser has some initialization overhead that slows down repeated parsing. For very large files, confirm that the parser can interpret the file in sufficient time with reasonable resource requirements. For the latter case, very large XML documents may require using a SAX parser that does not store the document in memory. You might also consider reading in parts of the document (using an appropriate DTD that allows for a partial document) and manipulating the document fragments in memory, one at a time.
In addition, new DOM parsing solutions may handle massive XML documents more effectively. Remember that the DOM API specifies only how to interact with the document, not how it must be stored. Persistent DOM (PDOM) implementations with index-based searches and retrieval are in the works, but I have not yet tested any of these.
You should also evaluate how well the parser traverses an in-memory DOM after XML data has been parsed. If you require the ability to search or scan through a post-parsed DOM using the API, you can rule out SAX?unless you are willing to create your own document model from your callback functions. For W3C DOM-compliant parsers, test the speed of scanning through the constructed DOM to see how expensive traversal of the tree can be.
Some XML parsers come with a serialization feature and are able to convert a document tree to XML data. This capability is not in all parsers, but the performance of parsers that support this ability is often proportional to the time required to navigate a given document tree using the API. Again, because SAX does not support an internal representation of the document, you would have to provide your own document and serialization functionality.
The available XML parsers vary in performance. Performance is not a definitive benchmark, and it barely scratches the surface of all parser capabilities. I used the XmlTest application to test a selection of Java-based XML parsers:
- Sun’s Project X parser, included with the JAXP release
- Oracle’s v2 XML parser
- the Xerces-J parser, shared by both IBM and Apache
- Read and parse a small DTD-enforced XML file (approximately 25 elements)
- Read and parse a large DTD-enforced XML file (approximately 50,000 elements)
- Navigate the DOM created by Test #2 from root to all children
- Build a large DOM (approximately 60,000 elements) from scratch
- Build an infinitely large DOM from scratch using createElement(…) and similar function calls. Continue until a java.OutOfMemoryError is raised. Measure the time it takes to hit the “memory wall” within a default (32MB heap) JRE and count how many elements are created before the unrecoverable error is raised.
Table 1: Test Results
What Have We Learned?
From these results, one could draw some initial conclusions. First, the results clearly vary quite a bit for very similar code across all parsers. Only minimal changes were made to comply with the specific interface of each parser. XP obviously accomplishes one of its goals: high performance. However, this may be explained through some missing features such as lack of DTD validation, which creates overhead for the other parsers.
SAX clearly beats DOM for run-time parsing, although its lack of an internal DOM representation will cause some difficulties for developers under certain situations. These differences are most apparent when the document gets very large. Although these tests do not show it, SAX parsers typically are faster for very large documents where the DOM model hits virtual memory or consumes all available memory.
These tests also seemed to indicate that Sun was much more efficient during construction than the read-and-parse state. Although Sun excelled in Tests #4 and #5, it came in last place for Tests #1 and #2.
I have no doubt that some tweaking of each parser’s default behavior could improve its results. That would be the next phase in your evaluation. Test the parsers with project-specific requirements that enter into the equation, such as XSL transformations and document sizes. Even the attribute types and complexity/nesting of elements can affect the parsers differently. Some parsers are more efficient with heavy white spacing, while attribute-rich elements bog down others.
Know Your Needs
If you need to parse and process huge XML documents, SAX implementations obviously offer some benefits over DOM-based ones. Also ask yourself if an improved design would remove the need for such large XML documents, perhaps pre-filtering in a database that can stream XML would suit your needs. By going with SAX, you may restrict your options for document manipulation and XSLT and require your team to write code to internally manage, store, and rewrite the document. SAX is best suited to sequential-scan applications, for which you want to quickly go through the XML document start-to-finish. However, sometimes you won’t need the overhead of a full-blown DOM, and a SAX parser will be sufficient for creating a lightweight and compact internal data structure.
At the same time, DOM has great advantages, including its simplicity, powerful access to the document, popularity, and well-defined specification. It also pairs nicely with XSLT and other document-transformation solutions you may require. DOM implementations are currently biased towards in-memory storage of the document, but this may change as PDOM implementations become more popular. Programming DOM code becomes even easier with a JDOM wrapper for Java, which encapsulates SAX/DOM manipulation behind a much simpler interface.
A large number of parser options are available. Picking the right one can be tricky, but a few tests will help to point you in the right direction. The JAXP plug-in XML parser framework could make it much easier for you to swap and evaluate XML parsers without significantly breaking your code. Also, using news groups to gauge other developers’ feedback can save you some time. I can’t recommend a specific parser as the right tool because I don’t know your situation. The one for you depends on the needs of your application.