XML Parsers: DOM and SAX Put to the Test

XML Parsers: DOM and SAX Put to the Test

ML is becoming increasingly popular in the developer community as a tool for passing, manipulating, storing, and organizing information. If you are one of the many developers planning to use XML, you must carefully select and master the XML parser. The parser?one of XML’s core technologies?is your interface to an XML document, exposing its contents through a well-specified API. Confirm that the parser you select has the functionality and performance that your application requires. A poor choice can result in excessive hardware requirements, poor system performance and developer productivity, and stability issues.

I tested a of selection of Java-based XML parsers, and this article presents the results while discussing the performance issues you should consider when selecting a parser. Your software’s performance hinges on your choosing the right one.

Performance Issues
Because XML is a standardized format, it offers more developer and product support than proprietary formats, parsers, and configuration and storage schemes. Your XML project also will be easier to manage if you keep it simple. If possible, write interface code in only one or two languages (e.g., Java or C++), using as few APIs as possible (DOM, SAX, XML, and perhaps JAXP).

Minimizing technologies sounds good in theory, but it can’t be done without effective tools. What makes a tool effective? Depending on the project, the following attributes can:

  • Stable specifications
  • Commercial vendor support
  • Adequate performance
  • Adequate API features

SAX vs. DOM
At present, two major API specifications define how XML parsers work: SAX and DOM. The DOM specification defines a tree-based approach to navigating an XML document. In other words, a DOM parser processes XML data and creates an object-oriented hierarchical representation of the document that you can navigate at run-time.

The SAX specification defines an event-based approach whereby parsers scan through XML data, calling handler functions whenever certain parts of the document (e.g., text nodes or processing instructions) are found.

How do the tree-based and event-based APIs differ? The tree-based W3C DOM parser creates an internal tree based on the hierarchical structure of the XML data. You can navigate and manipulate this tree from your software, and it stays in memory until you release it. DOM uses functions that return parent and child nodes, giving you full access to the XML data and providing the ability to interrogate and manipulate these nodes. DOM manipulation is straightforward and the API does not take long to understand, particularly if you have some JavaScript DOM experience.

In SAX’s event-based system, the parser doesn’t create any internal representation of the document. Instead, the parser calls handler functions when certain events (defined by the SAX specification) take place. These events include the start and end of the document, finding a text node, finding child elements, and hitting a malformed element.

SAX development is more challenging, because the API requires development of callback functions that handle the events. The design itself also can sometimes be less intuitive and modular. Using a SAX parser may require you to store information in your own internal document representation if you need to rescan or analyze the information?SAX provides no container for the document like the DOM tree structure.

Is having two completely different ways to parse XML data a problem? Not really, both parsers have very different approaches for processing the information. The W3C DOM specification provides a very rich and intuitive structure for housing the XML data, but can be quite resource-intensive given that the entire XML document is typically stored in memory. You can manipulate the DOM at run-time and stream the updated data as XML, or transform it to your own format if you require.

The strength of the SAX specification is that it can scan and parse gigabytes worth of XML documents without hitting resource limits, because it does not try to create the DOM representation in memory. Instead, it raises events that you can handle as you see fit. Because of this design, the SAX implementation is generally faster and requires fewer resources. On the other hand, SAX code is frequently complex, and the lack of a document representation leaves you with the challenge of manipulating, serializing, and traversing the XML document.

Putting the Parser to the Test
To determine the right parser for you, prioritize the importance of functionality, speed, memory requirements, and class footprint size. A few types of tests can help you evaluate them, although the performance of some depends on the specific nature and design of your software. These tests include parsing large and small XML documents, traversing and navigating the processed DOM, constructing a DOM from scratch, and evaluating the resource requirements of the parser.

You can tell quite a bit about a parser by using one or two simple XML documents. If your software will have to deal with many small files, see if the parser has some initialization overhead that slows down repeated parsing. For very large files, confirm that the parser can interpret the file in sufficient time with reasonable resource requirements. For the latter case, very large XML documents may require using a SAX parser that does not store the document in memory. You might also consider reading in parts of the document (using an appropriate DTD that allows for a partial document) and manipulating the document fragments in memory, one at a time.

In addition, new DOM parsing solutions may handle massive XML documents more effectively. Remember that the DOM API specifies only how to interact with the document, not how it must be stored. Persistent DOM (PDOM) implementations with index-based searches and retrieval are in the works, but I have not yet tested any of these.

You should also evaluate how well the parser traverses an in-memory DOM after XML data has been parsed. If you require the ability to search or scan through a post-parsed DOM using the API, you can rule out SAX?unless you are willing to create your own document model from your callback functions. For W3C DOM-compliant parsers, test the speed of scanning through the constructed DOM to see how expensive traversal of the tree can be.

Some XML parsers come with a serialization feature and are able to convert a document tree to XML data. This capability is not in all parsers, but the performance of parsers that support this ability is often proportional to the time required to navigate a given document tree using the API. Again, because SAX does not support an internal representation of the document, you would have to provide your own document and serialization functionality.

Parsing Benchmarks
The available XML parsers vary in performance. Performance is not a definitive benchmark, and it barely scratches the surface of all parser capabilities. I used the XmlTest application to test a selection of Java-based XML parsers:

  • Sun’s Project X parser, included with the JAXP release
  • Oracle’s v2 XML parser
  • the Xerces-J parser, shared by both IBM and Apache
  • XP
All of the parsers have both SAX and DOM support except for the XP parser, which is SAX-based.

Test Framework Design
Figure 1: Test Framework Design

 
Figure 1 shows the architecture for my test framework. The XmlTest application took an argument that specified which parser to instantiate and test. This insured that each parser started with a clean Java run-time (JRT). The following tests were performed:
  1. Read and parse a small DTD-enforced XML file (approximately 25 elements)
  2. Read and parse a large DTD-enforced XML file (approximately 50,000 elements)
  3. Navigate the DOM created by Test #2 from root to all children
  4. Build a large DOM (approximately 60,000 elements) from scratch
  5. Build an infinitely large DOM from scratch using createElement(…) and similar function calls. Continue until a java.OutOfMemoryError is raised. Measure the time it takes to hit the “memory wall” within a default (32MB heap) JRE and count how many elements are created before the unrecoverable error is raised.
Table 1 shows the results of the tests. Again, these are not meant to be definitive. The tests are grouped by DOM-based and SAX-based parsers. Some tests were not performed on SAX parsers (indicated by the “-” designation). All tests except for Test #5 were run as follows: one dry run to remove any caching effects and then five repetitions of the test. The results are averaged to produce the test scores. Test #5 was averaged by running the same test framework five times to confirm the results. It was impossible to repeat the test within one JRE session because a java.OutOfMemoryError is not recoverable, leaving the final {…} clause to report the test results and exit.

Test #
1
2
3
4
5
5
Parser
Small Read
(s)
Large Read
(s)
Large Nav
(s)
Build Large
(s)
Build Huge
(s)
Max Size
(elements)
SunDOM0.0223.7320.210.49612.33440,358
OracleDOM0.0142.9760.060.9268.23281,308
XercesDOM0.0422.4820.0780.8110.11389,044
SunSAX0.0180.7
OracleSAX0.010.546
XercesSAX0.0361.3
XPSAX0.0160.458

Table 1: Test Results

What Have We Learned?
From these results, one could draw some initial conclusions. First, the results clearly vary quite a bit for very similar code across all parsers. Only minimal changes were made to comply with the specific interface of each parser. XP obviously accomplishes one of its goals: high performance. However, this may be explained through some missing features such as lack of DTD validation, which creates overhead for the other parsers.

SAX clearly beats DOM for run-time parsing, although its lack of an internal DOM representation will cause some difficulties for developers under certain situations. These differences are most apparent when the document gets very large. Although these tests do not show it, SAX parsers typically are faster for very large documents where the DOM model hits virtual memory or consumes all available memory.

These tests also seemed to indicate that Sun was much more efficient during construction than the read-and-parse state. Although Sun excelled in Tests #4 and #5, it came in last place for Tests #1 and #2.

I have no doubt that some tweaking of each parser’s default behavior could improve its results. That would be the next phase in your evaluation. Test the parsers with project-specific requirements that enter into the equation, such as XSL transformations and document sizes. Even the attribute types and complexity/nesting of elements can affect the parsers differently. Some parsers are more efficient with heavy white spacing, while attribute-rich elements bog down others.

Know Your Needs
If you need to parse and process huge XML documents, SAX implementations obviously offer some benefits over DOM-based ones. Also ask yourself if an improved design would remove the need for such large XML documents, perhaps pre-filtering in a database that can stream XML would suit your needs. By going with SAX, you may restrict your options for document manipulation and XSLT and require your team to write code to internally manage, store, and rewrite the document. SAX is best suited to sequential-scan applications, for which you want to quickly go through the XML document start-to-finish. However, sometimes you won’t need the overhead of a full-blown DOM, and a SAX parser will be sufficient for creating a lightweight and compact internal data structure.

At the same time, DOM has great advantages, including its simplicity, powerful access to the document, popularity, and well-defined specification. It also pairs nicely with XSLT and other document-transformation solutions you may require. DOM implementations are currently biased towards in-memory storage of the document, but this may change as PDOM implementations become more popular. Programming DOM code becomes even easier with a JDOM wrapper for Java, which encapsulates SAX/DOM manipulation behind a much simpler interface.

A large number of parser options are available. Picking the right one can be tricky, but a few tests will help to point you in the right direction. The JAXP plug-in XML parser framework could make it much easier for you to swap and evaluate XML parsers without significantly breaking your code. Also, using news groups to gauge other developers’ feedback can save you some time. I can’t recommend a specific parser as the right tool because I don’t know your situation. The one for you depends on the needs of your application.

devx-admin

devx-admin

Share the Post:
5G Innovations

GPU-Accelerated 5G in Japan

NTT DOCOMO, a global telecommunications giant, is set to break new ground in the industry as it prepares to launch a GPU-accelerated 5G network in

AI Ethics

AI Journalism: Balancing Integrity and Innovation

An op-ed, produced using Microsoft’s Bing Chat AI software, recently appeared in the St. Louis Post-Dispatch, discussing the potential concerns surrounding the employment of artificial

Savings Extravaganza

Big Deal Days Extravaganza

The highly awaited Big Deal Days event for October 2023 is nearly here, scheduled for the 10th and 11th. Similar to the previous year, this

5G Innovations

GPU-Accelerated 5G in Japan

NTT DOCOMO, a global telecommunications giant, is set to break new ground in the industry as it prepares to launch a GPU-accelerated 5G network in Japan. This innovative approach will

AI Ethics

AI Journalism: Balancing Integrity and Innovation

An op-ed, produced using Microsoft’s Bing Chat AI software, recently appeared in the St. Louis Post-Dispatch, discussing the potential concerns surrounding the employment of artificial intelligence (AI) in journalism. These

Savings Extravaganza

Big Deal Days Extravaganza

The highly awaited Big Deal Days event for October 2023 is nearly here, scheduled for the 10th and 11th. Similar to the previous year, this autumn sale has already created

Cisco Splunk Deal

Cisco Splunk Deal Sparks Tech Acquisition Frenzy

Cisco’s recent massive purchase of Splunk, an AI-powered cybersecurity firm, for $28 billion signals a potential boost in tech deals after a year of subdued mergers and acquisitions in the

Iran Drone Expansion

Iran’s Jet-Propelled Drone Reshapes Power Balance

Iran has recently unveiled a jet-propelled variant of its Shahed series drone, marking a significant advancement in the nation’s drone technology. The new drone is poised to reshape the regional

Solar Geoengineering

Did the Overshoot Commission Shoot Down Geoengineering?

The Overshoot Commission has recently released a comprehensive report that discusses the controversial topic of Solar Geoengineering, also known as Solar Radiation Modification (SRM). The Commission’s primary objective is to

Remote Learning

Revolutionizing Remote Learning for Success

School districts are preparing to reveal a substantial technological upgrade designed to significantly improve remote learning experiences for both educators and students amid the ongoing pandemic. This major investment, which

Revolutionary SABERS Transforming

SABERS Batteries Transforming Industries

Scientists John Connell and Yi Lin from NASA’s Solid-state Architecture Batteries for Enhanced Rechargeability and Safety (SABERS) project are working on experimental solid-state battery packs that could dramatically change the

Build a Website

How Much Does It Cost to Build a Website?

Are you wondering how much it costs to build a website? The approximated cost is based on several factors, including which add-ons and platforms you choose. For example, a self-hosted

Battery Investments

Battery Startups Attract Billion-Dollar Investments

In recent times, battery startups have experienced a significant boost in investments, with three businesses obtaining over $1 billion in funding within the last month. French company Verkor amassed $2.1

Copilot Revolution

Microsoft Copilot: A Suit of AI Features

Microsoft’s latest offering, Microsoft Copilot, aims to revolutionize the way we interact with technology. By integrating various AI capabilities, this all-in-one tool provides users with an improved experience that not

AI Girlfriend Craze

AI Girlfriend Craze Threatens Relationships

The surge in virtual AI girlfriends’ popularity is playing a role in the escalating issue of loneliness among young males, and this could have serious repercussions for America’s future. A

AIOps Innovations

Senser is Changing AIOps

Senser, an AIOps platform based in Tel Aviv, has introduced its groundbreaking AI-powered observability solution to support developers and operations teams in promptly pinpointing the root causes of service disruptions

Bebop Charging Stations

Check Out The New Bebob Battery Charging Stations

Bebob has introduced new 4- and 8-channel battery charging stations primarily aimed at rental companies, providing a convenient solution for clients with a large quantity of batteries. These wall-mountable and

Malyasian Networks

Malaysia’s Dual 5G Network Growth

On Wednesday, Malaysia’s Prime Minister Anwar Ibrahim announced the country’s plan to implement a dual 5G network strategy. This move is designed to achieve a more equitable incorporation of both

Advanced Drones Race

Pentagon’s Bold Race for Advanced Drones

The Pentagon has recently unveiled its ambitious strategy to acquire thousands of sophisticated drones within the next two years. This decision comes in response to Russia’s rapid utilization of airborne

Important Updates

You Need to See the New Microsoft Updates

Microsoft has recently announced a series of new features and updates across their applications, including Outlook, Microsoft Teams, and SharePoint. These new developments are centered around improving user experience, streamlining

Price Wars

Inside Hyundai and Kia’s Price Wars

South Korean automakers Hyundai and Kia are cutting the prices on a number of their electric vehicles (EVs) in response to growing price competition within the South Korean market. Many

Solar Frenzy Surprises

Solar Subsidy in Germany Causes Frenzy

In a shocking turn of events, the German national KfW bank was forced to discontinue its home solar power subsidy program for charging electric vehicles (EVs) after just one day,

Electric Spare

Electric Cars Ditch Spare Tires for Efficiency

Ira Newlander from West Los Angeles is thinking about trading in his old Ford Explorer for a contemporary hybrid or electric vehicle. However, he has observed that the majority of

Solar Geoengineering Impacts

Unraveling Solar Geoengineering’s Hidden Impacts

As we continue to face the repercussions of climate change, scientists and experts seek innovative ways to mitigate its impacts. Solar geoengineering (SG), a technique involving the distribution of aerosols

Razer Discount

Unbelievable Razer Blade 17 Discount

On September 24, 2023, it was reported that Razer, a popular brand in the premium gaming laptop industry, is offering an exceptional deal on their Razer Blade 17 model. Typically

Innovation Ignition

New Fintech Innovation Ignites Change

The fintech sector continues to attract substantial interest, as demonstrated by a dedicated fintech stage at a recent event featuring panel discussions and informal conversations with industry professionals. The gathering,

Import Easing

Easing Import Rules for Big Tech

India has chosen to ease its proposed restrictions on imports of laptops, tablets, and other IT hardware, allowing manufacturers like Apple Inc., HP Inc., and Dell Technologies Inc. more time