Choose Your Java XML Parser

he XML parser world is a dynamic one. As standards change, the parsers change as well–XML parsers are becoming more sophisticated. For most programming projects, the parser, at minimum, must support DOM Level 2, SAX 2, XSLT, and Namespaces. All the parsers discussed here provide these capabilities; however, there are distinct differences in performance, reliability, and conformance to standards. In this article, I’ll compare the latest parsers from Sun, Oracle, and the Apache Software Foundation.

Apache parser
The Apache parser version 1.2.3 (commonly known as Xerces) is an open-source effort based on IBM’s XML4J parser. Xerces has full support for the W3C Document Object Model (DOM) Level 1 and the Simple API for XML (SAX) 1.0 and 2.0; however it currently has only limited support for XML Schemas, DOM Level 2 (version 1). Add the xerces.jar file to your CLASSPATH to use the parser. You can use Xalan, also available from Apache’s Web site, for XSLT processing. You can configure both the DOM and SAX parsers. Xerces uses the SAX2 method getFeature() and setFeature() to query and set various parser features. For example, to create a validating DOM parser instance, you would write:

DOMParser domp = new DOMParser();   try {      domp.setFeature (“http://xml.org/dom/features/validation”, true);   } catch (SAXExcepton ex) {      System.out.println(ex);   }
Other modifiable features include support for Schemas and namespaces.

The following example shows a minimal program that counts the number of tags in an XML file using the DOM. The second import line specifically refers to the Xerces parser. The main method creates a new DOMParser instance and then invokes its parse() method. If the parse operation succeeds, you can retrieve a Document object through which you can access and manipulate the DOM tree using standard DOM API calls. This simple example retrieves the “servlet” nodes and prints out the number of nodes retrieved.

import org.w3c.dom.*;import org.apache.xerces.parsers.DOMParser;public class DOM{    public static void main(String[] args)     {        try {            DOMParser parser = new DOMParser();            parser.parse(args[0]);            Document doc = parser.getDocument();            NodeList nodes = doc.getElementsByTagName(“servlet”);            System.out.println(“There are ” + nodes.getLength() + 
               ” elements.”); } catch (Exception ex) { System.out.println(ex); } }}

You can use SAX to accomplish the same task. SAX is event-oriented. In the following example, inherits from DefaultHandler, which has default implementations for all the SAX event handlers, and overrides two methods: startElement() and endDocument(). The parser calls the startElement() method each time it encounters a new element in the XML file. In the overridden startElement method, the code checks for the “servlet” tag, and increments the tagCounter counter variable.. When the parser reaches the end of the XML file, it calls the endDocument() method. The code prints out the counter variable at that point. Set the ContentHandler and the ErrorHandler properties of the the SAXParser() instance in the main() method , and then use the parse() method to start the actual parsing.

import org.xml.sax.*;
import org.xml.sax.helpers.DefaultHandler;
import org.apache.xerces.parsers.SAXParser;

public class SAX extends DefaultHandler
{    
      int tagCount = 0;

      public void startElement(String uri, String localName,
         String rawName, Attributes attributes)
      {
            if (rawName.equals(“servlet”)) {
               tagCount++;
            }
      }

      public void endDocument()
      {
            System.out.println(“There are ” + tagCount +
                ” elements.”);
      }

      public static void main(String[] args)
      {
            try {
                  SAX SAXHandler = new SAX();

                  SAXParser parser = new SAXParser();
                  parser.setContentHandler(SAXHandler);
                  parser.setErrorHandler(SAXHandler);
                  parser.parse(args[0]);
            }
                  catch (Exception ex) {
                        System.out.println(ex);
                  }
      }
}

With Xerces installed and your CLASSPATH set, you can compile and run the above programs as follows:

javac DOM.java
javac SAX.java
C:xmlcode>java DOM web.xml
There are 6 elements.
C:xmlcode>java SAX web.xmlThere are 6  elements.

Oracle Parser
I used the beta 2.1.0 version of the parser included with the Oracle XML tools for this article. The Oracle parser implements DOM Level 1, and SAX 1.0 and 2.0. It has a partial implementation of DOM Level 2 and includes APIs for XSLT. It supports Schemas through the oracle.xml.parser.schema package.

The DOM class shown earlier requires relatively few changes to compile and run with the Oracle parser. Change the second import line so it refers to the Oracle parser. The parse() method, expects a string containing a URL, not a physical path and filename. Alternately, you can pass the parse() method a URL object.

import org.w3c.dom.*;import oracle.xml.parser.v2.DOMParser;public class DOM{    public static void main(String[] args)     {        try {            DOMParser parser = new DOMParser();               String url = “file://C|/xml/code/” + args[0];            parser.parse(url);             Document doc = parser.getDocument();            NodeList nodes = doc.getElementsByTagName(“servlet”);            System.out.println(“There are ” + nodes.getLength() +
elements.”); } catch (Exception ex) { System.out.println(ex); } }}

You must make two changes for the SAX version. The import statements would be:

import org.xml.sax.*;import org.xml.sax.helpers.DefaultHandler;import oracle.xml.parser.v2.SAXParser;                           
The initiation of the parser would become:
SAXParser parser = new SAXParser();parser.setContentHandler(SAXHandler);parser.setErrorHandler(SAXHandler);String url = “file://C|/xml/code/” + args[0];parser.parse(url);

Sun Parser
Sun packages its XML APIs as the Java API for XML Processing (JAXP). I used an early relase latest version (1.1). Like Oracle, Sun incorporates support for DOM, SAX, Schema and XSL into its parser. The XML Parser is based on the Project X parser from Sun and the XSLT processor is actually Xalan from Apache. Using factory classes, JAXP allows you to plug in any conforming XML or XSL parser, thus creating a standard mechanism for Java applications to interact with XML parsers. The parser supports SAX 2.0, DOM Level 2 and XSLT 1.0. You need to add three jar files–jaxp.jar, crimson.jar, and xalan.jar–to your CLASSPATH..

Import the following three lines to use the DOM API with JAXP. Note that the JAXP specific classes begin with “javax.”

import org.w3c.dom.*;
import javax.xml.parsers.*;
import java.io.*;

Sun’s strategy of using factory classes makes the process of initiating the parser a bit different. Here are the necessary lines:

DocumentBuilderFactory dbf =
DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(args[0]);

The first line creates a new DocumentBuilderFactory instance. By default, the DocumentBuilderFactory uses the built-in XML parser that comes with JAXP,
but you can change the parser by setting the system property javax.xml.parsers.DocumentBuilderFactory. You can pre-configure the parser using factory class methods like setValidating() and setNamespaceAware(). After you have the factory, you create a DocumentBuilder and invoke its parse() method to parse a document.

You also use a factory class to create a SAX parser. Use the javax.xml.parsers.SAXParserFactory system property to set the default parser. The import lines would change to the following:

import org.xml.sax.*;
import org.xml.sax.helpers.DefaultHandler;
import javax.xml.parsers.*;

There are several ways of initiating the parsing. One way would be to use the factory to get the parser directly like this:

SAX SAXHandler = new SAX();
SAXParserFactory spf = SAXParserFactory.newInstance();
SAXParser sparser = spf.newSAXParser();
sparser.parse(args[0], SAXHandler);

Note that the preceding example passes the SAXHandler object as a parameter to the parse() method. This way the parser “knows” where the event handlers for the various SAX events are.

You can also interact indirectly with a SAX parser through the SAX2 compliant XMLReader interface. For example:

SAX SAXHandler = new SAX();SAXParserFactory spf = SAXParserFactory.newInstance();XMLReader xmlReader = null;try {    SAXParser saxParser = spf.newSAXParser();    xmlReader = saxParser.getXMLReader();} catch (Exception ex) {    System.out.println(ex);}xmlReader.setContentHandler(SAXHandler);xmlReader.setErrorHandler(SAXHandler);try {    xmlReader.parse(args[0]);} catch (SAXException ex) {    System.out.println(ex);}

Conformance and Performance
An XML parser is an essential tool for any serious programming effort involving XML documents. The parser is the link between the XML document representing data and the application code. Conformance to the W3C standards varies widely between the parsers, but it’s beyond the scope of this article to assess how well each parser conforms to the standards.

Using a very informal method, I timed each parser as it went through a 92KB XML file. The system was a Windows NT 4 SP6 running on a 300 MGHz Pentium machine with 64MB of memory. I used Java’s System.currentTimeMillis() method to see how long it took the program to perform the parse() method. The result ranked Oracle’s parser as the fastest, followed by JAXP and Xerces. Again, this was a very unscientific test, so your results may vary.

Additionally, I tried the XML parser from Microsoft using a VB program. The parser was very fast. You should seriously consider the Microsoft XML parser for windows-only applications as its performance is far superior to all the Java parsers discussed in this article.

Table 1: Average parser load times
ParserAvg. Parse Time
(in milliseconds)
Oracle Parser1094
Sun Parser (JAXP) 1344
Xerces1719
Microsoft Parser80

As XML technologies mature, we probably will see less direct interaction with the parser. Also, as parsers move toward full compliance with the standards, it will be relatively easy to change parsers, especially when using factory classes. Performance, reliability, and the programmers’ familiarity with a particular implementation are going to be the most important factors in deciding which parser to use.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Overview

Recent Articles: