StAX: DOM Ease with SAX Efficiency

ith so many XML technologies, deciding what to use and when to use it can sometimes be bewildering. Many chose to build on top of existing DOM or SAX implementations rather than StAX (the Streaming API for XML). However, with StAX JSR-173 in the pipeline, this may change. StAX is a parser-independent, streaming pull-based Java API for reading and writing XML data. It is a memory-efficient, simple, and convenient way to process XML while retaining control over the parsing and writing process.

Most parsers fall into two broad categories: tree based (e.g., DOM) or event based (e.g., SAX). Although StAX is more closely aligned with the latter, it bridges the gap between the two. In SAX, data is pushed via events to application code handlers. In StAX, the application “pulls” the data from the XML data stream at its convenience. Application code can filter, skip tags, or stop parsing at any time. The application–not the parser–is in control, which enables a more intuitive way to process data.

This article gives you a look under the hood of this useful Java API and then demonstrates how to read and write XML documents efficiently using StAX.

A Brief Recap on XML Parsing

In tree-based or DOM parsers, the entire XML content is read and assembled into an in-memory, hierarchical object graph. Graphs are convenient when applications need to traverse the document multiple times or manipulate the DOM tree. The downside is that they can be inefficient. The object model can take up more memory than the raw XML itself. This precludes loading large documents into memory. SAX, on the other hand, is memory efficient. It reads the XML and pushes pieces of the document to application handlers using events. The parser takes control of the process, which makes it fast but also a bit awkward to use and debug.

SAX Push vs. StAX Pull

The following code examples demonstrate the respective push and pull approaches of SAX and StAX.

SAX

Application code registers a callback, which the SAX parser invokes as it reads the XML:

FileInputStream fis = new FileInputStream(file);XMLReader saxXmlReader = XMLReaderFactory.createXMLReader();// Create callback handlerDefaultHandler handler = new DefaultHandler() {public void startElement(String uri, String localName, String qName, Attributes attributes) {        // do something with element      }};// register handersaxXmlReader.setContentHandler(handler);saxXmlReader.setErrorHandler(handler);// control passed to parser...saxXmlReader.parse(new InputSource(fis));

StAX

Application code controls parsing directly by iterating over the document using the StAX stream reader:

FileInputStream fis = new FileInputStream(file);XMLInputFactory factory = (XMLInputFactory)XMLInputFactory.newInstance();XMLStreamReader staxXmlReader = (XMLStreamReader) factory.createXMLStreamReader(fis);for (int event = staxXmlReader.next(); event !=   XMLStreamConstants.END_DOCUMENT; event = staxXmlReader.next()) {  if (event == XMLStreamConstants.START_ELEMENT) {    String element = staxXmlReader.getLocalName();    // do something with element  }}

Like SAX, StAX employs a streaming approach. It holds only a small part of the document in memory at any one time. Consequently, it is extremely efficient and a good choice for dealing with large documents.

StAX in Detail

The StAX XMLStreamReader is the main class for interacting with StAX. It presents an Iterator- (or Cursor-) style interface. (Other event-based Iterator APIs are available if you require them.) With the XMLStreamReader, an application iterates over the document by invoking next() until it has read all the data. Each call to next() advances the StAX reader to the next item in the XML stream, whether it be an element, namespace, DTD, or start or end document. The next() return code indicates which type of event has been read. The possible event types are defined as constants on the XMLStreamConstants interface.

A common Application StAX idiom is to read events in a loop using the XMLStreamReader and delegate control to other components based on the event type, using a switch or if statement:

for (int event = staxXmlReader.next(); event != XMLStreamConstants.END_DOCUMENT; event = staxXmlReader.next()) {switch (event) {  case XMLStreamConstants.START_DOCUMENT:    System.out.println("Start document " + staxXmlReader.getLocalName());    break;  case XMLStreamConstants.START_ELEMENT:    System.out.println("Start element " + staxXmlReader.getLocalName()); 	System.out.println("Element text " + staxXmlReader.getElementText());    break;  case XMLStreamConstants.END_ELEMENT:    System.out.println("End element " + staxXmlReader.getLocalName());    break;  default:    break;  }}

On each call, the application code can either chose to process the event or continue. In this fashion, the application can easily skip unwanted elements. However, some reader methods can be used only when the reader is positioned on certain tags. For example, calls to get attribute details such as XMLStreamReader::getAttributeValue() work only when the reader is currently positioned on a start element tag, not on end document tag or end element tag.

Patterns for Using StAX

If your XML is anything more than trivial, you’ll find that putting all that parsing logic inside one large event loop can quickly become unmanageable and hard to maintain. A better way to do this is to group logically related units of parsing work into discrete components that can be called from within the main event loop.

Take the following ATOM XML feed file as an example:

?xml version="1.0" encoding="utf-8"?>"http://www.w3.org/2005/Atom"> </span>Simple Atom Feed File<span style='color:navy'> Using StAX to read feed files "http://example.org/"/> 2006-01-01T18:30:02Z     Feed Author   [email protected]     </span>StAX parsing is simple<span style='color:navy'>   "http://www.devx.com"/>   2006-01-01T18:30:02Z   Lean how to use StAX 

To make life easy, create a small piece of infrastructure. Start by defining a ComponentParser interface that defines the contract between the main StAX event loop and parsing components:

public interface ComponentParser {  public void parseElement(XMLStreamReader staxXmlReader) throws XMLStreamException;}

This allows parsing components to be dealt with in a common way through the interface.

Define two concrete parsers: one to parse ATOM author elements and one to parse ATOM entry elements. Ensure that they implement the ComponentParser interface.

The following is the AuthorParser class:

public class AuthorParser implements ComponentParser{    public void parse(XMLStreamReader staxXmlReader) throws XMLStreamException{          // read name      StaxUtil.moveReaderToElement("name",staxXmlReader);      String name = staxXmlReader.getElementText();            // read email      StaxUtil.moveReaderToElement("email",staxXmlReader);      String email = staxXmlReader.getElementText();            // Do something with author data...  }}

The following is the EntryParser class:

public class EntryParser implements ComponentParser {  public void parse(XMLStreamReader staxXmlReader) throws XMLStreamException{          // read title      StaxUtil.moveReaderToElement("title",staxXmlReader);      String title = staxXmlReader.getElementText();            // read link attributes      StaxUtil.moveReaderToElement("link",staxXmlReader);      // read href attribute      String linkHref = staxXmlReader.getAttributeValue(0);            // read updated      StaxUtil.moveReaderToElement("updated",staxXmlReader);      String updated = staxXmlReader.getElementText();            // read title      StaxUtil.moveReaderToElement("summary",staxXmlReader);      String summary = staxXmlReader.getElementText();            // Do something with the data read from StAX..  }}

The StaxUtil class is just a helper class for reading from the StAX reader until the code finds the target element. Note that you should take care to (1) read elements in the correct order, (2) not read past the end of the stream, and (3) not read data that belongs to other ComponentParsers.

In the main event loop, modify the code to farm out parsing work to ComponentParsers based on the XML element name. ComponentParsers can be pre-registered with the main class prior to parsing. The advantage of this pattern is that it keeps the main event loop code simple and devoid of any understanding of the ATOM XML format. ComponentParsers still pull data from StAX, but they are neatly separated and can be reused (e.g., in recurring elements in the XML hierarchy). You can now apply the loop to parse any XML file, provided you registered the appropriate ComponentParsers. The following is the main event loop using a component parser registry:

public class StaxParser implements ComponentParser {    private Map delegates;    …    public void parse(XMLStreamReader staxXmlReader) throws XMLStreamException{      for (int event = staxXmlReader.next(); event != XMLStreamConstants.END_DOCUMENT; event = staxXmlReader.next()) {        if (event == XMLStreamConstants.START_ELEMENT) {          String element = staxXmlReader.getLocalName();          // If a Component Parser is registered that can handle          // this element delegate…          if (delegates.containsKey(element)) {            ComponentParser parser = (ComponentParser) delegates.get(element);            parser.parse(staxXmlReader);          }         }      } //rof    }}

Here’s how you would put it all together in a test case:

InputStream in = this.getClass().getResourceAsStream("atom.xml");     XMLInputFactory factory = (XMLInputFactory) XMLInputFactory.newInstance(); XMLStreamReader staxXmlReader = (XMLStreamReader) factory.createXMLStreamReader(in);     StaxParser parser = new StaxParser(); parser.registerParser("author",new AuthorParser()); parser.registerParser("entry",new EntryParser());     parser.parse(staxXmlReader);

StAX Output

No discussion of StAX is complete without mentioning StAX output. StAX is bi-directional in that it supports both read and write. The StAX XMLStreamWriter class provides a simple, low-level API to output XML data.

The following is an example of using StAX to generate an XML ATOM feed document:

File file = new File("atomoutput.xml");FileOutputStream out = new FileOutputStream(file);String now = new SimpleDateFormat().format(new Date(System.currentTimeMillis()));XMLOutputFactory factory = XMLOutputFactory.newInstance();XMLStreamWriter staxWriter = factory.createXMLStreamWriter(out);staxWriter.writeStartDocument("UTF-8", "1.0");// feedstaxWriter.writeStartElement("feed");staxWriter.writeNamespace("", "http://www.w3.org/2005/Atom");// titleStaxUtil.writeElement(staxWriter,"title","Simple Atom Feed File");// subtitleStaxUtil.writeElement(staxWriter,"subtitle","Using StAX to read feed files");// linkstaxWriter.writeStartElement("link");staxWriter.writeAttribute("href","http://example.org/");staxWriter.writeEndElement();// updatedStaxUtil.writeElement(staxWriter,"updated",now);// author...// entry.. staxWriter.writeEndElement(); // end feedstaxWriter.writeEndDocument();staxWriter.flush();staxWriter.close();

The resultant XML file is identical to the ATOM feed file previously shown in the “Patterns for Using StAX” section.

StAX Parsers

Several JSR-173-compliant parsers are available, including the following:

  1. Woodstox
  2. StAX Reference Implementation
  3. Oracle StAX Pull Parser
  4. BEA

You’ll find the entire StAX JSR-173 on the JCP Web site.

Just How Fast Is StAX?

In addition to being easy to use, StAX is also very fast. Sun has released a whitepaper (PDF) that compares its performance with several other parsers.

The Future of StAX

StAX is ideally suited for no-nonsense, efficient XML input and output. The pull paradigm promotes a more intuitive parsing approach whereby application components can aggregate logically related parsing operations and pull what they want from the stream one element after another. Developers must still maintain appropriate state throughout the parsing process, but they retain overall control. StAX, like SAX, works well for large documents and when parts of the document can be dealt with in small chunks independently of other chunks. And best of all, it’s fast.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Related Posts