DevX HomePage

Merge XML documents with StAX

Deriving new XML documents from input documents is where the Streaming API for XML (StAX) shines. This tip explores how client applications can utilize the event-based API to efficiently merge two incoming XML documents into one.

First published by IBM at http://www-106.ibm.com/developerworks/xml/library/x-tipstx5/

In my previous tip, "Write XML documents with StAX", I showed how to use the low-level, cursor-based StAX API to create XML documents programmatically. In this tip, I use the high-level, event-based API to demonstrate this by creating a program that merges two incoming XML documents into one.

Processing several XML documents simultaneously can be a significant challenge. SAX parsers, for example, deliver the parsing events through callbacks to the client application. Because the SAX parser controls this process, the client application does not really have a chance to synchronize the different input sources. Therefore, programmers usually resort to the DOM parser when it comes to multi-document processing. However, the penalty here is excessive resource usage; the node trees of all input documents must completely reside in memory.

StAX does not suffer from these drawbacks. As its name indicates, it is targeted at streaming applications such as the merging of two documents. The following example shows how this is done. Assume that you want to merge two documents containing lists of products. Each document consists of a <products> element that contains one or several <product> elements sorted alphabetically by attribute pid. Listing 1 is an example of such a document:

Listing 1. Product list

<products>
   <product pid="01"/>
   <product pid="05"/>
   <product pid="09"/>
</products>

In Listing 2, I use a classical merge algorithm to merge the lists from both documents. Depending on the comparison between the merge criteria from the documents, I either copy events from document 1 to the output document or from document 2 to the output document. This is done by the readToNextElement() method. This method contains some extra logic for detecting the end of the product list. Special treatment is also required for the beginning of the document and for the end of the document.

Listing 2. Merging documents

import java.io.*;
import javax.xml.namespace.QName;
import javax.xml.stream.*;
import javax.xml.stream.events.XMLEvent;

public class Merger {

   private static final QName prodName = new QName("product");
   private static final QName pidName = new QName("pid");

   public static void main(String[] args)
      throws FileNotFoundException, XMLStreamException {
         
      // Use  the reference implementation for the  XML input factory
      System.setProperty(
         "javax.xml.stream.XMLInputFactory",
         "com.bea.xml.stream.MXParserFactory");
      // Create the XML input factory
      XMLInputFactory factory = XMLInputFactory.newInstance();
      // Create XML event reader 1
      XMLEventReader r1 = 
         factory.createXMLEventReader(new FileReader("prodList1.xml"));
      // Create XML event reader 2
      XMLEventReader r2 = 
         factory.createXMLEventReader(new FileReader("prodList2.xml"));

      // Create the output factory
      XMLOutputFactory xmlof = XMLOutputFactory.newInstance();
      // Create XML event writer
      XMLEventWriter xmlw = xmlof.createXMLEventWriter(System.out);

      // Read to first <product> element in document 1
      // and output to result document
      String pid1 = readToNextElement(r1, xmlw, false);
      // Read to first <product> element in document 1
      // without writing to result document
      String pid2 = readToNextElement(r2, null, false);
      // Loop over both XML input streams
      while (pid1 != null || pid2 != null) {
         // Compare merge criteria
         if (pid2 == null || (pid1 != null && pid1.compareTo(pid2) <= 0))
            // Continue in document 1
            pid1 = readToNextElement(r1, xmlw, pid2 == null);
         else
            // Continue in document 2
            pid2 = readToNextElement(r2, xmlw, pid1 == null);
      }
      xmlw.close();
   }

   /**
    * @param reader - the document reader
    * @param writer - the document writer
    * @param processEnd - forces the document end to be written
    * @return - the next merge criterion value
    * @throws XMLStreamException
    */
   private static String readToNextElement(XMLEventReader reader,
         XMLEventWriter writer, boolean processEnd) throws XMLStreamException {
      // Nesting level
      int level = 0;
      while (true) {
         // Read event to be written to result document
         XMLEvent event = reader.next();
         // Avoid double processing of document end
         if (!processEnd)
            switch (event.getEventType()) {
               case XMLEvent.START_ELEMENT :
                  ++level;
                  break;
               case XMLEvent.END_ELEMENT :
                  if (--level < 0)
                     return null;
                  break;
            }
         // Output event
         if (writer != null)
            writer.add(event);
         // Look at next event
         event = reader.peek();
         switch (event.getEventType()) {
            case XMLEvent.START_ELEMENT :
               // Start element - stop at <product> element
               QName name = event.asStartElement().getName();
               if (name.equals(prodName)) {
                  return event
                     .asStartElement()
                     .getAttributeByName(pidName)
                     .getValue();
               }
               break;
            case XMLEvent.END_DOCUMENT :
               // Stop at end of document
               return null;
         }
      }
   }
}

As you can see, the event-based API is ideally suited for deriving a document from other documents. With the low-level, cursor-based API, you would need to use different method calls for each different event type, but with the event-based API you just pass generic events to the event writer's add()method and that's it.

Summary
This tip has demonstrated the use of the event-based API of StAX for pipelined XML applications, such as the merging of documents. As of Nov 3, 2003, StAX has passed the Final JSR-0173 Approval Ballot. It will make a valuable addition to every Java programmer's toolbox.

Resources

Berthold Daum is a consultant and writer based in Lützelbach, Germany. For information on his recent books, System Architecture with XML and Modeling Business Objects with XML Schema (both from Morgan Kaufman), see http://www.bdaum.de. You can contact Berthold at berthold.daum@bdaum.de.