Stage One of the Solution: StAX to Chunks
A solution for these requirements is using a given XML technology where it is strongest: StAX to parse and chunk the incoming XML into a manageable size and XMLBeans to load the chunks into objects for processing. The data flow diagram in Figure 2
illustrates this approach.
|Figure 2. Data Flow Diagram of XML Technology Strengths Approach|
StAX provides two different APIs for reading XML: an XMLStreamReader and an XMLEventReader. The XMLStreamReader provides the lowest-level access to the XML, but it breaks the XML data into fine-grained particles that the application must handle individually. The XMLEventReader provides a slightly higher level of abstraction by encapsulating the XML particles in events such as element start and element end. You could do the chunking implementation with either API, but I think you'll agree that the XMLEventReader makes the process easier.
To create an XMLEventReader, you use the XMLInputFactory class. The factory can be configured using a set of properties described in the StAX documentation. However, in this example, you will just use the default properties. Once the reader is created, you can extract a single, complete element from it by iterating over the XML events and checking if the event is a start element. If it is, you check whether the element is the one you would like to extract. (Listing 2 shows this simple loop.)
You will need to continue to loop over the XML events until an end element event occurs with the same name as the start element. At that point you know that you have a complete element (in this case, an Employee). But you have processed only StAX XMLEvent objects, which need to be converted back into XML before you can process them with XMLBeans.
This is where the second half of StAX comes in, the XML writing APIs: XMLStreamWriter and XMLEventWriter. Again, just as with the readers, the XMLStreamWriter provides low-level particle writing and the XMLEventWriter will write an entire event at once. The StAX API was designed in such a way that events produced by the XMLEventReader can be directly consumed by the XMLEventWriter. This makes the process of serializing the XML chunk nice and easy.
The XMLEventWriter is created in a similar fashion to the reader, using an XMLOutputFactory. Because the events you will be writing are from the middle of larger XML documents, they likely will be using namespace prefixes that were not originally defined on the elements themselves. To handle this, you have to set the property javax.xml.stream.isRepairingNamespaces to true to allow the writer to add namespace declarations for prefixes if they are missing (see Listing 3). Once the writer is created, you can feed it all the XMLEvent objects that occur in the desired element (in this case, the Employee).
As the XMLEventWriter consumes the events, it converts the events back into valid XML. It writes this XML to the stream that was provided when the writer was created, which could point to a temporary file on disk that is used to store the chunk until it is ready to be processed.
Another approach that can offer better performance is keeping the XML in memory by writing to a string buffer. The XMLOutputFactory that you previously used to create the XMLEventWriter takes a reference to the writer that will back the event writer. To avoid having to continually recreate the event writer for each chunk, you ideally could just reset the writer after each chunk is complete. The bad news is that Java does not provide a string buffer implementation of the java.io.Writer interface that also supports resetting (so it can be used repeatedly). The good news is that it is relatively simple to write your own.
Listing 4 shows a ResettableStringWriter implementation. This writer contains a java.lang.StringBuilder that is used to store the data written until it is requested through the Reset() operation, which returns the data while setting the buffer back to a size of zero. This custom string writer allows you to use the same XMLEventWriter throughout the lifetime of the application and keeps the XML in memory rather than forcing a round trip to disk for each chunk.
One important point to consider is the amount of data you want to keep in the memory buffer while reading the XML. Because the XML may not have been validated yet, it is possible that the desired start element, Employee, is found but then no end element is ever found. This would cause the application to continually read and write new events until it reaches the end of the input file. To prevent memory exhaustion, the ResettableStringWriter provides a simple check to ensure that the size of the buffer remains below a reasonable limit. If this limit is exceeded, an exception will be raised and your application can respond appropriately. A possible workaround for this problem is performing validation earlier in the process, either through a validating StAX parser or another stage in the data flow.