Iteration 2: Divide-and-Conquer JAXP XSLT Transformation
In order to process huge XML input files without running out of memory, you need to subdivide the input XML document into manageable chunks. In this case, each Record
element is a manageable chunk and can be transformed in isolation of any other Record
element. So you first change your XSLT transformation slightly to transform just an individual record (see Listing 5
Next, the brilliant creators of the dom4j open source library come to the rescue. They have developed a way to easily parse huge XML input files incrementally. All you need to do is give dom4j an xPath expression (in this case, /InsurancePolicyData/Record) that subdivides the input document, and the library will load only one portion of the XML document sub-tree into memory at a time:
// read the input file incrementally
SAXReader reader = new SAXReader();
The SingleThreadSplitXSLT.java file (see Listing 6) is an updated Java command-line program that incrementally parses the input XML document. That program links in an event listener class, termed an element handler (see Listing 7), to do the partial XSLT transformation.
The element handler class has a few tricks up its sleeve to get the job done. First of all, it caches the XSLT transformation as this same transformation is going to be applied hundreds or perhaps even millions of times:
transformer = cachedXSLT.newTransformer();
There is no reason for the XSLT file to be loaded from a file into memory and parsed each time a record is transformed. Next, you need to make sure the XML declaration for each individual record transformed is omitted:
This declaration is valid only once in an XML file and must be at the top of the output file. The final trick in the element handler class is to copy each individual record into its own XML document. Once the copy is made, you can tell dom4j to delete the memory used to parse that portion of the huge XML input file via the detach() method call:
Element record = path.getCurrent();
Document newDocument = DocumentHelper.createDocument();
So you process only one portion of the XML input file at a time, and do this as many times as needed to process all records in the input file. The file size and contents are identical between Iteration 1 of the program and this iteration, but this version can process the entire 275 MB input file in less than 20 MB of memorya huge improvement over the first version. Unfortunately, you do lose some efficiency as the new program requires about another 20 seconds to execute. Since Iteration 1 of the program did not copy portions of the input document and executed only one transformation rather than many thousands (one per input record), the increased run time is understandable.
This iteration also consumes only one CPU thread. Modern server architectures typically have many CPUs on the same server. Those CPUs may have multiple cores. And each core may run multiple threads. So can you decrease the overall run time of this program by making use of the Java standard concurrency library? It's time to find out.