uring the past 20 years I have had the privilege of working with well over 1,000 different software developers. Some of the best developers in the industry mentored me when I began my career. They taught me software development from the ground up, beginning with how to run and enhance automated unit tests for operating systems. (Yes, people knew how to create and execute automated unit tests well before Kent Beck developed the elegant xUnit framework.)
I gradually started leading teams of junior software developers and tried to teach them the lessons I had learned. I since have mentored many development teams from the United States, India, and China, and I’ve come to a sobering realization: the biggest lesson of all is often forgotten in today’s fast-paced and often outsourced software industry. Getting a software program working is just Step 1 in a series of steps to produce good software.
To keep this vital lesson fresh in your mind and further your software engineering prowess, this article walks you through Step 1 of a challenging and all-too-common programming problem: transforming XML documents between disparate systems.
The Problem: Efficiently Processing Huge XML Transformations
I often face the need to transform XML documents. Sometimes these XML documents get very large as one system performs a data extract of perhaps hundreds of thousands of transactions and then sends them on to a second system for further processing. Rarely does the sending system view data in the same way as the receiving system, therefore an XML transformation almost always is required to integrate the two systems.
Let’s apply this problem to a programming exercise. There are many ways to transform XML, and more than a few software companies make nice livings out of providing such services. But as good software developers, we should always try to use industry standards where possible. XSLT is the industry standard for this problem, and Java has a standard way to invoke XSLT, via JAXP.
The Input.xml document (see Listing 1) in this exercise contains a simplified partial extract of an XML file from the sending system. Notice two important points about how this file is structured. First of all, each transaction relayed between the two systems is delimited by a Record element under the root element named InsurancePolicyData. That way the sending system can place any number of transactions in the same XML export file. Second, the sending system has a pretty obscure way of writing out its data: in label/value pairs rather than in more traditional XML elements and text within an element.
The receiving system wants the XML to be structured as shown in the Output.xml document (see Listing 2) so it can process it more efficiently. XML transformations always require time, processing resources, and some sort of mapping and/or coding process to complete. So always check if system 1 or system 2 can agree upon a standardized XML schema to avoid the transformation process. But in many cases these days the two systems are provided by third-party vendors you cannot control. And more often than not the systems are written by two different vendors with two very different sets of goals. So begins your quest to write an efficient standards-based XML transformation program.
Iteration 1: A Standard “Big Bang” JAXP XSLT Transformation
If you are accustomed to developing object-oriented programs in Java, you’ll need some time to get used to XSLT. But you can accomplish sophisticated XML transformations with a pretty small amount of XSLT code, and you can also modify the transformation very easily as the sending and receiving systems enhance their XML schemas over time. A Java code transformation takes a lot longer to initially create and modify.
The WholeFileTransformation.xslt file (see Listing 3) contains a simplified partial XSLT to transform the Input.xml document into the Output.xml document. The only trick in this particular XSTL file is creating an XML element from the ColumnName attribute contained within the input XML file. The XSLT file is itself an XML file and must conform to all XML specifications. In order to dynamically create a named XML element, you must escape the greater-than and less-than symbols surrounding the name of the element. The following line will send a less-than () symbol to the output document:
So you combine this with the name of the element, end it with the escaped greater-than symbol, and you have a dynamically created XML element start tag.
Next, you create a Java code program to run a JAXP standard XSLT transformation. The SingleThreadXSLT.java file (see Listing 4) is a very simple Java class that is called from the command line. It contains three input parameters: the names of the input XML, XSLT, and output XML files. It will then run the XSLT transformation on the input XML file to produce the output XML file.
This transformation program runs very well on a small input XML file and takes less than a second once the Java virtual machine is loaded. It also runs in very little memory compared with what the base Java virtual machine consumes, around 12 MB in total.
Unfortunately, as you increase the size of the input XML file, you greatly increase the size of the Java virtual machine required to transform the file. A 275 MB input file nearly runs out of memory in a virtual machine sized to 1 GB. A standard JAXP transformation appears to load the entire input document into a DOM structure in memory before even starting the transformation process. So this approach certainly won’t work to process multiple gigabyte-sized input files. It also utilizes only one CPU thread per program run, causing the 275 MB input file to take a little over 200 seconds to execute.
Iteration 2: Divide-and-Conquer JAXP XSLT Transformation
In order to process huge XML input files without running out of memory, you need to subdivide the input XML document into manageable chunks. In this case, each Record element is a manageable chunk and can be transformed in isolation of any other Record element. So you first change your XSLT transformation slightly to transform just an individual record (see Listing 5).
Next, the brilliant creators of the dom4j open source library come to the rescue. They have developed a way to easily parse huge XML input files incrementally. All you need to do is give dom4j an xPath expression (in this case, /InsurancePolicyData/Record) that subdivides the input document, and the library will load only one portion of the XML document sub-tree into memory at a time:
// read the input file incrementallySAXReader reader = new SAXReader();reader.addHandler("/InsurancePolicyData/Record", new MultiThreadSplitFileElementHandler(xsltFile, outputWriter, executor));
The SingleThreadSplitXSLT.java file (see Listing 6) is an updated Java command-line program that incrementally parses the input XML document. That program links in an event listener class, termed an element handler (see Listing 7), to do the partial XSLT transformation.
The element handler class has a few tricks up its sleeve to get the job done. First of all, it caches the XSLT transformation as this same transformation is going to be applied hundreds or perhaps even millions of times:
transformer = cachedXSLT.newTransformer();
There is no reason for the XSLT file to be loaded from a file into memory and parsed each time a record is transformed. Next, you need to make sure the XML declaration for each individual record transformed is omitted:
transformer.setOutputProperty( OutputKeys.OMIT_XML_DECLARATION, "yes");
This declaration is valid only once in an XML file and must be at the top of the output file. The final trick in the element handler class is to copy each individual record into its own XML document. Once the copy is made, you can tell dom4j to delete the memory used to parse that portion of the huge XML input file via the detach() method call:
Element record = path.getCurrent();Document newDocument = DocumentHelper.createDocument();newDocument.add(record.createCopy() );record.detach();
So you process only one portion of the XML input file at a time, and do this as many times as needed to process all records in the input file. The file size and contents are identical between Iteration 1 of the program and this iteration, but this version can process the entire 275 MB input file in less than 20 MB of memory?a huge improvement over the first version. Unfortunately, you do lose some efficiency as the new program requires about another 20 seconds to execute. Since Iteration 1 of the program did not copy portions of the input document and executed only one transformation rather than many thousands (one per input record), the increased run time is understandable.
This iteration also consumes only one CPU thread. Modern server architectures typically have many CPUs on the same server. Those CPUs may have multiple cores. And each core may run multiple threads. So can you decrease the overall run time of this program by making use of the Java standard concurrency library? It’s time to find out.
Iteration 3: Multithreaded JAXP XSLT Transformation
Starting with JDK 5 (1.5) and above, Java introduced a really useful utility library for concurrent programming. With a few lines of code you can create a thread pool and give that thread pool discrete tasks to execute in the background. Thread pools are very important for bulk background processing, as the overhead to create and destroy threads is so large that doing so on a routine basis can eliminate any positive effect of multi-threading your program. You want to minimize thread creation wherever possible and the concurrency library does this for you automatically. The simplest way to do this is to create a fixed-size thread pool as follows:
ExecutorService executor = Executors.newFixedThreadPool(totalThreads);
This type of thread pool creates the number of threads given as input once, up front, and all those threads just wait around until given a task to run. After a task is finished executing, the thread does not terminate, it just sleeps until it is assigned another task to run. The executor object handles queuing the tasks for you. So all you need to do is create a task, which is simply a Java class extending either the Runnable or Callable interface, and then tell the executor to execute that task when a thread in the pool is available:
executor.execute(new XSLTRunnable( transformer, outputWriter, newDocument));
Unfortunately, you still need to put a lot of thought into how you can restructure your program to take advantage of that thread pool. Good candidates for thread pool tasks are sequences of processing that repeat many times during the execution of one particular program feature. The best candidates are processing actions that can be executed in any order. Remember that tasks run on a thread pool are always in the background and as such you cannot guarantee the order in which they execute.
In this programming problem, you have three actions performed time and time again:
- Parsing the input file for each record, copying each record into an XML document for individual XSLT transformation
- Running the XSLT transformation
- Appending the output of the XSLT transformation to the output file
I don’t think the dom4j library would support multithreaded parsing. The library parses the input file as a sequential file via a SAX parser. So cross #1 off the above list; it will be performed as a single thread in the main program. That leaves #2 and #3, both of which can be done in parallel. But when you do that, do you need to maintain an order in the output file that is equivalent to the order defined in the input file? Not for this exercise since the output file can be reordered, as long as all input records are properly transformed. If you need to maintain order, you still can but it requires a more complex program.
You will create a thread pool task to do a combination of #2 and #3. Hopefully the single-threaded #1 action is much faster than the combination of #2 and #3, or you won’t get much of a benefit from multi-threading the program. Before going through all the work of restructuring the program, however, verify that assumption! I did.
The MultiThreadSplitXSLT.java file (see Listing 8) shows the updated main program class, which takes an additional input parameter: the number of threads to allocate in the fixed-size thread pool. It creates the thread pool, starts an incremental parse and transformation of the large input XML file, waits for it to complete, and ends the output file and program.
The MultiThreadSplitFileElementHandler.java file (see Listing 9) shows an updated dom4j element handler that creates a thread pool task for as much work as possible. The element handler creates the new partial XML document to transform and passes it to a thread pool task to transform it and write it to the output file. The only tricky part of this file is making sure you create a new transformer for each thread pool task:
Transformer transformer = cachedXSLT.newTransformer();
Transformers cannot transform multiple documents at the same time. You must create a new transformer for each task you create. Any time you pass an object into a task executed in parallel, you must take great care to synchronize the object with any other tasks executed at the same time. If you create a new transformer object for each task, then you do not need to worry about synchronization. Two tasks will never reference the same object, so you do not need to synchronize access to the object.
The XSLTRunnable.java file (see Listing 10) is a new class used to execute an XSLT in the thread pool. It extends the Runnable interface so when a thread from the thread pool picks up the task, it executes the run method. This method performs a JAXP XSLT transformation on the input document, but then it synchronizes the outputWriter object before it appends the results of the transformation.
Remember the rule above: any object passed into a task executed in parallel must be safe to be accessed in parallel. If you create a new object each time you send it to a task, that object generally should be safe to use without synchronization (to know for certain, you must make sure anything that object accesses in turn is either distinct among any other threads executing or is synchronized properly).
When creating a thread pool task, you pass it three parameters:
XSLTRunnable(Transformer transformer, Writer outputWriter, Document document)
You created a new transformer object for each task so you do not need to synchronize access to that object as discussed above. The document is also derived from a full copy of each individual Record element from the input file. So that object is distinct for each task, and you have no problem with synchronization there either.
Unfortunately the outputWriter is an output stream object to the one and only output file. You cannot create copies of this object, nor can you simply ignore synchronization issues. If you did, the output file would be scrambled with partial records interleaved with other partial records. You must write out a completely transformed record in its entirety before allowing any other thread to write out its completely transformed record.
Executing this version of the program on a single CPU machine reduced the runtime processing of a 275 MB file to 160 seconds, which initially surprised me. That is a better runtime than the Iteration 1 transformation, even though this program is doing quite a bit more work to accomplish the same result. On a single CPU machine the parallel processing is allowing the program to do in-memory transformations at the same time as I/O operations. In the Iteration 1 approach, the program cannot fully utilize its only CPU while it reads input and writes output files.
Step 1: Got It Working!
Now the real fun starts when you execute the multithreaded program on a multiple CPU server. I was able to run the program on a server with four dual-core, single-thread CPUs and the run time decreased to just over 30 seconds. Memory consumption was a modest 20 MB, a very small fraction of the Iteration 1 big bang transformation program.
The output file was identical in size to the big bang transformation, but the contents were not. You did not take the time to order the output in this article so that is to be expected. However, the XML generated was in tact for all records in the file; they were just ordered differently. So you can declare victory, Step 1 is complete!
However, only novice software developers stop at Step 1. Disciplined developers quickly move on to estimate the lifetime of the code they just got to work. The longer the code is expected to live, the more effort they spend on the code. Each step has a goal in mind that increases the usefulness of the code to extend its lifetime:
- Get the code working
- Get the code understandable ? Goal: allow others to maintain and enhance the code
- Get the code “debuggable” ? Goal: minimize support efforts and production downtime
- Get the code efficient ? Goal: improve the efficiency of the code so that it has acceptable response times and is scalable to support large volumes and/or users
- Get the code automation testable ? Goal: manage the increasing size and complexity of a system; combat system entropy
- Get the code reusable ? Goal: have more than one product offering use the same code; ease the maintenance burden and come out with more products more quickly
You got through Step 1; hopefully you found the programming exercise challenging and you are ready to move on to Step 2!