olving the same problems over and over again can be quite tiring for a software engineer, yet the object persistence wheel has been reinvented more times than I’d like to count. Thankfully, the industry is centralizing on XML because it can represent object relationships very well, and it is architecture and language-agnostic by nature. XML is now at the backbone of most client/server applications, from XHTML to SOAP web services to RESTful services, preferences, persistence, and configuration.
However, even with the advent of XML, mapping from objects such as Java class instances to XML is not always trivial. In particular, using a “contract first” approach that defines an XML schema, namespaces, and XML data types can be arduous. Object-to-XML mapping (OXM) libraries such as JAXB, XMLBeans, and XStream have made OXM easier, and they’ve helped to define APIs that serve as the foundation of serialization for tools such as Spring Web Services (Spring-WS). These libraries work by generating classes or using Java annotations to map the objects to a defined XML schema automatically. So the application developer simply instantiates an object, populates the data, and tells the library to marshal the object to well-formed XML. The process then works in reverse when the application unmarshals the XML back into objects by feeding the XML into the library.
A common architectural pattern for applications using OXM involves defining a generic marshaller and unmarshaller interface. This approach hides the actual marshalling technology behind the interface to enable multiple implementations, easier mock testing, consistent exception handling, and future flexibility. However, these interfaces, combined with the hierarchical nature of XML, require that the entire object to be marshalled live in memory, which becomes a major hindrance when the objects grow too large. For example, how do you marshal a web server access log that may contain tens of thousands of requests, or how do you marshal a security audit log for multiple users on the system simultaneously without exhausting memory?
A better solution (albeit with a few limitations) is the callback approach, which solves many of the problems found with existing solutions for large object XML serialization. Using a callback-based API allows an application to “stream” objects in and out of XML while still gaining all the benefits of OXM, such as namespace handling, element-to-field mappings, and data type conversions. This article explains how the callback approach solves problems such as direct low-level XML handling and custom OXM marshallers.
Existing Solutions vs. Generic, Reusable Marshalling
One of the most common solutions to marshalling large object graphs to XML is to use a low-level library such as StAX (the streaming API for XML) directly, or even to resort to string concatenation (if you’re looking for trouble). The StAX event-based API allows an application to generate individual events for each component of the XML document, and stream these events directly to an output stream such as a file or network socket. While this solution works, it carries a lot of baggage. The XML generation would need to be rewritten for each type of object to be transformed, and the implementation would be very tightly coupled to the XML technology selected. With many of these low-level libraries, the application is responsible for data conversion into and out of the XML type system. Finally, the amount of code required may be very large, thereby creating a debugging and maintenance challenge.
Another approach is to write a custom OXM marshaller with a standard interface for each object that needs to be transformed into XML. This technique is similar to an implementation I described in a previous DevX article, “Use the Best of StAX and XMLBeans to Stream XML Object Binding.” A custom OXM marshaller has the advantage of abstracting the transformation from the application and leveraging the type mapping support of a library such as XMLBeans or JAXB. However, the pattern is still insufficient because it requires a new marshaller implementation for each object type. At the same time, segmenting the XML, as described in my previous article, can cause performance degradation when processing extremely large documents.
The ideal solution is a library that supports marshalling any object to XML, requires little code, is maintainable, and has reasonable performance. The new marshalling pattern that this library would use must satisfy a number of core requirements. It must be:
- Generic: The marshaller must support marshalling any object that the underlying marshaller implementation can understand. For example, with JAXB, this would be any annotated object. For XStream, this would be any JavaBean.
- Reusable: The marshaller must be thread safe and reusable so it can be used in a stateless system such as an enterprise application or servlet. This allows for easier configuration and testing using a dependency-injection framework such as Spring.
- Scalable: The marshaller must support objects of any size when marshalling and unmarshalling.
- Reasonably high performing: The marshaller should not be much slower than using the OXM library directly. For example, if JAXB is the underlying implementation, the generic, streaming marshaller must perform just as well as using JAXB directly to write the object.
Consider a possible scenario where this solution could be applied. Suppose a company has to archive orders in XML. The orders currently exist in a database. Each order may contain hundreds of items. The history of users who created the orders must also be archived?a record of all the actions they performed on the system. This information exists in data access objects (DAOs), which can return the information in pages (i.e., order items 1 to 10, 11 to 20, etc.). Loading the entire order or the entire user history into memory isn’t feasible given the hardware specifications, especially if this process needs to be done in parallel. Ideally, a single OXM library would be used to transform both the orders and the user histories to XML in an efficient manner.
Using these requirements and this scenario, the following section explains how I designed and implemented a generic, reusable, marshalling library.
Streaming Object Marshalling APIs
Leveraging the well-known marshalling APIs defined by frameworks such as Spring-WS, I started with the marshaller and unmarshaller interface shown in Listing 1. These interfaces provide a common method for transforming an object into and out of XML while hiding the underlying implementation and providing consistent exception handling. The next step is to extend these interfaces to support objects of any size.
One option is to pass lists to the marshal method, but that doesn’t meet the requirement of scalability if all of the objects must be placed in the list first. This can also cause problems because the number of lists required could vary based on the object being serialized, and therefore doesn’t easily support a generic API. A callback interface can be used to request the objects to be written as they are needed. This creates a ‘stream’ of objects that are pulled directly from the source to the marshaller, or from the unmarshaller directly to the sink. Listing 2 shows the simple marshaller source and unmarshaller sink APIs.
To let the marshaller know when to start pulling objects from the source, or which source/sink to use if multiple object streams are used, the streamed object must be described to the marshaller. This is done using a stream definition as shown in Listing 3. The definition simply provides a lookup based on the current element being written, and can return an object stream source. The objects returned from the source compose the children of the original element. The same design works in reverse for unmarshalling as the stream definition is interrogated for object stream sinks.
The final step is to modify the marshal and unmarshal methods to take the stream definition as a parameter (see Listing 4). The new API, using the stream definition to locate the object streams, can be used as a generic OXM API to support objects of any size very efficiently. With the interfaces in place, a working implementation is the next objective.
Implementation with JAXB on StAX
Using the interfaces described in the previous section, I could implement the solution using existing XML marshalling tools. I decided to use JAXB on top of the StAX library because of the stability and flexibility that these two tools provide. At the same time, my general marshaller was already implemented with JAXB, so the new streaming marshaller could leverage some of that implementation. Also, to reduce the number of implementation classes, the JAXB implementation recognizes both the StreamMarshaller and StreamUnmarshaller interfaces. See the full implementation in Listing 5.
The key to the implementation is detecting when an element being written requires children from a stream source. To accomplish this, the implementation creates a dynamic proxy to the StAX XMLEventWriter instance, which watches for add method calls that are adding a start element event. In generic terms, the stream marshaller is looking for the specific XML open element tag that will contain streamed objects. For each start element, the stream definition is consulted to see if there is a stream source available.
If the stream definition returns a stream source, the current marshalling activity is suspended, and a new marshalling loop begins in which the objects from the source are marshaled to the event writer until all are exhausted. Once exhausted, the original marshalling activity resumes. This streaming technique allows the stream source to load the objects to be marshalled on demand and therefore limits memory usage. Due to a reentrancy limitation in the JAXB marshaller, a second marshaller is used in the inner loop, but other implementations may support the use of a single marshaller.
The unmarshalling process works in essentially the same way. If a start element is found, the stream definition is consulted. If a stream sink is returned from the definition, the current unmarshalling activity is suspended, and a new unmarshalling loop begins in which the child objects are unmarshalled and given to the sink until an end element is found. Again, the large list of objects in the XML document are read and processed individually to prevent them all from being loaded into memory.
Putting it All Together
To put it all together, let’s look at a simple example using an audit trail logging system. The schema, and consequently the domain model, for the audit trail is shown in Figure 1. Depending on the user activity, the audit trail can contain thousands of log entries, so it isn’t advisable to try to load them all into memory before serialization. Listing 6 presents the implementation of a stream definition for the audit trail in which the ListOfEvents element will trigger the use of a stream source/sink. (The implementation of the data access objects is outside the scope of this article. However, it is safe to assume that they simply load or save objects to a database.)
|Figure 1. The Schema for the Audit Trail: The XML schema model for the audit trail example domain objects.|
The application can now simply use the streaming marshaller with the stream definition to marshal an entire audit trail to XML without worrying about memory exhaustion (see Listing 7). The only audit-trail-specific code is the simple stream definition and the anonymous source/sink objects. The JAXB marshaller implementation remains generic and can be reused in a threaded environment for any supported domain object.
While this streaming marshaller pattern is sufficient for the majority of cases, you can create a few extensions to make it even more useful. The implementation presented so far supports only a single nested object stream. By splitting the child marshalling loop out into a separate operation that internally creates the child JAXB marshaller, any number of nested object streams could be supported.
Writing a no-operation source or sink would allow the application to ignore sections of a document that were not relevant. For example, an application could process a payroll XML document and skip the expense reports section by using a no-op stream sink. Then it would process only the billable time section. The no-op source or sink implementations can be generic and reusable by any stream definition.
Based on the implementation presented, if the stream definition returns null for a given start element, the marshalling process will continue as normal. By leveraging this functionality, an application can decide to allow the marshaller to read or write objects normally?even if they usually are streamed. For example, based on the previous audit trail example, if the application knows that very few events are in the current system, it could simply load the events into the audit trail and return null in the stream definition. You may notice that by returning null from the stream definition the streaming marshaller behaves exactly the same as the non-streaming marshaller implementation.
Limitations and Benefits
Of course any pattern has tradeoffs and this one is no exception. The callback model used by the stream source and sink objects can sometimes be more difficult to implement in an application because the marshaller controls the reading or writing process. This callback model is reminiscent of the SAX (Simple API for XML) model, which is somewhat deprecated.
During unmarshalling, the stream sink will receive callbacks with the child objects before the parent object has been fully unmarshalled due to the hierarchical nature of XML. This could be problematic in situations where information from the parent object is required before the children can be processed. For example, what do you associate audit events to in the database when the unmarshaller returns them before the actual audit trail parent object? One workaround is to use a no-op sink as described earlier to read the XML document once, extracting only the parent object, and then read the XML document again using the proper sinks. This will not deliver the best performance but it does give access to the parent object before the children are returned.
But even given these limitations, the benefits of the generic, reusable implementation, automatic XML type handling, and limited memory usage still make the approach a winner. So the next time you have to serialize large objects to XML, consider this pattern before dropping to the low-level APIs. You’ll save yourself a lot of headaches.