Home » Manipulate XML Content the Ximple Way

Manipulate XML Content the Ximple Way

he latest Java version of the Virtual Token Descriptor for XML (VTD-XML) can function as a slicer, an editor, and an incremental modifier to intelligently manipulate XML document content. This article will show you how to use it, introduce you to the concept of “document-centric” XML processing and discuss its implications for service-oriented architecture (SOA) and the future of enterprise IT.

Previous articles on DevX (see Related Resources) presented VTD-XML as a general-purpose, ultra high-performance XML parser well-suited for processing large XML documents using XPath. In parsing mode, VTD-XML derives its memory efficiency and high performance from non-extractive parsing. Internally, VTD-XML retains the XML document intact in memory and un-decoded, using offsets and lengths to describe tokens in the XML document. By resorting entirely to primitive data types (such as 64-bit integers), VTD-XML achieves unrivaled performance and memory efficiency by eliminating unnecessary object creation and garbage collection costs (which are largely responsible for the poor performance of DOM and SAX parsing).

Nevertheless, memory usage and CPU efficiency may be only a small part of the inherent benefits that non-extractive parsing offers. An arguably more significant implication—one that sets it apart from other XML parsing techniques—lies in its unique ability to manipulate XML document content at the byte level. Below are three distinct, yet related, sets of capabilities available in version 2.2 of VTD-XML.

XML slicer—You can use a pair of integers (offset and length) to address a segment of XML content so your application can slice the segment from the original document and move it to another location in the same or a different document. The VTDNav class exposes two methods that allow you to address an element fragment: getElementFragment(), which returns a 64-bit integer representing the offset and length value of the current element, and getElementFragmentNs() (in the latest version), which returns an ElementFragmentNs object representing a “namespace-compensated” element fragment (more detail on this later).
Incremental XML modifier—You can modify an XML document incrementally through the XMLModifier, which defines three types of “modify” operations: inserting new content into any location (at any offset) in the document, deleting content (by specifying the offset and length), and replacing old content with new content—which effectively is a deletion and insertion at the same location. To compose a new document containing all the changes, you need to call the XMLModifier’s output(…) method.
XML editor— You can directly edit the in-memory copy of the XML text using VTDNav’s overWrite(…) method, provided that the original tokens you’re overwriting are wide enough to hold the new byte content.

Editor VS Incremental Modifier
While non-extractive parsing enables both the editing mode and the incremental modifier mode of VTD-XML, there are subtle differences between the two. Using VTD-XML as an incremental modifier (by calling various XMLModifier methods) doesn’t modify the in-memory copy of the XML document; instead, you compose a new document based on the original document and the operations you specify. To generate the new document, you must call the XMLModifier’s output(…) method.

In contrast, when using VTD-XML as an editor, you directly modify the in-memory XML text. In other words, if the modification is successful, your application logic can immediately access the new data—there’s no need to reparse.

Consider the following XML document named test.xml:

To change the attribute value of “attr” to “new value,” you can use the following Java code:

   import com.ximpleware.*;    public class changeAttrVal{       public static void main(String args[]) throws Exception{          VTDGen vg = new VTDGen();          XMLModifier xm = new XMLModifier();          if (vg.parseFile("test.xml",false)){             VTDNav vn = vg.getNav();             xm.bind(vn);             int i = vn.getAttrVal("attr");             if (i!=-1)                xm.updateToken(i,"new value");             xm.output("new_test.xml");          }       }    }

The last line of the preceding code outputs the modified XML document with the changed attribute value to the file new_text.xml, as shown below:

You could achieve the same result using the VTD-XML’s editing mode using this Java code:

   import com.ximpleware.*;    import java.io.*;    public class changeAttrVal2{       public static void main(String args[])          throws Exception{          VTDGen vg = new VTDGen();          if (vg.parseFile("test.xml",false)){             VTDNav vn = vg.getNav();             int i = vn.getAttrVal("attr");             if (i != -1){                vn.overWrite(i, "new value".getBytes());                //print the new string here                System.out.println(                  "print the new attr value ===> " +                   vn.toString(i));             }             FileOutputStream fos = new FileOutputStream("new_test2.xml");             fos.write(vn.getXML().getBytes());             fos.close();          }       }    }

In contrast to the output from XMLModifier, this version retains a few extra white spaces as a part of the attribute value. This is because VTDNav’s overWrite() method first fills the “window” (the space occupied by the content) of the attribute value with the new byte content, then fills the remaining part of the window with white spaces, guaranteeing that the new token has the same length as the old token in the new XML file. However, note that the example can immediately print out the new attribute value after calling overWrite(), without generating a new copy of the document:

Using Namespace-Compensated Element Fragments
For documents that don’t use namespaces, you call VTDNav’s getElementFragment() to retrieve the offset and length of the element fragment, which in itself is a valid XML document. But for XML files that include namespaces, describing an element fragment by only offset and length is usually insufficient, because it may miss namespace declarations in ancestor nodes. VTD-XML 2.2 allows you to obtain an ElementFragmentNs instance by calling VTDNav’s new getElementFragmentNs() method You can think of an ElementFragmentNs object as a namespace-aware, well-formed element fragment, consisting of the fragment itself plus any namespace declarations in its ancestor nodes. Consider the following XML document:

                           uuid:           093a2da1-q345-739r-ba5d-pqff98fe8j7d                              2001-11-29T13:20:00.000-05:00

Using only the offset and length, you get a “naked” fragment for the m:reservation element as shown below. Notice that it is not well-formed namespace-wise:

                 uuid:           093a2da1-q345-739r-ba5d-pqff98fe8j7d                             2001-11-29T13:20:00.000-05:00

In comparison, a namespace-compensated fragment for the m:reservation element contains an additional namespace declaration (as defined in the root element):

                uuid:           093a2da1-q345-739r-ba5d-pqff98fe8j7d                             2001-11-29T13:20:00.000-05:00

An interesting property is that the ElementFragmentNs instance of a root element is precisely the document itself. Version 2.2 added a couple of overloaded methods to the XMLModifer class that let you insert an ElementFragmentNs object into the document. These two methods are insertAfterElement(ElementFragmentNs efn) and insertBeforeElement(ElementFragmentNs efs).

“Document-Centric” XML Processing


Figure 1: XML Processing: Object-oriented processing forces object creation, while document-centric XML processing does not.

Traditional XML processing models (such as DOM, SAX and JAXB) were designed around the notion of objects. The XML text, as a mere form of object serialization, was relegated to the status of a second-class citizen. You base your applications on DOM nodes, strings, and various business objects, but rarely on the physical documents. However, it’s become obvious that this object-oriented approach of XML processing makes little sense as it causes performance hits from virtually all directions. Not only are object creation and garbage collection inherently memory and CPU intensive, but applications incur the cost of re-serialization with even the smallest changes to the original text (see Figure 1).

In contrast, VTD-XML’s non-extractive parsing starts from the XML itself—the persistent data format. Whether you’re parsing, performing XPath queries, modifying content, or slicing element fragments, you no longer work directly with objects by default. Instead, you need to create and work with objects only when it makes sense to do so. More often than not, you can treat documents purely as syntax, and think in bytes, byte arrays, integers, offsets, lengths, fragments, and namespace-compensated fragments. The first-class citizen in this paradigm is the XML text itself; object-centric notions of XML processing, such as serialization and de-serialization (or marshalling and unmarshalling) are often displaced, if not replaced, by more document-centric notions of parsing and composition (see Figure 1.). When you approach XML programming in this manner, you’ll find that your XML programming experience gets simpler. And not surprisingly, the simpler, more intuitive way to think about XML processing is also the most efficient and powerful.

Implications for SOA and Future of Enterprise IT
But there are other practical limitations of the object-oriented development approach. The failure of CORBA showed that the object-oriented approach is fundamentally ill suited for building Internet-scale, loosely-coupled distributed systems. Just as with the XML processing approaches mentioned earlier, CORBA’s attempt to achieve interoperability at the API level relegated the wire format (the persistent data) to afterthought status, leading to a distributed system design that is brittle, tightly coupled, and stiflingly complex. Those painful technical lessons of CORBA have ultimately led us to SOA, which achieves loose-coupling and simplicity by explicitly exposing the XML messages (the wire format) as the public contract of your services. In other words, when building loosely coupled services, think first in messages.

Document-centric XML processing fits into and enhances the technical foundation of SOA. Simply put, by thinking in messages, you gain not just loose coupling and simplicity, but efficiency as well. It is quite harmful to think of your services in objects. Consider an SOA intermediary/broker application that aggregates multiple services. Pretty much all it does is to splice together fragments from multiple documents to compose a single, large document and shove it upstream. Objects aren’t really necessary in that scenario. As another example, consider a service’s dissemination point (the exact opposite) where large XML documents get split into several smaller ones, each of which gets forwarded to a downstream recipient for further processing. Again, there’s no need to allocate objects to perform that task. As more services become available, you’ll discover that composite services/applications are mostly about slicing, editing, modifying, splicing, and splitting documents. In such an environment, traditional, object-oriented design patterns become increasingly less applicable, whereas the simpler, more intuitive way of dealing with documents directly is highly efficient.

In many ways, the programming experience for building SOA applications is a lot like building the Internet itself, requiring you to think of building applications in networking terms, such as Open System Interconnect (OSI). Your services—especially those residing in the middle tier—represent a new breed of networked information devices possessing the following characteristics:

Application layer devices—These devices behave more or less like routers, switches, and filters sitting in the wiring closet in your datacenter, except that they speak XML, rather than IP. They are services in themselves, and can assume a variety of names, such as XML router/switch, SOA intermediary, message/service broker, etc.
Integrated native XML databases—From time to time you’ll want to persist and index messages in transit natively to serve other requests at a later time.
XPath and XQuery—Query languages such as XPath and XQuery are used extensively, not just to achieve loose coupling, but also for routing, switching, filtering, querying, and transformation.

XML has finally emerged as the most powerful universal wire format. Not only is it easy to learn, human-readable, interoperable, and enables loose coupling, it also leads CORBA, DCOM, and RMI by a mile, performance-wise (you can read more here).

But SOA is only a part of the larger IT transformation of “Copernican” proportion, in which the focal point of enterprise architecture is irreversibly shifting away from “application” toward “data.” Increasingly the “smarts” are being moved out of applications and into data, because making data smart helps ensure interoperability with present and future applications, both within and external to your organization. Therefore, it seems to me that the correct model is to let objects (an “application-level” concept as applications consist of interacting objects according to OO design doctrine) revolve around persistent data (which outlives objects and applications) instead of the other way around. In other words, applications and objects come and go, but your smart data lives on forever. Object-centric notions of serialization and de-serialization are the wrong way to think about XML documents, not just in the context of SOA, but in the larger context of IT architecture as well. To design the next-generation, services-oriented, “smart-data” enterprise you’ll need to pick up the XML slicer and turn yourselves into “data” artists.

Code Examples
The rest of this article shows you how to combine VTD-XML’s slicing, modifying, and editing capabilities to intelligently manipulate XML document content. Each example consists of a brief description, the input XML documents (side-by-side with the output documents so you can see what the application does), and the Java code. You can download the complete code for the examples in Java, C#, or C. There are many more examples in the code download than could fit in this article.

Example 1: Remove Element Fragments
This example removes all the fragments evaluating to the XPath expression /root/b. In all these examples, the table shows the XML for both the input and output.

Input XML	Output XML
text text text text text text text text text	text text text text text text

Here’s the Java code:

   import com.ximpleware.*;    public class removeFragments {       public static void main(String[] args) throws Exception{          VTDGen vg = new VTDGen();          AutoPilot ap = new AutoPilot();          XMLModifier xm = new XMLModifier();          ap.selectXPath("/root/b");               if (vg.parseFile("old.xml", false)) {             VTDNav vn = vg.getNav();             ap.bind(vn);             xm.bind(vn);             while(ap.evalXPath()!=-1){                xm.remove();             }             xm.output("new.xml");          }       }    }    import java.io.*;    public class arrangeFragments {       public static void writeFragment(         OutputStream os, long l, byte[] ba)          throws IOException {             int offset = (int) l;          int len = (int) (l >> 32);          os.write('
');          os.write(ba, offset, len);       }       public static void main(String[] args)          throws Exception {             VTDGen vg = new VTDGen();          AutoPilot ap0 = new AutoPilot();          AutoPilot ap1 = new AutoPilot();          AutoPilot ap2 = new AutoPilot();          ap0.selectXPath("/root/a");          ap1.selectXPath("/root/b");          ap2.selectXPath("/root/c");          if (vg.parseFile("old.xml", false)) {             VTDNav vn = vg.getNav();             ap0.bind(vn);             ap1.bind(vn);             ap2.bind(vn);                  FileOutputStream fos = new FileOutputStream("new.xml");             fos.write("".getBytes());             byte[] ba = vn.getXML().getBytes();                  while (ap0.evalXPath() != -1) {                long l = vn.getElementFragment();                writeFragment(fos,l,ba);             }             ap0.resetXPath();                  while (ap1.evalXPath() != -1) {                long l = vn.getElementFragment();                writeFragment(fos,l,ba);             }             ap1.resetXPath();                  while (ap2.evalXPath() != -1) {                long l = vn.getElementFragment();                writeFragment(fos,l,ba);             }             ap2.resetXPath();                  fos.write('
');             fos.write("".getBytes());          }       }    }

Example 2: Inserting Namespace-Compensated Element Fragments
This example extracts a namespace-compensated element fragment (m:reservation) from the input XML #2 (highlighted section), and inserts it both before and after the a element from the input XML #1. The output XML shows the results, with the inserted content highlighted.

Input XML #1

Input XML #2

Output XML

text
text

       “http://www.w3.org/2003/05/
   soap-envelope” xmlns=”abc”>

Example 3: Make an XML Template
An XML template is analogous to the tax forms you can pick up from the IRS. Those forms have descriptions of fields, but the fields themselves are empty and wide enough for you to fill in data. An XML template is basically an XML document with all the element tags, attribute names, and unfilled fields wide enough to fill in the data. Your application can parse the template, fill in the data, or do something else. Note that the data you fill in with this example becomes available to calling applications immediately, without the need to reparse.

This example takes a CD catalog document and replaces the text nodes with empty fields of defined lengths. Notice that this technique is similar to designing a SQL table, where you have to specify the width of the table columns.

Input XML	Output XML
Empire Burlesque Bob Dylan USA Columbia 10.90 1985 Still Got the Blues Gary More UK Virgin Records 10.20 1990 Hide Your Heart Bonnie Tyler UK CBSRecords 9.90 1988 Greatest Hits Dolly Parton USA RCA 9.90 1982

Here’s the Java code that drives the example:

   import com.ximpleware.*;    public class makeTemplate {       public static byte[] ba0, ba1, ba2, ba3, ba4, ba5;       public static void main(String[] args) throws Exception {          ba0 = " ".getBytes();          ba1 = " ".getBytes();          ba2 = " ".getBytes();          ba3 = " ".getBytes();          ba4 = " ".getBytes();          ba5 = " ".getBytes();          VTDGen vg = new VTDGen();          BookMark bm = new BookMark();          AutoPilot ap = new AutoPilot();          XMLModifier xm = new XMLModifier();          ap.selectXPath("/CATALOG/CD");          if (vg.parseFile("old_cd.xml", false)) {             VTDNav vn = vg.getNav();             ap.bind(vn);             xm.bind(vn);             int i;             //Insert spaces which will later be edited on             while ((i = ap.evalXPath()) != -1) {                convert(vn,xm);             }             xm.output("cd_Template.xml");          }       }                   public static void convert(VTDNav vn, XMLModifier xm) throws Exception{          int i = -1;          vn.toElement(VTDNav.FIRST_CHILD);          i = vn.getText();          xm.updateToken(i,ba0);               vn.toElement(VTDNav.NEXT_SIBLING);          i = vn.getText();          xm.updateToken(i,ba1);               vn.toElement(VTDNav.NEXT_SIBLING);          i = vn.getText();          xm.updateToken(i,ba2);               vn.toElement(VTDNav.NEXT_SIBLING);          i = vn.getText();          xm.updateToken(i,ba3);               vn.toElement(VTDNav.NEXT_SIBLING);          i = vn.getText();          xm.updateToken(i,ba4);               vn.toElement(VTDNav.NEXT_SIBLING);          i = vn.getText();          xm.updateToken(i,ba5);               vn.toElement(VTDNav.PARENT);       }    }

Example 4: Erasing Fields
This last example erases the fields of the second “CD” element, highlighted in the output XML. Notice that this approach works much better than removing the second element, because you can later fill in new values.

Input XML	Output XML
EmpireBurlesque BobDylan USA Columbia 10.9 1985 Still Got the Blues GaryMore UK Virgin 10.2 1990	EmpireBurlesque BobDylan USA Columbia 10.9 1985

And here’s the code:

   import com.ximpleware.*;    import java.io.*;    public class erase {       public static void main(String[] args)          throws Exception{             VTDGen vg = new VTDGen();          AutoPilot ap = new AutoPilot();          ap.selectXPath("/CATALOG/CD[PRICE=10.2]/*/text()");          if (vg.parseFile("old_cd.xml",false)){             VTDNav vn = vg.getNav();             ap.bind(vn);             int i;             byte[] ba = "".getBytes();             while((i=ap.evalXPath())!=-1){                vn.overWrite(i,ba);             }             FileOutputStream fos = new FileOutputStream("new_cd.xml");             fos.write(vn.getXML().getBytes());             fos.close();          }       }    }

At this point, you’ve seen how VTD-XML functions as a slicer, an editor, and an incremental modifier to help manipulate your XML documents intelligently. Most of the features discussed in this article are available to you only through VTD-XML. As XML assumes an increasingly larger role in enterprise computing, it’s also simplifying the enterprise IT infrastructure by breaking down silos. Many of XML’s much-discussed performance issues are not caused by XML per se, but instead by excessive, un-moderated use of OO design methodology. The power of non-extractive, “document-centric” XML processing takes a balanced approach and aims to fundamentally resolve those problems. But that is just a new beginning; the best is yet to come. By judiciously combining the various new tools and features, you can discover new approaches and develop new ways of thinking that is tailored toward a services-oriented, “smart-data” world.

Charlie Frank

Charlie has over a decade of experience in website administration and technology management. As the site admin, he oversees all technical aspects of running a high-traffic online platform, ensuring optimal performance, security, and user experience.

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.