Better, Faster XML Processing with VTD-XML

TD-XML is a new, open-source, non-validating, non-extractive XML processing API written in Java. Different from current XML processing technologies, VTD-XML is designed to be random-access capable without incurring excessive resource overhead. One key optimization of VTD-XML is non-extractive tokenization. Internally, VTD-XML retains the XML message in memory intact and un-decoded, and tokens represents tokens using starting offset and length exclusively.

VTD-XML’s tokenization is based on the Virtual Token Descriptor (VTD) core binary encoding specification. A VTD record is a 64-bit integer that encodes the token length, starting offset, type and nesting depth of a token in XML.

Because VTD records are constant in length, one can bulk-allocate memory buffers to store those records, and therefore avoid creating the large numbers of string/node objects typically associated with other XML processing technologies. As a result, VTD-XML achieves reductions in both memory usage and object creation cost, thus leading to significantly higher processing performance. On a 1.5 Ghz Athlon machine, VTD-XML delivers random access at a performance level of 25~35MB/sec, outperforming most SAX parsers with null content handlers. An in-memory VTD-XML document typically consumes only 1.3 to 1.5 times the size of the XML document itself.

For software developers, VTD-XML provides several benefits. For example, when you start working on a project involving XML, you have to pick a processing model. You’ve probably been told that DOM is slow and consumes too much memory, particularly for large documents. But you also find SAX difficult to use, especially for XML documents with complex structures. VTD-XML provides a new option that doesn’t force you to trade processing performance for usability. In fact, VTD-XML’s random-access capability is critical in providing the best possible performance. Although SAX is fast, because of its forward-only nature, raw SAX performance is usually not indicative of real-world performance. In some situations, you end up doing lots of buffering to extract the data you need, while in others, you have to repeat SAX parsing on the same document multiple times. No matter what you do, SAX programming usually results in ugly and unmaintainable code, while the performance benefit over DOM isn’t always significant. Using VTD-XML, you should be able to simultaneously achieve ease-of-use and high-performance. And its performance benefit over DOM is substantial.

Understanding VTD-XML
You can use two criteria to determine whether VTD-XML can help with your next XML project. First, the current version of VTD-XML doesn’t support entity declarations in DTDs. It only recognizes five built-in entities (&s;, ', <, >, and ). So if you’re dealing with SOAP, RDF, FIXML, or RSS, VTD-XML should handle the job very well. Second, VTD-XML’s internal parsed representation of XML is slightly larger than the XML itself, so you have to make sure that you have sufficient RAM. Keep in mind that to provide true, random access to the entire document, keeping that document entirely in memory is pretty much unavoidable. If both criteria are met, you’ll find VTD-XML to be the most efficient XML processing API.

At the top level, the Java API of VTD-XML consists of three essential components:

  • VTDGen (VTD generator) encapsulates the parsing routine that produces the internal parsed representation of XML.
  • VTDNav (VTD navigator) is a cursor-based API that allows for DOM-like random access to the hierarchical structure of XML.
  • Autopilot is the class that allows for document-order element traversal similar to Xerces’ NodeIterator.

To use VTD-XML to process an XML document, whether from disk, or via HTTP, the first step is to find out its length, allocate a chunk of memory big enough to hold the document, and then read the entire document into memory. Next, you create an instance of VTDGen and assign the byte array to it using setDoc(). Finally, you call parse(boolean ns) to generate the parsed XML representation. When ns is set to true, subsequent document navigation is namespace aware. If parsing succeeds, you can retrieve an instance of VTDNav by calling getNav().

Navigating the Document Hierarchy
At the onset of navigation, the VTDNav instance’s cursor points at the root element (equivalent to the VTD record of the starting tag) of the XML document. To move the cursor manually to different positions in the hierarchy, you use one of the overloaded versions of toElement(). The simplest form is toElement(int direction), which takes an integer as the input to indicate the direction in which the cursor moves. Defined as class variables of VTDNav, the six possible values of this integer are: ROOT, PARENT, FIRST_CHILD, LAST_CHILD, NEXT_SIBLING, and PREV_SIBLING. Each has its respective acronym: R, P, FC, LC, NS, and PS. The method toElement() returns a Boolean value indicating the status of the operation?true when the cursor moved successfully. If you try to move the cursor to a non-existent location (e.g. the first child of a childless element), the cursor does not move and toElement() returns false.

The method getAttrVal(String attrName) retrieves the attribute value of the element at the cursor position. Likewise, getText() retrieves the text content of the cursor element. If the namespace is turned on during parsing, you can also use toElementNS() and getAttrValNS() to navigate the document hierarchy in a namespace-aware fashion.

Autopilot is the other mode of navigation is by using. An Autopilot instance acts like a magic hand that automatically moves the cursor through the node hierarchy in document order. To use it, you first call the constructor, which accepts an instance variable of VTDNav as the input. Next, you call the selectElement() or selectElementNS() to specify the descendent elements to be sifted out. Afterwards, each call to iterate() moves the cursor to the next matching element.

Comparison with Existing XML Processing APIs
Now that you’ve seen the basic strategy behind VTD-XML, it’s worth focusing on its unique properties as compared to other similar XML APIs, such as DOM and XMLCursor.

VTD-XML’s hierarchy consists exclusively of element nodes. This is very different from DOM, which treats every node, whether it is an attribute node or a text node, as a part of the hierarchy. Second, in VTD-XML, there is one and only one cursor for every instance of VTDNav. You can move the cursor back and forth in the hierarchy, but you may not duplicate it. However, you can temporarily save the location of the cursor on a global stack. VTDNav has two stack access methods. Calling push() saves the cursor state; while calling pop() restores it. Suppose that you’re somewhere in the element hierarchy and you wanted to save the current location, move to a different part of the document, and then continue at the saved point. To do that, you first push() the location onto the stack. Then, after moving the cursor to a different part of document, you can very quickly jump back to the saved location by popping it off the stack.

The most unique aspect of VTD-XML, one that distinguishes it from any other XML processing API, is its “non-extractive” tokenization based on Virtual Token Descriptor. As mentioned earlier, non-extractive parsing is the key to achieving optimal processing and memory efficiency in VTD-XML. VTD-XML manifests this non-extractiveness in the following ways.

Figure 1. Extractive vs. Non-extractive Parsing: The figure shows the basic differences between extractive and non-extractive approaches to parsing, making it easy to see why non-extractive parsing is more efficient and faster in most cases.

First, many member methods of VTDNav, such as getAttrVal(), getCurrentIndex(), and getText() return an integer. This integer is in fact a VTD record index that describes the token as requested by the calling functions. After parsing, VTD-XML produces a linear buffer filled with VTD records. Because VTD records are all have the same length, you can access any record in the buffer if you know its index value. Also notice that VTD records are not objects, and therefore are not addressable using pointers. When a VTDNav function doesn’t evaluate to any meaningful value, it returns -1?which you can think of as more or less equivalent to a NULL pointer in DOM.

Second, because the parsing process doesn’t create any string objects (after all, tokenization is done virtually), VTD-XML implements its own set of comparison functions that directly operate on VTD records. For example, VTDNav’s matchElement() method tests if the element name, which effectively is the VTD record of the cursor, matches a given string. Similarly, VTDNav’s matchTokenString(), matchRawTokenString(), and matchNormalizedTokenString() methods perform a direct comparison between a string and a VTD record, although each has a different flavor. Why is this a good thing? Because you simply don’t have to?and have every incentive not to?pull tokens out into string objects, which are expensive to create, especially when you create lots of them. Even worse, those strings will eventually be garbage-collected. Bypassing excessive object creation is the main reason VTD-XML significantly outperforms DOM and SAX.By the same token, VTD-XML also implements its own set of string-to-numeric data conversion functions that operate directly on VTD records. VTDNav has these four member methods: parseInt(), parseLong(), parseFloat() and parseDouble(). Those functions take a VTD record index value and convert it directly into a numeric data type. Figure 1 shows the difference between extractive and non-extractive parsing for string to integer conversion.

When you do need strings for certain tasks, for example, formatting SQL queries, you can use VTDNav’s toString(), toRawString(), and toNormalizedString() methods. All three methods accept the index value of a VTD record and convert the record’s value into a string. Still, for maximum performance, please avoid creating string objects whenever possible.

Finally, a nice by-product of VTD-XML’s non-extractive tokenization is a feature called “incremental update.” Because a VTD record marks the region in the XML message in which a token resides, when one wants to change the content of that token, he only needs to update the content in that same region. Likewise, adding new content into the XML message can literally be as simple as sticking the bytes into the right location of the message. In contrast, to accomplish the same thing using DOM or SAX often requires taking the message apart, then putting everything back. You can find more information about how DOM and SAX take apart XML documents here.

A Sample Project
So far, you have had a high-level overview of VTD-XML. Now it’s time to put it into action and demonstrate some of its important concepts and features. The following examples use a ProductDetails.xml file, whose content is shown below.


The sample code parses the file using VTDGen, navigates the hierarchy using VTDNav, and then performs a simple element node iteration using AutoPilot. Also you will see how to incrementally update the content of this XML file.

For this project, create a Java class named, which only has a main() method. To handle all possible exceptions, the code uses a single try/catch block wrapped around the application logic.

try {       // application logic goes here   }   catch (ParseException e){     System.out.println(        " XML file parsing error 
"+e);   }   catch (NavException e){     System.out.println(        " Exception during navigation "+e);   }   catch ( e)   {     System.out.println(" IO exception condition"+e);   }   

The following code snippet reads the file content into a buffer, and then generates a parsed representation for ProductDetails.xml.

 // parse XML   File f = new File("ProductDetails.xml");    FileInputStream fis =  new FileInputStream(f);   // find out file length and allocate buffer   byte[] b = new byte[(int) f.length()];    // read file content into the buffer;       //create an instance of VTDGen   VTDGen vg = new VTDGen();    // assign the buffer to VTDGen   vg.setDoc(b);    // parse with namespace awareness turned off   vg.parse(false); 

Next, it obtains an instance of VTDNav from VTDGen and moves the cursor to a child element named product whose id attribute equals “3.”

   // manual navigation   // get the navigator   VTDNav vn = vg.getNav();         // verify the root element name   if (vn.matchElement("Products")){               // move the cursor to first child       // named "product"      if (vn.toElement(         VTDNav.FIRST_CHILD,"Product")) {         // stepping across every sibling          // named "Product"         do {            int i = vn.getAttrVal("ID");            if (vn.matchRawTokenString(i,"3")) {               System.out.println("element name ==>" +                   vn.toString(vn.getCurrentIndex()));               System.out.println(                  "element fragement is ==>");               // then print out element fragment                // using getElementFragment();               long l = vn.getElementFragment();                int len = (int) (l>>32);               int offset = (int) l;               System.out.println(new String(                  b,offset, len));            }         }         while(vn.toElement(            VTDNav.NEXT_SIBLING,"Product"));      }      else          System.out.println("no child named Product");   }

The output from executing the code snippet above is shown below.

 element name ==> Productelement fragment is ==>      

Next, the cursor gets reset to the root, then the code creates an instance of AutoPilot to iterate through all elements of the name “Desc.”

// iterate over element using AutoPilot   vn.toElement(VTDNav.ROOT);   AutoPilot ap = new AutoPilot(vn);   ap.selectElement("Desc"); //select "Desc"   while(ap.iterate())   {      long l = vn.getElementFragment();      int len = (int) (l>>32);      int offset = (int) l;      System.out.println(new String(b,offset,len));   }

And its corresponding output looks like this:

Finishing Up
The final section of the code demonstrates how to incrementally update the content of XML. For the element named “Product” whose “ID” equals “3,” the code changes its value to “4.” Then it removes the child element “Desc” and writes the updated content into ProductDetailsUpdated.xml.

// incremental update     vn.toElement(VTDNav.ROOT);    vn.toElement(VTDNav.FIRST_CHILD, "Product");    int  i1=0; //vtd index for "3"    long l2=0; // offset length for the child element        //find the element "Product     do {        int i = vn.getAttrVal("ID");        if (vn.matchRawTokenString(i,"3"))        {                i1 = i;                vn.toElement(VTDNav.FIRST_CHILD);                l2 = vn.getElementFragment();            }    }    while (vn.toElement(VTDNav.NEXT_SIBLING,       "Product"));        int tokenOffset = vn.getTokenOffset(i1);     int tokenLength = vn.getTokenLength(i1);    int fragOffset = (int) l2;    int fragLength = (int) (l2 >> 32);        File fo = new File("ProductDetailsUpdated.xml");    FileOutputStream fos = new FileOutputStream(fo);    // write everything before "    fos.write(b, 0, tokenOffset);     // overwrite 3 with 4    fos.write("4".getBytes());        // write everything before the element     // fragment for "Desc"    fos.write(b, tokenOffset+tokenLength,         fragOffset-(tokenOffset+tokenLength+1));     // skip the fragment, then write the rest    fos.write(b, fragOffset+fragLength +1,        b.length - 1 - (fragOffset+fragLength+1));

Here’s the content of the final ProductDetailsUpdated.xml file:


You can download the source code for the entire project. VTD-XML is hosted on SourceForge, which has an in-depth discussion on the internals of the algorithm, in case you are curious.

Future Roadmap
The VTD-XML project has plans for a few enhancements in future releases of VTD-XML. For one, the parsed representation of VTD-XML is inherently persistent, meaning that you can save it on disk, or transport it along with the XML file to reduce the overhead of repetitive parsing. A future version will add that capability. Another useful planned feature is implementing XPath, which should further enhance VTD-XML’s usability.

To summarize, VTD-XML is a new XML processing API whose primary design goal is to achieve high performance parsing and low memory usage without compromising ease of use. VTD-XML also offers unique incremental update capability. As XML becomes increasingly ubiquitous, its combination of features and attributes together, I feel that VTD-XML has the potential to play a significant role in enabling a range of applications.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin


Recent Articles: