Improve XPath Efficiency with VTD-XML

Improve XPath Efficiency with VTD-XML

or the past several years, XPath has been steadily gaining popularity as an effective tool when developing XML applications. XPath was originally viewed as an adjunct element for the W3C’s XSLT and XPointer specifications, but developers found its simplicity appealing. With XPath, instead of manually navigating the hierarchical data structure, you can use compact, “file-system”-like expressions to address any node or set of nodes in XML documents. However, most existing XPath engines work with DOM trees or similar object models, which are slow to build and modify?and consume excessive amounts of memory. This presents a dilemma for anyone looking to take advantage of XPath for SOA applications that are either performance sensitive or routinely deal with large XML documents.

My last two articles with DevX (see the Related Resources) introduced VTD-XML as a next-generation XML processing model that goes beyond DOM and SAX in performance, memory usage, and ease of use. VTD-XML is simultaneously:

  • Memory-efficient: VTD-XML typically requires only somewhere between 1.3 ~ 1.5 times the size of the XML document itself?meaning it’s far more memory-efficient than DOM?and works very well with large XML documents.
  • High-performance: VTD-XML typically outperforms DOM parsers by 5 ~ 10 times, and it typically outperforms SAX parsers with null content handlers by about 100 percent
  • Easy to use: Applications written in VTD-XML are more compact and readable than those written in DOM or SAX.

What is VTD-XML’s not-so-secret sauce? Unlike traditional XML parsers which create string-based tokens as the first step of parsing, VTD-XML uses linear buffers internally to store 64-bit integers containing the starting offsets and lengths of XML tokens, while keeping the un-decoded XML document intact in memory. All VTD-XML’s benefits are the result?one way or the other?of this “non-extractive” tokenization. At the API level, VTD-XML consists of the following core classes:

  • VTDGen encapsulates the main parsing, index writing, and index loading functions.
  • VTDNav exports a cursor-based API that contains the methods that navigate the XML hierarchy.
  • AutoPilot supports document-order element traversal?similar to Xerces’ NodeIterator.

VTD-XML’s XPath Implementation
VTD-XML’s XPath implementation, introduced with version 1.0, supports the full W3C XPath 1.0 spec. It builds upon VTDNav’s concept of cursor-based navigation. The AutoPilot class exports all the XPath-related methods. As described in one of the earlier articles, to manually navigate an XML document’s hierarchical structure, you obtain a VTDNav instance, and repeatedly call the toElement() method to move the cursor to various parts of the document. Using XPath you can either move the cursor manually or tell AutoPilot to move it to qualified nodes in the document automatically.

Table 1 shows AutoPilot’s XPath-related methods.

Table 1. AutoPilot’s XPath-related Methods: The table lists AutoPilot’s XPath-related methods along with a short description of each.
Method Description
declareXPathNameSpace(…) Binds a namespace prefix (used in the XPath expression) to a URL.
selectXPath(…) Compiles an XPath expression into an internal representation.
evalXPath(…) Moves the cursor to a qualified node in the node set.
evalXPathToBoolean(…), evalXPathToNumber(…), evalXpathToString(…) These three methods evalute an XPath expression to a Boolean, a double and a string, respectively.
resetXPath() Resets the internal state so the XPath can be re-used.
getExprString() Call this method to verify the correctness of the compiled expression.

VTD-XML’s XPath implementation also introduces two exception classes:

  • XPathParseException?Thrown when there is a syntax error in the XPath expression.
  • XPathEvalException?Thrown when an exception condition occurs during XPath evaluation.

Non-Blocking Node Set Evaluation
The most important innovation in VTD-XML’s XPath implementation is its stateless, non-blocking evaluation of the node set. Unlike other XPath engines (such as JAXEN, XALAN, etc) that return the entire node set all at once, VTD-XML returns a single qualified node as soon as the XPath engine finds a match. After receiving the first match, you repeatedly call the key method in the AutoPilot class called evalXPath(), which moves the cursor to the next node (as long as at least one matching node remains in the node set) every time it is called. To illustrate the differences between those two styles of XPath evaluation, compare the two short snippets written in JAXEN and VTD-XML respectively that execute the same application logic on the catalog.xml in Listing 1.

Here’s the JAXEN-based code. Note that JAXEN returns the entire matching node set for the provided XPath query:

   import javax.xml.parsers.*;   import*;   import org.w3c.dom.*;   import org.xml.sax.InputSource;   import org.jaxen.*;   import org.jaxen.dom.*;   import java.util.*;    public class jaxenXPath   {       public static void main(String args[]) throws Exception      {         DocumentBuilderFactory factory =            DocumentBuilderFactory.newInstance();         DocumentBuilder parser = factory.newDocumentBuilder();         Document d = parser.parse("catalog.xml");         XPath expression = new            org.jaxen.dom.DOMXPath("/CATALOG/CD/TITLE/text()");            // Return the entire node set all at once         List result = expression.selectNodes(d);         int sz = result.size();         for (int i=0;i "+              ((Node)result.get(i)).getNodeValue());         }      }   } 

In contrast, here’s the same application written for VTD-XML:

   import com.ximpleware.*;    public class vtdXPath   {      public static void main(String args[]) throws Exception      {         VTDGen vg =new VTDGen();         int i;         AutoPilot ap = new AutoPilot();          ap.selectXPath("/CATALOG/CD/TITLE/text()");         if (vg.parseFile("catalog.xml", false))          {            VTDNav vn = vg.getNav();            ap.bind(vn);            //XPath eval returns one node at a time            while((i=ap.evalXPath())!=-1)            {              System.out.println(" text ==> "+                 vn.toString(i));             }             ap.resetXPath();         }      }    }

In the second example above, VTD-XML’s XPath implementation returns the first matching node immediately, and then returns additional matching nodes in the while loop, letting the application begin handling matching nodes immediately, rather than waiting until the XPath implementation finds all the matching nodes.

The code sample shown above deals with only one XPath expression; to deal with multiple XPath expressions, you must instantiate multiple AutoPilot objects, each of which corresponds to a different XPath expression. When you call selectXPath(…) multiple times, the AutoPilot object remembers only the last XPath expression. After an AutoPilot object selects an XPath expression, you call evalXPath() which moves the embedded VTDNav instance cursor to the next available qualified node, and returns its VTD record index. When no more matching nodes remain in the node set, evalXPath() returns -1.

Querying Multiple Documents
Another common scenario consists of executing the same XPath query against multiple XML documents. In this case, you can reuse the XPath query for each document, through AutoPilot’s delayed binding feature. Here’s the procedure: After executing a query on one XML document, you bind (by calling bind(…)) the same AutoPilot object to a new VTDNav object that corresponds to a different XML document. You also need to call resetXPath(), which resets the internal XPath state so it is ready for reuse.

Listing 2 shows an example that applies two XPath expressions to two different XML documents.

The Benefits
When comparing with other XPath engines, VTD-XML’s non-blocking XPath evaluation approach has a number of benefits. Without sacrificing any inherent benefits of XPath, VTD-XML offers unsurpassed flexibility and puts more control in your hands. Consider the use case in which you want to use XPath to process a large XML document and the node set contains thousands of nodes. With other XPath engines, your application logic will have to wait for the entire node set to be returned before beginning to process matches. But with VTD-XML, your application logic can begin processing matches as soon as the XPath implementation finds and returns the first node.

If your application processes only a small number of nodes in the node set, or you have prior knowledge of the XML document (e.g. you know there is only one occurrence of the matching node at the beginning of a large XML file), this saves the time you’d have to wait while the XPath implementation finishes searching the entire document. In this case, other types of XPath evaluation unnecessarily block the program flow and seem more like the wrong approach!

As another example, you can use VTD-XML’s non-blocking XPath to walk the nodes, which simply isn’t possible with other types of XPath engines.

When combined with the other benefits and features of VTD-XML, the comparison with DOM and SAX gets lopsided in a hurry. Regardless of the file sizes, whether it is parsing, updating or XPath evaluation (shown later), VTD-XML delivers the kind of across-the-board performance improvements never before thought possible by many. For transactional SOA applications dealing with small-to-medium-size XML files, VTD-XML easily blows away DOM?and the benefit increases for large XML files. When compared with SAX, you literally move from long, ugly, and difficult-to-maintain SAX code to clean, short VTD-XML code, while enjoying the full power of VTD-XML’s random access and improving performance between 10 and 50 times along the way.

If you regularly deal with large XML documents or XPath expressions returning huge node sets, VTD-XML is a must-have!

Common Usage Patterns
There are a few common patterns to follow in order to base your application on VTD-XML.

Embed evalXPath(..) in while Statements
The most common pattern is to embed the call to evalXPath() in the control statement of a while loop. So far the examples you’ve seen code all use this pattern to navigate to the qualified nodes in the document. In this pattern, the while loop checks the return value of evalXPath(); when it equals -1, there are no more matching nodes in the node set. Otherwise, the return value equals the VTD record index of the matching node, to which the cursor of the VTDNav object (bound to the AutoPilot object) also points. The following two short code fragments illustrate this?although syntactically slightly different, they produce equivalent results:

   int i;    ap.bind(vn);    ap.selectXPath(...);    while( (i=ap.evalXPath())!=-1){         }       int i;    ap.bind(vn);    ap.selectXPath(...);    while(ap.evalXPath()!>=0){       i = vn.getCurrentIndex();    } 

Combining XPath with Manual Navigation
But sometimes you want to navigate the VTDNav cursor a bit further within the while loop. To do that, the rule is that the code segment nested in the loop must not alter the node position, meaning that you can either navigate where you like, and then manually move the cursor back to the original location before the end of the loop, or you can use the VTDNav object’s push() and pop() methods, which push and pop cursor locations onto a stack, to comply with the rule. I’ll show you both patterns.

Using the catalog.xml file in Listing 1, this first example navigates manually within the while loop to the first child node of each matching element, extracts the text, and then moves the cursor back to the starting position before the next iteration:

   import com.ximpleware.*;    public class vtdXPath2   {      public static void main(String args[]) throws Exception      {         VTDGen vg =new VTDGen();         int i;         AutoPilot ap = new AutoPilot();          ap.selectXPath("/CATALOG/CD[PRICE < 10]");         if (vg.parseFile("catalog.xml", false))         {            VTDNav vn = vg.getNav();            ap.bind(vn);               //XPath eval returns one node at a time            while((i=ap.evalXPath())!=-1)            {                // get to the first child               if (vn.toElement(VTDNav.FIRST_CHILD, "TITLE"))               {                   int j = vn.getText();                   if (j!=-1)                      System.out.println(" text node ==>"+vn.toString(j));                   vn.toElement(VTDNav.PARENT); // move the cursor back                }             }             ap.resetXPath();         }      }   } 

This second example (which is slightly less efficient) uses push() and pop() to reset the cursor position:

   import com.ximpleware.*;    public class vtdXPath3    {      public static void main(String args[]) throws Exception      {         VTDGen vg =new VTDGen();         int i;         AutoPilot ap = new AutoPilot();          ap.selectXPath("/CATALOG/CD[PRICE < 10]");         if (vg.parseFile("catalog.xml", false))         {            VTDNav vn = vg.getNav();            ap.bind(vn);               //XPath eval returns one node at a time            while((i=ap.evalXPath())!=-1)            {                // push the current cursor position               vn.push();                // get to the first child               if (vn.toElement(VTDNav.FIRST_CHILD, "TITLE"))               {                   int j = vn.getText();                   if (j!=-1)                      System.out.println(" text node ==>"+vn.toString(j));                }                // restore the cursor position               vn.pop();             }             ap.resetXPath();         }      }   } 

Nested XPath Support
You aren’t limited to manual navigation within the while loop. For more complicated navigation, you can nest XPath queries. The following example is equivalent to the examples in the preceding section, but has been rewritten to use nested XPath queries. Note that the rule that ensures cursor position consistency still applies:

   import com.ximpleware.*;    public class vtdXPath4   {      public static void main(String args[]) throws Exception      {         VTDGen vg =new VTDGen();         int i;         AutoPilot ap = new AutoPilot();          ap.selectXPath("/CATALOG/CD[PRICE < 10]");          AutoPilot ap2 = new AutoPilot();          ap2.selectXPath("TITLE/text()");          if (vg.parseFile("catalog.xml", false))         {            VTDNav vn = vg.getNav();            ap.bind(vn);             ap2.bind(vn);               //XPath eval returns one node at a time            while((i=ap.evalXPath())!=-1)            {                vn.push();                // get to the first child                int j;               while ((j=ap2.evalXPath())!=-1)               {                   System.out.println(" text node ==>" + vn.toString(j));                }                // don't forget this next statement; the next iteration will reuse               // ap2's XPath!!!                ap2.resetXPath();                vn.pop();             }             ap.resetXPath();         }      }   } 

Ad-Hoc Tree-walker Mode
As mentioned earlier, one way to use evalXPath() is just to move the cursor to the desired destination(s) in the XML document:

   import com.ximpleware.*;    public class vtdXPath5   {      public static void main(String args[]) throws Exception      {         VTDGen vg =new VTDGen();         int i;         AutoPilot ap = new AutoPilot();          ap.selectXPath("/CATALOG/CD[PRICE < 10]");          if (vg.parseFile("catalog.xml", false))         {            VTDNav vn = vg.getNav();            ap.bind(vn);            //XPath eval returns one node at a time            if (ap.evalXPath()>=0)            {                // do whatever you like here with vn 
Figure 1. XPath Evaluation Performance Comparison: The charts in this figure show the average latencies of several different query executions for JAXEN, Xalan, and VTD-XML against three documents of similar structure, but differing in size.
} } } }

XPath Evaluation Performance Benchmarks
Finally, to round out the performance claims, the figures in this section briefly show the results of comparing VTD-XML’s XPath performance with that of Xalan and JAXEN, both of which work with the Xerces DOM. The benchmark programs pre-compile a set of XPath expressions and then repetitively execute the queries over the same set of documents. The charts in Figure 1 show the average latencies of several different query executions for JAXEN, Xalan, and VTD-XML against three documents of similar structure, but differing in size.

You can read a complete description and discussion of the test set up here, but the figures shown here should give you a very good idea of the overall results.

In this article, you’ve seen an introduction and overview of VTD-XML’s XPath implementation. VTD-XML empowers you to write clean, readable code and achieve peerless performance, efficiency, and flexibility. In fact, VTD-XML is changing some of the most deeply entrenched perceptions and assumptions about XML programming. Consequently, the infrastructure built on top of XML, such as Web services and SOA, is on the verge of positive changes as well.

Share the Post:
XDR solutions

The Benefits of Using XDR Solutions

Cybercriminals constantly adapt their strategies, developing newer, more powerful, and intelligent ways to attack your network. Since security professionals must innovate as well, more conventional endpoint detection solutions have evolved

AI is revolutionizing fraud detection

How AI is Revolutionizing Fraud Detection

Artificial intelligence – commonly known as AI – means a form of technology with multiple uses. As a result, it has become extremely valuable to a number of businesses across

AI innovation

Companies Leading AI Innovation in 2023

Artificial intelligence (AI) has been transforming industries and revolutionizing business operations. AI’s potential to enhance efficiency and productivity has become crucial to many businesses. As we move into 2023, several

data fivetran pricing

Fivetran Pricing Explained

One of the biggest trends of the 21st century is the massive surge in analytics. Analytics is the process of utilizing data to drive future decision-making. With so much of

kubernetes logging

Kubernetes Logging: What You Need to Know

Kubernetes from Google is one of the most popular open-source and free container management solutions made to make managing and deploying applications easier. It has a solid architecture that makes

ransomware cyber attack

Why Is Ransomware Such a Major Threat?

One of the most significant cyber threats faced by modern organizations is a ransomware attack. Ransomware attacks have grown in both sophistication and frequency over the past few years, forcing

data dictionary

Tools You Need to Make a Data Dictionary

Data dictionaries are crucial for organizations of all sizes that deal with large amounts of data. they are centralized repositories of all the data in organizations, including metadata such as