RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX

By submitting your information, you agree that devx.com may send you DevX offers via email, phone and text message, as well as email offers about other products and services that DevX believes may be of interest to you. DevX will process your information in accordance with the Quinstreet Privacy Policy.


Better, Faster XML Processing with VTD-XML : Page 3

VTD-XML is a new open source XML processing API that provides a great alternative to SAX and DOM that doesn't force you to trade processing performance for usability. Find out why this Java-based, non-validating parser is faster than DOM and better than SAX.




Building the Right Environment to Support AI, Machine Learning and Deep Learning

Comparison with Existing XML Processing APIs
Now that you've seen the basic strategy behind VTD-XML, it's worth focusing on its unique properties as compared to other similar XML APIs, such as DOM and XMLCursor.

VTD-XML's hierarchy consists exclusively of element nodes. This is very different from DOM, which treats every node, whether it is an attribute node or a text node, as a part of the hierarchy. Second, in VTD-XML, there is one and only one cursor for every instance of VTDNav. You can move the cursor back and forth in the hierarchy, but you may not duplicate it. However, you can temporarily save the location of the cursor on a global stack. VTDNav has two stack access methods. Calling push() saves the cursor state; while calling pop() restores it. Suppose that you're somewhere in the element hierarchy and you wanted to save the current location, move to a different part of the document, and then continue at the saved point. To do that, you first push() the location onto the stack. Then, after moving the cursor to a different part of document, you can very quickly jump back to the saved location by popping it off the stack.

The most unique aspect of VTD-XML, one that distinguishes it from any other XML processing API, is its "non-extractive" tokenization based on Virtual Token Descriptor. As mentioned earlier, non-extractive parsing is the key to achieving optimal processing and memory efficiency in VTD-XML. VTD-XML manifests this non-extractiveness in the following ways.

Figure 1. Extractive vs. Non-extractive Parsing: The figure shows the basic differences between extractive and non-extractive approaches to parsing, making it easy to see why non-extractive parsing is more efficient and faster in most cases.

First, many member methods of VTDNav, such as getAttrVal(), getCurrentIndex(), and getText() return an integer. This integer is in fact a VTD record index that describes the token as requested by the calling functions. After parsing, VTD-XML produces a linear buffer filled with VTD records. Because VTD records are all have the same length, you can access any record in the buffer if you know its index value. Also notice that VTD records are not objects, and therefore are not addressable using pointers. When a VTDNav function doesn't evaluate to any meaningful value, it returns -1—which you can think of as more or less equivalent to a NULL pointer in DOM.

Second, because the parsing process doesn't create any string objects (after all, tokenization is done virtually), VTD-XML implements its own set of comparison functions that directly operate on VTD records. For example, VTDNav's matchElement() method tests if the element name, which effectively is the VTD record of the cursor, matches a given string. Similarly, VTDNav's matchTokenString(), matchRawTokenString(), and matchNormalizedTokenString() methods perform a direct comparison between a string and a VTD record, although each has a different flavor. Why is this a good thing? Because you simply don't have to—and have every incentive not to—pull tokens out into string objects, which are expensive to create, especially when you create lots of them. Even worse, those strings will eventually be garbage-collected. Bypassing excessive object creation is the main reason VTD-XML significantly outperforms DOM and SAX. By the same token, VTD-XML also implements its own set of string-to-numeric data conversion functions that operate directly on VTD records. VTDNav has these four member methods: parseInt(), parseLong(), parseFloat() and parseDouble(). Those functions take a VTD record index value and convert it directly into a numeric data type. Figure 1 shows the difference between extractive and non-extractive parsing for string to integer conversion.

When you do need strings for certain tasks, for example, formatting SQL queries, you can use VTDNav's toString(), toRawString(), and toNormalizedString() methods. All three methods accept the index value of a VTD record and convert the record's value into a string. Still, for maximum performance, please avoid creating string objects whenever possible.

Finally, a nice by-product of VTD-XML's non-extractive tokenization is a feature called "incremental update." Because a VTD record marks the region in the XML message in which a token resides, when one wants to change the content of that token, he only needs to update the content in that same region. Likewise, adding new content into the XML message can literally be as simple as sticking the bytes into the right location of the message. In contrast, to accomplish the same thing using DOM or SAX often requires taking the message apart, then putting everything back. You can find more information about how DOM and SAX take apart XML documents here.

Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



Thanks for your registration, follow us on our social networks to keep up-to-date