advertisement
Premier Club Log In/Registration
  Include Code  Search Tips
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   SKILLBUILDING  |   TIP BANK  |   SOURCEBANK  |   FORUMS  |   NEWSLETTERS
Browse DevX
30484 - Source Code
Partners & Affiliates
advertisement
advertisement
advertisement
Average Rating: 4.7/5 | Rate this item | 3 users have rated this item.
Email this articleEmail this article
 
A Step in the Right Direction: VTD-XML Improves XML Processing
Find out how this next generation XML processing API goes beyond DOM and SAX in performance, memory usage, and ease-of-use.  

advertisement
f you are among those enterprise developers routinely facing the tasks of processing large XML files whose sizes range from tens to hundreds of megabytes, most likely you have used one of the two types of XML parsers:
  • DOM (Document Object Model): This is a tree-based, XML-processing API specification. Because DOM creates in-memory data structures that precisely models the data represented in XML and allows random-access, it is generally considered an easy and natural way of working with XML. However, building a DOM tree is not only slow, but also consumes a memory capacity somewhere between five or ten times a document's size. Depending on the file size and structural complexity of the document, building a DOM tree can take tens of seconds, and that is before any actual processing work can be done. Plus, most 32-bit operating systems can only address two to four gigabytes of physical memory. This restricts the DOM tree size at any given time.
  • SAX/Pull: These are both designed to tackle the memory and processing inefficiency of DOM, as both are essentially simple, low-level tokenizers. Both claim to be faster and more memory efficient, but SAX/Pull programming can result in tremendous implementation efforts and bulky, unmaintainable code—particularly when the data access pattern is complex in nature (e.g. re-visit previously visited nodes), Another big disadvantage is their forward-only nature and lack of random access.
For power, flexibility, and ease-of-use you'd like to use DOM; for CPU and memory efficiency, you'd like to use SAX. What is needed is a way to get the best of both—because performing any complex data processing task for large XML files with either, even on well-equipped servers, is going to be slow at best.


VTD-XML to the Rescue
VTD-XML is a next-generation, open source XML processing API that offers significantly better and more advanced processing capabilities than DOM, SAX, or Pull. Take a quick look at some of the technical highlights of VTD-XML:
  • Random Access: VTD-XML is designed to be random-access capable and natively supports XPath.
  • Performance: VTD-XML's performance is typically between five to ten times faster than DOM’s and one and a half to two times that of SAX with the Null content handler. On a 3400+ Athlon machine, the expected performance is 50MB/sec ~ 60 MB/sec, easily making it the fastest XML parser in the world.
  • Memory Usage: The memory that VTD-XML consumes is typically 1.3 to1.5 times the size of the XML document—a reduction of 30 to 45 percent[3x to 5x] over DOM.
  • A Simple and Intuitive API: VTD-XML also features an easy-to-understand, cursor-based API significantly simpler than DOM's node-based API (click here for a demo).
You may wonder how VTD-XML achieves both high performance and low memory usage without sacrificing random access. The basic concept is simple: VTD-XML tokenizes XML by recording offsets and lengths according to a binary encoding specification called Virtual Token Descriptor (VTD), while retaining the XML document as is in memory (which takes up the one in the 1.3 times memory size of VTD-XML). VTD records are 64-bit integers that encode the lengths, offsets, nesting depths, and types of XML tokens (click here to view the architecture of a VTD record).

VTD plays a critical role in the reducing overall memory usage for the following reasons:

  • Avoiding Per-object Memory Overhead: Per-object allocation typically incurs a small amount of memory overhead in many modern, object-oriented VM-based languages. For JDK 1.42, there is an 8-byte overhead associated with every object allocation. For an array, that overhead goes up to 16 bytes. A VTD record is immune to Java's per-object overhead because it is an integer, not an object.
  • Using Arrays Whenever Possible: The biggest memory-saving factor is that both VTD record types are constant in length and can be stored in array-like memory chunks. For example, by allocating a large array for 4096 VTD records, you incur the per-array overhead of 16 bytes only once across 4096 records, and the per-record overhead is dramatically reduced to almost nothing.
These articles page provides detailed descriptions of the internals of VTD-XML. You can also download the latest version of VTD-XML here.

  Next Page: Memory and Performance


Page 1: IntroductionPage 3: Parsing Performance
Page 2: Memory and PerformancePage 4: Navigation Performance
advertisement
Advertising Info  |   Member Services  |   Permissions  |   Contact Us  |   Help  |   Feedback  |   Site Map  |   Network Map  |   About


JupiterOnlineMedia

internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and JupiterOnlineMedia

Jupitermedia Corporate Info


Legal Notices, Licensing, Reprints, & Permissions, Privacy Policy.

Advertise | Newsletters | Tech Jobs | Shopping | E-mail Offers

Solutions
Whitepapers and eBooks
IBM Whitepaper: Innovative Collaboration to Advance Your Business
Internet.com eBook: Real Life Rails
Avaya Article: Call Control XML - Powerful, Standards-Based Call Control
Internet.com eBook: The Pros and Cons of Outsourcing
Go Parallel Article: Scalable Parallelism with Intel(R) Threading Building Blocks
Internet.com eBook: Best Practices for Developing a Web Site
IBM CXO Whitepaper: The 2008 Global CEO Study "The Enterprise of the Future"
Avaya Article: Call Control XML in Action - A CCXML Auto Attendant
Go Parallel Article: James Reinders on the Intel Parallel Studio Beta Program
IBM CXO Whitepaper: Unlocking the DNA of the Adaptable Workforce--The Global Human Capital Study 2008
Adobe Acrobat Connect Pro: Web Conferencing and eLearning Whitepapers
Go Parallel Article: Getting Started with TBB on Windows
HP eBook: Storage Networking , Part 1
MORE WHITEPAPERS, EBOOKS, AND ARTICLES
Webcasts
Go Parallel Video: Intel(R) Threading Building Blocks: A New Method for Threading in C++
HP Video: Is Your Data Center Ready for a Real World Disaster?
Microsoft Partner Portal Video: Microsoft Gold Certified Partners Build Successful Practices
HP On Demand Webcast: Virtualization in Action
Go Parallel Video: Performance and Threading Tools for Game Developers
Rackspace Hosting Center: Customer Videos
Intel vPro Developer Virtual Bootcamp
HP Disaster-Proof Solutions eSeminar
HP On Demand Webcast: Discover the Benefits of Virtualization
MORE WEBCASTS, PODCASTS, AND VIDEOS
Downloads and eKits
Microsoft Download: Silverlight 2 Software Development Kit Beta 2
30-Day Trial: SPAMfighter Exchange Module
Red Gate Download: SQL Toolbelt
Iron Speed Designer Application Generator
Microsoft Download: Silverlight 2 Beta 2 Runtime
MORE DOWNLOADS, EKITS, AND FREE TRIALS
Tutorials and Demos
IBM IT Innovation Article: Green Servers Provide a Competitive Advantage
Microsoft Article: Expression Web 2 for PHP Developers--Simplify Your PHP Applications
Featured Algorithm: Intel Threading Building Blocks - parallel_reduce
MORE TUTORIALS, DEMOS AND STEP-BY-STEP GUIDES