A Step in the Right Direction: VTD-XML Improves XML Processing

A Step in the Right Direction: VTD-XML Improves XML Processing

f you are among those enterprise developers routinely facing the tasks of processing large XML files whose sizes range from tens to hundreds of megabytes, most likely you have used one of the two types of XML parsers:

  • DOM (Document Object Model): This is a tree-based, XML-processing API specification. Because DOM creates in-memory data structures that precisely models the data represented in XML and allows random-access, it is generally considered an easy and natural way of working with XML. However, building a DOM tree is not only slow, but also consumes a memory capacity somewhere between five or ten times a document’s size. Depending on the file size and structural complexity of the document, building a DOM tree can take tens of seconds, and that is before any actual processing work can be done. Plus, most 32-bit operating systems can only address two to four gigabytes of physical memory. This restricts the DOM tree size at any given time.
  • SAX/Pull: These are both designed to tackle the memory and processing inefficiency of DOM, as both are essentially simple, low-level tokenizers. Both claim to be faster and more memory efficient, but SAX/Pull programming can result in tremendous implementation efforts and bulky, unmaintainable code?particularly when the data access pattern is complex in nature (e.g. re-visit previously visited nodes), Another big disadvantage is their forward-only nature and lack of random access.

For power, flexibility, and ease-of-use you’d like to use DOM; for CPU and memory efficiency, you’d like to use SAX. What is needed is a way to get the best of both?because performing any complex data processing task for large XML files with either, even on well-equipped servers, is going to be slow at best.

VTD-XML to the Rescue
VTD-XML is a next-generation, open source XML processing API that offers significantly better and more advanced processing capabilities than DOM, SAX, or Pull. Take a quick look at some of the technical highlights of VTD-XML:

  • Random Access: VTD-XML is designed to be random-access capable and natively supports XPath.
  • Performance: VTD-XML’s performance is typically between five to ten times faster than DOM’s and one and a half to two times that of SAX with the Null content handler. On a 3400+ Athlon machine, the expected performance is 50MB/sec ~ 60 MB/sec, easily making it the fastest XML parser in the world.
  • Memory Usage: The memory that VTD-XML consumes is typically 1.3 to1.5 times the size of the XML document?a reduction of 30 to 45 percent[3x to 5x] over DOM.
  • A Simple and Intuitive API: VTD-XML also features an easy-to-understand, cursor-based API significantly simpler than DOM’s node-based API (click here for a demo).

You may wonder how VTD-XML achieves both high performance and low memory usage without sacrificing random access. The basic concept is simple: VTD-XML tokenizes XML by recording offsets and lengths according to a binary encoding specification called Virtual Token Descriptor (VTD), while retaining the XML document as is in memory (which takes up the one in the 1.3 times memory size of VTD-XML). VTD records are 64-bit integers that encode the lengths, offsets, nesting depths, and types of XML tokens (click here to view the architecture of a VTD record).

VTD plays a critical role in the reducing overall memory usage for the following reasons:

  • Avoiding Per-object Memory Overhead: Per-object allocation typically incurs a small amount of memory overhead in many modern, object-oriented VM-based languages. For JDK 1.42, there is an 8-byte overhead associated with every object allocation. For an array, that overhead goes up to 16 bytes. A VTD record is immune to Java’s per-object overhead because it is an integer, not an object.
  • Using Arrays Whenever Possible: The biggest memory-saving factor is that both VTD record types are constant in length and can be stored in array-like memory chunks. For example, by allocating a large array for 4096 VTD records, you incur the per-array overhead of 16 bytes only once across 4096 records, and the per-record overhead is dramatically reduced to almost nothing.

These articles page provides detailed descriptions of the internals of VTD-XML. You can also download the latest version of VTD-XML here.

Figure 1. DOM vs. VTD-XML: Results of the memory usage comparison.

VTD-XML’s Memory and Performance
After parsing, VTD-XML doesn’t create a lot of objects in-memory, but instead allocates large memory blocks to store VTD tokens and XML structural information. To quickly see the effectiveness of this approach, I compare the memory usage of VTD-XML vs. Xerces DOM (bundled with JDK) for XML documents between one and 400 megabytes in size, and the benchmark programs (shown below) were compiled and run using JDK Version 1.5 on a Athlon64 3400+ box with 1GB of memory and running Windows XP:

VTD-XML Code Measuring Memory Usage

DOM Code Measuring Memory Usage

import com.ximpleware.*;
import java.io.*;

public class benchmark_mem {
     static Runtime rt;
 public static void main(String[] args){
     File f = new File(args[0]);    
     long l;
     int t;
     try{
         FileInputStream fis = new FileInputStream(f);
         rt = Runtime.getRuntime();
         long startMem = rt.totalMemory() – rt.freeMemory();
         byte[] ba = new byte[(int)f.length()];
         t=fis.read(ba);
         VTDGen vg = new VTDGen(); 
         int fl = (int) f.length();
         l = System.currentTimeMillis();
         vg.setDoc(ba);
         vg.parse(true);
         long endMem = rt.totalMemory() – rt.freeMemory();
         System.out.println(“Memory Use: ” + ((float)endMem –           startMem)/(1<<20) + " MB.");
   }
   catch (Exception e){
       System.out.println(“exception ==> “+e);
     }
 }
}

import org.w3c.dom.*;
import org.w3c.*;
import javax.xml.parsers.*;
import org.w3c.dom.Document;
import org.xml.sax.SAXException;

public class benchmarkDOM_mem {
    static Runtime rt;
  public static void main(String[] args){
    File f = new File(args[0]);
    try{
       FileInputStream fis = new FileInputStream(f);
       rt = Runtime.getRuntime();
       byte[] ba = new byte[(int)f.length()];
       fis.read(ba);
       long startMem = rt.totalMemory() – rt.freeMemory();
       DocumentBuilderFactory factory =
          DocumentBuilderFactory.newInstance();
       factory.setNamespaceAware(true);
       factory.setExpandEntityReferences(false);
       DocumentBuilder parser = factory.newDocumentBuilder();
       ByteArrayInputStream bais = new ByteArrayInputStream(ba);
       parser.parse(bais);
       long endMem = rt.totalMemory() – rt.freeMemory();
       System.out.println(“Memory Use: ” + ((float) endMem – startMem)/(1<<20) + " MB.");
     }
     catch (Exception e){
       System.out.println(“exception ==> “+e);
     }
   }
}

Table 1. VTD-XML vs. Xerces DOM: Comparing memory usage.

Table 2 (shown below) lists the results of memory consumption measurement. The test code was initialized with Hotspot’s server JVM (standard with JDK1.5) with a maximum heap size of 800 megabytes. Parsing “po_huge.xml” exhausts all the memory of JVM and results in “OutOfMemoryException;” VTD-XML on the other hand passed the test without any problems.

File Name/Size in MB Description  VTD-XML’s Memory Usage Multiplying Factor DOM’s Memory Usage Multiplying Factor
blog.xml (1.3 MB)    RSS feed from infoworld   1.64 MB  1.29x     5.35 MB  4.2x
bioinfo.xml (4.4MB)    bio-informatics data file   6.05 MB  1.42x   27.08 MB  6.3x
address.xml (15.24 MB)    address book data   26.39 MB  1.73x   109.83 MB  7.2x
cd.xml (25.57 MB)    CD catalog file   48.67 MB  1.90x   211.48 MB  8.3x
po.xml (70.3MB)    purchase order 122.57 MB  1.68x   514.03 MB  7.05x
po_huge.xml (405.60MB)    super-sized purchase order  686.0 MB  1.69x   Out of memory  Out of memory

Table 2. Memory Usage Comparison: Between VTD-XML and DOM.

Figure 2. Parsing Throughput Comparison: The results for VTD-XML, DOM, and SAX.

Parsing Performance
To compare the parsing performance of VTD-XML with Xerces DOM, I ran the benchmark code (shown blow) on the same set of test files used for the above memory test using the HotSpot server JVM, which does an excellent job performing JIT (Just In Time) compilation from byte code to native code. To measure the peak performance of each parser, the sample code first went through a warm-up stage to ensure that the parsing routines were executed in its native mode to obtain maximum performance. The test files were first read into byte arrays in order for the results to exclude any timing variation due to disk IO.

   VTD-XML Sample Code Measuring Parsing Performance  DOM Sample Code Measuring Parsing Performance
   l = System.currentTimeMillis();
   while(System.currentTimeMillis()-l<30000)
   {
     vg.setDoc(ba);
     vg.parse(true);
   }
   for (int j=0;j<20;j++){
    l = System.currentTimeMillis();
    for(int i = 0;i    {
      vg.setDoc(ba);
      vg.parse(true);
    }
    long l2 = System.currentTimeMillis();
    System.out.println(“l2 – l “+ (l2-l)+ ” ms”);
    System.out.println(” average parsing time ==> “+
        ((double)(l2 – l)/total));
    System.out.println(” performance ==> “+
        ( ((double)fl *1000 * total)/((l2 – l)*(1<<20))));
   }
    l = System.currentTimeMillis();
   while(System.currentTimeMillis()-l<30000)
   {
      ByteArrayInputStream bais = new    ByteArrayInputStream(ba);
      parser.parse(bais);
   }
   for(int j=0;j<10;j++) {
     l = System.currentTimeMillis();
     for(int i = 0;i    {
      ByteArrayInputStream bais = new ByteArrayInputStream(ba);
      parser.parse(bais);
     }
     long l2 = System.currentTimeMillis();
    System.out.println(” average parsing time ==> “+
             ((float)(l2 – l)/total));
    System.out.println(” performance ==> “+
             ( ((double)fl *1000 * total)/((l2 – l)*(1<<20))));
   }

Table 3. VTD-XML vs. Xerces DOM: Comparing parsing performance.

The results of parsing performance are listed in Table 4. Notice that, as another reference point, I also measure the performance of Xerces SAX with Null content handler:

File Name (Size in MB )  VTD-XML (time) VTD-XML (MB/sec)    DOM (time) DOM (MB/sec) SAX (time) SAX (MB/sec)
blog.xml (1.3 MB)  17.98 ms  70.8  89.1 ms  14.3  22.65 ms  56.18
bioinfo.xml (4.4MB)  68.8 ms  62.04  406.2 ms  10.5  103.0 ms  41.44
address.xml (15.24 MB)  293.8 ms  51.87  1703.2 ms   8.95  556.2 ms  27.40
cd.xml (25.57 MB)  493.8 ms  51.8  2825.0 ms   9.13  759.4 ms  33.68
po.xml (70.3 MB)  1337 ms  54.50  7015 ms   10.35  2743.8 ms  34.46
po_huge.xml (405.60MB)  6.87 s  59.76  Out of memory  Out of memory  10.756 s  37.687

Table 4. Parsing performance summary.

As Table 4 demonstrates, VTD-XML is not only much faster than DOM, it also significantly outperforms SAX with Null content handler. This is, again, the result of VTD-XML’s superior memory allocation strategy.

Navigation Performance
Next, I benchmarked the navigation performances of VTD-XML and Xerces DOM. Because navigating behavior is specific to the file structure and tags names of the test documents, the test code (shown below) first parses po.xml and po_huge.xml, then traverses the document hierarchies corresponding to the XPath expression /purchseOrder/item/items[@partNum=’872-AA’], which navigates across the entire document from the beginning to the end. Comparing the code written for DOM and VTD-XML, you can see that VTD-XML API is much simpler than DOM.

VTD-XML Code Measuring Navigation Performance  DOM Code Measuring Navigation Performance
for (int j=0;j<20;j++){
   l = System.currentTimeMillis();
   for(int i = 0;i   {
     count = 0;
     vn.toElement(VTDNav.ROOT);
     if (vn.matchElement(“purchaseOrder”)){
      if (vn.toElement(VTDNav.FIRST_CHILD,”items”)){
        do {
         if (vn.toElement(VTDNav.FIRST_CHILD)){
           do {
            temp = vn.getAttrVal(“partNum”);
            if (vn.matchTokenString(temp,”872-AA”)){
              count++;
            }
           }
          while(vn.toElement(VTDNav.NEXT_SIBLING));
          vn.toElement(VTDNav.PARENT);
         }
      } while (vn.toElement(VTDNav.NEXT_SIBLING, “items”));
     }
    }
  }
long l2 = System.currentTimeMillis();
System.out.println(“l2 – l “+ (l2-l)+ ” ms”);
System.out.println(” average nav time ==> “+
((double)(l2 – l)/total));
for (int j=0;j<20;j++){
   l = System.currentTimeMillis();
   for(int i = 0;i   {
     Element current = d.getDocumentElement();
     count = 0;
     if (current.getNodeName().compareTo(“purchaseOrder”)==0){
      Node n = current.getFirstChild();
      if (n != null){
        do {
         if (n.getNodeType() == Node.ELEMENT_NODE
            && n.getNodeName().compareTo(“items”)==0){
          Node n1 = n.getFirstChild();
          Element e;
          do {
             if (n1.getNodeType() == Node.ELEMENT_NODE
                && n1.getNodeName().compareTo(“item”)==0){
              e = (Element) n1;
              if (e.getAttribute(“partNum”).compareTo(“872-AA”)==0 ){
                 count++;
              }
             }
          } while ((n1 = n1.getNextSibling())!= null);
         }
        } while ((n=n.getNextSibling()) != null );
      }
    }
   }
   long l2 = System.currentTimeMillis();
   System.out.println(“l2 – l “+ (l2-l)+ ” ms”);
   System.out.println(” average nav time ==> “+
            ((double)(l2 – l)/total));
}

Table 5. VTD-XML vs. Xerces DOM: Comparing navigation performances.

The timing results, shown in Table 6, demonstrate that VTD-XML’s random access capability is quite similar to DOM. Even for XML documents exceeding 400 MB in size (for which DOM simply runs of gas) VTD-XML doesn’t miss a beat in its ability to jump between different nodes in the document hierarchy, and the navigation latency scales linearly with the size of the document.

File Name (Size )  Navigation Performance VTD-XML  Navigation Performance DOM
po.xml (70.3 MB)  241.4ms  303.9ms
po_huge.xml (405.60MB)  1303.2ms  N/A (Out of Memory)

Table 6. Navigation performance summary.

What Does All This Mean For You?
How will using VTD-XML change the way you do things? VTD-XML will first affect your choice of processing model. VTD-XML exposes the key weakness of SAX?lack of random access. So, unless the XML document is too big to load into memory, you no longer have any incentives to go with SAX. Written in VTD-XML, your program code should be shorter, cleaner, and less bug-prone. And because VTD-XML’s benefits apply to XML documents of different sizes and complexity, you no longer have to switch between the radically different parsing styles of DOM and SAX. This makes it easier to accomplish anything that XML, by design, allows you to do. Click here to download a representative example XML file and test it out yourself.

Switching to VTD-XML also brings instant hardware upgrade. If you have been thinking about adding more boxes to keep up with the growing amount of XML, writing your applications in VTD-XML may be all you need. Whether it is batch processing or real time transactions, VTD-XML should help you squeeze out every drop of efficiency and make your applications run smoother and more responsive.

devx-admin

devx-admin

Share the Post:
Software Development

Top Software Development Companies

Looking for the best in software development? Our list of Top Software Development Companies is your gateway to finding the right tech partner. Dive in

India Web Development

Top Web Development Companies in India

In the digital race, the right web development partner is your winning edge. Dive into our curated list of top web development companies in India,

USA Web Development

Top Web Development Companies in USA

Looking for the best web development companies in the USA? We’ve got you covered! Check out our top 10 picks to find the right partner

Clean Energy Adoption

Inside Michigan’s Clean Energy Revolution

Democratic state legislators in Michigan continue to discuss and debate clean energy legislation in the hopes of establishing a comprehensive clean energy strategy for the

Chips Act Revolution

European Chips Act: What is it?

In response to the intensifying worldwide technology competition, Europe has unveiled the long-awaited European Chips Act. This daring legislative proposal aims to fortify Europe’s semiconductor

Revolutionized Low-Code

You Should Use Low-Code Platforms for Apps

As the demand for rapid software development increases, low-code platforms have emerged as a popular choice among developers for their ability to build applications with

Software Development

Top Software Development Companies

Looking for the best in software development? Our list of Top Software Development Companies is your gateway to finding the right tech partner. Dive in and explore the leaders in

India Web Development

Top Web Development Companies in India

In the digital race, the right web development partner is your winning edge. Dive into our curated list of top web development companies in India, and kickstart your journey to

USA Web Development

Top Web Development Companies in USA

Looking for the best web development companies in the USA? We’ve got you covered! Check out our top 10 picks to find the right partner for your online project. Your

Clean Energy Adoption

Inside Michigan’s Clean Energy Revolution

Democratic state legislators in Michigan continue to discuss and debate clean energy legislation in the hopes of establishing a comprehensive clean energy strategy for the state. A Senate committee meeting

Chips Act Revolution

European Chips Act: What is it?

In response to the intensifying worldwide technology competition, Europe has unveiled the long-awaited European Chips Act. This daring legislative proposal aims to fortify Europe’s semiconductor supply chain and enhance its

Revolutionized Low-Code

You Should Use Low-Code Platforms for Apps

As the demand for rapid software development increases, low-code platforms have emerged as a popular choice among developers for their ability to build applications with minimal coding. These platforms not

Cybersecurity Strategy

Five Powerful Strategies to Bolster Your Cybersecurity

In today’s increasingly digital landscape, businesses of all sizes must prioritize cyber security measures to defend against potential dangers. Cyber security professionals suggest five simple technological strategies to help companies

Global Layoffs

Tech Layoffs Are Getting Worse Globally

Since the start of 2023, the global technology sector has experienced a significant rise in layoffs, with over 236,000 workers being let go by 1,019 tech firms, as per data

Huawei Electric Dazzle

Huawei Dazzles with Electric Vehicles and Wireless Earbuds

During a prominent unveiling event, Huawei, the Chinese telecommunications powerhouse, kept quiet about its enigmatic new 5G phone and alleged cutting-edge chip development. Instead, Huawei astounded the audience by presenting

Cybersecurity Banking Revolution

Digital Banking Needs Cybersecurity

The banking, financial, and insurance (BFSI) sectors are pioneers in digital transformation, using web applications and application programming interfaces (APIs) to provide seamless services to customers around the world. Rising

FinTech Leadership

Terry Clune’s Fintech Empire

Over the past 30 years, Terry Clune has built a remarkable business empire, with CluneTech at the helm. The CEO and Founder has successfully created eight fintech firms, attracting renowned

The Role Of AI Within A Web Design Agency?

In the digital age, the role of Artificial Intelligence (AI) in web design is rapidly evolving, transitioning from a futuristic concept to practical tools used in design, coding, content writing

Generative AI Revolution

Is Generative AI the Next Internet?

The increasing demand for Generative AI models has led to a surge in its adoption across diverse sectors, with healthcare, automotive, and financial services being among the top beneficiaries. These

Microsoft Laptop

The New Surface Laptop Studio 2 Is Nuts

The Surface Laptop Studio 2 is a dynamic and robust all-in-one laptop designed for creators and professionals alike. It features a 14.4″ touchscreen and a cutting-edge design that is over

5G Innovations

GPU-Accelerated 5G in Japan

NTT DOCOMO, a global telecommunications giant, is set to break new ground in the industry as it prepares to launch a GPU-accelerated 5G network in Japan. This innovative approach will

AI Ethics

AI Journalism: Balancing Integrity and Innovation

An op-ed, produced using Microsoft’s Bing Chat AI software, recently appeared in the St. Louis Post-Dispatch, discussing the potential concerns surrounding the employment of artificial intelligence (AI) in journalism. These

Savings Extravaganza

Big Deal Days Extravaganza

The highly awaited Big Deal Days event for October 2023 is nearly here, scheduled for the 10th and 11th. Similar to the previous year, this autumn sale has already created

Cisco Splunk Deal

Cisco Splunk Deal Sparks Tech Acquisition Frenzy

Cisco’s recent massive purchase of Splunk, an AI-powered cybersecurity firm, for $28 billion signals a potential boost in tech deals after a year of subdued mergers and acquisitions in the

Iran Drone Expansion

Iran’s Jet-Propelled Drone Reshapes Power Balance

Iran has recently unveiled a jet-propelled variant of its Shahed series drone, marking a significant advancement in the nation’s drone technology. The new drone is poised to reshape the regional

Solar Geoengineering

Did the Overshoot Commission Shoot Down Geoengineering?

The Overshoot Commission has recently released a comprehensive report that discusses the controversial topic of Solar Geoengineering, also known as Solar Radiation Modification (SRM). The Commission’s primary objective is to

Remote Learning

Revolutionizing Remote Learning for Success

School districts are preparing to reveal a substantial technological upgrade designed to significantly improve remote learning experiences for both educators and students amid the ongoing pandemic. This major investment, which

Revolutionary SABERS Transforming

SABERS Batteries Transforming Industries

Scientists John Connell and Yi Lin from NASA’s Solid-state Architecture Batteries for Enhanced Rechargeability and Safety (SABERS) project are working on experimental solid-state battery packs that could dramatically change the

Build a Website

How Much Does It Cost to Build a Website?

Are you wondering how much it costs to build a website? The approximated cost is based on several factors, including which add-ons and platforms you choose. For example, a self-hosted

Battery Investments

Battery Startups Attract Billion-Dollar Investments

In recent times, battery startups have experienced a significant boost in investments, with three businesses obtaining over $1 billion in funding within the last month. French company Verkor amassed $2.1