Working With Large Files and Data Sets

Question:

I am working on a program that uses XML for data archival. This data archive creates XML files in the multi-megabyte range (5 MB – 500 MB and greater). The problem is that the Microsoft XML parser loads the entire file into memory (using virtual memory) until the system crashes. What is the best way to use large files and data sets with XML?

Answer:

This issue has become somewhat contentious for developers working with the Microsoft DOM, and I suspect it’s on the top of the “things to fix” in their queue of bugs. The problem of working with large documents is common?unfortunately, the answers aren’t. However, I can suggest a couple of different routes.

Take a good hard look at your schema, and see if you can decompose your XML into some form of an object model. For example, you may have a structure that looks something like this:

  MyCompany, Inc.  MINE  128FD              Accounting                                      Ferd                 Anton                 Sr.VP, Finance                 125000                                           Canot                 Sylvia                 Accountant                 65000                                           Research                                      Newton                 Issac                 Researcher                 135000                                           Gillian                 Jill                 Programmer                 55000                                 

For compression sake, I haven’t shown the other 4923 employees in this XML structure, but you can readily imagine that they are there, and this large amount of data will cause the parser to crash. However, this particular structure can readily be decomposed into an object model with three object classes (company, division, and employee) and two collection classes (divisions and employees). An object class can in turn be thought of as a container of both properties (leaf nodes) and collections (group nodes) while a collection class itself only contains object classes as children (although obviously it can be a grandparent to one or more collections).

The advantage to such a decomposition is that you can use a collection node as an alias into subordinate XML documents. For example, you could modify the previous structures so that the employees collection nodes point to specific files:

  MyCompany, Inc.  MINE  128FD              Accounting                            Research                                                         Ferd                 Anton                 Sr.VP, Finance                 125000                                           Canot                 Sylvia                 Accountant                 65000                                                                  Newton                 Issac                 Researcher                 135000                                           Gillian                 Jill                 Programmer                 55000                       

You could similarly decompose the into separate files, which in turn would contain the decomposed employees (I don’t really want to think about what I just wrote).

Once you have your resources linked, you do have to do a little more work in manipulating it. For example, if you were trying to do a company-wide survey to find out who makes more than $100,000 a year, you couldn’t just use the .selectNodes() function. However, you could emulate it with some work. The ExtendedQuery function shows one approach for doing just that (this code is written in Visual Basic, but you can convert it to VBS by removing the As XXX keywords:

function ExtendedQuery(primaryNode as IXMLDOMElement,queryStr as String) as DOMDocument   dim colXML as DOMDocument   dim tempXML as DOMDocument   dim queryStrArray as Variant   dim queryMain as String   dim querySub as String   dim src as String       queryStrArray=split(queryStr,"#")   queryMain=queryStrArray(0)   querySub=queryStrArray(1)   set colXML=createObject("Microsoft.XMLDOM")   set tempXML=createObject("Microsoft.XMLDOM")   tempXML.async=false   colXML.loadXML ""   for each superNode in primaryNode.selectNodes(queryMain)       src=superNode.getAttribute("src")       if not isNull(src) then          if src<>"" then              tempXML.load src              for each subNode in tempXML.documentElement.selectNodes(querySub)                  colXML.documentElement.appendChild subNode.cloneNode(true)              next          end if       end if    next    return colXMLend function

This function works by taking a slightly modified XSL query string?a pound sign (#) is inserted after the node name where the link occurs. For example, to get those workers who make more than $100,000 dollars, you’d use the expression:

set newDoc=ExtendedQuery(xmlDoc,"//employees#employee[salary $gt% 100000]")

The function splits the query string into two parts, and uses the left part to query the primary node and return all pointers that satisfy the query. Then it retrieves the filename of each collection, loads that collection into memory, applies the secondary query to it, and copies all valid nodes (and sub-nodes) into an interim XML document. If you just wanted to get those employees in research that satisfy the criterion, you’d use the same query syntax you normally would, with the # exception:

set newDoc=ExtendedQuery(xmlDoc,"//division[name =       'Research']/[email protected][salary $gt$ 100000]")

It’s worth noting some important but subtle points here. The query returns an entirely new document that maintains its own internal pointers, not simply a collection of XML node pointers (which is essentially what a nodelist is). In other words, the document that’s returned does not contain any references to the document that called it in the first place.

Because you’re dealing with a new entity, this object will also be slower than querying using the selectNodes?both because a number of XML files will need to get loaded and because nodes are themselves copies, rather than just their pointers being copied.

Also, the previous code applies only to one level of indirection?I leave it as an exercise to extend this to multiple levels, although it’s not terribly complex (hint: use recursion). There is a trade-off here, however. What you are doing is trading memory management for time?the deeper you go, the longer the query will take, on an exponential basis. Too many levels of indirection will make the query come to a crawl.

Additionally, you haven’t completely eliminated the possibility of maxing out memory. If you perform a general query requesting all employees (“//employees#*”) and you have 5000 employees, you will still have an XML tree that will push the bounds of memory.

One final piece of advice on this topic: The src does not necessarily need to be an XML file explicitly. It could be a parameterized out-of-proc (exe) server, or an Active Server Page (ASP) talking via ActiveX Data Objects (ADO) to a SQL Server database, so long as the output of the process is a valid XML stream. One variation of this schema is to create an intermediate index XML file that would contain a link between a shortcut name and a file reference?that way, you wouldn’t even need to explicitly hardcode the links in the master document. This is a common strategy when working with SGML (Standard Generalized Markup Language) documents that support inline interpolation of elements.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Overview

Recent Articles: