I am working on a program that uses XML for data archival. This data archive creates XML files in the multi-megabyte range (5 MB - 500 MB and greater). The problem is that the Microsoft XML parser loads the entire file into memory (using virtual memory) until the system crashes. What is the best way to use large files and data sets with XML?
This issue has become somewhat contentious for developers working with the Microsoft DOM, and I suspect it's on the top of the "things to fix" in their queue of bugs. The problem of working with large documents is commonunfortunately, the answers aren't. However, I can suggest a couple of different routes.
Take a good hard look at your schema, and see if you can decompose your XML into some form of an object model. For example, you may have a structure that looks something like this:
For compression sake, I haven't shown the other 4923 employees in this XML structure, but you can readily imagine that they are there, and this large amount of data will cause the parser to crash. However, this particular structure can readily be decomposed into an object model with three object classes (company, division, and employee) and two collection classes (divisions and employees). An object class can in turn be thought of as a container of both properties (leaf nodes) and collections (group nodes) while a collection class itself only contains object classes as children (although obviously it can be a grandparent to one or more collections).
The advantage to such a decomposition is that you can use a collection node as an alias into subordinate XML documents. For example, you could modify the previous structures so that the employees collection nodes point to specific files:
You could similarly decompose the into separate files, which in turn would contain the decomposed employees (I don't really want to think about what I just wrote).
Once you have your resources linked, you do have to do a little more work in manipulating it. For example, if you were trying to do a company-wide survey to find out who makes more than $100,000 a year, you couldn't just use the .selectNodes() function. However, you could emulate it with some work. The ExtendedQuery function shows one approach for doing just that (this code is written in Visual Basic, but you can convert it to VBS by removing the As XXX keywords:
function ExtendedQuery(primaryNode as IXMLDOMElement,queryStr as String) as DOMDocument
dim colXML as DOMDocument
dim tempXML as DOMDocument
dim queryStrArray as Variant
dim queryMain as String
dim querySub as String
dim src as String
for each superNode in primaryNode.selectNodes(queryMain)
if not isNull(src) then
if src<>"" then
for each subNode in tempXML.documentElement.selectNodes(querySub)
This function works by taking a slightly modified XSL query stringa pound sign (#) is inserted after the node name where the link occurs. For example, to get those workers who make more than $100,000 dollars, you'd use the expression:
set newDoc=ExtendedQuery(xmlDoc,"//employees#employee[salary $gt% 100000]")
The function splits the query string into two parts, and uses the left part to query the primary node and return all pointers that satisfy the query. Then it retrieves the filename of each collection, loads that collection into memory, applies the secondary query to it, and copies all valid nodes (and sub-nodes) into an interim XML document. If you just wanted to get those employees in research that satisfy the criterion, you'd use the same query syntax you normally would, with the # exception:
set newDoc=ExtendedQuery(xmlDoc,"//division[name =
'Research']/employees@employee[salary $gt$ 100000]")
It's worth noting some important but subtle points here. The query returns an entirely new document that maintains its own internal pointers, not simply a collection of XML node pointers (which is essentially what a nodelist is). In other words, the document that's returned does not contain any references to the document that called it in the first place.
Because you're dealing with a new entity, this object will also be slower than querying using the selectNodesboth because a number of XML files will need to get loaded and because nodes are themselves copies, rather than just their pointers being copied.
Also, the previous code applies only to one level of indirectionI leave it as an exercise to extend this to multiple levels, although it's not terribly complex (hint: use recursion). There is a trade-off here, however. What you are doing is trading memory management for timethe deeper you go, the longer the query will take, on an exponential basis. Too many levels of indirection will make the query come to a crawl.
Additionally, you haven't completely eliminated the possibility of maxing out memory. If you perform a general query requesting all employees ("//employees#*") and you have 5000 employees, you will still have an XML tree that will push the bounds of memory.
One final piece of advice on this topic: The src does not necessarily need to be an XML file explicitly. It could be a parameterized out-of-proc (exe) server, or an Active Server Page (ASP) talking via ActiveX Data Objects (ADO) to a SQL Server database, so long as the output of the process is a valid XML stream. One variation of this schema is to create an intermediate index XML file that would contain a link between a shortcut name and a file referencethat way, you wouldn't even need to explicitly hardcode the links in the master document. This is a common strategy when working with SGML (Standard Generalized Markup Language) documents that support inline interpolation of elements.