Login | Register   
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


Tip of the Day
Language: XML
Expertise: Beginner
Sep 23, 1999

Working With Large Files and Data Sets

Question:

I am working on a program that uses XML for data archival. This data archive creates XML files in the multi-megabyte range (5 MB - 500 MB and greater). The problem is that the Microsoft XML parser loads the entire file into memory (using virtual memory) until the system crashes. What is the best way to use large files and data sets with XML?

Answer:

This issue has become somewhat contentious for developers working with the Microsoft DOM, and I suspect it's on the top of the "things to fix" in their queue of bugs. The problem of working with large documents is common—unfortunately, the answers aren't. However, I can suggest a couple of different routes.

Take a good hard look at your schema, and see if you can decompose your XML into some form of an object model. For example, you may have a structure that looks something like this:


  MyCompany, Inc.
  MINE
  128FD
  
    
        Accounting
        
             
                 Ferd
                 Anton
                 Sr.VP, Finance
                 125000
             
             
                 Canot
                 Sylvia
                 Accountant
                 65000
             
          
      
      
        Research
        
             
                 Newton
                 Issac
                 Researcher
                 135000
             
             
                 Gillian
                 Jill
                 Programmer
                 55000
             
          
      
    

For compression sake, I haven't shown the other 4923 employees in this XML structure, but you can readily imagine that they are there, and this large amount of data will cause the parser to crash. However, this particular structure can readily be decomposed into an object model with three object classes (company, division, and employee) and two collection classes (divisions and employees). An object class can in turn be thought of as a container of both properties (leaf nodes) and collections (group nodes) while a collection class itself only contains object classes as children (although obviously it can be a grandparent to one or more collections).

The advantage to such a decomposition is that you can use a collection node as an alias into subordinate XML documents. For example, you could modify the previous structures so that the employees collection nodes point to specific files:


  MyCompany, Inc.
  MINE
  128FD
  
    
        Accounting
        
      
      
        Research
        
      
    



         
             
                 Ferd
                 Anton
                 Sr.VP, Finance
                 125000
             
             
                 Canot
                 Sylvia
                 Accountant
                 65000
             
         
      

        
             
                 Newton
                 Issac
                 Researcher
                 135000
             
             
                 Gillian
                 Jill
                 Programmer
                 55000
             
          

You could similarly decompose the into separate files, which in turn would contain the decomposed employees (I don't really want to think about what I just wrote).

Once you have your resources linked, you do have to do a little more work in manipulating it. For example, if you were trying to do a company-wide survey to find out who makes more than $100,000 a year, you couldn't just use the .selectNodes() function. However, you could emulate it with some work. The ExtendedQuery function shows one approach for doing just that (this code is written in Visual Basic, but you can convert it to VBS by removing the As XXX keywords:

function ExtendedQuery(primaryNode as IXMLDOMElement,queryStr as String) as DOMDocument
   dim colXML as DOMDocument
   dim tempXML as DOMDocument
   dim queryStrArray as Variant
   dim queryMain as String
   dim querySub as String
   dim src as String
    
   queryStrArray=split(queryStr,"#")
   queryMain=queryStrArray(0)
   querySub=queryStrArray(1)
   set colXML=createObject("Microsoft.XMLDOM")
   set tempXML=createObject("Microsoft.XMLDOM")
   tempXML.async=false
   colXML.loadXML ""
   for each superNode in primaryNode.selectNodes(queryMain)
       src=superNode.getAttribute("src")
       if not isNull(src) then
          if src<>"" then
              tempXML.load src
              for each subNode in tempXML.documentElement.selectNodes(querySub)
                  colXML.documentElement.appendChild subNode.cloneNode(true)
              next
          end if
       end if
    next
    return colXML
end function

This function works by taking a slightly modified XSL query string—a pound sign (#) is inserted after the node name where the link occurs. For example, to get those workers who make more than $100,000 dollars, you'd use the expression:

set newDoc=ExtendedQuery(xmlDoc,"//employees#employee[salary $gt% 100000]")

The function splits the query string into two parts, and uses the left part to query the primary node and return all pointers that satisfy the query. Then it retrieves the filename of each collection, loads that collection into memory, applies the secondary query to it, and copies all valid nodes (and sub-nodes) into an interim XML document. If you just wanted to get those employees in research that satisfy the criterion, you'd use the same query syntax you normally would, with the # exception:

set newDoc=ExtendedQuery(xmlDoc,"//division[name =
       'Research']/employees@employee[salary $gt$ 100000]")

It's worth noting some important but subtle points here. The query returns an entirely new document that maintains its own internal pointers, not simply a collection of XML node pointers (which is essentially what a nodelist is). In other words, the document that's returned does not contain any references to the document that called it in the first place.

Because you're dealing with a new entity, this object will also be slower than querying using the selectNodes—both because a number of XML files will need to get loaded and because nodes are themselves copies, rather than just their pointers being copied.

Also, the previous code applies only to one level of indirection—I leave it as an exercise to extend this to multiple levels, although it's not terribly complex (hint: use recursion). There is a trade-off here, however. What you are doing is trading memory management for time—the deeper you go, the longer the query will take, on an exponential basis. Too many levels of indirection will make the query come to a crawl.

Additionally, you haven't completely eliminated the possibility of maxing out memory. If you perform a general query requesting all employees ("//employees#*") and you have 5000 employees, you will still have an XML tree that will push the bounds of memory.

One final piece of advice on this topic: The src does not necessarily need to be an XML file explicitly. It could be a parameterized out-of-proc (exe) server, or an Active Server Page (ASP) talking via ActiveX Data Objects (ADO) to a SQL Server database, so long as the output of the process is a valid XML stream. One variation of this schema is to create an intermediate index XML file that would contain a link between a shortcut name and a file reference—that way, you wouldn't even need to explicitly hardcode the links in the master document. This is a common strategy when working with SGML (Standard Generalized Markup Language) documents that support inline interpolation of elements.

DevX Pro
 
Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap