Login | Register   
RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Maintain Large Databases with XML Servers

An XML server can be very useful in managing a very large XML database. These servers produce XML based upon an initial query of some sort, borrowing on the RDBMS concept of normalization.


ne of the more common questions asked here at "Ask the XML Pro" has to do with the dilemma of maintaining large databases of XML data. Specifically, suppose that you had an XML document that was both broad (had a large number of objects, as might be expected from a database query on a large dataset), and deep (each object contained a fair number of properties, subordinate objects, and the like). This database-like document presents any number of challenges when it comes to parsers. Most XML parsers work on one of two strategies—XML documents are either loaded completely into memory (as is the case with Internet Explorer) or are processed incrementally (as is the case with most Java parsers). Unfortunately, neither strategy handles the case where the number of upper-tier XML nodes reaches more than a few thousand elements and the data each node contains is relatively rich.

For example, consider an XML structure, which consists of a highly truncated database with 10,000 game character records (see Listing 1). A good rule of thumb when dealing with an XML document is to figure that a single node typically takes up approximately 1/2 KB of memory (this is of course a very rough estimate, depending upon the actual content, but it will help illustrate the point). This means that the sample document in memory probably occupies 10,000 records x 10 KB per record, or roughly 100 MB of memory. Even for a server with a fairly large memory allowance, 100 MB is not an easy chunk to swallow, and its impact on smaller systems could be devastating. More to the point, though, is that searching a structure that large is slow and time-consuming.

However, one way you can reduce the memory footprint of this database without losing significant benefits of XML is to build an XML server. An XML server is simply a device that produces XML based upon an initial query of some sort. One technique in building an XML server borrows a concept from the world of relational databases called normalization.

Normalize Your Data
For those of you coming from the database world, normalization should be a familiar concept. It involves turning one large database table into a number of smaller tables around one or more keys. Good normalization is more art than science, but in general most complex data can be fairly readily partitioned into smaller parts.

The primary key in an XML structure, the one used to identify specific objects, has a requirement that it be unique. Typically for XML structures, this uniqueness will come in the form of an ID of some sort. For the sample characters document, the ID is actually an attribute called "id" attached to the <character> node. Because most parsers are able to parse attributes faster than elements and because an ID is less a describing property of the object and more a name for that object, you should keep IDs and ID references as attributes within your document structure for best efficiency.

Once the ID is identified, you should start to divide your objects up into distinct sub-objects, keeping in mind that you should break them along the lines of both functional definitions and efficient retrieval. For example, the game processor is likely to want to pull up character matches based upon a few primary properties, such as name, species, gender, or vocation, so it makes sense to keep these properties in the central XML document. Secondary properties (level, sublevel, orderIndex, etc.) would in turn go into a second database, description (which may include CDATA blocks and could conceivably be quite large) would go in a third, while the game play attributes (such as strength, memory, etc.) would go into a fourth. These divisions would turn your single XML document into a set of four subdocuments (see Listing 2).

The idref attribute that each of these subtrees has in common maps to the ID in the primary document (world.xml). By breaking the data in this fashion, you turn the large 100 MB file into smaller 20- or 25 MB files—still large, but no longer unmanageable. Furthermore, you can fine-tune your primary file to contain just those attributes that you anticipate will be commonly requested, although a trade-off has to be made here—the more information you throw into the primary (or index) file, the less benefit this technique has.

Given these arrangements, the simplest way to reconstruct the files is to store references to the primary and secondary files in a single XML database, here called characterDef.xml. By storing file references this way, you create a sort of class definition for a collection of objects in an indirect fashion. You can actually create different types of collections just by changing the definition files to point to appropriate XML databases (or, perhaps more interestingly, to ASP or JSP scripts that generate parametric XML).

<!-- characterDef.xml -->
    <index src="world.xml" objectPath="//character" 
    <link src="secondaries.xml"/>
    <link src="attributes.xml"/>
    <link src="descriptions.xml"/>

The objectPath attribute, connected to the index document, gives an XPath expression to describe the set of objects you want to load, and can include qualifications via predicate expressions (i.e., bracketed expressions such as those found in an expression like "//character[characterName='Aleria']" ). The group attribute, on the other hand, specifies the name of the containing node that holds all of the objects you generate. It doesn't have to be the name of the containing element in the source; it's just a generic root node for the set you're about to generate.

Microsoft Releases a New XSL Parser
Up until now, I haven't indicated the parser to be used here—the fundamental concept is pretty much the same regardless of which parser is used. But because of the long time discrepancy between Microsoft's release of their original XSL parser and the most recent XSL-Transformation recommendation, I was fearing I would end up having to write this article targeting that parser. However, as of January 25th, Microsoft released a new XML technology preview that includes a much more robust XSL parser, which is close to being completely compliant with the XSL-T specification (most of the non-compliant areas are in very secondary portions of the document, mainly having to do with formatting). Thus, the code that I am going to show you here is "pure" XSL-T, with the exception of some DOM routines to handle areas that are easier to deal with in scripted code. You can download the new XML parser from http://msdn.microsoft.com/xml.

The way to approach to building sub-trees from the whole data-space is to take advantage of XSL. The XSL processor is fast. It's optimized for XML parsing operations that are much more efficient that trying to do the same things through the XML DOM. I wanted to set up the routines so that you could parametrically specify a subset of characters (or any objects, for that matter), but I'll continue to use game characters for reference. You can make use of an XSL structure that works quite well for filtering—an identity transform:

<!-- filterAndCollect.xsl -->
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
<xsl:param name="group">characters</xsl:param>
<xsl:param name="query" select="//character"/>

<xsl:template match="/">
<xsl:element name="{$group}">
<xsl:apply-templates select="$query"/>

<xsl:template match="*|@*|text()">
<xsl:copy><xsl:apply-templates select="*|@*|text()"/></xsl:copy>


This little XSL script uses two XSL parameters: "group" and "query". The "group" parameter contains the name of the root node that holds all of the characters while "query" contains the query that you use to retrieve just those game characters that satisfy a given condition (for example, "give me all Elven Warriors," which would be expressed in XPath notation as "//character[species='Elf' and vocation='Warrior']"). Because you assume that the full set of elements to be retrieved is contained in the characterDef.xml file (as "//character" in this case), you just pass the expression within the predicate as the query—for example, query is "species='Elf' and vocation='Warrior'".

Parameters are probably the coolest feature in the remarkably robust implementation of the parser because they turn your XSL scripts into little transformation functions for which you don't need to know the interior structures—they're much more encapsulated than with the older version of the parser.

You match the root node with the first template (match="/"), then immediately create a new element that has the same name as the group parameter. The dollar sign indicates that the expression is a parameter, variable, or externally defined object. The { } brackets, to be used only within attributes, evaluate the parameter and pass it as an argument to the <xsl:element>'s name attribute. For the default case, this creates an element of the form <characters></characters>. This fairly simple act was almost impossible to do in the older implementation.

The second expression is a little more confusing. Notice that you define the group and query parameters in very different ways. The group parameter is defined by setting the text of the node to "characters". The query node, on the other hand, gets set by selecting the expression "//character" as an attribute. This latter form actually performs a "select nodes" on the to-be-passed XML document and passes the result as a node list into the XSL structure. In other words, you can parametrically pass a pre-queried set of nodes into the XSL transform from the outside.

The final template is a standard transform template—it creates a duplicate of the XML document passed into it. Thus, you have changed the name of the root node in the process (something that's very difficult to do in DOM), and filtered the initial data store to allow only the desired nodes to end up in the final document. This technique will make the work you need to do on the reduced set much easier to handle, as you could have filtered potentially thousands of object nodes down to a few dozens (or even down to one...or zero, if nothing matches).

This little transform is very handy—I find it indispensable in any number of projects that I'm working on. A little design note: I personally have taken to labeling my XSL documents as verbs. A transformation does something, and is the declarative equivalent of a function or method in a procedural language like Visual Basic. Thus, I can talk about an XSL transformation in abstract terms, without specifically knowing the internal structure of either the transformer or the transformee—I filterAndCollect the World collection. This design preference may be a little thing, but it helps you make the transition from a procedural to a declarative paradigm.

At this stage, you have all that you need to create the output from a specific query filter. While the next stage could have been coded as an XSL script as well, I wanted to show some of the external characteristics of the new XSL world and how it relates to DOM. The biggest new piece is the creation of XSL processors.

XSL processors were designed to solve a couple of problems that the older XSL parser ran into—it didn't scale. Every time transformNode was called, the parser had to load in the XSL document, create the templating associations, process text, then spit out the output. In procedural terms, the XSL document was interpreted every time it was needed, which, combined with the problem of interpreting script, made for an extremely slow process.

XSL processors solve that problem. An XSL template is loaded into memory, which takes an XSL document as a property, then generates an XSL processor when the createProcessor() method is called (the template is a processor factory that compiles the XSL into a distinct object). Once you have the processor, you can use it repeatedly with different XML input. It works by assigning an input stream and an output stream. An input stream is most likely going to be an XML document or node, but conceivably could be a recordset, a request object from an ASP source that's just received an XML block via XML HTTP, or any other XML source. An output stream is anything that takes an IStream interface, such as a TextStream object, a recordset, an ASP Response object, or another XML document. You apply the XSL processor with the transform() method, which doesn't return any output directly (it's directed toward the output stream).

While I didn't use the scalability aspects of the processors in the current code, I did use another useful feature. You can use the processor to add parameters to your XSL document—such parameters overwrite any identically named parameters previously defined. Using the .addParameter method, you can pass strings, numbers, single DOM nodes, or node lists.

You pass both strings and lists in the routine getFilteredDocument (see Listing 3). This program shows how powerful and pervasive XML can be as a technology. The routine reads the object definition file (such as characterDef.xml) as the parameter jclDocName (an acronym for Job Control Language). You can then pass a query filter and a subfilter in as secondary parameters. The object definition file retrieves the name of the index file and the three linked files, as well as the base object (in this case '//character', as defined previously) and the group name.

Before you start using the newer XSL namespace, you need to tell the XML documents that they're using the newer parser:

xmlDoc.setProperty "SelectionLanguage","XPath"

If you set the "SelectionLanguage" to "XPattern" (or to nothing), then it will revert to using the older parser and namespace (www.w3.org/TR/WD-xsl), which will definitely complain about the non-standard code. As a general principle, you should specify that you're using the new parser for all of your user-defined XML objects, as it's easy to forget and then spend several hours trying to figure why your XSL won't compile properly.

The .getFilteredDocument() routine is divided into four parts: setting up the requisite files and parameters, performing the primary query, attaching the rest of the nodes to each resultant object, and performing the secondary query. The primary query is a query that operates only on the properties contained in the index file (here, world.xml). These properties are roughly analogous to indexed keys in a relational database—they will be referenced frequently, so they are kept in an easily accessible form. In the case of the game characters, the properties include the character and player names, the species, gender, vocation, and game level. Your game application is probably far more likely to want to retrieve a list of "Elven Rangers" than it would a list of all characters wearing plate mail. (Obviously, if you are designing a more conventional database, your common properties might be things like last name and e-mail address, with performance evaluation scores being considered secondary).

You can use an XPath notation to reference these common properties, but if you attempt to retrieve a property that isn't in the index file at this stage, you'll receive back an empty set. The objects only contain those properties in the index file, and asking for all characters with a string above .4 simply will not find anything to compare against.

The third part of the .getFilteredDocument routine takes the result retrieved from the second part and attaches the rest of the nodes. It does this by calling the appendNodes function. This function streams through the list of objects from the second part, retrieves the ID, performs a search for the idref with the same name in each of the subordinate <link> files defined in the object definition file, and grafts a copy of the node onto the object node:

function appendNodes(fileName,sourceDoc,sourcePath)
    set subDoc=createObject("MSXML2.FreeThreadedDOMDocument")
    subDoc.load fileName
    for each node in sourceDoc.selectNodes(sourcePath)
        set appNode=subDoc.selectSingleNode _
        node.appendChild appNode
        appNode.removeAttribute "idref"
    set appendNodes=sourceDoc
end function

To keep the file from getting unwieldy, the idref is removed from each of the child objects when the child is appended to the parent object. Because you're still dealing with large files, anything you can do to keep redundancy down will help streamline your production.

Earlier, the only queries that could be made were on the indexed elements, or those contained in the primary file. Once the secondary elements have been added, however, a full search can be done on the remaining XML elements. While this particular scheme isn't especially optimal, it is possible (though beyond the scope of this article) to parse the query string so that the primary and secondary keys are each separated into different sections.

The last section of .getFilteredDocument is almost identical to the first query section, and in fact uses exactly the same XSL filter, but works with the resulting document from the third section. This will typically be a smaller (conceivably much smaller) subset of the total set of objects, so searching on them becomes much easier. At this stage, you can do a complete search on all elements, as in the main() function:

function main()
    set xmlDoc=getFilteredDocument("characterDef.xml", _
        "gender='Female' and vocation='Mage' and level>3", _
    set xslDoc=createObject("MSXML2.FreeThreadedDOMDocument")
    xslDoc.load "showCharacters.xsl"
    document.write xmlDoc.transformNode(xslDoc)    
end function

The main() routine serves the same purpose as one in Java or VB—it indicates the routine to be called that handles all other processing. In this case, you call getFilteredDocument(), passing the object definition file, characterDef.xml, the primary query and the secondary query. The primary query, "gender='Female' and vocation='Mage' and level>3", tells the routine to return all female mages of at least fourth level, and the secondary query, "contains(description,'brown')", retrieves all 4+ level female mages that have the word "brown" somewhere in their description field.

Display the Output
To prove that you can use the simpler transformNode function with the new XML parser, load in the showCharacters.xsl stylesheet and do a straight transformNode call in the main() function to generate the output from the result set just passed. This handles the creation of a "sheet" that describes the character in detail (see Listing 4).

I'd like to point out two tricks here, although they are not immediately germane to the issue of creating a server. All of the element node names are given in the form thisIsAName, which has the first letter shown as lowercase and spaces removed, and is a fairly well-accepted naming convention. Because I didn't want to go to the bother of creating a schema for the output (although in general, this is a good idea), there was no way of providing names that might seem a little bit easier to follow without adding to the size of the XML document. However, primarily because I wanted to play with the string functions of the new parser, I wrote a small declarative routine that converted the first character of a name from lowercase to uppercase:

<xsl:value-of select="concat( translate( substring( name(.),1,1), 
substring( name(.),2,string-length(name(.))))"/>

The value-of tag, when working in conjunction with one or more functions declared in the select statement, will evaluate the function and return its result into the output stream. Concat joins two strings together, translate maps characters from the first given string (the lowercase letters here) to the second string (uppercase characters), and substring returns the string from the first position given to the last. Finally, string-length returns the total length of the string, while name returns the node-name of the node in question—here the current node.

The second trick is a little more subtle, but no less powerful. The initial "game" stored all of the game characters' attributes (string, analysis, etc.) as numbers between 0 and 1. While these values are easier to work with in turns of the processing, they are not necessarily intuitive to the user. You can rely on a simple trick from the early JavaScript days—you create a graph by taking a number of HTML DIVs, placing them in a table, then varying their widths. By putting a background in each DIV (in this case a gradient JPG), you get the illusion of a table with graded columns.

You can incorporate this principle with an extra fillip—you can use the attribute bracketed notation to determine the width based upon the attribute's value, then multiply that by 200 and add 20 so that the text will always appear against a dark background:

<div style="color:yellow;font-weight:bold;background-color:red;
<xsl:value-of select=".*100"/></div>

Being able to evaluate expressions within an attribute string, or using the xsl:value-of element, significantly enhances what you can do with XSL without having to rely on blocks of scripted code. Indeed, the ultimate advantage of this is that, at least internally, the XSL scripts contained herein would work equally well on a Linux machine or a Macintosh—they are completely compliant with the XSL standards.

More About XML Servers
Of course, to call this system a server is a little misleading, but not all that much. As I mentioned, an XML server is simply a device that produces XML based upon an initial query of some sort. The same notions that I've talked about here could be further abstracted. You could abstract sub-elements of the total XML space by placing them within distinct file collections. Moreover, the beauty of XML is that it is agnostic about its source. It's not much of a stretch to see how one of the tables in the character definition file could point to an ODBC database that in turn has an XSL query front end and produces XML as a result set, regardless of the mechanism to retrieve this information. Similarly, the <link> elements could point to a program or driver that can be queried using XPath and generate state information through XML.

Obviously, you can improve on this design. One of the biggest limitations that this solution faces is that you need to provide both a primary and secondary query. However, extracting this information from a simple query is not all that complicated. You could use a regular expression parser to extract all of the keywords (look for the pattern "([\w])+\s*=" and the "and" keyword) and determine which expressions involve keywords from what part of the database, then split the query this way. Things get more complicated when you introduce functional keywords, of course, which is one of the reasons I'm not covering this topic in depth.

Still, the concepts provided here form the basis of many pure XML servers that are beginning to appear on the market today. Because of the neutral nature of XML, so long as you have a mechanism to retrieve XML parametrically, building such a server is a lot easier than you might think.

Click here to download the code for this article.

Note: The code files make extensive use of the new XML Parser, downloadable from the Microsoft Web site at http://msdn.microsoft.com/xml. If you have not done so, you will need to install it (it's still a beta, but extremely robust).

Kurt Cagle is a writer and developer specializing in XML, XSLT and distributed Internet application technologies, and is author of Sybex's XML Developers' Handbook and coauthor of the upcoming Wrox Professional XSLT book. He lives in Olympia, Washington with his wife and two daughters, and welcomes any comments, questions or suggestions for articles. He can be reached at cagle@olywa.net
Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



Thanks for your registration, follow us on our social networks to keep up-to-date