XML Storage in SQL Server: Strategies and Fallout

liché but true: To someone with a hammer, everything looks like a nail. We developers seem to be especially vulnerable to this problem because we like technology for its own sake. We’ll get a whole bagful of hammers when SQL Server 2005 (formerly codenamed Yukon) comes out this year, and one of the most important is native XML support.

Increasingly, developers don’t have a choice between relational and XML data. Each has its strengths and each has its place. Reports and business intelligence applications work primarily with relational data, but venders and customers need data in XML. And both kinds of data are increasingly integrated: change requests refer to customer email; trouble-tickets reference white papers. SQL Server 2005 can help you manage all these different kinds of data under a single platform.

What’s new?
Even without any built-in support for XML, a database system typically offers more robust management functionality than a file system. Features such as security, backup, and mirroring make the database a good home for documents.

But XML isn’t completely new to the Microsoft database world: SQL Server 2000 shipped with some XML functionality, and SQLXML continues to improve client-side functionality. But most of the XML support in that version has to do with mapping between relational and markup data. The OPENXML keyword is used to shred documents into a tabular format for storage. And the FOR XML clause can be used to publish relational data as XML on demand.

What You Need
To really follow along, you’ll need SQL Server 2005. MSDN subscribers can download beta 2 from MSDN. Everyone else can get the Express Edition from Microsoft’s SQL Server Developers Center.
Figure 1. SQL Server 2000: SQLXML and T-SQL language enhancements enable mapping between markup and tabular data but offer little support for working with XML natively.

It’s a nice piece of indirection: Developers don’t (necessarily) have to learn database libraries and DBAs don’t have to worry (too much) about the intricacies of XML. But the indirection is only one-way: SQL Server 2000 offers very little in the way of native XML support. You can store your document in a string, but you can’t index it, or shred it on the fly?not without a lot custom development (see Figure 1).

SQL Server 2005 adds symmetry to the storage-type indirection: Now that XML is built-in, you can manage XML data with database tools such as indexes, constraints, and queries. Even better, you can use XML views to expose markup data as rows and columns (see Figure 2).

Relational Storage: Still the Default

Figure 2. SQL Server 2005: SQL Server 2005 lets you expose natively stored XML in a tabular format.

The relational database isn’t very sexy, but even after 25 years of commercial availability, it’s still the best way to store structured data. Structured data has at least two characteristics:

  • Dense?Every entity has roughly the same set of known properties.
  • Regular?Entities can reference each other?think foreign keys?but in a predictable manner.

Here’s a more intuitive test: If your data can easily fit into a spreadsheet, it’s probably structured. Of course, the concept of structure in data exists on a continuum, so extremes are relatively rare. Even contact information can be sparsely populated?not everyone has a first name, a last name, and a middle initial, so these fields are occasionally NULL?but relational is usually the best way to store it. Here’s some contact information in a tabular format:

Email GivenName FamilyName Street Unit City State PostalCode Country
[email protected] NULL Cher 1234 American Way 15d YourCity CA 54321 USA
[email protected] Billy Pilgrim 5432 American Way NULL AnyTown CO 12345 USA

Here’s the XML equivalent:

     [email protected]">           Cher     
123 American Way 15d YourCity CA 54321 USA
[email protected] Billy Pilgrim
543 American Way AnyTown NY 12345 USA

Each representation has its advantages: The table is more succinct; XML is better at dealing with Cher’s single name. But overall, the table is the best way to represent this kind of data for a variety of reasons:

  • Query performance?The relational model is much simpler than XML, which makes searching easier and faster.
  • Storage space?XML is notoriously space-inefficient?maybe not a problem for a config file, but size can be an issue for large databases.
  • Locking?The XML type is lockable like any other type. So, if many users are updating the same set of documents, they can run into decreasing performance and even deadlocks.
  • Legacy?A lot of technological infrastructure supports the relational model, such as reporting software and business intelligence systems.

So even if you import data in an XML format, it’s typically best to shred it up and store it in a relational format. And this is where SQL Server 2000 is already pretty good. Listing 1 shows a XML file, shredded using SQL Server 2000’s OPENXML rowset. Listing 2 shows a way to publish XML query results using functionality new in SQL Server 2005.

The Right Tool for the Right Job
But there’s a whole universe of data that doesn’t map very well to rows and columns. If the relational model handles data that’s dense and regular, XML takes care of just about everything else.

First, XML is very good at representing very hierarchical data. Navigating a hierarchy of arbitrary depth is a real hassle in SQL Server 2000. As an example of hierarchical data, consider the blueprint of a house. Various structural elements support each other in complex ways: The roof sits on beams, posts, and walls, each of which is supported by one or more structural elements below (see Listing 3). Users may want to run queries, but relationships between members are arbitrary so this kind of data doesn’t map well to tables and columns.

XML is also gaining popularity for representing unstructured data. These are often prose documents, and markup helps to store them in a display-neutral fashion, so you can easily transform your document into HTML for a Web site or into a PDF document to send to the printers. Marking up prose can also add semantic transparency, making document searches more precise. Listing 4 shows a SQL script for querying unstructured data.

The example below stores an email message as XML. Not only are the standard to, from, and subject fields marked up, but the text of the message itself contains custom tags as well. These custom tags make searching through text a less ambiguous exercise.

   [email protected]   [email protected]   Eyes on the Prize         Remember the California rollout  happens Monday 12/14/2004.      Management.      

SQL Server gives you the best of both worlds for storing hierarchical and unstructured data. Now, you can query and index documents that would previously be searchable only through full text indexing.

Implementation Details: Four more Nails
Depending on your situation, XML can also be the best choice for storing even highly structured data. Here are a few practical reasons to consider storing data in a field of type XML:

  • Repeated shredding or publishing?On-demand transformations carry a performance penalty. If you have to shred or publish the same document over and over again, consider storing it natively as XML. You can always expose it to relational consumers with an XML view.
  • Rapidly changing data structures?When modeled correctly, XML lives up to its name: It’s extensible. Developers can add new pieces of data?even new hierarchies?to a schema without compromising existing software. Extensibility is an extra advantage when prototyping, or when working with rapidly changing problem domains such as bioinformatics.
  • Atomic data?Sometimes, you’ll have XML data that’s never consumed except as a whole. Think of this as logical atomicity?if you never access the parts individually, you might as well store it in one big chunk.
  • Debugging?Especially for new releases, it can be a good idea to tuck away a copy of your XML imports. The data may be redundant, but keeping the original makes tracking down problems a whole lot easier.

Microsoft’s database support for XML has come a long way with SQL Server 2005. So consider these more as guidelines than as rules. There are enough nails out there; you don’t need to look for any more.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Overview

Recent Articles: