Chapter 9?? ADO.NET XML Data SerializationXML is the key element responsible for the greatly improved interoperability of the Microsoft ADO.NET object model when compared to Microsoft ActiveX Data Objects (ADO). In ADO, XML was merely an I/ O format (nondefault) used to persist the contents of a disconnected recordset. The participation of XML in the building and in the interworkings of ADO.NET is much deeper. The aspects of ADO.NET in which the interaction and integration with XML is stronger can be summarized in two categories: object serialization and remoting and a dual programming interface.
In ADO.NET, you have several options for saving objects to, and restoring objects from, XML documents. In effect, this capability belongs to one object only?the DataSet object?but it can be extended to other container objects with minimal coding. Saving objects like DataTable and DataView to XML is essentially a special case of the DataSet object serialization.
As we saw in Chapter 8, ADO.NET and XML classes provide for a unified, intermediate API that is made available to programmers through a dual, synchronized programming interface?the XmlDataDocument class. You can access and update data using either the hierarchical node-based approach of XML or the relational approach of column-based tabular data sets. At any time, you can switch from a DataSet representation of the data to an XML Document Object Model (XML DOM) representation, and vice versa. Data is synchronized, and any change you enter in either model is immediately reflected and visible in the other.
In this chapter, we’ll explore the XML features built around the DataSet object and other ADO.NET objects for data serialization and deserialization. You’ll learn how to persist and restore data contents, how to deal with schema information, and even how schema information is automatically inferred from the XML source.
In ADO.NET, serialization of an object is performed either through the public ISerializable interface or through public methods that expose the object’s internal serialization mechanism. As .NET Framework objects, ADO.NET objects can plug into the standard .NET Framework serialization mechanism and output their contents to standard and user-defined formatters. The .NET Framework provides a couple of built-in formatters: the binary formatter and the Simple Object Access Protocol (SOAP) formatter. A .NET Framework object makes itself serializable by implementing the methods of the ISerializable interface?specifically, the GetObjectData method, plus a particular flavor of the constructor. According to this definition, both the DataSet and the DataTable objects are serializable.
In addition to the official serialization interface, the DataSet object supplies an alternative, and more direct, series of methods to serialize and deserialize itself, but in a class-defined XML format only. To serialize using the standard method, you create instances of the formatter object of choice (binary, SOAP, or whatever) and let the formatter access the source data through the methods of the ISerializable interface. The formatter obtains raw data that it then packs into the expected output stream.
In the alternative serialization model, the DataSet object itself starts and controls the serialization and deserialization process through a group of extra methods. The DataTable object does not offer public methods to support such an alternative and embedded serialization interface, nor does the DataView object.
In the end, both the official and the embedded serialization engines share the same set of methods. The overall architecture of DataSet and DataTable serialization is graphically rendered in Figure 9-1.
Figure 9-1??Both the DataSet object and the DataTable object implement the ISerializable interface for classic .NET Framework serialization. The DataSet object also publicly exposes the internal API used to support classic serialization.?(Image unavailable)
All the methods that the DataSet object uses internally to support the .NET Framework serialization process are publicly exposed to applications through a group of methods, one pair of which clearly stands out? ReadXml and WriteXml. The DataTable object, on the other hand, does not publish the same methods, although this feature can be easily obtained with a little code. (I’ll demonstrate this in the section “Serializing Filtered Views,” on page 417.)
As you can see in the architecture depicted in Figure 9-1, both objects always pass XML data to .NET Framework formatters. This means that there is no .NET Frameworkprovided way to serialize ADO.NET objects in binary formats. We’ll return to this topic in the section “Custom Binary Serialization,” on page 424.
Table 9-1 The DataSet Object’s Embedded Serialization API
|GetXml||Returns an XML representation of the data currently stored in the DataSet object. No schema information is included.|
|GetXmlSchema||Returns a string that represents the XML schema information for the data currently stored in the object.|
|ReadXml||Populates the DataSet object with the specified XML data read from a stream or a file. During the process, schema information is read or inferred from the data.|
|ReadXmlSchema||Loads the specified XML schema information into the current DataSet object.|
|WriteXml||Writes out the XML data, and optionally the schema, that represents the DataSet object to a storage medium?that is, a stream or a file.|
|WriteXmlSchema||Writes out a string that represents the XML schema information for the DataSet object. Can write to a stream or a file.|
Note that GetXml returns a string that contains XML data. As such, it requires more overhead than simply using WriteXml to write XML to a file. You should not use GetXml and GetXmlSchema unless you really need to obtain the DataSet representation or schema as distinct strings for in-memory manipulation. The GetXmlSchema method returns the DataSet object’s XML Schema Definition (XSD) schema; there is no way to obtain the DataSet object’s XML-Data Reduced (XDR) schema.
As Table 9-1 shows, when you’re working with DataSet and XML, you can manage data and schema information as distinct entities. You can take the XML schema out of the object and use it as a string. Alternatively, you could write the schema to a disk file or load it into an empty DataSet object. Alongside the methods listed in Table 9-1, the DataSet object also features two XML-related properties: Namespace and Prefix. Namespace specifies the XML namespace used to scope XML attributes and elements when you read them into a DataSet object. The prefix to alias the namespace is stored in the Prefix property. The namespace can’t be set if the DataSet object already contains data.
In this chapter, we’ll focus on the stateless representation of the DataSet object, with just a glimpse at the stateful representation?the DiffGram format. In Chapter 10, we’ll delve into the DiffGram’s structure and goals.
The XML representation of a DataSet object can be written to a file, a stream, an XmlWriter object, or a string using the WriteXml method. It can include, or not include, XSD schema information. The actual behavior of the WriteXml method can be controlled by passing the optional XmlWriteMode parameter. The values in the XmlWriteMode enumeration determine the output’s layout. The overloads of the method are shown in the following listing:
WriteXml provides four additional overloads with the same structure as this code but with no explicit XmlWriteMode argument.
The stateless representation of the DataSet object takes a snapshot of the current status of the object. In addition to data, the representation includes tables, relations, and constraints definitions. The rows in the tables are written only in their current versions, unless you use the DiffGram format?which would make this a stateful representation. The following schema shows the ADO.NET normal form?that is, the XML stateless representation of a DataSet object:
The root tag is named after the DataSet object. If the DataSet object has no name, the string NewDataSet is used. The name of the DataSet object can be set at any time through the DataSetName property or via the constructor upon instantiation. Each table in the DataSet object is represented as a block of rows. Each row is a subtree rooted in a node with the name of the table. You can control the name of a DataTable object via the TableName property. By default, the first unnamed table added to a DataSet object is named Table. A trailing index is appended if a table with that name already exists. The following listing shows the XML data of a DataSet object named NorthwindInfo:
Basically, the XML representation of a DataSet object contains rows of data grouped under a root node. Each row is rendered with a subtree in which child nodes represent columns. The contents of each column are stored as the text of the node. The link between a row and the parent table is established through the name of the row node. In the preceding listing, the
Modes of Writing
Table 9-2 summarizes the writing options available for use with WriteXml through the XmlWriteMode enumeration.
Table 9-2 The XmlWriteMode Enumeration
|DiffGram||Writes the contents of the DataSet object as a DiffGram, including original and current values.|
|IgnoreSchema||Writes the contents of the DataSet object as XML data without a schema.|
|WriteSchema||Writes the contents of the DataSet object, including an in-line XSD schema. The schema can’t be inserted as XDR, nor can it be added as a reference.|
IgnoreSchema is the default option. The following code demonstrates the typical way to serialize a DataSet object to an XML file:
In terms of functionality, calling the GetXml method and then writing its contents to a data store is identical to calling WriteXml with XmlWriteMode set to IgnoreSchema. Using GetXml can be comfortable, but in terms of raw overhead, calling WriteXml on a StringWriter object is slightly more efficient, as shown here:
The same considerations apply to GetXmlSchema and WriteXmlSchema.
Preserving Schema and Type Information
The stateless XML format is a flat format. Unless you explicitly add schema information, the XML output is weakly typed. There is no information about tables and columns, and the original content of each column is normalized to a string. If you need a higher level of type and schema fidelity, start by adding an in-line XSD schema.
In general, a few factors can influence the final structure of the XML document that WriteXml creates for you. In addition to the overall XML format?DiffGram or a plain hierarchical representation of the current contents?important factors include the presence of schema information, nested relations, and how table columns are mapped to XML elements.
To optimize the resulting XML code, the WriteXml method drops column fields with null values. Dropping the null column fields doesn’t affect the usability of the DataSet object?you can successfully rebuild the object from XML, and data-bound controls can easily manage null values. This feature can become a problem, however, if you send the DataSet object’s XML output to a non-.NET platform. Other parsers, unaware that null values are omitted for brevity, might fail to parse the document. If you want to represent null values in the XML output, replace the null values (System.DBNull type) with other neutral values (for example, blank spaces).
The schema contains information about the constituent columns of each DataTable object. (Column information includes name, type, any expression, and all the contents of the ExtendedProperties collection.)
The schema is always written as an in-line XSD. As mentioned, there is no way for you to write the schema as XDR, as a document type definition (DTD), or even as an added reference to an external file. The following listing shows the schema source for a DataSet object named NorthwindInfo that consists of two tables: Employees and Territories. The Employees table has three columns?employeeid, lastname, and firstname. The Territories table includes employeeid and territoryid columns. (These elements appear in boldface in this listing.)
The schema can be slightly more complex if relations exist between two or more pairs of tables. The msdata namespace contains ad hoc attributes that are used to annotate the schema with ADO.NETspecific information, mostly about indexes, table relationships, and constraints.
In-Line Schemas and Validation
Chapter 3 hinted at why the XmlValidatingReader class is paradoxically unable to validate the XML code that WriteXml generates for a DataSet object with an in-line schema, as shown here:
In the final XML layout, schema information is placed at the same level as the table nodes, but includes information about the common root (DataSetName, in the preceding code) as well as the tables (Table1 and Table2). Because the validating parser is a forward-only reader, it can match the schema only for nodes placed after the schema block. The idea is that the parser first reads the schema and then checks the compliance of the remainder of the tree with the just-read information, as shown in Figure 9-2.
Figure 9-2??How the .NET Framework validating reader parses a serialized DataSet object with an in-line schema.?(Image unavailable)
Due to the structure of the XML document being generated, what comes after the schema does not match the schema! Figure 9-3 shows that the validating parser we built in Chapter 3 around the XmlValidatingReader class does not recognize (I’d say, by design) a serialized DataSet object when an in-line schema is incorporated.
Figure 9-3??The validating parser built in Chapter 3 does not validate an XML DataSet object with an in-line schema.?(Image unavailable)
Is there a way to serialize the DataSet object so that its XML representation remains parsable when an in-line schema is included? The workaround is fairly simple.
Serializing to Valid XML
As you can see in Figure 9-2, the rub lies in the fact that the in-line schema is written in the middle of the document it is called to describe. This fact, in addition to the forward-only nature of the parser, irreversibly alters the parser’s perception of what the real document schema is. The solution is simple: move the schema out of the DataSet XML serialization output, and group both nodes under a new common root, as shown here:
Here’s a code snippet that shows how to implement this solution:
If you don’t use an XML writer, the WriteXmlSchema method would write the XML declaration in the middle of the document, thus making the document wholly unparsable. You can also mark this workaround with your own credentials using a custom namespace, as shown here:
Figure 9-4 shows the new document displayed in Microsoft Internet Explorer.
Figure 9-4 The DataSet object’s XML output after modification.?(Image unavailable)
Figure 9-5 shows that this new XML file (validdataset.xml) is successfully validated by the XmlValidatingReader class. The validating parser raises a warning about the new root node; this feature was covered in Chapter 3.
Figure 9-5 The validating parser raises a warning but accepts the updated XML file.?(Image unavailable)
A reasonable concern you might have is about the DataSet object’s ability to read back such a modified XML stream. No worries! The ReadXml method is still perfectly able to read and process the modified schema, as shown here:
Although paradoxical, this behavior (whether it’s by design or a bug) does not deserve much hype. At first glance, this behavior seems to limit true cross-platform interoperability, but after a more thoughtful look, you can’t help but realize that very few XML parsers today support in-line XML schemas. In other words, what appears to be a clamorous and incapacitating bug is actually a rather innocuous behavior that today has a very limited impact on real applications. Real-world cross-platform data exchange, in fact, must be done using distinct files for schema and data.
Customizing Column Mapping
Each row in a DataTable object originates an XML subtree whose structure depends on the value assigned to the DataColumn object’s ColumnMapping property. Table 9-3 lists the allowable column mappings.
Table 9-3 The MappingType Enumeration
|Attribute||The column is mapped to an XML attribute on the row node.|
|Element||The column is mapped to an XML node element. The default setting.|
|Hidden||The column is not included in the XML output unless the DiffGram format is used.|
|SimpleContent||The column is mapped to simple text. (Only for tables containing exactly one column.)|
The column data depends on the row node. If ColumnMapping is set to Element, the column value is rendered as a child node, as shown here:
If ColumnMapping is set to Attribute, the column data becomes an attribute on the row node, as shown here:
By setting ColumnMapping to Hidden, you can filter the column out of the XML representation. Unlike the two preceding settings, which are maintained in the DiffGram format, a column marked with Hidden is still serialized in the DiffGram format, but with a special attribute that indicates that it was originally marked hidden for serialization. The reason is that the DiffGram format is meant to provide a stateful and high-fidelity representation of the DataSet object.
Finally, the SimpleContent attribute renders the column content as the text of the row node, as shown here:
For this reason, this attribute is applicable only to tables that have a single column.
Persisting Extended Properties
Many ADO.NET classes, including DataSet, DataTable, and DataColumn, use the ExtendedProperties property to enable users to add custom information. Think of the ExtendedProperties property as a kind of generic cargo variable similar to the Tag property of many ActiveX controls. You populate it with name/value pairs and manage the contents using the typical and familiar programming interface of collections. For example, you can use the DataTable object’s ExtendedProperties collection to store the SQL command that should be used to refresh the table itself.
The set of extended properties is lost at serialization time, unless you choose to add schema information. The WriteXml method adds extended properties to the schema using an ad hoc attribute prefixed with the msprop namespace prefix. Consider the following code:
When the tables are serialized, the Command slot is rendered as follows:
ExtendedProperties holds a collection of objects and can accept values of any type, but you might run into trouble if you store values other than strings there. When the object is serialized, any extended property is serialized as a string. In particular, the string is what the object’s ToString method returns. This can pose problems when the DataSet object is deserialized.
Not all types can be successfully and seamlessly rebuilt from a string. For example, consider the Color class. If you call ToString on a Color object (say, Blue), you get something like Color [Blue]. However, no constructor on the Color class can rebuild a valid object from such a string. For this reason, pay careful attention to the nonstring types you store in the ExtendedProperties collection.
More often than not, a relation entails table constraints. In ADO.NET, you have two types of constraints: foreign-key constraints and unique constraints. A foreign-key constraint denotes an action that occurs on the columns involved in the relation when a row is either deleted or updated. A unique constraint denotes a restriction on the parent column whereby duplicate values are not allowed. How are relations rendered in XML?
If no schema information is required, relations are simply ignored. When a schema is not explicitly required, the XML representation of the DataSet object is a plain snapshot of the currently stored data; any ancillary information is ignored. There are two ways to accurately represent a DataRelation relation within an XML schema: you can use the
The msdata:Relationship Annotation
The msdata:Relationship annotation is a Microsoft XSD extension that ADO.NET and XML programmers can use to explicitly specify a parent/ child relationship between non-nested tables in a schema. This annotation is ideal for expressing the content of a DataRelation object. In turn, the content of an msdata:Relationship annotation is transformed into a DataRelation object when ReadXml processes the XML file.
Let’s consider the following relation:
The following listing shows how to serialize this relation to XML:
This syntax is simple and effective, but it has one little drawback? it is simply targeted to describe a relation. When you serialize a DataSet object to XML, you might want to obtain a hierarchical representation of the data, if a parent/child relationship is present. For example, which of the following XML documents do you find more expressive? The sequential layout shown here is the default:
The following layout provides a hierarchical view of the data?all the territories’ rows are nested below the logical parent row:
As an annotation, msdata:Relationship can’t express this schema-specific information. Another piece of information is still needed. For this reason, the WriteXml method uses the
The XSD keyref Element
In XSD, the keyref element allows you to establish links between elements within a document in much the same way a parent/child relationship does. The WriteXml method uses keyref to express a relation within a DataSet object, as shown here:
The name attribute is set to the name of the DataRelation object. By design, the refer attribute points to the name of a key or unique element defined in the same schema. For a DataRelation object, refer points to an automatically generated unique element that represents the parent table, as shown in the following code. The child table of a DataRelation object, on the other hand, is represented by the contents of the keyref element.
The keyref element’s contents consist of two mandatory subelements?selector and field?both of which contain an XPath expression. The selector subelement specifies the node-set across which the values selected by the expression in field must be unique. Put more simply, selector denotes the parent or the child table, and field indicates the parent or the child column. The final XML representation of our sample DataRelation object is shown here:
This code is functionally equivalent to the msdata:Relationship annotation, but it is completely expressed using the XSD syntax.
Nested Data and Nested Types
The XSD syntax is also important for expressing relations in XML using nested subtrees. Neither msdata:Relationship nor keyref are adequate to express the relation when nested tables are required. Nested relations are expressed using nested types in the XML schema.
In the following code, the Territories type is defined within the Employees type, thus matching the hierarchical relationship between the corresponding tables:
By using keyref and nested types, you have a single syntax?the XML Schema language?to render in XML the contents of any ADO.NET DataRelation object. The Nested property of the DataRelation object specifies whether the relation must be rendered hierarchically?that is, with child rows nested under the parent?or sequentially?that is, with all rows treated as children of the root node.
When reading an XML stream to build a DataSet object, the ReadXml method treats the
In the meantime, let’s see how to extend the DataTable and DataView objects with the equivalent of a WriteXml method.
The DataSet class includes internal methods that can be used to persist an individual DataTable object to XML. Unfortunately, these methods are not publicly available. Saving the contents of a stand-alone DataTable object to XML is not particularly difficult, however, and requires only one small trick.
The idea is that you create a temporary, empty DataSet object, add the table to it, and then serialize the DataSet object to XML. Here’s some sample code:
This code is excerpted from a sample class library that provides static methods to save DataTable and DataView objects to XML. Each method has several overloads and mimics as much as possible the DataSet object’s WriteXml method. In the preceding sample code, the input DataTable object is incorporated in a temporary DataSet object that is then saved to a disk file. The following code creates the temporary DataSet object and adds the DataTable object to it:
Note that a DataTable object can’t be linked to more than one DataSet object at a time. If a given DataTable object has a parent object, its DataSet property is not null. If the property is not null, the temporary DataSet object must be linked to an in-memory copy of the table.
The class library that contains the various WriteDataTable overloads is available in this book’s sample files and is named AdoNetXmlSerializer. A client application uses the library as follows:
Figure 9-6 shows the sample application in action.
Figure 9-6 An application that passes some data to a DataTable object and then persists it to XML.?(Image unavailable)
So much for DataTable objects. Let’s see what you can do to serialize to XML the contents of an in-memory, possibly filtered, view.
The view is implemented by maintaining a separate array with the indexes of the original rows that match the criteria set on the view. By default, the table view is unfiltered and contains all the records included in the table. By configuring the RowFilter and RowStateFilter properties, you can narrow the set of rows that fit into a particular view. Using the Sort property, you can apply a sort expression to the rows in the view. Figure 9-7 illustrates the internal architecture of the DataView object.
Figure 9-7 A DataView object maintains an index of the table rows that match the criteria.?(Image unavailable)
When any of the filter properties is set, the DataView object gets from the underlying DataTable object an updated index of the rows that match the criteria. The index is a simple array of positions. No row objects are physically copied or referenced at this time.
Linking Tables and Views
The link between the DataTable object and the DataView object is typically established at creation time through the constructor, as shown here:
However, you could also create a new view and associate it with a table at a later time using the DataView object’s Table property, as in the following example:
You can also obtain a DataView object from any table. In fact, the DefaultView property of a DataTable object simply returns a DataView object initialized to work on that table, as shown here:
Originally, the view is unfiltered, and the index array contains as many elements as there are rows in the table.
Getting Views of Rows
The contents of a DataView object can be scrolled through a variety of programming interfaces, including collections, lists, and enumerators. The GetEnumerator method in particular ensures that you can walk your way through the records in the view using the familiar foreach statement.
The following code shows how to access all the rows that fit into the view:
When client applications access a particular row in the view, the class expects to find it in an internal rows cache. If the rows cache is not empty, the specified row is returned to the caller via an intermediate DataRowView object. The DataRowView object is a wrapper for the DataRow object that contains the actual data. You access row data through the Row property. If the rows cache is empty, the DataView class fills it with an array of DataRowView objects, each of which references an original DataRow object. The rows cache can be empty either because it has not yet been used or because the sort expression or the filter string has been changed in the meantime.
Serializing DataView Objects
The AdoNetXmlSerializer class also provides overloaded methods to serialize a DataView object. You build a copy of the original DataTable object with all the rows (and only those rows) that match the view, as shown here:
You create a temporary DataTable object and then serialize it to XML using the previously defined methods. The structure of the internal CreateTempTable routine is fairly simple, as shown here:
The ImportRow method creates a new row object in the context of the table. Like many other ADO.NET objects, the DataRow object can’t be referenced by two container objects at the same time. Using ImportRow is logically equivalent to cloning the row and then adding the clone as a reference to the table. Figure 9-8 shows a DataView object saved to XML.
Figure 9-8 Saving a DataView object to XML.?(Image unavailable)
The big difference between methods like WriteXml and .NET Framework data formatters is that in the former case, the object itself controls its own serialization process. When .NET Framework data formatters are involved, any object can behave in one of two ways. The object can declare itself as serializable (using the Serializable attribute) and passively let the formatter extrapolate any significant information that needs to be serialized. This type of object serialization uses .NET Framework reflection to list all the properties that make up the state of an object.
The second behavior entails the object implementing the ISerializable interface, thus passing the formatters the data to be serialized. After this step, however, the object no longer controls the process. A class that neither is marked with the Serializable attribute nor implements the ISerializable interface can’t be serialized. No ADO.NET class declares itself as serializable, and only DataSet and DataTable implement the ISerializable interface. For example, you can’t serialize to any .NET Framework formatters a DataColumn or a DataRow object.
A formatter object is merely a class that implements the IFormatter interface to support the serialization of a graph of objects. The SoapFormatter and BinaryFormatter classes also implement the IRemotingFormatter interface to support remote procedure calls across AppDomains. No technical reasons prevent you from implementing custom formatters. In most cases, however, you only need to tweak the serialization process of a given class instead of creating an extension to the general serialization mechanism. Quite often, this objective can be reached simply by implementing the ISerializable interface.
The following code shows what’s needed to serialize a DataTable object using a binary formatter:
The Serialize method causes the formatter to flush the contents of an object to a binary stream. The Deserialize method does the reverse?it reads from a previously created binary stream, rebuilds the object, and returns it to the caller, as shown here:
When you run this code, something surprising happens. Have you ever tried to serialize a DataTable object, or a DataSet object, using the binary formatter? If so, you certainly got a binary file, but with a ton of XML in it. Unfortunately, XML data in serialized binary files only makes them huge, without the portability and readability advantages that XML normally offers. As a result, deserializing such files might take a while to complete?usually seconds.
There is an architectural reason for this odd behavior. The DataTable and DataSet classes implement the ISerializable interface, thus making themselves responsible for the data being serialized. The ISerializable interface consists of a single method?GetObjectData?whose output the formatter takes and flushes into the output stream.
Can you guess what happens next? By design, the DataTable and DataSet classes describe themselves to serializers using an XML DiffGram document. The binary formatter takes this rather long string and appends it to the stream. In this way, DataSet and DataTable objects are always remoted and transferred using XML?which is great. Unfortunately, if you are searching for a more compact representation of persisted tables, the ordinary .NET Framework run-time serialization for ADO.NET objects is not for you. Let’s see how to work around it.
- Create a custom class, and mark it as serializable (or, alternatively, implement the ISerializable interface).
- Copy the key properties of the DataTable object to the members of the class. Which members you actually map is up to you. However, the list must certainly include the column names and types, plus the rows.
- Serialize this new class to the binary formatter, and when deserialization occurs, use the restored information to build a new instance of the DataTable object.
Let’s analyze these steps in more detail.
Creating a Serializable Ghost Class
Assuming that you need to persist only columns and rows of a DataTable object, a ghost class can be quickly created. In the following example, this ghost class is named GhostDataTable:
This class consists of three, serializable ArrayList objects that contain column names, column types, and data rows.
The serialization process now involves the GhostDataTable class rather than the DataTable object, as shown here:
The key event here is how the DataTable object is mapped to the GhostDataTable class. The mapping takes place in the folds of the CreateTableGraph routine.
Mapping Table Information
The CreateTableGraph routine populates the colNames array with column names and the colTypes array with the names of the data types, as shown in the following code. The dataRows array is filled with an array that represents all the values in the row.
The DataRow object’s ItemArray property is an array of objects. It turns out to be particularly handy, as it lets you handle the contents of the entire row as a single, monolithic piece of data. Internally, the get accessor of ItemArray is implemented as a simple loop that reads and stores one column after the next. The set accessor is even more valuable, because it automatically groups all the changes in a pair of BeginEdit/EndEdit calls and fires column-changed events as appropriate.
Sizing Up Serialized Data
The sample application shown in Figure 9-9 demonstrates that a DataTable object serialized using a ghost class can be up to 80 percent smaller than an identical object serialized the standard way.
Figure 9-9 The difference between ordinary and custom binary serialization.?(Image unavailable)
In particular, consider the DataTable object resulting from the following query:
The table contains five columns and 2155 records. It would take up half a megabyte if serialized to the binary formatter as a DataTable object. By using an intermediate ghost class, the size of the output is 83 percent less. Looking at things the other way round, the results of the standard serialization process is about 490 percent larger than the results you obtain using the ghost class.
Of course, not all cases give you such an impressive result. In all the tests I ran on the Northwind database, however, I got an average 60 percent reduction. The more the table content consists of numbers, the more space you save. The more BLOB fields you have, the less space you save. Try running the following query, in which photo is the BLOB field that contains an employee’s picture:
The ratio of savings here is only 25 percent and represents the bottom end of the Northwind test results. Interestingly, if you add only a couple of traditional fields to the query, the ratio increases to 28 percent. The application shown in Figure 9-9 (included in this book’s sample files) is a useful tool for fine-tuning the structure of the table and the queries for better serialization results.
Once the binary data has been deserialized, you hold an instance of the ghost class that must be transformed back into a usable DataTable object. Here’s how the sample application accomplishes this:
The information stored in the ghost arrays is used to add columns and rows to a newly created DataTable object. Figure 9-9 demonstrates the perfect equivalence of the objects obtained by deserializing a DataTable and a ghost class.
The ghost class used in the preceding sample code serializes the minimal amount of information necessary to rebuild the DataTable object. You should add new properties to track other DataColumn or DataRow properties that are significant in your own application. Note that you can’t simply serialize the DataColumn and DataRow objects as a whole because none of them is marked as serializable.
The ReadXml method fills a DataSet object by reading from a variety of sources, including disk files, .NET Framework streams, or instances of XmlReader objects. In general, the ReadXml method can process any type of XML file, but of course the nontabular and rather irregularly shaped structure of XML files might create some problems and originate unexpected results when the files are rendered in terms of rows and columns.
In addition, the ReadXml method is extremely flexible and lets you load data according to a particular schema or even infer the schema from the data.
The ReadXml method creates the relational schema for the DataSet object according to the read mode specified and regardless of whether a schema already exists in the DataSet object. The following code snippet is typical code you would use to load a DataSet object from XML:
The return value of the ReadXml method is an XmlReadMode value that indicates the modality used to read the data. This information is particularly important when no reading mode is specified or when the automatic default mode is set. In either case, you don’t really know how the schema for the target DataSet object has been generated.
Modes of Reading
Table 9-4 summarizes the reading options available for use with the ReadXml method; allowable options are grouped in the XmlReadMode enumeration.
Table 9-4 XmlReadMode Enumeration Values
|Auto||Default option; indicates the most appropriate way of reading by looking at the source data.|
|DiffGram||Reads a DiffGram and adds the data to the current schema. If no schema exists, an exception is thrown. Information that doesn’t match the existing schema is discarded.|
|Fragment||Reads and adds XML fragments until the end of the stream is reached.|
|IgnoreSchema||Ignores any in-line schema that might be available and relies on the DataSet object’s existing schema. If no schema exists, no data is loaded. Information that doesn’t match the existing schema is discarded.|
|InferSchema||Ignores any in-line schema and infers the schema from the XML data. If the DataSet object already contains a schema, the current schema is extended. An exception is thrown in the case of conflicting table namespaces and column data types.|
|ReadSchema||Reads any in-line schema and loads both data and schema. An existing schema is extended with new columns and tables, but an exception is thrown if a given table already exists in the DataSet object.|
The default read mode is XmlReadMode.Auto. When this mode is set, or when no read mode has been explicitly set, the ReadXml method examines the XML source and chooses the most appropriate option.
The first possibility checked is whether the XML data is a DiffGram. If it is, the XmlReadMode.DiffGram mode is used. If the XML data is not a DiffGram but references an XDR or an XSD schema, the InferSchema mode is used. ReadSchema is used only if the document contains an in-line schema. In both the InferSchema and ReadSchema cases, the ReadXml method checks first for an XDR (referenced or in-line) schema and then for an XSD schema. If the DataSet object already has a schema, the read mode is set to IgnoreSchema. Finally, if no schema information can be found, the InferSchema mode is used.
Reading XML Data
Although ReadXml supports various types of sources?streams, files, and text readers?the underlying routine used in all cases reads data using an XML reader. The following pseudocode illustrates the internal architecture of the ReadXml overloads:
The XML source is read one node after the next until the end is reached. The information read is transformed into a DataRow object that is added to a DataTable object. Of course, the layout of both the DataTable object and the DataRow object is determined based on the schema read or inferred.
Merging DataSet Objects
When loading the contents of XML sources into a DataSet object, the ReadXml method does not merge new and existing rows whose primary key information matches. To merge an existing DataSet object with a DataSet object just loaded from an XML source, you must proceed in a particular way.
First you create a new DataSet object and fill it with the XML data. Next you merge the two objects by calling the Merge method on either object, as shown in the following code. The Merge method is used to merge two DataSet objects that have largely similar schemas.
The target DataSet object is the object on which the merge occurs. The source DataSet object provides the information to merge but is not affected by the operation. Determining which DataSet object must be the target and which will be the source is up to you and depends on the data your application needs to obtain. During the merging, the rows that get overwritten are those with matching primary keys.
An alternative way to merge existing DataSet objects with contents read from XML is through the DiffGram format. Loading a DiffGram using ReadXml will automatically merge rows that have matching primary keys. When using the XmlReadMode.DiffGram format, the target DataSet object must have the same schema as the DiffGram; otherwise, the merge operation fails and an exception is thrown.
Reading In-Line Schemas
The XmlReadMode.ReadSchema option works only with in-line schemas and does not recognize external references to schema files. The ReadSchema mode causes the ReadXml method to add new tables to the DataSet object, but if any tables defined in the in-line schema already exist in the DataSet object, an exception is thrown. You can’t use the ReadSchema option to change the schema of an existing table.
If the DataSet object does not contain a schema (that is, the DataSet object is empty) and there is no in-line schema, no data is read or loaded. ReadXml can read only in-line schemas defined using the XDR or XSD schema. DTD documents are not supported.
Reading External Schemas
An XML source that imports XDR or XSD schema information from an external resource can’t be handled through ReadSchema. External references are resolved through the InferSchema option by inferring the schema from the external file.
The InferSchema option is generally quite slow because it has to determine the structure by reading the source. With externally referenced schemas, however, the procedure is considerably faster. The ReadXml method simply reads the schema information from the given URL in the same way as the ReadXmlSchema method does?no true inferential process is started.
By design, external schema resolution is implemented in the InferSchema reading mode rather than in ReadSchema. When called to operate in automatic mode on a file that references an external schema, the ReadXml method returns InferSchema. In turn, ReadSchema does not work if called to work on external schemas.
The ReadSchema and InferSchema options are complementary. The former reads only in-line schema and ignores external references. The latter does the reverse, ignoring any in-line schema that might be present in the source.
When the XmlReadMode.Fragment option is set, the DataSet object is loaded from an XML fragment. An XML fragment is a valid piece of XML that identifies elements, attributes, and documents. The XML fragment for an element is the markup text that fully qualifies the XML element (node, CDATA, processing instruction, or comment). The fragment for an attribute is the Value attribute; the fragment for a document is the entire content set.
When the XML data is a fragment, the root level rules for well-formed XML documents are not applied. Fragments that match the existing schema are appended to the appropriate tables, and fragments that do not match the schema are discarded. ReadXml reads from the current position to the end of the stream. The XmlReadMode.Fragment option should not be used to populate an empty, and subsequently schemaless, DataSet object.
In addition to the ReadXml method, you can use the DataSet object’s InferXmlSchema method to load the schema from a specified XML file into the DataSet object. You can control, to some extent, the XML elements processed during the schema inference operation. The signature of the InferXmlSchema method allows you to specify an array of namespaces whose elements will be excluded from inference, as shown here:
The InferXmlSchema method creates an XML DOM representation of the XML source data and then walks its way through the nodes, creating tables and columns as appropriate.
A Sample Application
To demonstrate the various effects of ReadXml and other reading modes, I’ve created a sample application and a few sample XML documents. Using the application is straightforward. You select an XML file, and the code attempts to load it into a DataSet object using the XmlReadMode option you specify. The results are shown in a DataGrid control. As shown in Figure 9-10, the bottom text box displays the schema of the DataSet object as read or inferred by the reading method.
Figure 9-10 ReadXml correctly recognizes an XML document in ADO.NET normal form.?(Image unavailable)
In Figure 9-10, the selected XML document is expressed in the ADO.NET normal form?that is, the default schema generated by WriteXml?and the ReadXml method handles it correctly.
Not all XML sources smoothly fill out a DataSet object, however. Let’s consider what happens with the following XML document:
This document is not in ADO.NET normal form even though it contains information that can easily fit in a table of data. As you can see in Figure 9-11, the .NET Framework inference algorithm identifies three distinct tables in this document: class, days, and day. Although acceptable, this is not probably what one would expect.
Figure 9-11 The schema that ReadXml infers from the specified and nonstandard XML file.?(Image unavailable)
I would read this information as a single table?day?contained in a DataSet object. My interpretation is a logical rather than an algorithmic reading of the data, however. The final schema consists of three connected tables, shown in Figure 9-12, of which the first two tables simply contain a foreign key field that normalizes the entire data structure.
Figure 9-12 How Microsoft Visual Studio .NET renders the XML schema inferred by ReadXml.?(Image unavailable)
Choosing the Correct Reading Mode
If you save the contents of a DataSet object to XML and then read it back via ReadXml, pay attention to the reading mode you choose. Each reading mode has its own set of features and to the extent that it is possible, you should exploit those features.
Although it is fairly easy to use, the XmlReadMode.Auto mode is certainly not the most effective way to read XML data into a DataSet object. Avoid using this mode as much as possible, and instead use a more direct, and data-specific, option.
Binding XML to Data-Bound Controls:
XML data sources are not in the official list of allowable data sources for the .NET Framework data-bound client and server controls. Many .NET Framework classes can be used as data sources?not just those dealing with database contents. In general, any object that exposes the ICollection interface is a potential source for data binding. As a result, you can bind a Microsoft Windows Forms data-bound control or a Web Forms data-bound control to any of the following data structures:
- In-memory .NET Framework collection classes, including arrays, dictionaries, sorted and linked lists, hash tables, stacks, and queues
- User-defined data structures, as long as the structure exposes ICollection or one of its child interfaces, such as IList
- Database-oriented classes such as DataTable and DataSet
- Views of data represented by the DataView class
You can’t directly bind XML documents, however, unless you load XML data in one of the aforementioned classes. Typically, you load XML data into a DataTable or a DataSet object. This operation can be accomplished in a couple of ways. You can load the XML document into a DataSet object using the ReadXml method. Alternatively, you can load the XML document into an instance of the XmlDataDocument class and access the internally created DataSet object.
Our target XML schema for the CSV document would be the following:
Of course, this is not the only schema you can choose. I have chosen it because it is both compact and readable. If you decide to use another schema, the code for the reader should be changed accordingly. The target XML schema is a crucial aspect, as it specifies how the Read method should be implemented. Figure 9-13 illustrates the behavior of the Read method.
Figure 9-13 The process of returning an XML schema for a CSV file.?(Image unavailable)
The reader tracks the current node and sets internal variables to influence the next node to be returned. For example, when returning an Element node, the reader annotates that there’s an open node to close. Given this extremely simple schema, a Boolean member is enough to implement this behavior. In fact, no embedded nodes are allowed in a CSV file. In more complex scenarios, you might want to use a stack object.
The Read Method
When a new node is returned, the reader updates the node’s depth and state. In addition, the reader stores fresh information in node-specific properties such as Name, NodeType, and Value, as shown here:
For example, when the start tag of a new element is returned, the following code runs:
When traversing a document using an XML reader, the ReadXml method visits attributes in a loop and reads attribute values using ReadAttributeValue.
Attributes are not read through calls made to the Read method. A reader provides ad hoc methods to access attributes either randomly or sequentially. When one of these methods is called?say, MoveToNextAttribute? the reader calls an internal method that refreshes the state so that Name and NodeType can now point to the correct content, as shown here:
A node is associated with a line of text read from the CSV file. Each token of information becomes an attribute, and attributes are stored in a collection of name/value pairs. (This part of the architecture was described in detail in Chapter 2.) The m_parentNode property tracks the name of the element acting as the parent of the current attribute. Basically, it represents the node to move to when MoveToElement is called. Again, in this rather simple scenario, a string is sufficient to identify the parent node of an attribute. For more complex XML layouts, you might need to use a custom class.
Reading Attributes Using ReadXml
The ReadXml method accesses all the attributes of an element using a loop like this:
To load XML data into a DataSet object, the ReadXml method uses an XML loader class that basically reads the source and builds an XmlDocument object. This document is then parsed, and DataRow and DataTable objects are created and added to the target DataSet object. While building the temporary XmlDocument object, the loader scrolls attributes using MoveToNextAttribute and reads values using ReadAttributeValue.
ReadAttributeValue does not really return the value of the current attribute. This method, in fact, simply returns a Boolean value indicating whether there’s more to read about the attribute. By using ReadAttributeValue, however, you can read through the text and entity reference nodes that make up the attribute value. Let’s say that this is a more general way to read the content of an attribute; certainly, it is the method that ReadXml uses indirectly. To let ReadXml read the value of an attribute, you must provide a significant implementation for ReadAttributeValue. In particular, if the current node is an attribute, your implementation should set the new node type to XmlNodeType.Text, increase the depth by 1, and return true.
ReadAttributeValue parses the attribute value into one or more Text, EntityReference, or EndEntity nodes. This means that the XML loader won’t be able to read the value unless you explicitly set the node type to Text. (We don’t support references in our sample CSV reader.) At this point, the loader will ask the reader for the value of a node of type Text. Our implementation of the Value property does not distinguish between node types, but assumes that Read and other move methods (for example, MoveToNextAttribute) have already stored the correct value in Value. This is just what happens. In fact, the attribute value is read and stored in Value right after positioning on the attribute, before ReadAttributeValue is called. In other cases, you might want to check the node type in the Value property’s get accessor prior to returning a value.
In general, understanding the role of ReadAttributeValue and integrating this method with the rest of the code is key to writing effective custom readers. Nevertheless, as you saw in Chapter 2, if you don’t care about ReadXml support, you can write XML readers even simpler than this. But the specialness of an XML reader is precisely that you can use it with any method that accepts an XML reader! So dropping the support for the DataSet object’s ReadXml method would be a significant loss.
How ReadXml works with custom readers is in no way different from the way it works with system-provided XML readers. However, understanding how ReadXml works with XML readers can help you to build effective and functional custom XML readers.
There are several ways to persist a DataSet object’s contents. You can create a snapshot of the currently stored data using a standard layout referred to here as the ADO.NET normal form. This data format can include schema information or not. Saving to the ADO.NET normal form does not preserve the state of the DataSet object and discards any information about the previous state of each row. If you want stateful persistence, resort to the DiffGram XML format. DiffGrams are the subject of Chapter 10.
In this chapter, we also examined how ADO.NET objects integrate with the standard .NET Framework run-time serialization mechanism. DataSet and DataTable objects always expose themselves to data formatters as XML DiffGrams, thus resulting in larger output files. We looked at a technique for reducing the size of the serialized data as much as 500 percent.
In ADO.NET, the deserialization process is tightly coupled with the inference engine, which basically attempts to algorithmically extract the layout of the XML stream. When loading XML into a DataSet object, the inference engine is involved more frequently than not. Because it is not a lightweight piece of code, you should always opt for a clear and effective reading mode and use the inference engine only when absolutely necessary.
As mentioned, in the next chapter we’ll tackle a very special XML serialization format?the DiffGram. Among other things, the DiffGram format is the format used to deliver DataSet objects to other platforms through Web services. It is also ideal for setting up intermittent applications?that is, applications that can work both connected to and disconnected from the system.
It’s more difficult to locate a book that provides thorough coverage of object serialization. Chapter 11 in Programming Microsoft Visual Basic .NET, Core Reference, by Francesco Balena (Microsoft Press, 2002), is an excellent and self-contained reference. If you want a shorter but complete overview, have a look at the following online article: http://msdn.microsoft.com/library/en-us/dnadvnet/html/vbnet09252001.asp.
Reproduced from Applied XML Programming for Microsoft .NET by permission of Microsoft Press. ISBN 0735618011, copyright 2002. All rights reserved.