Loading DataSet Objects from XML
The contents of an ADO.NET DataSet
object can be loaded from an XML stream or documentfor example, from an XML stream previously created using the WriteXml
method. To fill a DataSet
object with XML data, you use the ReadXml
method of the class.
The ReadXml method fills a DataSet object by reading from a variety of sources, including disk files, .NET Framework streams, or instances of XmlReader objects. In general, the ReadXml method can process any type of XML file, but of course the nontabular and rather irregularly shaped structure of XML files might create some problems and originate unexpected results when the files are rendered in terms of rows and columns.
In addition, the ReadXml method is extremely flexible and lets you load data according to a particular schema or even infer the schema from the data.
Building DataSet Objects
method has several overloads, all of which are similar. They take the XML source plus an optional XmlReadMode
value as arguments, as shown here:
public XmlReadMode ReadXml(Stream, XmlReadMode);
public XmlReadMode ReadXml(string, XmlReadMode);
public XmlReadMode ReadXml(TextReader, XmlReadMode);
public XmlReadMode ReadXml(XmlReader, XmlReadMode);
The ReadXml method creates the relational schema for the DataSet object according to the read mode specified and regardless of whether a schema already exists in the DataSet object. The following code snippet is typical code you would use to load a DataSet object from XML:
StreamReader sr = new StreamReader(fileName);
DataSet ds = new DataSet();
The return value of the ReadXml method is an XmlReadMode value that indicates the modality used to read the data. This information is particularly important when no reading mode is specified or when the automatic default mode is set. In either case, you don't really know how the schema for the target DataSet object has been generated.
Modes of Reading
Table 9-4 summarizes the reading options available for use with the ReadXml method; allowable options are grouped in the XmlReadMode enumeration.
Table 9-4 XmlReadMode Enumeration Values
|Auto||Default option; indicates the most appropriate way of reading by looking at the source data.
|DiffGram||Reads a DiffGram and adds the data to the current schema. If no schema exists, an exception is thrown. Information that doesn't match the existing schema is discarded.
|Fragment||Reads and adds XML fragments until the end of the stream is reached.
|IgnoreSchema||Ignores any in-line schema that might be available and relies on the DataSet object's existing schema. If no schema exists, no data is loaded. Information that doesn't match the existing schema is discarded.
|InferSchema||Ignores any in-line schema and infers the schema from the XML data. If the DataSet object already contains a schema, the current schema is extended. An exception is thrown in the case of conflicting table namespaces and column data types.
|ReadSchema||Reads any in-line schema and loads both data and schema. An existing schema is extended with new columns and tables, but an exception is thrown if a given table already exists in the DataSet object.
The default read mode is XmlReadMode.Auto. When this mode is set, or when no read mode has been explicitly set, the ReadXml method examines the XML source and chooses the most appropriate option.
The first possibility checked is whether the XML data is a DiffGram. If it is, the XmlReadMode.DiffGram mode is used. If the XML data is not a DiffGram but references an XDR or an XSD schema, the InferSchema mode is used. ReadSchema is used only if the document contains an in-line schema. In both the InferSchema and ReadSchema cases, the ReadXml method checks first for an XDR (referenced or in-line) schema and then for an XSD schema. If the DataSet object already has a schema, the read mode is set to IgnoreSchema. Finally, if no schema information can be found, the InferSchema mode is used.
Reading XML Data
Although ReadXml supports various types of sourcesstreams, files, and text readersthe underlying routine used in all cases reads data using an XML reader. The following pseudocode illustrates the internal architecture of the ReadXml overloads:
public XmlReadMode ReadXml(Stream stream)
return ReadXml(new XmlTextReader(stream));
public XmlReadMode ReadXml(TextReader reader)
return ReadXml(new XmlTextReader(reader));
public XmlReadMode ReadXml(string fileName)
return ReadXml(new XmlTextReader(fileName));
The XML source is read one node after the next until the end is reached. The information read is transformed into a DataRow object that is added to a DataTable object. Of course, the layout of both the DataTable object and the DataRow object is determined based on the schema read or inferred.
Merging DataSet Objects
When loading the contents of XML sources into a DataSet object, the ReadXml method does not merge new and existing rows whose primary key information matches. To merge an existing DataSet object with a DataSet object just loaded from an XML source, you must proceed in a particular way.
First you create a new DataSet object and fill it with the XML data. Next you merge the two objects by calling the Merge method on either object, as shown in the following code. The Merge method is used to merge two DataSet objects that have largely similar schemas.
The target DataSet object is the object on which the merge occurs. The source DataSet object provides the information to merge but is not affected by the operation. Determining which DataSet object must be the target and which will be the source is up to you and depends on the data your application needs to obtain. During the merging, the rows that get overwritten are those with matching primary keys.
An alternative way to merge existing DataSet objects with contents read from XML is through the DiffGram format. Loading a DiffGram using ReadXml will automatically merge rows that have matching primary keys. When using the XmlReadMode.DiffGram format, the target DataSet object must have the same schema as the DiffGram; otherwise, the merge operation fails and an exception is thrown.
Reading Schema Information
option causes the ReadXml
method to ignore any referenced or in-line schema. The data is loaded into the existing DataSet
schema, and any data that does not fit is discarded. If no schema exists in the DataSet
object, no data will be loaded. Of course, an empty DataSet
object has no schema information, as shown in the following listing. If the XML source is in the DiffGram format, the IgnoreSchema
option has the same effect as XmlReadMode.DiffGram
// No schema in the DataSet, no data will be loaded
DataSet ds = new DataSet();
StreamReader sr = new StreamReader(fileName);
Reading In-Line Schemas
The XmlReadMode.ReadSchema option works only with in-line schemas and does not recognize external references to schema files. The ReadSchema mode causes the ReadXml method to add new tables to the DataSet object, but if any tables defined in the in-line schema already exist in the DataSet object, an exception is thrown. You can't use the ReadSchema option to change the schema of an existing table.
If the DataSet object does not contain a schema (that is, the DataSet object is empty) and there is no in-line schema, no data is read or loaded. ReadXml can read only in-line schemas defined using the XDR or XSD schema. DTD documents are not supported.
Reading External Schemas
An XML source that imports XDR or XSD schema information from an external resource can't be handled through ReadSchema. External references are resolved through the InferSchema option by inferring the schema from the external file.
The InferSchema option is generally quite slow because it has to determine the structure by reading the source. With externally referenced schemas, however, the procedure is considerably faster. The ReadXml method simply reads the schema information from the given URL in the same way as the ReadXmlSchema method doesno true inferential process is started.
By design, external schema resolution is implemented in the InferSchema reading mode rather than in ReadSchema. When called to operate in automatic mode on a file that references an external schema, the ReadXml method returns InferSchema. In turn, ReadSchema does not work if called to work on external schemas.
The ReadSchema and InferSchema options are complementary. The former reads only in-line schema and ignores external references. The latter does the reverse, ignoring any in-line schema that might be present in the source.
When the XmlReadMode.Fragment option is set, the DataSet object is loaded from an XML fragment. An XML fragment is a valid piece of XML that identifies elements, attributes, and documents. The XML fragment for an element is the markup text that fully qualifies the XML element (node, CDATA, processing instruction, or comment). The fragment for an attribute is the Value attribute; the fragment for a document is the entire content set.
When the XML data is a fragment, the root level rules for well-formed XML documents are not applied. Fragments that match the existing schema are appended to the appropriate tables, and fragments that do not match the schema are discarded. ReadXml reads from the current position to the end of the stream. The XmlReadMode.Fragment option should not be used to populate an empty, and subsequently schemaless, DataSet object.
Inferring Schema Information
When the ReadXml
method works with the XmlReadMode.InferSchema
option set, the data is loaded only after the schema has been completely read from an external source or after the schema has been inferred. Existing schemas are extended by adding new tables or by adding new columns to existing tables, as appropriate.
In addition to the ReadXml method, you can use the DataSet object's InferXmlSchema method to load the schema from a specified XML file into the DataSet object. You can control, to some extent, the XML elements processed during the schema inference operation. The signature of the InferXmlSchema method allows you to specify an array of namespaces whose elements will be excluded from inference, as shown here:
void InferXmlSchema(String fileName, String rgNamespace);
The InferXmlSchema method creates an XML DOM representation of the XML source data and then walks its way through the nodes, creating tables and columns as appropriate.
A Sample Application
To demonstrate the various effects of ReadXml and other reading modes, I've created a sample application and a few sample XML documents. Using the application is straightforward. You select an XML file, and the code attempts to load it into a DataSet object using the XmlReadMode option you specify. The results are shown in a DataGrid control. As shown in Figure 9-10, the bottom text box displays the schema of the DataSet object as read or inferred by the reading method.
Figure 9-10 ReadXml correctly recognizes an XML document in ADO.NET normal form. (Image unavailable)
In Figure 9-10, the selected XML document is expressed in the ADO.NET normal formthat is, the default schema generated by WriteXmland the ReadXml method handles it correctly.
Not all XML sources smoothly fill out a DataSet object, however. Let's consider what happens with the following XML document:
<?xml version="1.0" ?>
<class title="Programming XML.NET" company="Wintellect" author="DinoE">
<days total="4" expandable="true">
<day id="1">XML Core Classes</day>
<day id="2">XML-related Technologies</day>
<day id="3">XML and ADO.NET</day>
<day id="4">Remoting and Web services</day>
<day id="5" optional="true">Miscellaneous and Samples</day>
This document is not in ADO.NET normal form even though it contains information that can easily fit in a table of data. As you can see in Figure 9-11, the .NET Framework inference algorithm identifies three distinct tables in this document: class, days, and day. Although acceptable, this is not probably what one would expect.
Figure 9-11 The schema that ReadXml infers from the specified and nonstandard XML file. (Image unavailable)
I would read this information as a single tabledaycontained in a DataSet object. My interpretation is a logical rather than an algorithmic reading of the data, however. The final schema consists of three connected tables, shown in Figure 9-12, of which the first two tables simply contain a foreign key field that normalizes the entire data structure.
Figure 9-12 How Microsoft Visual Studio .NET renders the XML schema inferred by ReadXml. (Image unavailable)
Choosing the Correct Reading Mode
If you save the contents of a DataSet object to XML and then read it back via ReadXml, pay attention to the reading mode you choose. Each reading mode has its own set of features and to the extent that it is possible, you should exploit those features.
Although it is fairly easy to use, the XmlReadMode.Auto mode is certainly not the most effective way to read XML data into a DataSet object. Avoid using this mode as much as possible, and instead use a more direct, and data-specific, option.
Binding XML to Data-Bound Controls:
XML data sources are not in the official list of allowable data sources for the .NET Framework data-bound client and server controls. Many .NET Framework classes can be used as data sourcesnot just those dealing with database contents. In general, any object that exposes the ICollection
interface is a potential source for data binding. As a result, you can bind a Microsoft Windows Forms data-bound control or a Web Forms data-bound control to any of the following data structures:
- In-memory .NET Framework collection classes, including arrays, dictionaries, sorted and linked lists, hash tables, stacks, and queues
- User-defined data structures, as long as the structure exposes ICollection or one of its child interfaces, such as IList
- Database-oriented classes such as DataTable and DataSet
- Views of data represented by the DataView class
You can't directly bind XML documents, however, unless you load XML data in one of the aforementioned classes. Typically, you load XML data into a DataTable or a DataSet object. This operation can be accomplished in a couple of ways. You can load the XML document into a DataSet object using the ReadXml method. Alternatively, you can load the XML document into an instance of the XmlDataDocument class and access the internally created DataSet object.
Loading from Custom Readers
In Chapter 2
, we built a custom XML reader for loading CSV files into a DataTable
object. As mentioned, however, that reader is not fully functional and does not work through ReadXml
. Let's see how to rewrite the class to make it render the CSV content as a well-formed XML document.
Our target XML schema for the CSV document would be the following:
Of course, this is not the only schema you can choose. I have chosen it because it is both compact and readable. If you decide to use another schema, the code for the reader should be changed accordingly. The target XML schema is a crucial aspect, as it specifies how the Read method should be implemented. Figure 9-13 illustrates the behavior of the Read method.
Figure 9-13 The process of returning an XML schema for a CSV file. (Image unavailable)
The reader tracks the current node and sets internal variables to influence the next node to be returned. For example, when returning an Element node, the reader annotates that there's an open node to close. Given this extremely simple schema, a Boolean member is enough to implement this behavior. In fact, no embedded nodes are allowed in a CSV file. In more complex scenarios, you might want to use a stack object.
The Read Method
When a new node is returned, the reader updates the node's depth and state. In addition, the reader stores fresh information in node-specific properties such as Name, NodeType, and Value, as shown here:
public override bool Read()
if (m_readState == ReadState.Initial)
string m_headerLine = m_fileStream.ReadLine();
m_headerValues = m_headerLine.Split(,');
m_readState = ReadState.Interactive;
if (m_readState != ReadState.Interactive)
// Return an end tag if there's one opened
// Return an end tag if the document must be closed
m_readState = ReadState.EndOfFile;
// Open a new tag
m_currentLine = m_fileStream.ReadLine();
if (m_currentLine != null)
m_readState = ReadState.Interactive;
// Populate the internal structure representing the current element
string tokens = m_currentLine.Split(,');
for (int i=0; i<tokens.Length; i++)
string key = "";
key = m_headerValues[i].ToString();
key = CsvColumnPrefix + i.ToString();
For example, when the start tag of a new element is returned, the following code runs:
private void SetupElement()
m_isRoot = false;
m_mustCloseRow = true;
m_mustCloseDocument = false;
m_name = CsvRowName;
m_nodeType = XmlNodeType.Element;
m_depth = 1;
m_value = null;
// Reset the attribute index
m_currentAttributeIndex = -1;
When traversing a document using an XML reader, the ReadXml method visits attributes in a loop and reads attribute values using ReadAttributeValue.
Attributes are not read through calls made to the Read method. A reader provides ad hoc methods to access attributes either randomly or sequentially. When one of these methods is calledsay, MoveToNextAttribute the reader calls an internal method that refreshes the state so that Name and NodeType can now point to the correct content, as shown here:
private void SetupAttribute()
m_nodeType = XmlNodeType.Attribute;
m_name = m_tokenValues.Keys[m_currentAttributeIndex];
m_value = m_tokenValues[m_currentAttributeIndex].ToString();
if (m_parentNode == "")
m_parentNode = m_name;
A node is associated with a line of text read from the CSV file. Each token of information becomes an attribute, and attributes are stored in a collection of name/value pairs. (This part of the architecture was described in detail in Chapter 2.) The m_parentNode property tracks the name of the element acting as the parent of the current attribute. Basically, it represents the node to move to when MoveToElement is called. Again, in this rather simple scenario, a string is sufficient to identify the parent node of an attribute. For more complex XML layouts, you might need to use a custom class.
Reading Attributes Using ReadXml
The ReadXml method accesses all the attributes of an element using a loop like this:
To load XML data into a DataSet object, the ReadXml method uses an XML loader class that basically reads the source and builds an XmlDocument object. This document is then parsed, and DataRow and DataTable objects are created and added to the target DataSet object. While building the temporary XmlDocument object, the loader scrolls attributes using MoveToNextAttribute and reads values using ReadAttributeValue.
ReadAttributeValue does not really return the value of the current attribute. This method, in fact, simply returns a Boolean value indicating whether there's more to read about the attribute. By using ReadAttributeValue, however, you can read through the text and entity reference nodes that make up the attribute value. Let's say that this is a more general way to read the content of an attribute; certainly, it is the method that ReadXml uses indirectly. To let ReadXml read the value of an attribute, you must provide a significant implementation for ReadAttributeValue. In particular, if the current node is an attribute, your implementation should set the new node type to XmlNodeType.Text, increase the depth by 1, and return true.
public override bool ReadAttributeValue()
if (m_nodeType == XmlNodeType.Attribute)
m_nodeType = XmlNodeType.Text;
ReadAttributeValue parses the attribute value into one or more Text, EntityReference, or EndEntity nodes. This means that the XML loader won't be able to read the value unless you explicitly set the node type to Text. (We don't support references in our sample CSV reader.) At this point, the loader will ask the reader for the value of a node of type Text. Our implementation of the Value property does not distinguish between node types, but assumes that Read and other move methods (for example, MoveToNextAttribute) have already stored the correct value in Value. This is just what happens. In fact, the attribute value is read and stored in Value right after positioning on the attribute, before ReadAttributeValue is called. In other cases, you might want to check the node type in the Value property's get accessor prior to returning a value.
In general, understanding the role of ReadAttributeValue and integrating this method with the rest of the code is key to writing effective custom readers. Nevertheless, as you saw in Chapter 2, if you don't care about ReadXml support, you can write XML readers even simpler than this. But the specialness of an XML reader is precisely that you can use it with any method that accepts an XML reader! So dropping the support for the DataSet object's ReadXml method would be a significant loss.
works with custom readers is in no way different from the way it works with system-provided XML readers. However, understanding how ReadXml
works with XML readers can help you to build effective and functional custom XML readers.
In ADO.NET, XML is much more than a simple output format for serializing data. You can use XML to streamline the entire contents of a DataSet
object, but you can also choose the actual XML schema and control the structure of the resulting XML document.
There are several ways to persist a DataSet object's contents. You can create a snapshot of the currently stored data using a standard layout referred to here as the ADO.NET normal form. This data format can include schema information or not. Saving to the ADO.NET normal form does not preserve the state of the DataSet object and discards any information about the previous state of each row. If you want stateful persistence, resort to the DiffGram XML format. DiffGrams are the subject of Chapter 10.
In this chapter, we also examined how ADO.NET objects integrate with the standard .NET Framework run-time serialization mechanism. DataSet and DataTable objects always expose themselves to data formatters as XML DiffGrams, thus resulting in larger output files. We looked at a technique for reducing the size of the serialized data as much as 500 percent.
In ADO.NET, the deserialization process is tightly coupled with the inference engine, which basically attempts to algorithmically extract the layout of the XML stream. When loading XML into a DataSet object, the inference engine is involved more frequently than not. Because it is not a lightweight piece of code, you should always opt for a clear and effective reading mode and use the inference engine only when absolutely necessary.
As mentioned, in the next chapter we'll tackle a very special XML serialization formatthe DiffGram. Among other things, the DiffGram format is the format used to deliver DataSet objects to other platforms through Web services. It is also ideal for setting up intermittent applicationsthat is, applications that can work both connected to and disconnected from the system.
Object serialization and ADO.NET are the key topics of this chapter. You'll find a lot of books out there covering ADO.NET from various perspectives. I recommend Microsoft ADO.NET, Core Reference, by David Sceppa (Microsoft Press, 2002).
It's more difficult to locate a book that provides thorough coverage of object serialization. Chapter 11 in Programming Microsoft Visual Basic .NET, Core Reference, by Francesco Balena (Microsoft Press, 2002), is an excellent and self-contained reference. If you want a shorter but complete overview, have a look at the following online article: http://msdn.microsoft.com/library/en-us/dnadvnet/html/vbnet09252001.asp.
Reproduced from Applied XML Programming for Microsoft .NET by permission of Microsoft Press. ISBN 0735618011, copyright 2002. All rights reserved.