devxlogo

Working with Microsoft Office Word 2003’s XML

Working with Microsoft Office Word 2003’s XML

he .doc file format that is still present in Word 2003 is essentially a proprietary binary format; sadly, .doc files are difficult to extract information from. By saving documents in the new XML format, you can easily retrieve information trapped inside of Word 2003 documents by using little more than XPath queries.

New features included in Word 2003 also allow you to force users into entering data into an XML document without their knowledge! Essentially, you can annotate a document with an XML schema and then protect the document, only allowing the user to add or edit information in specific locations throughout the document. This way, when the user saves the document, the data is written directly to an XML document, allowing it to be easily consumed by another application or a database.

Another cool idea for using XML with Word 2003 documents is the ability to transform XML into other formats. As of this writing, there is an XSLT provided by Microsoft that takes a Word 2003 XML document and transforms it into an HTML document for viewing in a Web browser. Of course, my first reaction to this was “What good is that? I can save a document as HTML, right?” Then I realized that I have complete control over this transformation by designing my own XSLT, unlike the “Save as HTML” functionality from previous versions.

But these ideas are outside the topic of this article, which is focused on the ability to manipulate a Word 2003 document (saved as XML) from within code. Before Word 2003, all you could effectively do was to either use automation or to be really handy with the RTF format (and open the RTF using Word). With the ability of Word 2003 to both save as and read from XML, you can create sophisticated Word 2003 documents by processing and manipulating XML.

If you’re not sure why you might try something like this, here are a few ideas:

  • You can create documents from data within an application, such as form letters.
  • You can send Word 2003 documents to a client workstation over the Internet as XML and have it correctly interpreted at the client workstation as a Word 2003 document.
  • You can return Word 2003 documents from Web services.

So, to get a better feel for how this may benefit your own applications, let’s walk through the creation of a Word 2003 template, save it as XML, and then manipulate the document (using data provided by a user) to produce a final document for use in the application.

Creating a Schema

With the ability to save as and read from XML, you can create sophisticated documents by processing and manipulating XML.

The first step in this process is to create a schema for the data that you can insert into the Word 2003 document template. Although you don’t actually need to have a schema, it’s a bit easier to work with the document if you apply a schema to it. Without the schema, you’d have to use a feature like bookmarks, which are rendered like the following XML snippet:

                  [ContactName]           

Notice how the bookmark, named ContactName in this example, is delimited by two empty annotation elements. The only things that distinguish these elements are the type attribute values of Word.Bookmark.Start and Word.Bookmark.End. This is slightly more complex than applying a schema to the document, which produces the XML in the following snippet:

              Making a Word 2003 Template
With the schema out of the way, let's see how to apply it to a Word 2003 document. Start by creating or opening a document in Word 2003 with the desired boilerplate text. You may wish to highlight or somehow mark the locations for XML placeholders in your document so you can find them easily when it comes time to edit the document. My convention is to write the node names into the text of the document, and surround them with square brackets (e.g., [ContactName]). These become the placeholders for the schema elements in the document.

Because these are XML documents, you can pass them over the Internet from a Web service or Web site to a client.
To apply a schema, open the Tools menu and select the Templates and Add-ins... option. This opens the dialog box where you can manage the XML schemas that can be applied to Word 2003 documents. Select the XML Schema page to view the current list of attached schemas. If the list is blank, or the desired schema is not listed, click the Add Schema... button. After adding a schema, you are prompted to provide an alias for the schema, simply to make it easier to reference because the namespace is usually long and difficult to read. Once you've added your schema and provided the alias, it appears in the list on the XML Schema page. Enable the checkbox next to the desired schema, and then close the dialog box.

Once you press the OK button on the Templates dialog box, Word 2003 automatically displays the XML Structure task pane. If it doesn't, you can press Ctrl+F1 to make the task pane appear, and then select the XML Structure page from the drop-down list at the top of the pane.

Now that a schema has been attached, you can apply the elements from the schema to the document. Depending upon how your schema is constructed, you may or may not see any elements in the lower part of the XML Structure task pane. In the example schema, because there is no parent element, all of the nodes initially appear in the list.

Figure 1: A Word 2003 document that has been marked up with the schema from Listing 1 looks like this.
To apply the elements, select an area of your document (it doesn't have to contain any text) and then choose one of the available elements to apply. When selecting the first element to apply, Word 2003 prompts you to define how you wish to apply this first element, either to the entire document or only to what you have selected. I've gotten into the habit of always applying the elements to the selection, as that seems to be what I'd want in most situations anyway. Continue to highlight text and apply the elements as desired.

After making your selections and applying the schema, you may or may not see much of a difference in your document. This depends on whether or not you have selected the Show XML tags in the document checkbox in the XML Structure task pane. With this option selected, you'll see the start and end tags graphically represented in your Word 2003 document, as shown in Figure 1.

Now that you've applied the schema to your document, save it as an XML file so that you can parse it with your application code. To do this, start by choosing the Save As... option from the File menu. In the Save As Type drop-down list, choose XML Document (*.xml). You will then see some additional controls to the right of the drop-down list that are specific to the XML format, as shown in Figure 2.

None of the checkboxes should be selected for this example, as you do not want to apply a transform or save only the data without the tags. This ensures that all of the information you have entered into your document is written out to XML.

Figure 2: Choose the XML format from the lower portion of the Save As... dialog box showing the XML options.
Tips for Saving as XML
To make things a little cleaner in the XML output, you will want to ensure that you either spell everything correctly (not very likely if you use my naming convention for the placeholder text) or that you ignore any spelling errors flagged by Word 2003. If you leave in something that the Word 2003 spelling checker doesn't like, the resultant XML looks similar to the following snippet:
                        [                              ConyactName                              ]               

As you can see, with the proofing errors, this changes the expected XML, because Word 2003 has embedded some proofErr elements. Once you handle the spelling errors (e.g., right-click the error in the document and choose "Ignore All"), the XML appears as shown in this snippet:

                        [ContactName]               

Also, be aware of where your paragraph marks appear in relation to your applied schema elements. In the snippet shown above, the [ContactName] text appears on a line all by itself. This places a paragraph element (the w:p element) completely within the ContactName element.

If, on the other hand, you placed ContactName on the same line as some other text or another element, the paragraph element won't appear within the ContactName element but outside of it. Because my document contains both of these examples, the code will have to handle both situations appropriately.

Opening the XML File
Now that you've saved the document as XML, you can see the document on your hard drive with its XML extension. When you double-click it, it opens up within Word 2003, not in your associated program for XML files (which is, by default, Microsoft Internet Explorer). This is because there is a processing instruction at the top of the XML document that declares the ProgID to use when opening this XML file, as shown in this snippet:

         

If you comment out the second line of this document and then save it, you no longer launch Word 2003 when double-clicking the XML file. I found this useful during testing so that I could quickly view the XML produced by saving the Word 2003 document as XML.

Creating the Output
Now that the template has been defined and annotated as desired, you can write a small program to read data from an XML file and merge this data with the template. For this example, I've used a console application (as I don't need a GUI) and chose Visual Basic .NET as the language.

First, look at the XML data in Listing 2 that I'll merge with the document. It contains a single record from the Northwind database on SQL Server.

To make the example easier, I've saved this as a file called NWData.xml. In the real world, I'd probably capture the desired data in a Web page or Windows application and then retrieve the data from a database instead of a disk file.

There are more elements in this XML file compared to what I've applied in the Word 2003 document. That means I'll have to be certain to skip these elements when processing the file; perhaps they'll be added to other document templates in the future.

The code (the complete listing is shown in Listing 3) uses the XMLDocument class from the .NET Framework to do the bulk of the work. The code starts by loading both the data file and the Word 2003 template file into separate XML DOM objects. The Word 2003 document (saved as XML) is loaded through a method of a class instantiated as the oProcess object.

   Dim oProcess As New WordXMLTest   Dim sDocPath As String   Dim sDataPath As String   Dim sSaveFile As String      sDocPath = "sample2.xml"   sDataPath = "NWdata.xml"   sSaveFile = "OutFile.xml"      Try     'load the WordXML into a DOM     oProcess.LoadFile(sDocPath)        'load data into DOM      xmlDataDoc.Load(sDataPath)

Next, select the nodes from the data document with a simple XPath query, and iterate through them with a For-Next loop. Note that this code only assumes that a single customer record exists in the XML file. If there are multiple customers, add another outer loop to iterate through each customer record.

   'iterate through data nodes   xmlNodes = xmlDataDoc.SelectNodes( _              "/results/customers/*")   'replace Word doc area with data    If Not xmlNodes Is Nothing Then     For i = 0 To xmlNodes.Count - 1       xmlNode = xmlNodes(i)       sNodeName = xmlNode.Name       sNewText = xmlNode.InnerText          oProcess.ProcessNodes( _            sNodeName, sNewText)     Next   End If

For the ProcessNodes method, the desired node name and new text are passed as parameters. A separate method is used because in my template, I have the ContactName element in two locations within the document. I want to ensure that both of these locations are replaced with the same name.

So, in the ProcessNodes method, the specified node name is used to create XPath queries to retrieve lists of matching nodes. Then each query is executed with the SelectNodes method on the Word 2003 XML DOM object, oXMLWordDoc.

   Public Sub ProcessNodes( _           ByVal sNodeName As String, _          ByVal sNodeValue As String)   'replace the node(s) in the document 'with the specified value    Dim oNodeList As XmlNodeList      'get nodes that have    'embedded paragraph marks   oNodeList = _      oXMLWordDoc.SelectNodes( _        "//ns0:" + sNodeName + "//w:p", _        oNSMgr)

The interesting part of the code is the XPath queries; there are two of them, to ensure that you catch all of the nodes with the specified node name. Because some of the nodes are within a single paragraph and others are embedded within a paragraph, there are queries to account for both situations.

   If Not oNodeList Is Nothing Then     FillNodes(oNodeList, sNodeValue)   End If   'get nodes that do NOT have    'embedded paragraph marks      oNodeList = _        oXMLWordDoc.SelectNodes( _          "//ns0:" + sNodeName, oNSMgr)      If Not oNodeList Is Nothing Then     FillNodes(oNodeList, sNodeValue)   End If

The namespace prefix requires that the SelectNodes method specifies a NamespaceManager object, which is part of .NET's System.XML namespace. Otherwise, your SelectNodes query will fail with errors. The NamespaceManager object, stored in a property of the WordXMLTest class, is populated within the New method, so it runs when the WordXMLTest class is instantiated.

Word 2003 enforces the restrictions defined in the schema for each document.

The namespace URIs come directly from the Word 2003 XML file and may vary depending upon the target namespace declared in your schema and what Word 2003 assigns as a prefix to your schema.

The FillNodes method referenced in the ProcessNodes method receives a node list object and a new node value as parameters. It changes the contents of the specified nodes on the oXMLWordDoc object.

   Private Sub FillNodes( _        ByVal oNodeList As XmlNodeList,       ByVal sNodeValue As String)   Dim i As Integer   Dim oXMLNode, oInnerNode As XmlNode      For i = 0 To oNodeList.Count - 1     oXMLNode = oNodeList(i)     oInnerNode =          oXMLNode.SelectSingleNode( _          "w:r/w:t", oNSMgr)     If Not oInnerNode Is Nothing Then       oInnerNode.InnerText = sNodeValue     End If   Next

The replacement actually occurs on the text between the and tags that appear within the specified node object. This ensures that no formatting is lost, as font and paragraph properties are specified in the elements that surround the element.

The last bit is to take the modified XML and save it to disk with a different file name so that it can be viewed. This is done by calling the Save method on the Word 2003 XML DOM object:

     'write out the new Doc file.      oProcess.save(sSaveFile)   . . .   Class WordXMLTest     Public oXMLWordDoc As _        New XmlDocument     Public oNSMgr As _        New XmlNamespaceManager( _            oXMLWordDoc.NameTable)        Public Sub save( _        ByVal sFileName As String)              oXMLWordDoc.Save(sFileName)     End Sub

The Final Output
After running the program, you should now be able to double-click the output file and see the output in Word 2003, as shown in Figure 3.

devxblackblue

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.

About Our Journalist

©2024 Copyright DevX - All Rights Reserved. Registration or use of this site constitutes acceptance of our Terms of Service and Privacy Policy.