devxlogo

Export Customized XML from Microsoft Word with VB.NET

Export Customized XML from Microsoft Word with VB.NET

ord automation has traditionally been the province of VB Classic developers, but it’s alive and well in VB.NET?it’s just a little different. Word automation is the process of using the classes and methods exposed by Word to create new Word documents or alter or manipulate existing Word documents. In this article, you’ll see how to get started with Word automation in VB.NET by exploring a process for transforming Word documents into customizable XML. The technique shown here doesn’t rely on Word 2003’s XML capabilities, so you can use it with any version of Word that supports automation. Most of the techniques you’ll see apply generally to any application that needs to automate Word from within .NET

Here’s the process in a nutshell: You add a reference to the Word automation library to your project and use that reference to create a Word application object that can open Word files and export the document’s contents to an XML document. For this project, the application exports the content in such a way that each Word style gets translated to an appropriate XML element. By default, the application uses the names of Word styles (sometimes in slightly modified form) applied to paragraphs as the element names for the document. As written, the application follows the sequence of page breaks in the Word document itself. It doesn’t take section breaks within a document into account, but that would be relatively easy to add. The application preserves style formatting, but ignores empty paragraphs (those containing only whitespace such as spaces, tabs, carriage returns, and linefeeds).

Here’s an example. Suppose you have a Word document that looks like the sample.docsample.doc file shown in Figure 1.

Reference the Appropriate Word Library
Start a new Windows Application project in Visual Studio.NET, and name it WordAutomation. After creating the project, right-click on the References item in the Solution Explorer, and select Add Reference. Click the COM tab on the Add Reference dialog and then select the Microsoft Word MajorVersion.minorversion item (where MajorVersion.minorversion stands for the major and minor version number of the Word release you’re targeting). Doing that creates several references called Word, VBIDE, stdole, and Microsoft.Office.Core. The only one you need to worry about for this project is the Word reference.

Getting Word Content
The first step is to use Word automation to open and close Word documents and iterate through them to extract content. As an example, you’ll see how to load the sample.doc file that accompanies the downloadable code for this article, and display the text of each paragraph formatted as XML in a TextBox on your default form. The sample Form1 form has a TextBox to hold the filename of the Word file to process, a Browse button that lets you select a file, a multi-line TextBox to display the results (see Figure 2), and a Process button that processes the selected Word file, turning it into a valid XML document.

Figure 1: The Sample Form for the WordAutomation Project. The form lets you select a Word file, processes the file into XML, and displays the results in the multi-line TextBox.

To open the Word file, first create a Word.Application object. The sample project creates this when it instantiates the class by defining a class-level variable.

   Private wordApp As New Word.Application

When the user selects a file and clicks the Process button, the Click event-handler code opens the selected file by calling the Word application’s Documents.Open method.

   doc = wordApp.Documents.Open( _      CType(Me.txtFilename.Text, Object), 1, 1, 0)

The method creates a new instance of the Word.Document class, which represents a Word document.

Note that you can’t create a Word.DocumentClass instance directly. Instead, you get the reference indirectly through the Word.Application object’s Documents collection; in this case by calling the Open method to open the selected file. The Open method returns a WordDocument object, which you can then use to manipulate the document’s content.

Author’s Note: In earlier Word versions, using the Application and Document classes directly caused problems, so you may need to use the ApplicationClass and DocumentClass classes instead. For example, you may experience an irritating conflict between Close() methods. For a Word.Document instance, you’ll get an error stating that “‘Close’ is ambiguous across the inherited interfaces ‘Word._Document’ and ‘Word.DocumentEvents_Event’.” One solution is to create your document references as DocumentClass instances rather than Document instances. Another workaround is to cast the Document or Application instances to a more specific DocumentClass or ApplicationClass instance before issuing the ambiguous call. For example, assuming that myDoc is a Document object, to issue a Close call you could write:

   CType(myDoc, DocumentClass).Close()

The line of code above casts the Document reference myDoc to a DocumentClass reference, which avoids any ambiguity with the call to Close().

Finding and Replacing Character Styles
Word has two different types of styles. Paragraph styles control such paragraph properties as indentation, spacing before and after the paragraph, and the default font characteristics applied to text in the paragraph. In addition, Word supports character styles, which are styles applied to a range of characters. For example when you format characters as bold or italic you’ve applied a character style. For this application, assume you want the paragraph style to become the name of the XML tag surrounding the text of the paragraph, but you also want to maintain bold and italic styles within the text.

Rather than writing the code to iterate through all the Word characters looking for bold and italic characters yourself, it’s much simpler to get Word to do it for you. Word has powerful find and replace methods?and you have access to them through your object references. In the sample code, the two private methods boldToHTML and italicToHTML handle replacing ranges of bold and italic characters with HTML tag equivalents. In other words bold text becomes bold text and italic text becomes italic text. Using Word’s find-and-replace functionality, you can perform replacement processes on the entire document very quickly.

   Public Sub boldToHTML(ByVal doc As Word.Document)      With doc.Content.Find         .ClearFormatting()         .Font.Bold = 1            .Replacement.ClearFormatting()            .Replacement.Font.Bold = 0            .Text = "*"            .Execute(findtext:="", _               ReplaceWith:="^&", _               Format:=True, _               Replace:=Word.WdReplace.wdReplaceAll)         End With      End Sub      Public Sub italicToHTML(ByVal doc As Word.Document)         With doc.Content.Find            .ClearFormatting()            .Font.Italic = 1            .Replacement.ClearFormatting()            .Replacement.Font.Bold = 0            .Text = "*"            .Execute(findtext:="", _               ReplaceWith:="^&", _               Format:=True, _               Replace:=Word.WdReplace.wdReplaceAll)         End With      End Sub

You will probably want to add other, similar find-and-replace methods depending on your needs. For example, you might need to find underlined text and replace it with tags.

One problem with this approach is that as people write documents, they often turn bold or some other character formatting on, write some text, and then press return to end a paragraph. They start typing the next line and realize that the formatting is still in effect, so they turn it off. The document look fine, but unfortunately, this common series of actions means the carriage return at the end of the text is itself bold. When you execute the find-and-replace methods shown above, you’ll get results like this:

   This is bold text

That causes problems when you translate documents to XML, because it results in improperly nested tags, such as:

   

This is bold text

The solution is to make a single pass through the document replacing all paragraph marks (carriage returns) with unformatted paragraph marks. The clearCRFormatting method shown below solves the problem.

   Public Sub clearCRFormatting(ByVal doc _      As Word.Document)      With doc.Content.Find         .ClearFormatting()         .Replacement.ClearFormatting()         .Replacement.Font.Bold = 0         .Replacement.Font.Italic = 0         .Replacement.Font.Underline = 0         .Execute(findtext:="^p", ReplaceWith:="^p", _            Format:=True, _            Replace:=Word.WdReplace.wdReplaceAll)      End With   End Sub

The sample project calls all three find-and-replace methods immediately after loading the document (see Listing 2). Next, it starts collecting the paragraph content.

Word uses a hierarchical collection model. At the document level, each document has a collection of Paragraph objects, so it’s easy to iterate through them:

   ' loop through the paragraphs   For Each p As Word.Paragraph In doc.Paragraphs      ' do something with each paragraph   Next

For this application, the “do something with each paragraph” consists of several tasks:

  • Determine the page number where the paragraph would appear if the document were printed. Whenever the page changes, you want to insert a element to reflect the original page structure of the document in the XML file.
  • Retrieve the Word style name for the paragraph.
  • Create an XML element for that style. This step itself consists of several separate tasks, such as determining whether to use the Word style name directly or “map” it to a different element name using a “style mapping.” For example, you might want to translate the “Normal” style to HTML-like

    tags. By default, the application uses the Word style name, with minor adjustments to meet XML element-naming rules.

  • Append the XML element containing the contents of the paragraph to an XML output document.

The sample application wraps all these steps up into a method called docToXml, which takes the WordDocument instance as a parameter and returns an XML document object containing the completed elements and the Word document’s content.

Determining Page Numbers
A Word document is an abstract representation of a printed document. Both the content and the print setup affect page numbers. For example, you can print a Word document on non-standard paper, or insert new paragraphs at the beginning of a document. When you make such changes, Word rearranges the content to reflect the new content or output medium. The result is that paragraphs don’t “belong” to a specific page. Therefore, if you want to maintain the current Word page arrangement, you need to capture the page number as you loop through the document grabbing paragraph content. This application maintains the relationship between the paragraphs and Word’s pages by creating each paragraph element as a child of a element, where “#” is the page number.

To retrieve the page number on which any range of text lies, you first select the text and then use the Word.Selection.Information method to get the page number for a specific part of the selection (because selection ranges might cross pages).

   ' check the pagenumber   If CType(wordApp.Selection.Information( _      Word.WdInformation.wdActiveEndPageNumber), _      Integer) > currentPageNumber Then      pageNode = Me.addPage(xmlDoc)      currentPageNumber = Integer.Parse( _         pageNode.Attributes("id").Value)   End If   

The sample application initially sets up a new XML document with one element (page 1) and sets the currentPageNumber variable to 1. For each paragraph, the code retrieves the page number for the end of the paragraph, and compares it to the currentPageNumber value. When the paragraph page differs from the currentPageNumber, the application adds a new element using the addPage method, which returns a reference to the new page element.

   Private Function addPage(ByVal xmlDoc As _      XmlDocument) As XmlElement      Dim page As XmlElement = _         xmlDoc.CreateElement("page")      Dim natt As XmlAttribute = _         xmlDoc.CreateAttribute("id")      If xmlDoc.SelectNodes("document//page") _         Is Nothing Then         natt.Value = "1"      Else         natt.Value = (xmlDoc.SelectNodes( _            "document//page").Count + 1).ToString      End If      page.Attributes.Append(natt)      xmlDoc.SelectSingleNode( _         "/document").AppendChild(page)      Return page   End Function

This scheme can’t maintain a perfect page-to-paragraph relationship; if a paragraph begins on one page but ends on another, this scheme stores the entire paragraph on the higher page.

Some documents contain “hard” page breaks. Word stores these as a single carriage return character (decimal 13) manually. It turns out that these hard page breaks appear as the first character in the paragraph following the hard page break, so you must test for that as well. The code below strips the hard page break character from the beginning of the paragraph text.

   ' get the para text   Dim s As String = p.Range.Text      ' check to see if there's a hard page break at the    ' start of this para   If Asc(s.Chars(0)) = &HC Then      s = s.Substring(1, s.Length - 1)      pageNode = Me.addPage(xmlDoc)      currentPageNumber = Integer.Parse( _         pageNode.Attributes("id").Value)   End If

Mapping Style Names to Element Names
Each Word paragraph object has a Style property that returns a Style object. So as you iterate through the paragraphs, you want to obtain the Style object and retrieve its name. It turns out that Style objects don’t have a Name property, they have a NameLocal property instead, which corresponds to the name you see when you select a style from Word’s dropdown style list?and that’s exactly what you need. Because the paragraph returns an Object, you must cast it to a Word.Style object to use the NameLocal property in your code.

   Dim stylename As String = CType(p.Style, _      Word.Style).NameLocal

Now that you have the style name for this paragraph, you want to map it to an XML element. There are two considerations. First, the Word style names can contain spaces, while XML element names cannot; therefore, you must either remove or replace the spaces before applying the name to an XML element.

Second, you may not want to map the Word style names directly to XML element names. For example, you might want to map Word’s Normal style to a

element in the XML document. To do that, you need to write a bit of lookup code to map style names to element names. The sample application contains a StyleMapping class that performs the lookup (see Listing 3). For convenience, the StyleMapping class also contains a fixupName method that handles replacing any spaces in the Word style name with underscores.

To instantiate an instance of the StyleMapping class, pass it the name of an XML-formatted map file. Map files consist of a root tag, which contains any number of tags. Each tag has style and tag attributes that hold the Word style name and the corresponding name of the XML tag that will hold a paragraph of that style.

                      

For example, the preceding map file instructs the application to map the Word style “Heading 1” to an

element and to map the Normal style to a

element.

As written, the application always attempts to look up the style name for every paragraph by calling the StyleMapping.GetStyleToElementMapping method. If that method finds an element with a matching style attribute, it returns the value of the tag attribute; otherwise it “fixes up” the Word style name by calling the private fixupName method and returns the result.

   ' definition in docToXml method   Dim styleMapper As New StyleMapping( _      Application.StartupPath & "stylemapping.xml")      ' for each paragraph, map the Word style to    ' and XML element name   Dim elementName As String = _      styleMapper.GetStyleToElementMapping(stylename)         ' In the StyleMapping class   Public Function GetStyleToElementMapping( _      ByVal aStylename As String) As String      Dim el As XmlElement = getMapNode(aStylename)      Dim tagname As String = String.Empty      If Not el Is Nothing Then         If el.HasAttribute("tag") Then            tagname = el.GetAttribute("tag")         End If      End If      If tagname = String.Empty Then         tagname = fixupName(aStylename)      End If      Return tagname   End Function      Private Function getMapNode( _      ByVal aStylename As String) As XmlElement      Dim n As XmlNode =  _         xml.SelectSingleNode("//item[@style='" + _         aStylename + "']")      If Not n Is Nothing Then         Return CType(n, XmlElement)      Else         Return Nothing      End If   End Function      Private Function fixupName(ByVal aStylename _      As String) As String      Return aStylename.Replace(" "c, "_"c)   End Function

After obtaining a mapped name, you can create a new XmlElement and append it to the most recent page element.

   Dim N As XmlElement = _      xmlDoc.CreateElement(elementName)   N.InnerText = s   pageNode.AppendChild(N)

When the docToXml function has processed all the paragraphs, it returns the completed XML document. The Process button Click event handler code then displays it in the multi-line TextBox (see Figure 3).

Figure 3: The Completed Transformation. After processing, the simple sample.doc file, the multi-line TextBox displays the content transformed to XML.

Customizing the Doc-to-XML Sample Application
The solution shown here is very generic and also extremely simple. It doesn’t take advanced Word features such as bullets, numbered paragraphs, document sections, or custom character formatting into account, but adding support for such features is not difficult.

For example, you can check each Word paragraph to see if it has a bullet or number and add items to the mapping file to map bulleted paragraphs to specific XML element names, such as

  • or . You can also alter the format of the mapping file to hold more information, perhaps to add specific HTML style attributes to each element (although it’s usually best to add formatting with an XSLT transform). I’ve added code to the StyleMapping class so that you can add and remove mappings programmatically. Although the sample code doesn’t use these methods, you’ll find two commented-out lines at the top of the docToXml method that create a new StyleMapping document and add a mapping dynamically.

    Hard-coding the way the StyleMapping class interprets the contents of the mapping file isn’t very flexible. A more flexible method would be to follow the model the .NET framework uses to process configuration files by creating section handlers to handle various sections of the mapping file. Doing that would allow you to add custom processing for a specific file at run time.

    You can?and, (unless you’re planning to turn the XML back into a Word file) probably should?add code to ensure that Word’s control/formatting characters aren’t inserted into the XML file, or add code to remove them later. For example, this article showed one way to remove the manual page break character using string replacement. You can use Word’s find and replace feature as discussed in the section “Finding and Replacing Character Styles” to remove the control characters. You can also perform post-processing on the XML file itself, but if you do that, you should be aware that the XmlDocument encodes control/formatting character values as it adds text, so you’ll need to search for the encoded representation of the characters, which will start with an ampersand and a number sign followed by the character value, (such as —).With a little alteration, this entire concept is useful for implementing document-checking rules or extracting specific paragraphs or field values from existing Word documents. For example, you might want to check Word documents and ensure that the styles adhere to company standards. Or you might want to use these automation techniques to extract only specific field values from a Word document, using an XML file to specify which fields the process should extract.

    Whatever you need to do with your Word files, this article should help get you started and give you ideas for controlling the process from outside your code.

    devxblackblue

    About Our Editorial Process

    At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

    See our full editorial policy.

    About Our Journalist

    ©2024 Copyright DevX - All Rights Reserved. Registration or use of this site constitutes acceptance of our Terms of Service and Privacy Policy.