Export Customized XML from Microsoft Word with VB.NET

Export Customized XML from Microsoft Word with VB.NET

ord automation has traditionally been the province of VB Classic developers, but it’s alive and well in VB.NET?it’s just a little different. Word automation is the process of using the classes and methods exposed by Word to create new Word documents or alter or manipulate existing Word documents. In this article, you’ll see how to get started with Word automation in VB.NET by exploring a process for transforming Word documents into customizable XML. The technique shown here doesn’t rely on Word 2003’s XML capabilities, so you can use it with any version of Word that supports automation. Most of the techniques you’ll see apply generally to any application that needs to automate Word from within .NET

Here’s the process in a nutshell: You add a reference to the Word automation library to your project and use that reference to create a Word application object that can open Word files and export the document’s contents to an XML document. For this project, the application exports the content in such a way that each Word style gets translated to an appropriate XML element. By default, the application uses the names of Word styles (sometimes in slightly modified form) applied to paragraphs as the element names for the document. As written, the application follows the sequence of page breaks in the Word document itself. It doesn’t take section breaks within a document into account, but that would be relatively easy to add. The application preserves style formatting, but ignores empty paragraphs (those containing only whitespace such as spaces, tabs, carriage returns, and linefeeds).

Here’s an example. Suppose you have a Word document that looks like the sample.docsample.doc file shown in Figure 1.

Reference the Appropriate Word Library
Start a new Windows Application project in Visual Studio.NET, and name it WordAutomation. After creating the project, right-click on the References item in the Solution Explorer, and select Add Reference. Click the COM tab on the Add Reference dialog and then select the Microsoft Word MajorVersion.minorversion item (where MajorVersion.minorversion stands for the major and minor version number of the Word release you’re targeting). Doing that creates several references called Word, VBIDE, stdole, and Microsoft.Office.Core. The only one you need to worry about for this project is the Word reference.

Getting Word Content
The first step is to use Word automation to open and close Word documents and iterate through them to extract content. As an example, you’ll see how to load the sample.doc file that accompanies the downloadable code for this article, and display the text of each paragraph formatted as XML in a TextBox on your default form. The sample Form1 form has a TextBox to hold the filename of the Word file to process, a Browse button that lets you select a file, a multi-line TextBox to display the results (see Figure 2), and a Process button that processes the selected Word file, turning it into a valid XML document.

Figure 1: The Sample Form for the WordAutomation Project. The form lets you select a Word file, processes the file into XML, and displays the results in the multi-line TextBox.

To open the Word file, first create a Word.Application object. The sample project creates this when it instantiates the class by defining a class-level variable.

   Private wordApp As New Word.Application

When the user selects a file and clicks the Process button, the Click event-handler code opens the selected file by calling the Word application’s Documents.Open method.

   doc = wordApp.Documents.Open( _      CType(Me.txtFilename.Text, Object), 1, 1, 0)

The method creates a new instance of the Word.Document class, which represents a Word document.

Note that you can’t create a Word.DocumentClass instance directly. Instead, you get the reference indirectly through the Word.Application object’s Documents collection; in this case by calling the Open method to open the selected file. The Open method returns a WordDocument object, which you can then use to manipulate the document’s content.

Author’s Note: In earlier Word versions, using the Application and Document classes directly caused problems, so you may need to use the ApplicationClass and DocumentClass classes instead. For example, you may experience an irritating conflict between Close() methods. For a Word.Document instance, you’ll get an error stating that “‘Close’ is ambiguous across the inherited interfaces ‘Word._Document’ and ‘Word.DocumentEvents_Event’.” One solution is to create your document references as DocumentClass instances rather than Document instances. Another workaround is to cast the Document or Application instances to a more specific DocumentClass or ApplicationClass instance before issuing the ambiguous call. For example, assuming that myDoc is a Document object, to issue a Close call you could write:

   CType(myDoc, DocumentClass).Close()

The line of code above casts the Document reference myDoc to a DocumentClass reference, which avoids any ambiguity with the call to Close().

Finding and Replacing Character Styles
Word has two different types of styles. Paragraph styles control such paragraph properties as indentation, spacing before and after the paragraph, and the default font characteristics applied to text in the paragraph. In addition, Word supports character styles, which are styles applied to a range of characters. For example when you format characters as bold or italic you’ve applied a character style. For this application, assume you want the paragraph style to become the name of the XML tag surrounding the text of the paragraph, but you also want to maintain bold and italic styles within the text.

Rather than writing the code to iterate through all the Word characters looking for bold and italic characters yourself, it’s much simpler to get Word to do it for you. Word has powerful find and replace methods?and you have access to them through your object references. In the sample code, the two private methods boldToHTML and italicToHTML handle replacing ranges of bold and italic characters with HTML tag equivalents. In other words bold text becomes bold text and italic text becomes italic text. Using Word’s find-and-replace functionality, you can perform replacement processes on the entire document very quickly.

   Public Sub boldToHTML(ByVal doc As Word.Document)      With doc.Content.Find         .ClearFormatting()         .Font.Bold = 1            .Replacement.ClearFormatting()            .Replacement.Font.Bold = 0            .Text = "*"            .Execute(findtext:="", _               ReplaceWith:="^&", _               Format:=True, _               Replace:=Word.WdReplace.wdReplaceAll)         End With      End Sub      Public Sub italicToHTML(ByVal doc As Word.Document)         With doc.Content.Find            .ClearFormatting()            .Font.Italic = 1            .Replacement.ClearFormatting()            .Replacement.Font.Bold = 0            .Text = "*"            .Execute(findtext:="", _               ReplaceWith:="^&", _               Format:=True, _               Replace:=Word.WdReplace.wdReplaceAll)         End With      End Sub

You will probably want to add other, similar find-and-replace methods depending on your needs. For example, you might need to find underlined text and replace it with tags.

One problem with this approach is that as people write documents, they often turn bold or some other character formatting on, write some text, and then press return to end a paragraph. They start typing the next line and realize that the formatting is still in effect, so they turn it off. The document look fine, but unfortunately, this common series of actions means the carriage return at the end of the text is itself bold. When you execute the find-and-replace methods shown above, you’ll get results like this:

   This is bold text

That causes problems when you translate documents to XML, because it results in improperly nested tags, such as:


This is bold text

The solution is to make a single pass through the document replacing all paragraph marks (carriage returns) with unformatted paragraph marks. The clearCRFormatting method shown below solves the problem.

   Public Sub clearCRFormatting(ByVal doc _      As Word.Document)      With doc.Content.Find         .ClearFormatting()         .Replacement.ClearFormatting()         .Replacement.Font.Bold = 0         .Replacement.Font.Italic = 0         .Replacement.Font.Underline = 0         .Execute(findtext:="^p", ReplaceWith:="^p", _            Format:=True, _            Replace:=Word.WdReplace.wdReplaceAll)      End With   End Sub

The sample project calls all three find-and-replace methods immediately after loading the document (see Listing 2). Next, it starts collecting the paragraph content.

Word uses a hierarchical collection model. At the document level, each document has a collection of Paragraph objects, so it’s easy to iterate through them:

   ' loop through the paragraphs   For Each p As Word.Paragraph In doc.Paragraphs      ' do something with each paragraph   Next

For this application, the “do something with each paragraph” consists of several tasks:

  • Determine the page number where the paragraph would appear if the document were printed. Whenever the page changes, you want to insert a element to reflect the original page structure of the document in the XML file.
  • Retrieve the Word style name for the paragraph.
  • Create an XML element for that style. This step itself consists of several separate tasks, such as determining whether to use the Word style name directly or “map” it to a different element name using a “style mapping.” For example, you might want to translate the “Normal” style to HTML-like

    tags. By default, the application uses the Word style name, with minor adjustments to meet XML element-naming rules.

  • Append the XML element containing the contents of the paragraph to an XML output document.

The sample application wraps all these steps up into a method called docToXml, which takes the WordDocument instance as a parameter and returns an XML document object containing the completed elements and the Word document’s content.

Determining Page Numbers
A Word document is an abstract representation of a printed document. Both the content and the print setup affect page numbers. For example, you can print a Word document on non-standard paper, or insert new paragraphs at the beginning of a document. When you make such changes, Word rearranges the content to reflect the new content or output medium. The result is that paragraphs don’t “belong” to a specific page. Therefore, if you want to maintain the current Word page arrangement, you need to capture the page number as you loop through the document grabbing paragraph content. This application maintains the relationship between the paragraphs and Word’s pages by creating each paragraph element as a child of a element, where “#” is the page number.

To retrieve the page number on which any range of text lies, you first select the text and then use the Word.Selection.Information method to get the page number for a specific part of the selection (because selection ranges might cross pages).

   ' check the pagenumber   If CType(wordApp.Selection.Information( _      Word.WdInformation.wdActiveEndPageNumber), _      Integer) > currentPageNumber Then      pageNode = Me.addPage(xmlDoc)      currentPageNumber = Integer.Parse( _         pageNode.Attributes("id").Value)   End If   

The sample application initially sets up a new XML document with one element (page 1) and sets the currentPageNumber variable to 1. For each paragraph, the code retrieves the page number for the end of the paragraph, and compares it to the currentPageNumber value. When the paragraph page differs from the currentPageNumber, the application adds a new element using the addPage method, which returns a reference to the new page element.

   Private Function addPage(ByVal xmlDoc As _      XmlDocument) As XmlElement      Dim page As XmlElement = _         xmlDoc.CreateElement("page")      Dim natt As XmlAttribute = _         xmlDoc.CreateAttribute("id")      If xmlDoc.SelectNodes("document//page") _         Is Nothing Then         natt.Value = "1"      Else         natt.Value = (xmlDoc.SelectNodes( _            "document//page").Count + 1).ToString      End If      page.Attributes.Append(natt)      xmlDoc.SelectSingleNode( _         "/document").AppendChild(page)      Return page   End Function

This scheme can’t maintain a perfect page-to-paragraph relationship; if a paragraph begins on one page but ends on another, this scheme stores the entire paragraph on the higher page.

Some documents contain “hard” page breaks. Word stores these as a single carriage return character (decimal 13) manually. It turns out that these hard page breaks appear as the first character in the paragraph following the hard page break, so you must test for that as well. The code below strips the hard page break character from the beginning of the paragraph text.

   ' get the para text   Dim s As String = p.Range.Text      ' check to see if there's a hard page break at the    ' start of this para   If Asc(s.Chars(0)) = &HC Then      s = s.Substring(1, s.Length - 1)      pageNode = Me.addPage(xmlDoc)      currentPageNumber = Integer.Parse( _         pageNode.Attributes("id").Value)   End If

Mapping Style Names to Element Names
Each Word paragraph object has a Style property that returns a Style object. So as you iterate through the paragraphs, you want to obtain the Style object and retrieve its name. It turns out that Style objects don’t have a Name property, they have a NameLocal property instead, which corresponds to the name you see when you select a style from Word’s dropdown style list?and that’s exactly what you need. Because the paragraph returns an Object, you must cast it to a Word.Style object to use the NameLocal property in your code.

   Dim stylename As String = CType(p.Style, _      Word.Style).NameLocal

Now that you have the style name for this paragraph, you want to map it to an XML element. There are two considerations. First, the Word style names can contain spaces, while XML element names cannot; therefore, you must either remove or replace the spaces before applying the name to an XML element.

Second, you may not want to map the Word style names directly to XML element names. For example, you might want to map Word’s Normal style to a

element in the XML document. To do that, you need to write a bit of lookup code to map style names to element names. The sample application contains a StyleMapping class that performs the lookup (see Listing 3). For convenience, the StyleMapping class also contains a fixupName method that handles replacing any spaces in the Word style name with underscores.

To instantiate an instance of the StyleMapping class, pass it the name of an XML-formatted map file. Map files consist of a root tag, which contains any number of tags. Each tag has style and tag attributes that hold the Word style name and the corresponding name of the XML tag that will hold a paragraph of that style.


For example, the preceding map file instructs the application to map the Word style “Heading 1” to an

element and to map the Normal style to a


As written, the application always attempts to look up the style name for every paragraph by calling the StyleMapping.GetStyleToElementMapping method. If that method finds an element with a matching style attribute, it returns the value of the tag attribute; otherwise it “fixes up” the Word style name by calling the private fixupName method and returns the result.

   ' definition in docToXml method   Dim styleMapper As New StyleMapping( _      Application.StartupPath & "stylemapping.xml")      ' for each paragraph, map the Word style to    ' and XML element name   Dim elementName As String = _      styleMapper.GetStyleToElementMapping(stylename)         ' In the StyleMapping class   Public Function GetStyleToElementMapping( _      ByVal aStylename As String) As String      Dim el As XmlElement = getMapNode(aStylename)      Dim tagname As String = String.Empty      If Not el Is Nothing Then         If el.HasAttribute("tag") Then            tagname = el.GetAttribute("tag")         End If      End If      If tagname = String.Empty Then         tagname = fixupName(aStylename)      End If      Return tagname   End Function      Private Function getMapNode( _      ByVal aStylename As String) As XmlElement      Dim n As XmlNode =  _         xml.SelectSingleNode("//item[@style='" + _         aStylename + "']")      If Not n Is Nothing Then         Return CType(n, XmlElement)      Else         Return Nothing      End If   End Function      Private Function fixupName(ByVal aStylename _      As String) As String      Return aStylename.Replace(" "c, "_"c)   End Function

After obtaining a mapped name, you can create a new XmlElement and append it to the most recent page element.

   Dim N As XmlElement = _      xmlDoc.CreateElement(elementName)   N.InnerText = s   pageNode.AppendChild(N)

When the docToXml function has processed all the paragraphs, it returns the completed XML document. The Process button Click event handler code then displays it in the multi-line TextBox (see Figure 3).

Figure 3: The Completed Transformation. After processing, the simple sample.doc file, the multi-line TextBox displays the content transformed to XML.

Customizing the Doc-to-XML Sample Application
The solution shown here is very generic and also extremely simple. It doesn’t take advanced Word features such as bullets, numbered paragraphs, document sections, or custom character formatting into account, but adding support for such features is not difficult.

For example, you can check each Word paragraph to see if it has a bullet or number and add items to the mapping file to map bulleted paragraphs to specific XML element names, such as

  • or . You can also alter the format of the mapping file to hold more information, perhaps to add specific HTML style attributes to each element (although it’s usually best to add formatting with an XSLT transform). I’ve added code to the StyleMapping class so that you can add and remove mappings programmatically. Although the sample code doesn’t use these methods, you’ll find two commented-out lines at the top of the docToXml method that create a new StyleMapping document and add a mapping dynamically.

    Hard-coding the way the StyleMapping class interprets the contents of the mapping file isn’t very flexible. A more flexible method would be to follow the model the .NET framework uses to process configuration files by creating section handlers to handle various sections of the mapping file. Doing that would allow you to add custom processing for a specific file at run time.

    You can?and, (unless you’re planning to turn the XML back into a Word file) probably should?add code to ensure that Word’s control/formatting characters aren’t inserted into the XML file, or add code to remove them later. For example, this article showed one way to remove the manual page break character using string replacement. You can use Word’s find and replace feature as discussed in the section “Finding and Replacing Character Styles” to remove the control characters. You can also perform post-processing on the XML file itself, but if you do that, you should be aware that the XmlDocument encodes control/formatting character values as it adds text, so you’ll need to search for the encoded representation of the characters, which will start with an ampersand and a number sign followed by the character value, (such as —).With a little alteration, this entire concept is useful for implementing document-checking rules or extracting specific paragraphs or field values from existing Word documents. For example, you might want to check Word documents and ensure that the styles adhere to company standards. Or you might want to use these automation techniques to extract only specific field values from a Word document, using an XML file to specify which fields the process should extract.

    Whatever you need to do with your Word files, this article should help get you started and give you ideas for controlling the process from outside your code.



    Share the Post:
    5G Innovations

    GPU-Accelerated 5G in Japan

    NTT DOCOMO, a global telecommunications giant, is set to break new ground in the industry as it prepares to launch a GPU-accelerated 5G network in

    AI Ethics

    AI Journalism: Balancing Integrity and Innovation

    An op-ed, produced using Microsoft’s Bing Chat AI software, recently appeared in the St. Louis Post-Dispatch, discussing the potential concerns surrounding the employment of artificial

    Savings Extravaganza

    Big Deal Days Extravaganza

    The highly awaited Big Deal Days event for October 2023 is nearly here, scheduled for the 10th and 11th. Similar to the previous year, this

    5G Innovations

    GPU-Accelerated 5G in Japan

    NTT DOCOMO, a global telecommunications giant, is set to break new ground in the industry as it prepares to launch a GPU-accelerated 5G network in Japan. This innovative approach will

    AI Ethics

    AI Journalism: Balancing Integrity and Innovation

    An op-ed, produced using Microsoft’s Bing Chat AI software, recently appeared in the St. Louis Post-Dispatch, discussing the potential concerns surrounding the employment of artificial intelligence (AI) in journalism. These

    Savings Extravaganza

    Big Deal Days Extravaganza

    The highly awaited Big Deal Days event for October 2023 is nearly here, scheduled for the 10th and 11th. Similar to the previous year, this autumn sale has already created

    Cisco Splunk Deal

    Cisco Splunk Deal Sparks Tech Acquisition Frenzy

    Cisco’s recent massive purchase of Splunk, an AI-powered cybersecurity firm, for $28 billion signals a potential boost in tech deals after a year of subdued mergers and acquisitions in the

    Iran Drone Expansion

    Iran’s Jet-Propelled Drone Reshapes Power Balance

    Iran has recently unveiled a jet-propelled variant of its Shahed series drone, marking a significant advancement in the nation’s drone technology. The new drone is poised to reshape the regional

    Solar Geoengineering

    Did the Overshoot Commission Shoot Down Geoengineering?

    The Overshoot Commission has recently released a comprehensive report that discusses the controversial topic of Solar Geoengineering, also known as Solar Radiation Modification (SRM). The Commission’s primary objective is to

    Remote Learning

    Revolutionizing Remote Learning for Success

    School districts are preparing to reveal a substantial technological upgrade designed to significantly improve remote learning experiences for both educators and students amid the ongoing pandemic. This major investment, which

    Revolutionary SABERS Transforming

    SABERS Batteries Transforming Industries

    Scientists John Connell and Yi Lin from NASA’s Solid-state Architecture Batteries for Enhanced Rechargeability and Safety (SABERS) project are working on experimental solid-state battery packs that could dramatically change the

    Build a Website

    How Much Does It Cost to Build a Website?

    Are you wondering how much it costs to build a website? The approximated cost is based on several factors, including which add-ons and platforms you choose. For example, a self-hosted

    Battery Investments

    Battery Startups Attract Billion-Dollar Investments

    In recent times, battery startups have experienced a significant boost in investments, with three businesses obtaining over $1 billion in funding within the last month. French company Verkor amassed $2.1

    Copilot Revolution

    Microsoft Copilot: A Suit of AI Features

    Microsoft’s latest offering, Microsoft Copilot, aims to revolutionize the way we interact with technology. By integrating various AI capabilities, this all-in-one tool provides users with an improved experience that not

    AI Girlfriend Craze

    AI Girlfriend Craze Threatens Relationships

    The surge in virtual AI girlfriends’ popularity is playing a role in the escalating issue of loneliness among young males, and this could have serious repercussions for America’s future. A

    AIOps Innovations

    Senser is Changing AIOps

    Senser, an AIOps platform based in Tel Aviv, has introduced its groundbreaking AI-powered observability solution to support developers and operations teams in promptly pinpointing the root causes of service disruptions

    Bebop Charging Stations

    Check Out The New Bebob Battery Charging Stations

    Bebob has introduced new 4- and 8-channel battery charging stations primarily aimed at rental companies, providing a convenient solution for clients with a large quantity of batteries. These wall-mountable and

    Malyasian Networks

    Malaysia’s Dual 5G Network Growth

    On Wednesday, Malaysia’s Prime Minister Anwar Ibrahim announced the country’s plan to implement a dual 5G network strategy. This move is designed to achieve a more equitable incorporation of both

    Advanced Drones Race

    Pentagon’s Bold Race for Advanced Drones

    The Pentagon has recently unveiled its ambitious strategy to acquire thousands of sophisticated drones within the next two years. This decision comes in response to Russia’s rapid utilization of airborne

    Important Updates

    You Need to See the New Microsoft Updates

    Microsoft has recently announced a series of new features and updates across their applications, including Outlook, Microsoft Teams, and SharePoint. These new developments are centered around improving user experience, streamlining

    Price Wars

    Inside Hyundai and Kia’s Price Wars

    South Korean automakers Hyundai and Kia are cutting the prices on a number of their electric vehicles (EVs) in response to growing price competition within the South Korean market. Many

    Solar Frenzy Surprises

    Solar Subsidy in Germany Causes Frenzy

    In a shocking turn of events, the German national KfW bank was forced to discontinue its home solar power subsidy program for charging electric vehicles (EVs) after just one day,

    Electric Spare

    Electric Cars Ditch Spare Tires for Efficiency

    Ira Newlander from West Los Angeles is thinking about trading in his old Ford Explorer for a contemporary hybrid or electric vehicle. However, he has observed that the majority of

    Solar Geoengineering Impacts

    Unraveling Solar Geoengineering’s Hidden Impacts

    As we continue to face the repercussions of climate change, scientists and experts seek innovative ways to mitigate its impacts. Solar geoengineering (SG), a technique involving the distribution of aerosols

    Razer Discount

    Unbelievable Razer Blade 17 Discount

    On September 24, 2023, it was reported that Razer, a popular brand in the premium gaming laptop industry, is offering an exceptional deal on their Razer Blade 17 model. Typically

    Innovation Ignition

    New Fintech Innovation Ignites Change

    The fintech sector continues to attract substantial interest, as demonstrated by a dedicated fintech stage at a recent event featuring panel discussions and informal conversations with industry professionals. The gathering,

    Import Easing

    Easing Import Rules for Big Tech

    India has chosen to ease its proposed restrictions on imports of laptops, tablets, and other IT hardware, allowing manufacturers like Apple Inc., HP Inc., and Dell Technologies Inc. more time

    ©2023 Copyright DevX - All Rights Reserved. Registration or use of this site constitutes acceptance of our Terms of Service and Privacy Policy.