ord automation has traditionally been the province of VB Classic developers, but it’s alive and well in VB.NET?it’s just a little different. Word automation is the process of using the classes and methods exposed by Word to create new Word documents or alter or manipulate existing Word documents. In this article, you’ll see how to get started with Word automation in VB.NET by exploring a process for transforming Word documents into customizable XML. The technique shown here doesn’t rely on Word 2003’s XML capabilities, so you can use it with any version of Word that supports automation. Most of the techniques you’ll see apply generally to any application that needs to automate Word from within .NET
Here’s the process in a nutshell: You add a reference to the Word automation library to your project and use that reference to create a Word application object that can open Word files and export the document’s contents to an XML document. For this project, the application exports the content in such a way that each Word style gets translated to an appropriate XML element. By default, the application uses the names of Word styles (sometimes in slightly modified form) applied to paragraphs as the element names for the document. As written, the application follows the sequence of page breaks in the Word document itself. It doesn’t take section breaks within a document into account, but that would be relatively easy to add. The application preserves style formatting, but ignores empty paragraphs (those containing only whitespace such as spaces, tabs, carriage returns, and linefeeds).
Here’s an example. Suppose you have a Word document that looks like the sample.docsample.doc file shown in Figure 1.
|Reference the Appropriate Word Library|
Start a new Windows Application project in Visual Studio.NET, and name it WordAutomation. After creating the project, right-click on the References item in the Solution Explorer, and select Add Reference. Click the COM tab on the Add Reference dialog and then select the Microsoft Word MajorVersion.minorversion item (where MajorVersion.minorversion stands for the major and minor version number of the Word release you’re targeting). Doing that creates several references called Word, VBIDE, stdole, and Microsoft.Office.Core. The only one you need to worry about for this project is the Word reference.
Getting Word Content
To open the Word file, first create a Word.Application object. The sample project creates this when it instantiates the class by defining a class-level variable.
When the user selects a file and clicks the Process button, the Click event-handler code opens the selected file by calling the Word application’s Documents.Open method.
The method creates a new instance of the Word.Document class, which represents a Word document.
Note that you can’t create a Word.DocumentClass instance directly. Instead, you get the reference indirectly through the Word.Application object’s Documents collection; in this case by calling the Open method to open the selected file. The Open method returns a WordDocument object, which you can then use to manipulate the document’s content.
Author’s Note: In earlier Word versions, using the Application and Document classes directly caused problems, so you may need to use the ApplicationClass and DocumentClass classes instead. For example, you may experience an irritating conflict between Close() methods. For a Word.Document instance, you’ll get an error stating that “‘Close’ is ambiguous across the inherited interfaces ‘Word._Document’ and ‘Word.DocumentEvents_Event’.” One solution is to create your document references as DocumentClass instances rather than Document instances. Another workaround is to cast the Document or Application instances to a more specific DocumentClass or ApplicationClass instance before issuing the ambiguous call. For example, assuming that myDoc is a Document object, to issue a Close call you could write:
The line of code above casts the Document reference myDoc to a DocumentClass reference, which avoids any ambiguity with the call to Close().
Finding and Replacing Character Styles
Rather than writing the code to iterate through all the Word characters looking for bold and italic characters yourself, it’s much simpler to get Word to do it for you. Word has powerful find and replace methods?and you have access to them through your object references. In the sample code, the two private methods boldToHTML and italicToHTML handle replacing ranges of bold and italic characters with HTML tag equivalents. In other words bold text becomes bold text and italic text becomes italic text. Using Word’s find-and-replace functionality, you can perform replacement processes on the entire document very quickly.
You will probably want to add other, similar find-and-replace methods depending on your needs. For example, you might need to find underlined text and replace it with tags.
One problem with this approach is that as people write documents, they often turn bold or some other character formatting on, write some text, and then press return to end a paragraph. They start typing the next line and realize that the formatting is still in effect, so they turn it off. The document look fine, but unfortunately, this common series of actions means the carriage return at the end of the text is itself bold. When you execute the find-and-replace methods shown above, you’ll get results like this:
That causes problems when you translate documents to XML, because it results in improperly nested tags, such as:
The solution is to make a single pass through the document replacing all paragraph marks (carriage returns) with unformatted paragraph marks. The clearCRFormatting method shown below solves the problem.
The sample project calls all three find-and-replace methods immediately after loading the document (see Listing 2). Next, it starts collecting the paragraph content.
Word uses a hierarchical collection model. At the document level, each document has a collection of Paragraph objects, so it’s easy to iterate through them:
For this application, the “do something with each paragraph” consists of several tasks:
The sample application wraps all these steps up into a method called docToXml, which takes the WordDocument instance as a parameter and returns an XML document object containing the completed elements and the Word document’s content.
Determining Page Numbers
To retrieve the page number on which any range of text lies, you first select the text and then use the Word.Selection.Information method to get the page number for a specific part of the selection (because selection ranges might cross pages).
The sample application initially sets up a new XML document with one
This scheme can’t maintain a perfect page-to-paragraph relationship; if a paragraph begins on one page but ends on another, this scheme stores the entire paragraph on the higher page.
Some documents contain “hard” page breaks. Word stores these as a single carriage return character (decimal 13) manually. It turns out that these hard page breaks appear as the first character in the paragraph following the hard page break, so you must test for that as well. The code below strips the hard page break character from the beginning of the paragraph text.
Mapping Style Names to Element Names
Now that you have the style name for this paragraph, you want to map it to an XML element. There are two considerations. First, the Word style names can contain spaces, while XML element names cannot; therefore, you must either remove or replace the spaces before applying the name to an XML element.
Second, you may not want to map the Word style names directly to XML element names. For example, you might want to map Word’s Normal style to a
element in the XML document. To do that, you need to write a bit of lookup code to map style names to element names. The sample application contains a StyleMapping class that performs the lookup (see Listing 3). For convenience, the StyleMapping class also contains a fixupName method that handles replacing any spaces in the Word style name with underscores.
To instantiate an instance of the StyleMapping class, pass it the name of an XML-formatted map file. Map files consist of a root
For example, the preceding map file instructs the application to map the Word style “Heading 1” to an
element and to map the Normal style to a element.
As written, the application always attempts to look up the style name for every paragraph by calling the StyleMapping.GetStyleToElementMapping method. If that method finds an
After obtaining a mapped name, you can create a new XmlElement and append it to the most recent page element.
When the docToXml function has processed all the paragraphs, it returns the completed XML document. The Process button Click event handler code then displays it in the multi-line TextBox (see Figure 3).
Customizing the Doc-to-XML Sample Application
For example, you can check each Word paragraph to see if it has a bullet or number and add items to the mapping file to map bulleted paragraphs to specific XML element names, such as
Hard-coding the way the StyleMapping class interprets the contents of the mapping file isn’t very flexible. A more flexible method would be to follow the model the .NET framework uses to process configuration files by creating section handlers to handle various sections of the mapping file. Doing that would allow you to add custom processing for a specific file at run time.
You can?and, (unless you’re planning to turn the XML back into a Word file) probably should?add code to ensure that Word’s control/formatting characters aren’t inserted into the XML file, or add code to remove them later. For example, this article showed one way to remove the manual page break character using string replacement. You can use Word’s find and replace feature as discussed in the section “Finding and Replacing Character Styles” to remove the control characters. You can also perform post-processing on the XML file itself, but if you do that, you should be aware that the XmlDocument encodes control/formatting character values as it adds text, so you’ll need to search for the encoded representation of the characters, which will start with an ampersand and a number sign followed by the character value, (such as ).With a little alteration, this entire concept is useful for implementing document-checking rules or extracting specific paragraphs or field values from existing Word documents. For example, you might want to check Word documents and ensure that the styles adhere to company standards. Or you might want to use these automation techniques to extract only specific field values from a Word document, using an XML file to specify which fields the process should extract.
Whatever you need to do with your Word files, this article should help get you started and give you ideas for controlling the process from outside your code.
Share the Post:
NTT DOCOMO, a global telecommunications giant, is set to break new ground in the industry as it prepares to launch a GPU-accelerated 5G network in
An op-ed, produced using Microsoft’s Bing Chat AI software, recently appeared in the St. Louis Post-Dispatch, discussing the potential concerns surrounding the employment of artificial
The highly awaited Big Deal Days event for October 2023 is nearly here, scheduled for the 10th and 11th. Similar to the previous year, this
Cisco’s recent massive purchase of Splunk, an AI-powered cybersecurity firm, for $28 billion signals a potential boost in tech deals after a year of subdued
Iran has recently unveiled a jet-propelled variant of its Shahed series drone, marking a significant advancement in the nation’s drone technology. The new drone is
The Overshoot Commission has recently released a comprehensive report that discusses the controversial topic of Solar Geoengineering, also known as Solar Radiation Modification (SRM). The
NTT DOCOMO, a global telecommunications giant, is set to break new ground in the industry as it prepares to launch a GPU-accelerated 5G network in Japan. This innovative approach will
An op-ed, produced using Microsoft’s Bing Chat AI software, recently appeared in the St. Louis Post-Dispatch, discussing the potential concerns surrounding the employment of artificial intelligence (AI) in journalism. These
The highly awaited Big Deal Days event for October 2023 is nearly here, scheduled for the 10th and 11th. Similar to the previous year, this autumn sale has already created
Cisco’s recent massive purchase of Splunk, an AI-powered cybersecurity firm, for $28 billion signals a potential boost in tech deals after a year of subdued mergers and acquisitions in the
Iran has recently unveiled a jet-propelled variant of its Shahed series drone, marking a significant advancement in the nation’s drone technology. The new drone is poised to reshape the regional
The Overshoot Commission has recently released a comprehensive report that discusses the controversial topic of Solar Geoengineering, also known as Solar Radiation Modification (SRM). The Commission’s primary objective is to
School districts are preparing to reveal a substantial technological upgrade designed to significantly improve remote learning experiences for both educators and students amid the ongoing pandemic. This major investment, which
Scientists John Connell and Yi Lin from NASA’s Solid-state Architecture Batteries for Enhanced Rechargeability and Safety (SABERS) project are working on experimental solid-state battery packs that could dramatically change the
Are you wondering how much it costs to build a website? The approximated cost is based on several factors, including which add-ons and platforms you choose. For example, a self-hosted
In recent times, battery startups have experienced a significant boost in investments, with three businesses obtaining over $1 billion in funding within the last month. French company Verkor amassed $2.1
Microsoft’s latest offering, Microsoft Copilot, aims to revolutionize the way we interact with technology. By integrating various AI capabilities, this all-in-one tool provides users with an improved experience that not
The surge in virtual AI girlfriends’ popularity is playing a role in the escalating issue of loneliness among young males, and this could have serious repercussions for America’s future. A
Senser, an AIOps platform based in Tel Aviv, has introduced its groundbreaking AI-powered observability solution to support developers and operations teams in promptly pinpointing the root causes of service disruptions
Bebob has introduced new 4- and 8-channel battery charging stations primarily aimed at rental companies, providing a convenient solution for clients with a large quantity of batteries. These wall-mountable and
On Wednesday, Malaysia’s Prime Minister Anwar Ibrahim announced the country’s plan to implement a dual 5G network strategy. This move is designed to achieve a more equitable incorporation of both
The Pentagon has recently unveiled its ambitious strategy to acquire thousands of sophisticated drones within the next two years. This decision comes in response to Russia’s rapid utilization of airborne
Microsoft has recently announced a series of new features and updates across their applications, including Outlook, Microsoft Teams, and SharePoint. These new developments are centered around improving user experience, streamlining
South Korean automakers Hyundai and Kia are cutting the prices on a number of their electric vehicles (EVs) in response to growing price competition within the South Korean market. Many
In a shocking turn of events, the German national KfW bank was forced to discontinue its home solar power subsidy program for charging electric vehicles (EVs) after just one day,
Ira Newlander from West Los Angeles is thinking about trading in his old Ford Explorer for a contemporary hybrid or electric vehicle. However, he has observed that the majority of
As we continue to face the repercussions of climate change, scientists and experts seek innovative ways to mitigate its impacts. Solar geoengineering (SG), a technique involving the distribution of aerosols
The tech world is a rapidly evolving landscape where it feels as if the next exciting breakthrough is just around the corner. Technology is advancing at an incredible rate, with
On September 24, 2023, it was reported that Razer, a popular brand in the premium gaming laptop industry, is offering an exceptional deal on their Razer Blade 17 model. Typically
The fintech sector continues to attract substantial interest, as demonstrated by a dedicated fintech stage at a recent event featuring panel discussions and informal conversations with industry professionals. The gathering,
India has chosen to ease its proposed restrictions on imports of laptops, tablets, and other IT hardware, allowing manufacturers like Apple Inc., HP Inc., and Dell Technologies Inc. more time