ord automation has traditionally been the province of VB Classic developers, but it’s alive and well in VB.NET?it’s just a little different. Word automation is the process of using the classes and methods exposed by Word to create new Word documents or alter or manipulate existing Word documents. In this article, you’ll see how to get started with Word automation in VB.NET by exploring a process for transforming Word documents into customizable XML. The technique shown here doesn’t rely on Word 2003’s XML capabilities, so you can use it with any version of Word that supports automation. Most of the techniques you’ll see apply generally to any application that needs to automate Word from within .NET
Here’s the process in a nutshell: You add a reference to the Word automation library to your project and use that reference to create a Word application object that can open Word files and export the document’s contents to an XML document. For this project, the application exports the content in such a way that each Word style gets translated to an appropriate XML element. By default, the application uses the names of Word styles (sometimes in slightly modified form) applied to paragraphs as the element names for the document. As written, the application follows the sequence of page breaks in the Word document itself. It doesn’t take section breaks within a document into account, but that would be relatively easy to add. The application preserves style formatting, but ignores empty paragraphs (those containing only whitespace such as spaces, tabs, carriage returns, and linefeeds).
Here’s an example. Suppose you have a Word document that looks like the sample.docsample.doc file shown in Figure 1.
Reference the Appropriate Word Library Start a new Windows Application project in Visual Studio.NET, and name it WordAutomation. After creating the project, right-click on the References item in the Solution Explorer, and select Add Reference. Click the COM tab on the Add Reference dialog and then select the Microsoft Word MajorVersion.minorversion item (where MajorVersion.minorversion stands for the major and minor version number of the Word release you’re targeting). Doing that creates several references called Word, VBIDE, stdole, and Microsoft.Office.Core. The only one you need to worry about for this project is the Word reference. Getting Word Content
To open the Word file, first create a Word.Application object. The sample project creates this when it instantiates the class by defining a class-level variable.
When the user selects a file and clicks the Process button, the Click event-handler code opens the selected file by calling the Word application’s Documents.Open method.
The method creates a new instance of the Word.Document class, which represents a Word document. Note that you can’t create a Word.DocumentClass instance directly. Instead, you get the reference indirectly through the Word.Application object’s Documents collection; in this case by calling the Open method to open the selected file. The Open method returns a WordDocument object, which you can then use to manipulate the document’s content. Author’s Note: In earlier Word versions, using the Application and Document classes directly caused problems, so you may need to use the ApplicationClass and DocumentClass classes instead. For example, you may experience an irritating conflict between Close() methods. For a Word.Document instance, you’ll get an error stating that “‘Close’ is ambiguous across the inherited interfaces ‘Word._Document’ and ‘Word.DocumentEvents_Event’.” One solution is to create your document references as DocumentClass instances rather than Document instances. Another workaround is to cast the Document or Application instances to a more specific DocumentClass or ApplicationClass instance before issuing the ambiguous call. For example, assuming that myDoc is a Document object, to issue a Close call you could write:
The line of code above casts the Document reference myDoc to a DocumentClass reference, which avoids any ambiguity with the call to Close(). Finding and Replacing Character Styles Rather than writing the code to iterate through all the Word characters looking for bold and italic characters yourself, it’s much simpler to get Word to do it for you. Word has powerful find and replace methods?and you have access to them through your object references. In the sample code, the two private methods boldToHTML and italicToHTML handle replacing ranges of bold and italic characters with HTML tag equivalents. In other words bold text becomes bold text and italic text becomes italic text. Using Word’s find-and-replace functionality, you can perform replacement processes on the entire document very quickly.
You will probably want to add other, similar find-and-replace methods depending on your needs. For example, you might need to find underlined text and replace it with tags. One problem with this approach is that as people write documents, they often turn bold or some other character formatting on, write some text, and then press return to end a paragraph. They start typing the next line and realize that the formatting is still in effect, so they turn it off. The document look fine, but unfortunately, this common series of actions means the carriage return at the end of the text is itself bold. When you execute the find-and-replace methods shown above, you’ll get results like this:
That causes problems when you translate documents to XML, because it results in improperly nested tags, such as:
The solution is to make a single pass through the document replacing all paragraph marks (carriage returns) with unformatted paragraph marks. The clearCRFormatting method shown below solves the problem.
The sample project calls all three find-and-replace methods immediately after loading the document (see Listing 2). Next, it starts collecting the paragraph content. Word uses a hierarchical collection model. At the document level, each document has a collection of Paragraph objects, so it’s easy to iterate through them:
For this application, the “do something with each paragraph” consists of several tasks:
The sample application wraps all these steps up into a method called docToXml, which takes the WordDocument instance as a parameter and returns an XML document object containing the completed elements and the Word document’s content. Determining Page Numbers To retrieve the page number on which any range of text lies, you first select the text and then use the Word.Selection.Information method to get the page number for a specific part of the selection (because selection ranges might cross pages).
The sample application initially sets up a new XML document with one
This scheme can’t maintain a perfect page-to-paragraph relationship; if a paragraph begins on one page but ends on another, this scheme stores the entire paragraph on the higher page. Some documents contain “hard” page breaks. Word stores these as a single carriage return character (decimal 13) manually. It turns out that these hard page breaks appear as the first character in the paragraph following the hard page break, so you must test for that as well. The code below strips the hard page break character from the beginning of the paragraph text.
Mapping Style Names to Element Names
Now that you have the style name for this paragraph, you want to map it to an XML element. There are two considerations. First, the Word style names can contain spaces, while XML element names cannot; therefore, you must either remove or replace the spaces before applying the name to an XML element. Second, you may not want to map the Word style names directly to XML element names. For example, you might want to map Word’s Normal style to a element in the XML document. To do that, you need to write a bit of lookup code to map style names to element names. The sample application contains a StyleMapping class that performs the lookup (see Listing 3). For convenience, the StyleMapping class also contains a fixupName method that handles replacing any spaces in the Word style name with underscores. To instantiate an instance of the StyleMapping class, pass it the name of an XML-formatted map file. Map files consist of a root
For example, the preceding map file instructs the application to map the Word style “Heading 1” to an element and to map the Normal style to a element.As written, the application always attempts to look up the style name for every paragraph by calling the StyleMapping.GetStyleToElementMapping method. If that method finds an
After obtaining a mapped name, you can create a new XmlElement and append it to the most recent page element.
When the docToXml function has processed all the paragraphs, it returns the completed XML document. The Process button Click event handler code then displays it in the multi-line TextBox (see Figure 3).
Customizing the Doc-to-XML Sample Application For example, you can check each Word paragraph to see if it has a bullet or number and add items to the mapping file to map bulleted paragraphs to specific XML element names, such as Hard-coding the way the StyleMapping class interprets the contents of the mapping file isn’t very flexible. A more flexible method would be to follow the model the .NET framework uses to process configuration files by creating section handlers to handle various sections of the mapping file. Doing that would allow you to add custom processing for a specific file at run time. You can?and, (unless you’re planning to turn the XML back into a Word file) probably should?add code to ensure that Word’s control/formatting characters aren’t inserted into the XML file, or add code to remove them later. For example, this article showed one way to remove the manual page break character using string replacement. You can use Word’s find and replace feature as discussed in the section “Finding and Replacing Character Styles” to remove the control characters. You can also perform post-processing on the XML file itself, but if you do that, you should be aware that the XmlDocument encodes control/formatting character values as it adds text, so you’ll need to search for the encoded representation of the characters, which will start with an ampersand and a number sign followed by the character value, (such as ).With a little alteration, this entire concept is useful for implementing document-checking rules or extracting specific paragraphs or field values from existing Word documents. For example, you might want to check Word documents and ensure that the styles adhere to company standards. Or you might want to use these automation techniques to extract only specific field values from a Word document, using an XML file to specify which fields the process should extract. Whatever you need to do with your Word files, this article should help get you started and give you ideas for controlling the process from outside your code. About Our Editorial ProcessAt DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere. See our full editorial policy. About Our JournalistCharlie Frank
Charlie has over a decade of experience in website administration and technology management. As the site admin, he oversees all technical aspects of running a high-traffic online platform, ensuring optimal performance, security, and user experience.
View Author
Harris’s VP choice may shape climate agenda
Noah Nguyen
July 26, 2024
5:45 PM
Salesforce and Workday announce AI partnership
Cameron Wiggins
July 26, 2024
5:18 PM
Pil partners with WaveBL for eBL digitization
Rashan Dixon
July 26, 2024
1:48 PM
Musk activates internet in Gaza hospital
Cameron Wiggins
July 26, 2024
1:44 PM
Experts debate AI impact on cybersecurity
Noah Nguyen
July 26, 2024
1:43 PM
Palantir and C3.ai: high-potential AI stocks
Rashan Dixon
July 26, 2024
11:35 AM
Telefónica unveils new quantum security solution
Noah Nguyen
July 26, 2024
11:34 AM
Musk updates Tesla Roadster production timeline
Cameron Wiggins
July 26, 2024
11:26 AM
Employees report AI increases their workload
Rashan Dixon
July 26, 2024
11:24 AM
Protect your online privacy with VPN
Cameron Wiggins
July 26, 2024
11:24 AM
Amd announces Ryzen AI 9 HX 375
Noah Nguyen
July 26, 2024
11:19 AM
US faces hurdles to meet climate goals
Cameron Wiggins
July 26, 2024
11:13 AM
Elon Musk’s xAI launches Memphis supercomputer
Johannah Lopez
July 26, 2024
11:08 AM
Switzerland mandates open-source software for government
Noah Nguyen
July 26, 2024
8:53 AM
Reddit blocks most search engines except Google
Cameron Wiggins
July 26, 2024
8:46 AM
Monday sets record for hottest day
Johannah Lopez
July 26, 2024
8:02 AM
IBM stock rises on strong Q2 earnings
Johannah Lopez
July 26, 2024
7:47 AM
Wiz declines $23 billion offer from Alphabet
Cameron Wiggins
July 26, 2024
7:23 AM
Military crackdown leaves 200 dead in Bangladesh
Johannah Lopez
July 26, 2024
7:19 AM
Elon Musk attends Netanyahu’s address to Congress
Rashan Dixon
July 26, 2024
7:16 AM
Ai-powered GR Supras complete tandem drift
April Isaacs
July 26, 2024
7:11 AM
Mega-cap tech stocks under pressure
April Isaacs
July 25, 2024
5:45 PM
New IBM cybersecurity certificate at community colleges
April Isaacs
July 25, 2024
5:37 PM
Eviden unveils Qaptivaâ„¢ quantum emulator for researchers
Johannah Lopez
July 25, 2024
5:29 PM
Telefónica Tech secures global BBVA cybersecurity deal
Cameron Wiggins
July 25, 2024
5:27 PM
|