Browse DevX
Sign up for e-mail newsletters from DevX


Export Customized XML from Microsoft Word with VB.NET : Page 6

Learn to use Word automation from .NET to turn hard-to-process Word documents into customizable XML




Building the Right Environment to Support AI, Machine Learning and Deep Learning

Customizing the Doc-to-XML Sample Application
The solution shown here is very generic and also extremely simple. It doesn't take advanced Word features such as bullets, numbered paragraphs, document sections, or custom character formatting into account, but adding support for such features is not difficult. For example, you can check each Word paragraph to see if it has a bullet or number and add items to the mapping file to map bulleted paragraphs to specific XML element names, such as <li> or <numbered>. You can also alter the format of the mapping file to hold more information, perhaps to add specific HTML style attributes to each element (although it's usually best to add formatting with an XSLT transform). I've added code to the StyleMapping class so that you can add and remove mappings programmatically. Although the sample code doesn't use these methods, you'll find two commented-out lines at the top of the docToXml method that create a new StyleMapping document and add a mapping dynamically.

Hard-coding the way the StyleMapping class interprets the contents of the mapping file isn't very flexible. A more flexible method would be to follow the model the .NET framework uses to process configuration files by creating section handlers to handle various sections of the mapping file. Doing that would allow you to add custom processing for a specific file at run time. You can—and, (unless you're planning to turn the XML back into a Word file) probably should—add code to ensure that Word's control/formatting characters aren't inserted into the XML file, or add code to remove them later. For example, this article showed one way to remove the manual page break character using string replacement. You can use Word's find and replace feature as discussed in the section "Finding and Replacing Character Styles" to remove the control characters. You can also perform post-processing on the XML file itself, but if you do that, you should be aware that the XmlDocument encodes control/formatting character values as it adds text, so you'll need to search for the encoded representation of the characters, which will start with an ampersand and a number sign followed by the character value, (such as &#151;). With a little alteration, this entire concept is useful for implementing document-checking rules or extracting specific paragraphs or field values from existing Word documents. For example, you might want to check Word documents and ensure that the styles adhere to company standards. Or you might want to use these automation techniques to extract only specific field values from a Word document, using an XML file to specify which fields the process should extract.

Whatever you need to do with your Word files, this article should help get you started and give you ideas for controlling the process from outside your code.

A. Russell Jones is the Executive Editor at DevX. Reach him via email.
Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



Thanks for your registration, follow us on our social networks to keep up-to-date