Browse DevX
Sign up for e-mail newsletters from DevX


Export Customized XML from Microsoft Word with VB.NET : Page 3

Learn to use Word automation from .NET to turn hard-to-process Word documents into customizable XML




Building the Right Environment to Support AI, Machine Learning and Deep Learning

Finding and Replacing Character Styles
Word has two different types of styles. Paragraph styles control such paragraph properties as indentation, spacing before and after the paragraph, and the default font characteristics applied to text in the paragraph. In addition, Word supports character styles, which are styles applied to a range of characters. For example when you format characters as bold or italic you've applied a character style. For this application, assume you want the paragraph style to become the name of the XML tag surrounding the text of the paragraph, but you also want to maintain bold and italic styles within the text.

Rather than writing the code to iterate through all the Word characters looking for bold and italic characters yourself, it's much simpler to get Word to do it for you. Word has powerful find and replace methods—and you have access to them through your object references. In the sample code, the two private methods boldToHTML and italicToHTML handle replacing ranges of bold and italic characters with HTML tag equivalents. In other words bold text becomes <b>bold text</b> and italic text becomes <i>italic text</i>. Using Word's find-and-replace functionality, you can perform replacement processes on the entire document very quickly.

Public Sub boldToHTML(ByVal doc As Word.Document) With doc.Content.Find .ClearFormatting() .Font.Bold = 1 .Replacement.ClearFormatting() .Replacement.Font.Bold = 0 .Text = "*" .Execute(findtext:="", _ ReplaceWith:="<b>^&</b>", _ Format:=True, _ Replace:=Word.WdReplace.wdReplaceAll) End With End Sub Public Sub italicToHTML(ByVal doc As Word.Document) With doc.Content.Find .ClearFormatting() .Font.Italic = 1 .Replacement.ClearFormatting() .Replacement.Font.Bold = 0 .Text = "*" .Execute(findtext:="", _ ReplaceWith:="<i>^&</i>", _ Format:=True, _ Replace:=Word.WdReplace.wdReplaceAll) End With End Sub

You will probably want to add other, similar find-and-replace methods depending on your needs. For example, you might need to find underlined text and replace it with <u> tags.

One problem with this approach is that as people write documents, they often turn bold or some other character formatting on, write some text, and then press return to end a paragraph. They start typing the next line and realize that the formatting is still in effect, so they turn it off. The document look fine, but unfortunately, this common series of actions means the carriage return at the end of the text is itself bold. When you execute the find-and-replace methods shown above, you'll get results like this:

This is <b>bold text</b>

That causes problems when you translate documents to XML, because it results in improperly nested tags, such as:

<p>This is <b>bold text</p></b>

The solution is to make a single pass through the document replacing all paragraph marks (carriage returns) with unformatted paragraph marks. The clearCRFormatting method shown below solves the problem.

Public Sub clearCRFormatting(ByVal doc _ As Word.Document) With doc.Content.Find .ClearFormatting() .Replacement.ClearFormatting() .Replacement.Font.Bold = 0 .Replacement.Font.Italic = 0 .Replacement.Font.Underline = 0 .Execute(findtext:="^p", ReplaceWith:="^p", _ Format:=True, _ Replace:=Word.WdReplace.wdReplaceAll) End With End Sub

The sample project calls all three find-and-replace methods immediately after loading the document (see Listing 2). Next, it starts collecting the paragraph content.

Word uses a hierarchical collection model. At the document level, each document has a collection of Paragraph objects, so it's easy to iterate through them:

' loop through the paragraphs For Each p As Word.Paragraph In doc.Paragraphs ' do something with each paragraph Next

For this application, the "do something with each paragraph" consists of several tasks:

  • Determine the page number where the paragraph would appear if the document were printed. Whenever the page changes, you want to insert a <page id="#"> element to reflect the original page structure of the document in the XML file.
  • Retrieve the Word style name for the paragraph.
  • Create an XML element for that style. This step itself consists of several separate tasks, such as determining whether to use the Word style name directly or "map" it to a different element name using a "style mapping." For example, you might want to translate the "Normal" style to HTML-like <p> tags. By default, the application uses the Word style name, with minor adjustments to meet XML element-naming rules.
  • Append the XML element containing the contents of the paragraph to an XML output document.
The sample application wraps all these steps up into a method called docToXml, which takes the WordDocument instance as a parameter and returns an XML document object containing the completed elements and the Word document's content.

Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



Thanks for your registration, follow us on our social networks to keep up-to-date