Finding and Replacing Character Styles
Word has two different types of styles.
Paragraph styles control such paragraph properties as indentation, spacing before and after the paragraph, and the default font characteristics applied to text in the paragraph. In addition, Word supports
character styles, which are styles applied to a range of characters. For example when you format characters as bold or italic you've applied a character style. For this application, assume you want the paragraph style to become the name of the XML tag surrounding the text of the paragraph, but you also want to maintain bold and italic styles
within the text.
Rather than writing the code to iterate through all the Word characters looking for bold and italic characters yourself, it's much simpler to get Word to do it for you. Word has powerful find and replace methodsand you have access to them through your object references. In the sample code, the two private methods
boldToHTML and
italicToHTML handle replacing ranges of bold and italic characters with HTML tag equivalents. In other words
bold text becomes <b>bold text</b> and
italic text becomes <i>italic text</i>. Using Word's find-and-replace functionality, you can perform replacement processes on the entire document very quickly.
Public Sub boldToHTML(ByVal doc As Word.Document)
With doc.Content.Find
.ClearFormatting()
.Font.Bold = 1
.Replacement.ClearFormatting()
.Replacement.Font.Bold = 0
.Text = "*"
.Execute(findtext:="", _
ReplaceWith:="<b>^&</b>", _
Format:=True, _
Replace:=Word.WdReplace.wdReplaceAll)
End With
End Sub
Public Sub italicToHTML(ByVal doc As Word.Document)
With doc.Content.Find
.ClearFormatting()
.Font.Italic = 1
.Replacement.ClearFormatting()
.Replacement.Font.Bold = 0
.Text = "*"
.Execute(findtext:="", _
ReplaceWith:="<i>^&</i>", _
Format:=True, _
Replace:=Word.WdReplace.wdReplaceAll)
End With
End Sub
You will probably want to add other, similar find-and-replace methods depending on your needs. For example, you might need to find underlined text and replace it with
<u> tags.
One problem with this approach is that as people write documents, they often turn bold or some other character formatting on, write some text, and then press return to end a paragraph. They start typing the next line and realize that the formatting is still in effect, so they turn it off. The document
look fine, but unfortunately, this common series of actions means the carriage return at the end of the text is itself bold. When you execute the find-and-replace methods shown above, you'll get results like this:
This is <b>bold text</b>
That causes problems when you translate documents to XML, because it results in improperly nested tags, such as:
<p>This is <b>bold text</p></b>
The solution is to make a single pass through the document replacing all paragraph marks (carriage returns) with unformatted paragraph marks. The
clearCRFormatting method shown below solves the problem.
Public Sub clearCRFormatting(ByVal doc _
As Word.Document)
With doc.Content.Find
.ClearFormatting()
.Replacement.ClearFormatting()
.Replacement.Font.Bold = 0
.Replacement.Font.Italic = 0
.Replacement.Font.Underline = 0
.Execute(findtext:="^p", ReplaceWith:="^p", _
Format:=True, _
Replace:=Word.WdReplace.wdReplaceAll)
End With
End Sub
The sample project calls all three find-and-replace methods immediately after loading the document (see
Listing 2). Next, it starts collecting the paragraph content.
Word uses a hierarchical collection model. At the document level, each document has a collection of Paragraph objects, so it's easy to iterate through them:
' loop through the paragraphs
For Each p As Word.Paragraph In doc.Paragraphs
' do something with each paragraph
Next
For this application, the "do something with each paragraph" consists of several tasks:
- Determine the page number where the paragraph would appear if the document were printed. Whenever the page changes, you want to insert a <page id="#"> element to reflect the original page structure of the document in the XML file.
- Retrieve the Word style name for the paragraph.
- Create an XML element for that style. This step itself consists of several separate tasks, such as determining whether to use the Word style name directly or "map" it to a different element name using a "style mapping." For example, you might want to translate the "Normal" style to HTML-like <p> tags. By default, the application uses the Word style name, with minor adjustments to meet XML element-naming rules.
- Append the XML element containing the contents of the paragraph to an XML output document.
The sample application wraps all these steps up into a method called
docToXml, which takes the WordDocument instance as a parameter and returns an XML document object containing the completed elements and the Word document's content.