Browse DevX
Sign up for e-mail newsletters from DevX


Export Customized XML from Microsoft Word with VB.NET : Page 4

Learn to use Word automation from .NET to turn hard-to-process Word documents into customizable XML




Building the Right Environment to Support AI, Machine Learning and Deep Learning

Determining Page Numbers
A Word document is an abstract representation of a printed document. Both the content and the print setup affect page numbers. For example, you can print a Word document on non-standard paper, or insert new paragraphs at the beginning of a document. When you make such changes, Word rearranges the content to reflect the new content or output medium. The result is that paragraphs don't "belong" to a specific page. Therefore, if you want to maintain the current Word page arrangement, you need to capture the page number as you loop through the document grabbing paragraph content. This application maintains the relationship between the paragraphs and Word's pages by creating each paragraph element as a child of a <page id="#"></page> element, where "#" is the page number.

To retrieve the page number on which any range of text lies, you first select the text and then use the Word.Selection.Information method to get the page number for a specific part of the selection (because selection ranges might cross pages).

' check the pagenumber If CType(wordApp.Selection.Information( _ Word.WdInformation.wdActiveEndPageNumber), _ Integer) > currentPageNumber Then pageNode = Me.addPage(xmlDoc) currentPageNumber = Integer.Parse( _ pageNode.Attributes("id").Value) End If

The sample application initially sets up a new XML document with one <page> element (page 1) and sets the currentPageNumber variable to 1. For each paragraph, the code retrieves the page number for the end of the paragraph, and compares it to the currentPageNumber value. When the paragraph page differs from the currentPageNumber, the application adds a new <page> element using the addPage method, which returns a reference to the new page element.

Private Function addPage(ByVal xmlDoc As _ XmlDocument) As XmlElement Dim page As XmlElement = _ xmlDoc.CreateElement("page") Dim natt As XmlAttribute = _ xmlDoc.CreateAttribute("id") If xmlDoc.SelectNodes("document//page") _ Is Nothing Then natt.Value = "1" Else natt.Value = (xmlDoc.SelectNodes( _ "document//page").Count + 1).ToString End If page.Attributes.Append(natt) xmlDoc.SelectSingleNode( _ "/document").AppendChild(page) Return page End Function

This scheme can't maintain a perfect page-to-paragraph relationship; if a paragraph begins on one page but ends on another, this scheme stores the entire paragraph on the higher page.

Some documents contain "hard" page breaks. Word stores these as a single carriage return character (decimal 13) manually. It turns out that these hard page breaks appear as the first character in the paragraph following the hard page break, so you must test for that as well. The code below strips the hard page break character from the beginning of the paragraph text.

' get the para text Dim s As String = p.Range.Text ' check to see if there's a hard page break at the ' start of this para If Asc(s.Chars(0)) = &HC Then s = s.Substring(1, s.Length - 1) pageNode = Me.addPage(xmlDoc) currentPageNumber = Integer.Parse( _ pageNode.Attributes("id").Value) End If

Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



Thanks for your registration, follow us on our social networks to keep up-to-date