evelopers have worked with Office documents for years, primarily word-processing documents, presentations, and spreadsheets. While the various Microsoft Office applications have provided built-in support for creating and modifying Office documents, the creators and consumers of these documents have primarily been humans.
Now however, the need to improve business workflows has driven the need for software applications to be able to operate on Office documents. The main requirements arising from this need are:
- Document interoperability: The ability to work on documents using tools other than Office.
- Document manipulation and generation: The ability to create and modify documents programmatically.
Microsoft Office documents based on the older binary formats supported the preceding requirements to some extent using COM object model; however, it was complex and?because it required automating the single-threaded Office client applications?not scalable. Microsoft’s Open Office XML (OOXML) format, the default format for Office 2007 documents, overcomes both issues.
What is OOXML?
OOXML is an XML file format specification for representing word-processing documents, presentations and spreadsheets. Microsoft created the original specification, which was later approved as an ECMA 376 and ISO/IEC 29500 standard. OOXML uses familiar technologies such as XML and ZIP. Document content resides in a file package that conforms to the Open Package Convention. An OOXML file package contains a few XML files as well as other required resources such as image files, video files, and so forth.
The key concepts of OOXML are:
- An OOXML document is a ZIP-based package of files.
- A Package is composed of various parts, including a Main Document part, Image parts, Video, Slide, Workbook, Document Properties parts, etc. Each part is represented by a file in the zipped package.
- Relationships determine how the collection of parts comes together to form a document.
- Content types define the types of parts that can be stored in a package.
Benefits of OOXML
Because OOXML documents are based on standard, open, platform-independent formats, the ability to interoperate with Office documents has increased significantly compared to the earlier binary formats. Here are some of the benefits the OOXML format:
- Document Assembly: It’s easy to create documents, because individual parts can be created separately, and assembled when required.
- Document Archiving: You can save space by storing a single instance of the common parts from a large number of similar documents as well as the unique content for each document, assembling the individual documents in their entirety again when required.
- Searching Documents: Because content is stored in XML format, it’s much easier to search using common tools.
- Business Process Efficiency: Using an XML-based document format helps when automating decisions based on document content.
The OOXML SDK
The Open XML SDK is a .NET class library that exposes standard XML and Packaging APIs for working with OOXML documents. The SDK provides typed access to both the Open Package Convention packages and the XML content in those packages. Currently, two versions of the SDK are available:
- Version 1: Provides strongly typed access to packages.
- Version 2: Provides strongly typed access to both packages and their contents (currently available as an April 2009 CTP release).
Developing Applications with the Open XML SDK
This article uses the Open XML SDK version 2.0 CTP release. Download and install the SDK using the instructions on the download link page.
Set Up the Environment
After installing the SDK on your development machine, create a new project and add a reference to the SDK DocumentFormat.OpenXml assembly by clicking References Add Reference. Click the .NET tab, and then select the DocumentFormat.OpenXml assembly. If you don’t see the assembly in the .NET tab, you can add it manually from your installation folder.
The Open XML SDK uses .NET packaging APIs internally, so you also need to add a reference to System.IO.Packaging.Package.
Finally, add a reference to the WindowsBase assembly.
In your code, you’ll want to add these three main namespaces:
- DocumentFormat.OpenXml: General Open XML related functions.
- DocumentFormat.OpenXml.Packaging: APIs related to packaging.
- DocumentFormat.OpenXml.Wordprocessing: APIs for working on Microsoft Word (.docx) documents. The assembly also exposes namespaces for other Office client applications such as Excel and PowerPoint.
Creating a Basic Word Document
To get started, you’ll create a basic Word document which will serve to show you the basic infrastructure for Word documents. Remember that an Open XML package is composed of parts (Main Document Part, Image Part, Video Part, Document Properties Part, etc.). Each part is represented by at least one file.
The WordprocessingDocument class represents a Word package. You can use the class to create a new package or open an existing package:
WordprocessingDocument doc = WordprocessingDocument.Create (@"BasicWordDoc.docx", WordprocessingDocumentType.Document);
The Main Document part contains the text of the document?and it’s the only required part. You can add a main document part to a package using the following code:
MainDocumentPart mainPart = doc.AddMainDocumentPart();mainPart.Document = new Document();
The various parts of the document itself are arranged hierarchically as follows:
MainDocumentPart Document Body Paragraph Run Text
The SDK exposes classes for each of these components, which makes it simple to generate a Word document programmatically. For example, to create a document containing the text “Hello World,” you walk up the hierarchy, creating the appropriate objects at each step, passing each object to the constructor of its parent object:
Text textFirstLine = new Text("Hello World");Run run = new Run(textFirstLine);Paragraph para = new Paragraph(run);Body body = new Body(para);
After creating the object hierarchy, you can append the Body object to the document created earlier:
mainPart.Document.Append(body);
At this point, you can save the main document part and close the package:
mainPart.Document.Save();doc.Close();
Here’s the complete code for a C# console application that generates the “Hello World” example:
static void Main(string[] args){ /* Create the package and main document part */ WordprocessingDocument doc = WordprocessingDocument.Create(@"BasicWordDoc.docx", WordprocessingDocumentType.Document); MainDocumentPart mainPart = doc.AddMainDocumentPart(); mainPart.Document = new Document(); /* Create the contents */ Text textFirstLine = new Text("Hello World"); Run run = new Run(textFirstLine); Paragraph para = new Paragraph(run); Body body = new Body(para); mainPart.Document.Append(body); /* Save the results and close */ mainPart.Document.Save(); doc.Close();}
If you execute the application, it will create a new document. If you open that with Word, you’ll see the document shown in Figure 1.
? | |
Figure 1. Basic Document: This basic Word document was created programmatically. |
As you can see, the process to create a basic document is straightforward; however, most documents are more complicated, containing styled text and other formatting.
Creating Styled Documents
This next example generates a slightly more complex Word document with two paragraphs, each containing text in a different font. Here’s the process:
- Create the Word processing document package and add the main document part to it as in the previous example.
- Create the first paragraph containing the text “Hello World,” but don’t add it to the body yet:
- Create a second paragraph containing the text “Hello Open XML Community:”
- You apply font formatting to a Run. Runs can contain a RunProperties object, which control the formatting applied to the text contained in that Run. Runs that don’t have RunProperties use Word’s default font. You can create RunProperties objects independently and then apply them to any Run. The following code creates a RunProperties object that causes the Run to display its text using the Arial Black font:
- Apply the RunProperties object you just created to the Run run2 you created earlier:
- Create a new Body instance and append both paragraphs:
- Add the body to the document. Save and close the package:
Text textFirstLine1 = new Text("Hello World");Run run1 = new Run(textFirstLine1);Paragraph para1 = new Paragraph(run1);
Text textFirstLine2 = new Text("Hello Open XML Community");Run run2 = new Run(textFirstLine2);Paragraph para2 = new Paragraph(run2);
RunProperties runProp = new RunProperties();RunFonts runFont = new RunFonts();runFont.Ascii = "Arial Black";runProp.Append(runFont);
run2.PrependChild(runProp);
Body body = new Body();body.Append(para1);body.Append(para2);
mainPart.Document.Append(body);mainPart.Document.Save();doc.Close();
When you execute this program, it will create a document containing two paragraphs with different fonts (see Figure 2).
? | |
Figure 2. Styled Text: This two-paragraph document contains formatted text. |
Search and Replace Text in a Word Document
Creating new documents addresses only one aspect of working with OOXML documents. This next example opens an existing document, searches for some text in that document, and replaces it with other text. This is typical of scenarios where you want to generate a large number of documents based on a small template: You’d read the template, replace some portion of the template with custom content, and then save the altered document. (You’ll see a large-template scenario in the next section).
First, open the document using WordprocessingDocument class and get the MainDocumentPart. The document in this example is named SearchAndReplace.docx:
WordprocessingDocument doc = WordprocessingDocument.Open( @"SearchAndReplace.docx", true);MainDocumentPart mainPart = doc.MainDocumentPart;
Read the entire document contents using the GetStream method:
using (StreamReader sr = new StreamReader( doc.MainDocumentPart.GetStream())){ docText = sr.ReadToEnd();}
At the end of this process, the docText variable contains all the XML for the document text. Next, replace contents in the docText variable as needed. For this example the template contains the text "The current version of [sdk] is [VersionNumber]."
The task is to replace the [sdk] and [VersionNumber] placeholders with actual values. You can use standard .NET string-manipulation code to make the replacement, so I won’t show it here. After replacing the text, write the complete text back to the Main Document part using the following code:
using (StreamWriter sw = new StreamWriter(doc.MainDocumentPart.GetStream( FileMode.Create))){ sw.Write(docText);}
Template-Driven Document Generation using Word Content Controls
The example in the last section read the entire contents of a short document template into memory, and then performed a search and replace operation. That’s fine for small templates, but when you have large multi-page templates, that approach will create memory and performance issues. Instead, you can use Content Controls, which help create templates, support structured editing, and also provide placeholders for various kinds of content in documents.
The primary content controls available are:
- Plain Text
- Rich Text
- Picture
- Calendar
- Combo Box
- Drop-Down List
Apart from the intrinsic benefits that structured documents and content-type restrictions offer, you also benefit from the way OOXML stores data rendered in content controls.
OOXML stores content control data in a custom XML file in the document package. Individual controls are mapped to elements in the custom XML file. When you open such a document, it late-binds to the content control data in the file. While the document is open, any changes you make to content in the controls gets reflected in the XML data?and vice-versa.
The fact that content control data is stored separately and mapped to controls at runtime makes it a good candidate for generating template-based documents.
This example covers three main topics:
- Creating a template based on content controls
- Using the Word 2007 Content Control Toolkit to map controls to custom XML elements
- Updating the custom XML data programmatically, and generating documents based on the template
The next sections explain each topic in more detail.
Creating a Template
Open Word, create a new document, and switch to the Developer tab on the ribbon.
Author’s Note: If the Developer tab is not visible (it’s not by default), you can enable it by opening Word Options. To do that, click the Office button at the top left of your Word window and click the Word Options button at the bottom. In the “Popular” group, click the “Show Developer tab in the ribbon” option. Close the Word Options dialog, and the Developer tab will appear. |
The Developer ribbon has a button group called “Controls” that let you insert various kinds of controls into the document. This example uses the same template as the “Search and Replace”section earlier in this article. This time, however, you’ll create it using content controls. Again, the template example contains the text "The current version of [sdk] is [VersionNumber]."
Add two plain text controls for the SDK name and version number. After adding the controls, the template will look similar to Figure 3, depending on the control names you provided.
? | |
Figure 3. Template with Content Controls: In Word, the sample document containing the content controls should look similar to this. |
Word 2007 Content Control Toolkit
The Word 2007 Content Control Toolkit provides a visual interface that helps when mapping custom XML elements to content controls?a process much easier than writing XPath queries. Download the Word 2007 Content Control Toolkit from CodePlex and install it, then start the tool and open the document template you created in the preceding section.
In the “Content Controls” pane on the left, you will see the details of the two controls in the template, including their names and types. In the “Custom XML parts” pane on the right, click on the link “Click here to create a new one” to create a new custom XML file. Switch to Edit View and add two elements that will store data for the two controls. The element names do not have to match the control names. For example, my XML file looks like this:
Switch to “Bind View” and drag the elements you created to the left pane and drop them on the controls. The drag/drop process establishes the bindings. After you’ve established the bindings, the left pane will look similar to Figure 4.
? | |
Figure 4. Bound Content Controls: Here’s how the Content Controls pane in the Word 2007 Content Control Toolkit looks after binding the SDKName and VersionNumber controls to specific XML elements. |
Save the template, and then inspect the document package by changing the extension from docx to zip. You will find a new customXml folder containing your custom XML file with the data. If you change the contents of this XML file and then reopen the document (remembering to change the zip extension back to docx), you will find that the content controls now display the updated content. Similarly, if you change the control content in Word, save the file, rename it, and re-inspect the custom XML file, you’ll see that the changes have been persisted there.
With the template and bindings in place, you now have the opportunity to generate a large number of documents based on the template.
Update Custom XML Data Programmatically
Open the package using the WordprocessingDocument class’s Open method:
WordprocessingDocument wordDoc = WordprocessingDocument.Open(fileName, true);
Next, create the custom XML file containing the data for this document, and store it in the package.
Store the custom XML in memory and add placeholders for the actual data. For this sample the custom XML with placeholders is a string containing:
!Name! !Version!
You’ll replace the !Name! and !Version! placeholders with actual data for each document. This example uses the Regex utility, but you can use any code or technology you like to create your custom XML. When your custom XML is ready, replace the existing custom XML with the new one by deleting the existing one and adding the new one using the following code:
MainDocumentPart mainPart = wordDoc.MainDocumentPart;mainPart.DeleteParts(mainPart.CustomXmlParts);CustomXmlPart customXmlPart = mainPart.AddNewPart();StreamWriter ts = new StreamWriter(customXmlPart.GetStream());ts.Write(customXML);
You can now save this document under a different name, and repeat the process as needed, using different data for each document, giving you a fast way to create large numbers of custom documents.
The main difference between this and the earlier Search and Replace approach is that this technique focuses only on the dynamic data, while the other approach required parsing the entire document.
This approach is also much more efficient than Mail Merge functionality available which is used for creating large number of small documents based on template and data store.
This article used only a small fraction of the OpenXML SDK, but if you’ve ever tried to create or manipulate Word files programmatically using earlier technologies, you can probably already tell that this is a much more robust and simpler way. In addition to the scenarios shown here, the OpenXML SDK also lets you operate on comments and tracked changes stored in Word documents. In addition, the SDK contains APIs that operate on Microsoft Excel and PowerPoint documents.