alidating an XML document entails confirming that the document is both well-formed and conforms to a specific set of rules specified with a Document Type Definition (DTD), an XML Schema, or—as introduced in this article—a CAM template. DTD was the earliest specification. DTDs provided useful but limited capabilities, letting you validate XML document structure but very little in the way of semantics. Next came XML Schema, which offered more flexibility and capability, improved support for structure, and good (but not great) support for semantics. Schematron, RelaxNG, and others have attempted to improve the semantic support, but none have caught on in a big way. Now a new—really new—technology called Content Assembly Mechanism (CAM) is being developed under the aegis of OASIS, a well-respected standards body.
CAM is more than just another schema language, though. It was designed to better meet the needs of business exchange requirements and interoperability. CAM provides a powerful mechanism for validating XML both structurally and semantically, in a concise, easy-to-use, easy-to-maintain format. It provides a context mechanism—a way to dynamically adjust what should be considered a valid XML instance based upon other parts of the XML itself or external parameters.
CAM is an exciting technology with much promise, but it is a nascent technology, which can be both good and bad. Things move fast with CAM development, thus you may notice frequent “at the time of writing” disclaimers in this article. However, the chances are good that the development team will act upon some of the problems discussed here and fix them before you ever have a chance to encounter them!
So, at the time of writing this article, CAM’s documentation is sketchy: There is a formal specification, a white paper, a PowerPoint presentation, and a few web pages offering a brief introduction to the editor and to the API. There is no definitive guide or tutorial; this article functions as “CAM: The Missing Manual,” expanding upon the CAM documentation, covering both the how and the why of applying the specification and its idiosyncrasies to real-world usage.
|Author’s Note: While working on examples for this article I had to combat a variety of implementation bugs. But the development team is extremely responsive with fixing issues: Very early on I delivered a list of two dozen bugs and had a new release within 24 hours!|
What You Need
- Basic familiarity with XPath. CAM Uses XPath extensively for defining business rules. See the W3Schools’ XPath Tutorial for a great refresher.
- Basic familiarity with XML Schema. While ostensibly this article is about a successor to XML Schema, it relies extensively on contrasts with XML Schema as the most effective way to communicate new approaches. See the W3Schools’ XML Schema Tutorial for a great refresher.
Dictating Valid XML
An XML document is a hierarchical composition of elements, a “generic framework for storing any amount of text or any data whose structure can be represented as a tree”. An XML document needs only to be well-formed, meaning it must have but a single root and its elements and attributes must conform to the simple XML syntax rules. However, XML has little utility until you map it into a specific problem domain, such as mathematics, book-writing, or financial transactions. Such mapping removes documents from the abstract realm of XML and places them into a specific XML dialect for your particular problem. Any document in your dialect must, by definition, be valid according to your dialect semantics; otherwise it is rejected as invalid and cannot be processed.
Consider this portion of a customer address:
221B Baker Street. . .
To validate this XML fragment in XML Schema you would typically have a structure such as:
. . . . . .
These constraints indicate that an
XML Schema is a grammar-based system, in that you define a grammar for both semantics and structure against which an XML instance must conform. Schematron, on the other hand, is a rule-based system where you specify both semantics and structure using rules (see An Introduction to Schematron). That is, not only do you use a rule that specifies an address_street is a string, but you also use a rule to specify that
|Author’s Note: XML Schema Language Comparison for more in-depth information.|
In contrast, CAM is a hybrid system that separates structure from semantics (low coupling) and specifies semantics with rules. For example, the address example in CAM might look like this:
%street number and name%. . .
|Author’s Note: The only part of the placeholder that has semantic content is the percent signs themselves. Everything between them is completely ignored by the CAM processor; it is for you and consumers of your XML dialect. The Structure view in Figure 3, for example uses just a generic description (%string%) for many placeholders. You might take a different approach though and be more specific using, for instance, %city-name% for the
|Figure 1. WYSIWYG Example: Microsoft Word users much prefer to see the rendering of the document in the left pane rather than the right, but both represent the same thing and both may be edited to alter the document.|
XML Schema is also not WYSIWYG, although some excellent tools such as XmlSpy or Liquid XML Studio help put a WYSIWYG front-end on it. Consider this XML Schema example defining a cost to be in the range 1-999 with 2 decimal places permitted:
The equivalent CAM syntax shown below separates the rules from the structure, with the rules referring back to the appropriate structure elements. The rules map obviously and intuitively to the English description:
Benefits of CAM
Table 1 summarizes the key strengths of CAM compared to XML Schema and DTDs. Each line item in the table is covered in detail later in this article or in Part II.
|#||Item||DTD||XML Schema||CAM||Example / Notes|
|1||Separates structure and business rules||no (limited business rules)||no||yes|
|2||Current-node fixed validation||no||yes||yes|
|3||Current-node conditional validation||no||limited
Using pattern facets [See XML Schema Spec Part 2, section 4.3.4]
|4||Cross-node conditional validation||no||limited
Using identity-constraint definitions [See XML Schema Spec Part 1, section 3.11]
|5||Context mechanism||no||yes||yes||Interpret validity differently depending on whether condition A or condition B is satisfied.|
|6||Structure variability||no||no||yes||For orders exceeding 25kg, customers must also select a freight handler to transport the goods.|
|7||Parameterized invocation||no||no||yes||Orders from Canada must meet criteria x, y, and z, while orders from New Zealand must meet criteria a, b, and c.|
|9||Define own datatypes||no||yes
Using derived types
|10||Written in same syntax as documents||no||yes
Using named types
Using XPath selector for rules and include files for structure
|12||Tools/editors||many||many||1||“Any color as long as it’s black”|
|13||Graphical designer||many||many||none||With XML Schema, designers mask the complexities of the structure.|
|14||WYSIWYG||with external framework||with external framework||inherent||Statement of business rules and implementation of them are almost identical; truly a textual WYSIWYG. On top of that, editor also provides three different auto-generated documentation modes.|
|15||Adoption||mature||mature||nascent||Mature can be better for stability, support, and overhead; nascent can be better for starting new projects cleanly with new technology.|
|16||APIs||Java, .NET, Ruby, Perl, …||Java, .NET, Ruby, Perl, …||Java|
|Author’s Note: This article is based on a comparison to XML Schema 1.0; version 1.1 is in the works and it will use some of the same types of XPath expressiveness that CAM already has.|
Introduction to the CAM Editor
From this link to download the latest CAM editor, select Download in the left-hand panel, and you get a choice of downloading the CAM template editor or the JCam engine. For the bulk of this article you need only the CAM template editor (the JCam engine performs CAM validation programmatically).
To get started with the CAM editor, you may create a template from scratch, from an existing XML file, or from an existing schema (XSD) file. You’ll find the ease of creating a CAM template is exactly the reverse order; that is, you gain the most leverage from an XSD file, some from an XML file, and of course none when starting from scratch. So to get started, use the canonical Purchase Order schema from the W3C. Store this file locally as po.xsd. In the editor, select File ? New Template from Schema…, and supply the directory and file name where you stored the file separately (see Figure 2). The application freezes for a few seconds while it processes the file; when it comes back it fills out the Root Element field.
|Figure 2. New Template From Schema Dialog: Select the directory and file name pointing to your XSD file, then select the root element from that schema to have the CAM editor generate a base CAM template for you.|
The comment element is simply the name of the first node (in alphabetical order) among all
Purchase order schema for Example.com. Copyright 2000 Example.com. All rights reserved. ...
You actually want the purchaseOrder element as the root, so switch the Root Element in the dialog to purchaseOrder, and then click OK to generate the template. The application prompts you (well, forces you) to save the template before proceeding. After doing that, the template opens in the CAM Template Editor (see Figure 3).
|Figure 3. The CAM Editor: After generating a template from the po.xsd schema, the editor shows both a Structure a Rules view. The labels explain the iconic and textual conventions of the Structure view.|
Each tabbed container in the editor is referred to as a view. The Structure view shows the tree-structured hierarchy of the XML. Figure 3 shows that a purchase order has an orderData attribute along with four child nodes: shipTo, billTo, comment, and items. The items node may contain multiple item child nodes. The CAM editor closely mirrors the underlying XML CAM template file (PurchaseOrder/purchaseOrder_from_schema.cam). As shown below, the
%string% %string% %string% %string% %54321.00% %string% %string% %string% %string% %54321.00% %string%
%string% %1% %54321.00% %string% %YYYY-MM-DDZ%
That is partly because the CAM file maintains a clean separation between form (
You can see the complete CAM template file in Listing 2.
The Rules view (highlighted in Figure 3) shows all the business rules comprising the semantics of the template. Unlike structure, rules are stored differently in the file than in the Rules view. Table 2 reproduces the rules as shown in the Rules view for a close-up look. Without going into all the details of these rules, what you can glean from them is:
- Rules may be conditional or absolute. For example, the orderDate format requirement changes depending on its length.
- Items and conditions are specified via XPath. XPath is used extensively in CAM, providing tremendous flexibility and resolution. XML Schema 1.0, by contrast, uses XPath only for the advanced xs:unique and xs:key concepts.
- Rules may apply to as broad or as narrow a range of elements as you need. By its very nature, XPath supports selection of whatever part of a document you need: one element, one attribute, all elements of a given name, all elements in a certain position in the tree, etc.
- Rules are compact, concise, and intuitive. In fact, as you’ll see, writing CAM rules is practically the same thing as writing your application requirements.
|string-length(.) < 11||//purchaseOrder/@orderDate||setDateMask(YYYY-MM-DD)|
|string-length(.) > 10||//purchaseOrder/@orderDate||setDateMask(YYYY-MM-DDZ)|
|string-length(.) < 11||//item/shipDate||setDateMask(YYYY-MM-DD)|
|string-length(.) > 10||//item/shipDate||setDateMask(YYYY-MM-DDZ)|
An Example CAM Validation
With a template in hand, you may now validate an XML file against the template. The W3C site, besides providing the sample purchase order schema, kindly provides a sample purchase order instance (PurchaseOrder/po.xml)—but the download contains one typographic error. Figure 4 highlights the error. If you attempt to open or validate a malformed XML file, the CAM editor displays a stack dump and an error message (also shown in Figure 4), and refuses to load the file.
|Figure 4. Malformed XML: The figure shows why the original po.xml file is not well-formed; loading it into the CAM editor results in the ugly error popup shown.|
|Figure 5. The XML View: When you open an XML file it is rendered in an XML view showing the tree structure of the document with icons to collapse and expand the tree portions.|
After you correct the error by swapping the exclamation point and the left angle bracket (the corrected file is PurchaseOrder/po_corrected.xml) you can load the XML file using the CAM editor’s XML ? Open XML menu item. The CAM Editor displays the file in an XML view, rendering it in a style similar to the structure view (see Figure 5). The same elements are present as in the template, but now appear with actual values rather than placeholders (the descriptive terms surrounded by percent signs).
To validate the document select Run ? Run JCam. You’ll see the Run JCam dialog shown in Figure 6. By default JCam selects the loaded XML file, and should identify its structure ID as purchaseOrder (the root of this structure). Click Finish to close the dialog and run the validation; the results appear in the Run Results view at the bottom of the main window. Notice that the validation indicates two errors, although only one is in view in Figure 6. If you look closely you’ll see that nodes with errors have a tiny yellow or red error icon attached to them and their antecedents. In this case, because the error occurs on the
|Figure 6. Performing a Validation: Validation results appear in the Run Results view. Each element or attribute that fails validation has an attached error symbol; its antecedents have a warning symbol.|
This XML file validates with no errors in any XML Schema editor. Why does it fail here? The error in the Run Results view indicates the zip code is not valid according to the CAM template. The template is looking for a floating point number with 2 decimal places whereas zip codes in the US, of course, are 5-digit or 9-digit integers. The CAM template rule for the zip code came from the XSD specification, which states simply that a zip code is a decimal. You can see this in Listing 1: Look for the zip field within the USAddress complex type. The CAM template generator could do only as good as a job as its input allowed (a mild case of GIGO). While you might disagree, I submit that the XSD specification is too forgiving; the datatype should have been an integer rather than a decimal. The next section discusses how to correct this error in the CAM Editor.
As you follow along with the examples in this article or explore on your own, you might run into a template that is not behaving as expected. There are a couple things to check in that event.
- Invoke the Tools ? Validate CAM Template menu item to look for any issues from the editor’s perspective.
- If you press Finish in the Run JCam dialog box and nothing seems to happen, press Cancel to close the dialog, and then take a look at the Console view for any error messages. If, for example, you neglected to specify an XML file to validate, the dialog does not disable the Finish button but rather lets you press it, and reports the misleading error “template is null” in the Console view. (Other conditions may cause that error as well.) If the Console view is not visible, nothing appears to happen.
Creating Business Rules
In the Structure view select the
|Figure 7. Editing a Constraint Rule: To fix the setNumberMask predicate attached to the //shipTo/zip elements, select the element in the Structure view, open its context menu, and select Edit Rule to open the Edit Constraint Rule dialog. Click the Number Mask field for help in specifying the mask.|
Click on the number mask field, which opens another dialog to edit the mask. For now, just modify the field from ######.## to #####; that is, replace the original mask with just five octothorps. Close both dialogs. In the main editor window you’ll see the updated rule. Re-execute the validation. The //shipTo/zip error should be gone, leaving only an error on //billTo/zip. This is clearly the same error, so you can fix it the same way. But because the //billTo/zip value should always behave identically to the //shipTo zip value, it would be much cleaner to have a common rule for both rather than separate rules. The Common Rules section in Part II of this article discusses how to do this in more detail.
After updating the rule you also need to update the placeholder (item 1 in Figure 7). If you compare that to Figure 6, you can see that the value changed from %54321.00% to %54321%, which is more representative of a zip code. In this particular example, where the element’s placeholder and the associated rule are closely related, it is reasonable to suppose that they should automatically track each other in some fashion. But in many cases the relationship is not nearly as straightforward. Elements and rules have a many-to-many relationship: You could have multiple rules applied to a single element or a single rule applied to multiple elements.
To update the element’s placeholder as in Figure 7, open the context menu on the //shipTo/zip field in the Structure view and select Edit Text. In the dialog change %54321.00% to %54321%.
The placeholder serves a dual role. The CAM processor uses it solely to determine if an element’s content is fixed or not, determined by the presence of the percent signs surrounding the placeholder. (Notice that you re-ran the validation and the //shipTo/zip field validated before updating the element’s placeholder, confirming that the value between the percent signs is ignored by the CAM processor.)
The value between the percent signs is for human consumption, and should accurately and concisely convey what the element contains. Often the context has already done most of the work for you: the element name is “zip”, which is immediately recognized in the US as a string containing 5, 9, or 10 digits. By setting the placeholder to %54321% you are telling consumers of the template that you want only five-digit zip codes.
Stress-Testing the Validation
Now you have updated the placeholder and the rule. But are these two changes sufficient to properly validate a five-digit zip code? To check this you need to feed different test cases to the CAM processor. The simplest way is to open the XML view containing the data that you are validating, change the //shipTo/zip value, and re-validate. You edit nodes in the XML view just as in the Structure view: open the context menu and select Edit Text. Determine the smallest set of values that yield good coverage of all possible values (that is, determine appropriate equivalence classes of data) and feed each one to the validator. Table 3 provides one such list. There are two result columns because, as you may have surmised, what you have done so far does not properly validate values in the zip field. The two items marked in red in the second column produced an incorrect result. In this case, both passed validation when they should have both failed.
These two tests passed for the same reason: The mask is numeric, and both tests are valid numbers. So you need to back up a step. Even though a zip code contains only numbers, it is really a string masquerading as a number. While numerically, 00001 and 1 are the same, in the domain of zip codes, 00001 represents a valid zip code, while 1 does not. Therefore, instead of setting a numeric mask use a textual mask. Open the Edit Constraint Rule dialog for //shipTo/zip and change the action from setNumberMask to setStringMask. Click on the String Mask field to open the mask editor. Type five zeroes or press the “Digit [0-9] button” five times, then exit both dialogs. If you now re-validate each test case in Table 3, you’ll find that they all produce correct results, as shown in column three.
Changing the rule from checking for numbers to checking for strings let the processor fail the negative value, and changing the mask character from “#” (indicating a digit where leading zeroes may be absent) to “0” (indicating a digit where leading zeroes are required) allowed the processor to fail the 1 value. The value would pass if you changed it to 00001. The list of valid mask characters is documented in the formal CAM specification under section 3.4.3: CAM Content Mask Syntax. Table 4 is an adaptation from that section, with the text revised for clarity.
|X||Any character; mandatory|
|A||Mandatory alphanumeric character or space|
|a||Optional alphanumeric character or space|
|?||Any single character|
|*||Zero or more characters|
|U||A character to be converted to upper case|
|L||A character to be converted to lower case|
|0||A digit; trailing and leading zeros displayed; leading minus sign permitted|
|#||A digit; trailing and leading zeros suppressed; leading minus sign permitted|
|‘ ‘||Single quotes escape a character block to denote mandatory character/s|
|0||A digit; trailing and leading zeros displayed; leading minus sign permitted|
|#||A digit; trailing and leading zeros suppressed; leading minus sign permitted|
|.||Literal decimal point|
|J||As the first character of a mask, invokes alternate Java formatting methods to handle mask processing (the literal J is ignored when passed to Java)|
|DD||Day number in a month|
|DDD||Day number in a year|
|DDDD||Relative day number(?) in a month|
|MM||Month number in a year|
|MMM…||Month name, e.g. January (field is padded or truncated to the number of M’s, 3-10 permitted)|
|W||Day number in a week|
|WWW…||Day name (field is padded or truncated to the number of W’s, 3-10 permitted)|
|/||Literal virgule; a date separator|
|–||Literal hyphen; alternate date separator|
If you are looking for a tool set with full, clear documentation, and one that has had virtually all the bugs ironed out, you must regrettably look elsewhere. But if you do not mind a few rough edges on a gem of great value, I believe you will find CAM to be a great tool for your arsenal. Finally, given the zeal of the developers, it is quite possible that the behavior of the latest version of the CAM editor and the CAM engine may vary from what I describe here, using version 1.6.2.
This concludes Part I, but you have seen only a glimpse of how intuitive and easy it is to design with CAM. In the next part of this article you’ll see much more of CAM’s expressive power. Additionally, you’ll see much more in-depth discussion of practical techniques for developing templates and rules including: leveraging common structure and common rules; conditionalizing validation based on either internal or external factors; detailed comparison to XSD regarding datatypes, compositors, and cardinality; and finally, some pitfalls to avoid.