Taking XML Validation to the Next Level: Introducing CAM

alidating an XML document entails confirming that the document is both well-formed and conforms to a specific set of rules specified with a Document Type Definition (DTD), an XML Schema, or—as introduced in this article—a CAM template. DTD was the earliest specification. DTDs provided useful but limited capabilities, letting you validate XML document structure but very little in the way of semantics. Next came XML Schema, which offered more flexibility and capability, improved support for structure, and good (but not great) support for semantics. Schematron, RelaxNG, and others have attempted to improve the semantic support, but none have caught on in a big way. Now a new—really new—technology called Content Assembly Mechanism (CAM) is being developed under the aegis of OASIS, a well-respected standards body.

CAM is more than just another schema language, though. It was designed to better meet the needs of business exchange requirements and interoperability. CAM provides a powerful mechanism for validating XML both structurally and semantically, in a concise, easy-to-use, easy-to-maintain format. It provides a context mechanism—a way to dynamically adjust what should be considered a valid XML instance based upon other parts of the XML itself or external parameters.

CAM is an exciting technology with much promise, but it is a nascent technology, which can be both good and bad. Things move fast with CAM development, thus you may notice frequent “at the time of writing” disclaimers in this article. However, the chances are good that the development team will act upon some of the problems discussed here and fix them before you ever have a chance to encounter them!

So, at the time of writing this article, CAM’s documentation is sketchy: There is a formal specification, a white paper, a PowerPoint presentation, and a few web pages offering a brief introduction to the editor and to the API. There is no definitive guide or tutorial; this article functions as “CAM: The Missing Manual,” expanding upon the CAM documentation, covering both the how and the why of applying the specification and its idiosyncrasies to real-world usage.

Author’s Note: While working on examples for this article I had to combat a variety of implementation bugs. But the development team is extremely responsive with fixing issues: Very early on I delivered a list of two dozen bugs and had a new release within 24 hours!

What You Need

  • Basic familiarity with XPath. CAM Uses XPath extensively for defining business rules. See the W3Schools’ XPath Tutorial for a great refresher.
  • Basic familiarity with XML Schema. While ostensibly this article is about a successor to XML Schema, it relies extensively on contrasts with XML Schema as the most effective way to communicate new approaches. See the W3Schools’ XML Schema Tutorial for a great refresher.

Dictating Valid XML

An XML document is a hierarchical composition of elements, a “generic framework for storing any amount of text or any data whose structure can be represented as a tree”. An XML document needs only to be well-formed, meaning it must have but a single root and its elements and attributes must conform to the simple XML syntax rules. However, XML has little utility until you map it into a specific problem domain, such as mathematics, book-writing, or financial transactions. Such mapping removes documents from the abstract realm of XML and places them into a specific XML dialect for your particular problem. Any document in your dialect must, by definition, be valid according to your dialect semantics; otherwise it is rejected as invalid and cannot be processed.

Consider this portion of a customer address:

   
221B Baker Street . . .

To validate this XML fragment in XML Schema you would typically have a structure such as:

          . . .              . . .   

These constraints indicate that an element must exist, be contained within an

element, and must contain a string. For an address, a simple string value may be appropriate, but for other fields you would generally use something more specific, either a specialized string (a derived, restricted string), a date, an integer, or other defined type.

XML Schema is a grammar-based system, in that you define a grammar for both semantics and structure against which an XML instance must conform. Schematron, on the other hand, is a rule-based system where you specify both semantics and structure using rules (see An Introduction to Schematron). That is, not only do you use a rule that specifies an address_street is a string, but you also use a rule to specify that must appear within an

element. Both XML Schema and Schematron fundamentally intertwine semantics and structure. In programming terms, the coupling is high, which is not desirable.

Author’s Note: XML Schema Language Comparison for more in-depth information.

In contrast, CAM is a hybrid system that separates structure from semantics (low coupling) and specifies semantics with rules. For example, the address example in CAM might look like this:

          
%street number and name% . . .

The section of the CAM template defines the hierarchical structure of the XML document in a fashion that virtually duplicates an example XML instance, substituting placeholders (demarcated with percent signs) for actual data. So the preceding CAM template indicates that an XML instance would replace the %street number and number% placeholder with an actual street address.

Author’s Note: The only part of the placeholder that has semantic content is the percent signs themselves. Everything between them is completely ignored by the CAM processor; it is for you and consumers of your XML dialect. The Structure view in Figure 3, for example uses just a generic description (%string%) for many placeholders. You might take a different approach though and be more specific using, for instance, %city-name% for the element, %2-letter state abbreviation% for the element, etc.
 
Figure 1. WYSIWYG Example: Microsoft Word users much prefer to see the rendering of the document in the left pane rather than the right, but both represent the same thing and both may be edited to alter the document.

The section does embody some semantics—those that define which elements contain which other elements and in what order—however, unlike Schematron, you do not need to laboriously write rules to define the structure. CAM specifies structure in a true WYSIWYG nature while for Schematron you have to write the “code.” This is analogous to using Microsoft Word in its natural, WYSIWYG form vs. writing the RTF text to generate a Word document—writing RTF is tedious, difficult, and error prone—see Figure 1.

XML Schema is also not WYSIWYG, although some excellent tools such as XmlSpy or Liquid XML Studio help put a WYSIWYG front-end on it. Consider this XML Schema example defining a cost to be in the range 1-999 with 2 decimal places permitted:

                                                                  

The equivalent CAM syntax shown below separates the rules from the structure, with the rules referring back to the appropriate structure elements. The rules map obviously and intuitively to the English description:

      

The section of the CAM template defines all the semantics other than those implicitly embodied in the section, including datatypes, restrictions, cardinality, conditions, and more.

Benefits of CAM

Table 1 summarizes the key strengths of CAM compared to XML Schema and DTDs. Each line item in the table is covered in detail later in this article or in Part II.

Table 1. Vital Validation Features: The technology(ies) that have the best support for each feature are highlighted in green. CAM clearly has, by far, the strongest repertoire of the three technologies.
# Item DTD XML Schema CAM Example / Notes
1 Separates structure and business rules no (limited business rules) no yes  
2 Current-node fixed validation no yes yes holds an integer between 0 and 100.
3 Current-node conditional validation no limited
Using pattern facets [See XML Schema Spec Part 2, section 4.3.4]
yes must be either 5 or 10 digits.
4 Cross-node conditional validation no limited
Using identity-constraint definitions [See XML Schema Spec Part 1, section 3.11]
yes must be no if is AK, FL, NV, SD, TX, WA, WY, NH, or TN; otherwise it must be yes.
5 Context mechanism no yes yes Interpret validity differently depending on whether condition A or condition B is satisfied.
6 Structure variability no no yes For orders exceeding 25kg, customers must also select a freight handler to transport the goods.
7 Parameterized invocation no no yes Orders from Canada must meet criteria x, y, and z, while orders from New Zealand must meet criteria a, b, and c.
  Datatypes 10 44+ 44+  
8 Namespace aware no yes yes  
9 Define own datatypes no yes
Using derived types
yes
Using constraints
must be an eight-character string.
10 Written in same syntax as documents no yes
XML
yes
XML
 
11 Code reuse limited yes
Using named types
yes
Using XPath selector for rules and include files for structure
and addresses contain all the same children and some validation rules.
12 Tools/editors many many 1 “Any color as long as it’s black”
13 Graphical designer many many none With XML Schema, designers mask the complexities of the structure.
14 WYSIWYG with external framework with external framework inherent Statement of business rules and implementation of them are almost identical; truly a textual WYSIWYG. On top of that, editor also provides three different auto-generated documentation modes.
15 Adoption mature mature nascent Mature can be better for stability, support, and overhead; nascent can be better for starting new projects cleanly with new technology.
16 APIs Java, .NET, Ruby, Perl, … Java, .NET, Ruby, Perl, … Java  
17 Open standard yes yes yes  
Author’s Note: This article is based on a comparison to XML Schema 1.0; version 1.1 is in the works and it will use some of the same types of XPath expressiveness that CAM already has.

Introduction to the CAM Editor

From this link to download the latest CAM editor, select Download in the left-hand panel, and you get a choice of downloading the CAM template editor or the JCam engine. For the bulk of this article you need only the CAM template editor (the JCam engine performs CAM validation programmatically).

To get started with the CAM editor, you may create a template from scratch, from an existing XML file, or from an existing schema (XSD) file. You’ll find the ease of creating a CAM template is exactly the reverse order; that is, you gain the most leverage from an XSD file, some from an XML file, and of course none when starting from scratch. So to get started, use the canonical Purchase Order schema from the W3C. Store this file locally as po.xsd. In the editor, select File ? New Template from Schema…, and supply the directory and file name where you stored the file separately (see Figure 2). The application freezes for a few seconds while it processes the file; when it comes back it fills out the Root Element field.

 
Figure 2. New Template From Schema Dialog: Select the directory and file name pointing to your XSD file, then select the root element from that schema to have the CAM editor generate a base CAM template for you.

The comment element is simply the name of the first node (in alphabetical order) among all nodes in the po.xsd file. This file happens to contain two such nodes, comment and purchaseOrder, shown in bold in the schema excerpt below (you can see the full schema in Listing 1).

                          Purchase order schema for Example.com.        Copyright 2000 Example.com. All rights reserved.                                                                                                          ...      

You actually want the purchaseOrder element as the root, so switch the Root Element in the dialog to purchaseOrder, and then click OK to generate the template. The application prompts you (well, forces you) to save the template before proceeding. After doing that, the template opens in the CAM Template Editor (see Figure 3).

 
Figure 3. The CAM Editor: After generating a template from the po.xsd schema, the editor shows both a Structure a Rules view. The labels explain the iconic and textual conventions of the Structure view.

Each tabbed container in the editor is referred to as a view. The Structure view shows the tree-structured hierarchy of the XML. Figure 3 shows that a purchase order has an orderData attribute along with four child nodes: shipTo, billTo, comment, and items. The items node may contain multiple item child nodes. The CAM editor closely mirrors the underlying XML CAM template file (PurchaseOrder/purchaseOrder_from_schema.cam). As shown below, the section in the file shows virtually the same information line by line as the Structure view in Figure 3:

                                             %string%             %string%             %string%             %string%             %54321.00%                                   %string%             %string%             %string%             %string%             %54321.00%                      %string%                                       %string%               %1%               %54321.00%               %string%               %YYYY-MM-DDZ%                                             

That is partly because the CAM file maintains a clean separation between form () and function (). In contrast, XSD files intermingle structure with the business rules (thus incurring higher maintenance costs). Here’s the top-level skeleton of a complete CAM file, showing the two main elements:

                     

You can see the complete CAM template file in Listing 2.

The Rules view (highlighted in Figure 3) shows all the business rules comprising the semantics of the template. Unlike structure, rules are stored differently in the file than in the Rules view. Table 2 reproduces the rules as shown in the Rules view for a close-up look. Without going into all the details of these rules, what you can glean from them is:

  • Rules may be conditional or absolute. For example, the orderDate format requirement changes depending on its length.
  • Items and conditions are specified via XPath. XPath is used extensively in CAM, providing tremendous flexibility and resolution. XML Schema 1.0, by contrast, uses XPath only for the advanced xs:unique and xs:key concepts.
  • Rules may apply to as broad or as narrow a range of elements as you need. By its very nature, XPath supports selection of whatever part of a document you need: one element, one attribute, all elements of a given name, all elements in a certain position in the tree, etc.
  • Rules are compact, concise, and intuitive. In fact, as you’ll see, writing CAM rules is practically the same thing as writing your application requirements.
Table 2. Business Rules in the Editor: Converting the XML Schema for the purchase order to CAM automatically generates these rules, which serve as a starting point.
Condition Item Action
  //purchaseOrder/@orderDate makeOptional()
string-length(.) < 11 //purchaseOrder/@orderDate setDateMask(YYYY-MM-DD)
string-length(.) > 10 //purchaseOrder/@orderDate setDateMask(YYYY-MM-DDZ)
  //shipTo/@country makeOptional()
  //shipTo/@country datatype(NMTOKEN)
  //shipTo/zip setNumberMask(######.##)
  //billTo/@country makeOptional()
  //billTo/@country datatype(NMTOKEN)
  //billTo/zip setNumberMask(######.##)
  //purchaseOrder/comment makeOptional()
  //items/item makeRepeatable()
  //items/item makeOptional()
  //item/quantity setNumberMask(######)
  //item/quantity setNumberRange(1-999999)
  //item/USPrice setNumberMask(######.##)
  //item/comment makeOptional()
  //item/shipDate makeOptional()
string-length(.) < 11 //item/shipDate setDateMask(YYYY-MM-DD)
string-length(.) > 10 //item/shipDate setDateMask(YYYY-MM-DDZ)

An Example CAM Validation

With a template in hand, you may now validate an XML file against the template. The W3C site, besides providing the sample purchase order schema, kindly provides a sample purchase order instance (PurchaseOrder/po.xml)—but the download contains one typographic error. Figure 4 highlights the error. If you attempt to open or validate a malformed XML file, the CAM editor displays a stack dump and an error message (also shown in Figure 4), and refuses to load the file.

 
Figure 4. Malformed XML: The figure shows why the original po.xml file is not well-formed; loading it into the CAM editor results in the ugly error popup shown.
 
Figure 5. The XML View: When you open an XML file it is rendered in an XML view showing the tree structure of the document with icons to collapse and expand the tree portions.

After you correct the error by swapping the exclamation point and the left angle bracket (the corrected file is PurchaseOrder/po_corrected.xml) you can load the XML file using the CAM editor’s XML ? Open XML menu item. The CAM Editor displays the file in an XML view, rendering it in a style similar to the structure view (see Figure 5). The same elements are present as in the template, but now appear with actual values rather than placeholders (the descriptive terms surrounded by percent signs).

To validate the document select Run ? Run JCam. You’ll see the Run JCam dialog shown in Figure 6. By default JCam selects the loaded XML file, and should identify its structure ID as purchaseOrder (the root of this structure). Click Finish to close the dialog and run the validation; the results appear in the Run Results view at the bottom of the main window. Notice that the validation indicates two errors, although only one is in view in Figure 6. If you look closely you’ll see that nodes with errors have a tiny yellow or red error icon attached to them and their antecedents. In this case, because the error occurs on the element, its parent element also displays an error icon, as does the root element. Similarly, you can deduce that the second error is buried within the element.

 
Figure 6. Performing a Validation: Validation results appear in the Run Results view. Each element or attribute that fails validation has an attached error symbol; its antecedents have a warning symbol.

This XML file validates with no errors in any XML Schema editor. Why does it fail here? The error in the Run Results view indicates the zip code is not valid according to the CAM template. The template is looking for a floating point number with 2 decimal places whereas zip codes in the US, of course, are 5-digit or 9-digit integers. The CAM template rule for the zip code came from the XSD specification, which states simply that a zip code is a decimal. You can see this in Listing 1: Look for the zip field within the USAddress complex type. The CAM template generator could do only as good as a job as its input allowed (a mild case of GIGO). While you might disagree, I submit that the XSD specification is too forgiving; the datatype should have been an integer rather than a decimal. The next section discusses how to correct this error in the CAM Editor.

As you follow along with the examples in this article or explore on your own, you might run into a template that is not behaving as expected. There are a couple things to check in that event.

  • Invoke the Tools ? Validate CAM Template menu item to look for any issues from the editor’s perspective.
  • If you press Finish in the Run JCam dialog box and nothing seems to happen, press Cancel to close the dialog, and then take a look at the Console view for any error messages. If, for example, you neglected to specify an XML file to validate, the dialog does not disable the Finish button but rather lets you press it, and reports the misleading error “template is null” in the Console view. (Other conditions may cause that error as well.) If the Console view is not visible, nothing appears to happen.

Creating Business Rules

In the Structure view select the element under the element. The rules attached to this element appear in the ItemRules view. In this case, there is only a single rule, using the setNumberMask predicate. Open the context menu for this rule by right-clicking on the rule in the category column, and then selecting Edit Rule. The Edit Constraint Rule dialog box opens (see Figure 7).

 
Figure 7. Editing a Constraint Rule: To fix the setNumberMask predicate attached to the //shipTo/zip elements, select the element in the Structure view, open its context menu, and select Edit Rule to open the Edit Constraint Rule dialog. Click the Number Mask field for help in specifying the mask.

Click on the number mask field, which opens another dialog to edit the mask. For now, just modify the field from ######.## to #####; that is, replace the original mask with just five octothorps. Close both dialogs. In the main editor window you’ll see the updated rule. Re-execute the validation. The //shipTo/zip error should be gone, leaving only an error on //billTo/zip. This is clearly the same error, so you can fix it the same way. But because the //billTo/zip value should always behave identically to the //shipTo zip value, it would be much cleaner to have a common rule for both rather than separate rules. The Common Rules section in Part II of this article discusses how to do this in more detail.

After updating the rule you also need to update the placeholder (item 1 in Figure 7). If you compare that to Figure 6, you can see that the value changed from %54321.00% to %54321%, which is more representative of a zip code. In this particular example, where the element’s placeholder and the associated rule are closely related, it is reasonable to suppose that they should automatically track each other in some fashion. But in many cases the relationship is not nearly as straightforward. Elements and rules have a many-to-many relationship: You could have multiple rules applied to a single element or a single rule applied to multiple elements.

To update the element’s placeholder as in Figure 7, open the context menu on the //shipTo/zip field in the Structure view and select Edit Text. In the dialog change %54321.00% to %54321%.

The placeholder serves a dual role. The CAM processor uses it solely to determine if an element’s content is fixed or not, determined by the presence of the percent signs surrounding the placeholder. (Notice that you re-ran the validation and the //shipTo/zip field validated before updating the element’s placeholder, confirming that the value between the percent signs is ignored by the CAM processor.)

The value between the percent signs is for human consumption, and should accurately and concisely convey what the element contains. Often the context has already done most of the work for you: the element name is “zip”, which is immediately recognized in the US as a string containing 5, 9, or 10 digits. By setting the placeholder to %54321% you are telling consumers of the template that you want only five-digit zip codes.

Stress-Testing the Validation

Now you have updated the placeholder and the rule. But are these two changes sufficient to properly validate a five-digit zip code? To check this you need to feed different test cases to the CAM processor. The simplest way is to open the XML view containing the data that you are validating, change the //shipTo/zip value, and re-validate. You edit nodes in the XML view just as in the Structure view: open the context menu and select Edit Text. Determine the smallest set of values that yield good coverage of all possible values (that is, determine appropriate equivalence classes of data) and feed each one to the validator. Table 3 provides one such list. There are two result columns because, as you may have surmised, what you have done so far does not properly validate values in the zip field. The two items marked in red in the second column produced an incorrect result. In this case, both passed validation when they should have both failed.

Table 3. Zip Code Test Cases: This table shows results of several equivalence classes of values using the numeric mask ##### compared to using the string mask 00000. Results marked in green are correct; red results are incorrect.
//shipTo/zip setNumberMask(#####) setStringMask(00000)
90952 Pass Pass
90952.1 Fail Fail
123456 Fail Fail
90952-1234 Fail Fail
1 Pass Fail
(blank entry) Fail Fail
90952a Fail Fail
-12345 Pass Fail
(123) Fail Fail

These two tests passed for the same reason: The mask is numeric, and both tests are valid numbers. So you need to back up a step. Even though a zip code contains only numbers, it is really a string masquerading as a number. While numerically, 00001 and 1 are the same, in the domain of zip codes, 00001 represents a valid zip code, while 1 does not. Therefore, instead of setting a numeric mask use a textual mask. Open the Edit Constraint Rule dialog for //shipTo/zip and change the action from setNumberMask to setStringMask. Click on the String Mask field to open the mask editor. Type five zeroes or press the “Digit [0-9] button” five times, then exit both dialogs. If you now re-validate each test case in Table 3, you’ll find that they all produce correct results, as shown in column three.

Changing the rule from checking for numbers to checking for strings let the processor fail the negative value, and changing the mask character from “#” (indicating a digit where leading zeroes may be absent) to “0” (indicating a digit where leading zeroes are required) allowed the processor to fail the 1 value. The value would pass if you changed it to 00001. The list of valid mask characters is documented in the formal CAM specification under section 3.4.3: CAM Content Mask Syntax. Table 4 is an adaptation from that section, with the text revised for clarity.

Table 4. Mask Characters: When a rule action requires a mask, these characters have special meaning.
Character Description
String Masks 
X Any character; mandatory
A Mandatory alphanumeric character or space
a Optional alphanumeric character or space
? Any single character
* Zero or more characters
U A character to be converted to upper case
^ Uppercase; optional
L A character to be converted to lower case
_ Lowercase; optional
0 A digit; trailing and leading zeros displayed; leading minus sign permitted
# A digit; trailing and leading zeros suppressed; leading minus sign permitted
‘ ‘ Single quotes escape a character block to denote mandatory character/s
Number Masks 
0 A digit; trailing and leading zeros displayed; leading minus sign permitted
# A digit; trailing and leading zeros suppressed; leading minus sign permitted
. Literal decimal point
J As the first character of a mask, invokes alternate Java formatting methods to handle mask processing (the literal J is ignored when passed to Java)
Date Masks 
DD Day number in a month
DDD Day number in a year
DDDD Relative day number(?) in a month
MM Month number in a year
MMM… Month name, e.g. January (field is padded or truncated to the number of M’s, 3-10 permitted)
YY Two-digit year
YYYY Four-digit year
W Day number in a week
WWW… Day name (field is padded or truncated to the number of W’s, 3-10 permitted)
/ Literal virgule; a date separator
Literal hyphen; alternate date separator

If you are looking for a tool set with full, clear documentation, and one that has had virtually all the bugs ironed out, you must regrettably look elsewhere. But if you do not mind a few rough edges on a gem of great value, I believe you will find CAM to be a great tool for your arsenal. Finally, given the zeal of the developers, it is quite possible that the behavior of the latest version of the CAM editor and the CAM engine may vary from what I describe here, using version 1.6.2.

This concludes Part I, but you have seen only a glimpse of how intuitive and easy it is to design with CAM. In the next part of this article you’ll see much more of CAM’s expressive power. Additionally, you’ll see much more in-depth discussion of practical techniques for developing templates and rules including: leveraging common structure and common rules; conditionalizing validation based on either internal or external factors; detailed comparison to XSD regarding datatypes, compositors, and cardinality; and finally, some pitfalls to avoid.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Overview

The Latest

microsoft careers

Top Careers at Microsoft

Microsoft has gained its position as one of the top companies in the world, and Microsoft careers are flourishing. This multinational company is efficiently developing popular software and computers with other consumer electronics. It is a dream come true for so many people to acquire a high paid, high-prestige job

your company's audio

4 Areas of Your Company Where Your Audio Really Matters

Your company probably relies on audio more than you realize. Whether you’re creating a spoken text message to a colleague or giving a speech, you want your audio to shine. Otherwise, you could cause avoidable friction points and potentially hurt your brand reputation. For example, let’s say you create a

chrome os developer mode

How to Turn on Chrome OS Developer Mode

Google’s Chrome OS is a popular operating system that is widely used on Chromebooks and other devices. While it is designed to be simple and user-friendly, there are times when users may want to access additional features and functionality. One way to do this is by turning on Chrome OS