Taking XML Validation to the Next Level: XSD Schema vs. CAM

n the first article in this series, you saw the rudiments of template-building and validation with CAM. This part of the series presents a much more detailed comparison between XML Schema and CAM.

Schema Validation

Because XML Schema is well-established, the tools for designing schema are considerably richer. CAM does not currently, for example, offer any graphical tools, whereas XML Schema has a wide variety. Probably the most well-known, XmlSpy, is an outstanding platform for XML development; however this article uses Liquid XML Studio for the examples and figures, because it’s available in a free community edition. Figure 1 shows a graphical representation of the purchase order schema (see Listing 1) rendered by Liquid XML Studio. Note that it’s far easier to grasp the structure of the schema from the graphical rendering than by reading the XSD file!

Figure 1. Purchase Order Schema: This graphical rendering provides a much more intuitive view of the schema file than viewing the file contents directly.

Note that both //shipTo/zip and //billTo/zip are specified as decimal types, which?as you saw in the CAM discussion in Part 1?is not adequate to ensure a proper zip code. The following example shows how to fix the schema and run the series of test cases to verify the fix.

Figure 2. Validating an XML File in Liquid XML Studio: To validate the XML file press the validate button (1). Press the schema selector (2). Press the add button (3) in the document-to-schema mappings. Press the add button (4) in the entry form to select your XSD file. Provide a name (5), close out of the dialogs, and your schema will be visible at (6).

Open the po.xml file in Liquid XML Studio?refer to Figure 2. Press F7 or the on-screen validate button (item 1 in Figure 2) to validate the file.

Author’s Note: Remember that you must correct an error in the XML file?the closing bracket fix described in Part 1.

With the corrected file loaded, the application should report only that the document is well-formed. It does not validate against the schema, because you need to tell it to do so?item 6 in the figure is blank. Open the schema browser by pressing the schema selector button (item 2 in the figure). This opens the “XML Document to XML Schema mappings” window. Press the Add button (item 3) to open the Edit Schema Library Entry window where you will identify your schema file. You need to provide the file path and to give a name to the schema. Press the Add button (item 4) to open a standard file chooser and find the po.xsd file. That fills in the file path. Type a descriptive name of your choice (item 5), and then close the dialog boxes to return to the main window. The schema you selected should now appear in the properties pane (item 6). If not, re-open the schema browser and make sure the new schema has a check next to it.

Next, just as with the CAM template in Part 1, work through the test cases to see how well this schema can validate a zip code. With the zip defined as a simple decimal most of the test cases fail. Changing to an integer instead of a decimal improves the results a bit, but not much. To fully fix the schema you need to mirror the change made on the CAM side, which is to treat the zip code as a 5-character string containing all digits. The simplest way is to change the base data type to a string and set the pattern facet to “d{5}”, a regular expression indicating precisely five digits between 0 and 9. To do this in Liquid XML Studio, open the XSD file, select the zip element in the graphical view (see the arrow in Figure 3), and then adjust the type and the pattern in the properties pane as indicated in the figure.

Figure 3. Correcting the Zip Element: To properly validate a 5-digit zip code with XML Schema, change the datatype from decimal to string and use the pattern facet to restrict the string to 5 digits.

Making those adjustments automatically generates the schema code highlighted in the bottom pane, casting the zip as a subtype of string constrained by the specified facet. After this change is in place all the zip code tests should produce the correct results, giving you an XML Schema solution equivalent to the CAM solution from Part 1.

Mapping XSD to CAM

You’ve already seen how an XSD decimal datatype maps to CAM, but to migrate existing schemas you should understand how the other XSD datatypes get mapped. Figure 4 shows a contrived schema that simply contains representative samples of many common XSD datatypes. These are grouped into logical chunks, separating numbers, strings, and date/time elements. You can see a full list and relationships of all datatypes in the built-in datatypes section of the official W3C Schema specification.

Figure 4. The DataTypeSamples Schema: This schema sandbox represents many of the XML Schema datatypes.

In the CAM editor, create a new CAM template from the DataTypeSamples schema (you can find these in the downloadable code in the DataTypeSamples folder, DataTypeSamples.xsd and DataTypeSamples.cam). To recap, the Structure view shows the placeholders generated for the datatypes while the Rules view shows the validation semantics (see Figure 5).

Figure 5. The DataTypeSamples CAM Template: Converting the assorted datatypes from XML Schema to CAM yields this display in the CAM editor.

Table 1 reorganizes the information from the Structure and Rules views to be more digestible. The Category and Data Type columns list each datatype from the original schema (see Figure 4) along with its category. The Placeholder column shows how this maps to the structure portion of the CAM template, and the final two columns show the business context. As an example in interpreting this table, consider the decimalElement (row 3) that uses the decimal datatype you saw in the earlier zip code examples. The placeholder for a decimal datatype is %54321.00%. The business rule (######.##) indicates that it must contain only digits with an optional leading minus sign for the integer portion of the number, followed by a decimal point, followed by precisely two digits in the fractional portion of the number. While zero and the octothorp (#) are seemingly the most straightforward mask elements, there are some subtleties involved in knowing how they restrict your values, as you’ll see next.

Table 1. Datatype Mapping: CAM generates the placeholder, condition, and business rule shown for each XML Schema data type listed. Depending on the data type, CAM generates one rule or more than one rule, and may apply a rule with a mask or a rule with a datatype, as discussed in the text.
# Category Data Type Placeholder Condition Business Rule
1 ? booleanElement %false% ? restrictValues( ‘true’|’false’)
2 number byteElement %type = byte% ? datatype( byte)
3 number decimalElement %54321.00% ? setNumberMask( ######.##)
4 number doubleElement %type = double% ? datatype(double)
5 number floatElement %54321.00% ? setNumberMask( ######.####)
6 number intElement %12345% ? setNumberMask( ######)
7 number longElement %type = long% ? datatype(long)
8 number negativeIntegerElement %type =
? datatype( negativeInteger)
9 number shortElement %type = short% ? datatype( short)
10 number unsignedIntElement %type = unsignedInt% ? datatype( unsignedInt)
11 string stringElement %string% ? ?
12 string tokenElement %Token% ? datatype( token)
13 string normalizedStringElement %string% ? datatype( normalizedString)
14 datetime dateElement %YYYY-MM-DDZ% string-length(.) < 11 setDateMask( YYYY-MM-DD)
15 datetime dateElement %YYYY-MM-DDZ% string-length(.) > 10 setDateMask( YYYY-MM-DDZ)
16 datetime dateTimeElement %YYYY-MM-DD
string-length(.) < 26 setDateMask( YYYY-MM-DD’T’HH:MI:SSZ)
17 datetime dateTimeElement %YYYY-MM-DD
string-length(.) > 25 setDateMask( YYYY-MM-DD’T’HH:MI:SS.SZ)
18 datetime timeElement %HH:MI:SS.SZ% string-length(.) < 13 setDateMask(HH:MI:SS.SSS)
19 datetime timeElement %HH:MI:SS.SZ% string-length(.) > 12 setDateMask(HH:MI:SS.SSSZ)
20 datetime durationElement %P1% ? restrictValues( ‘P1’|’Y2’|’M3’|’DT1’|’H1’|’0M’|’0S’)

For now, there are several important points to glean from Table 1:

  • Generated rules are illustrative, not normative. That is, there are usually several different rules you may define to achieve approximately the same thing. The CAM processor in this implementation has made certain choices?but you should consider them guidelines rather than gospel. For example, some datatypes have associated rules with masks (e.g., floatElement) while others have associated rules with the datatype predicate (e.g., doubleElement). (The datatype predicate literally indicates that the value must match that datatype as Java understands it, because CAMed is a Java application.) This CAM processor chose to create its rules in that fashion, but you could do the reverse if you wish: use something like setNumberMask(######.############) for doubleElement or datatype(float) for floatElement.
Author’s Note: My recommendation would be to use datatype() for both floats and doubles unless you want to restrict the fractional portion of the number to a specific number of digits (the next section discusses this point further).
  • A single rule may specify multiple possible element values. Use the restrictValues predicate, as shown for the booleanElement (row 1), or the durationElement (row 20), to specify one or more values. The booleanElement, for example, must contain either true or false to pass validation. Note though, that all possible values are constants and must be specified in advance.
  • A single element may have multiple associated rules. This allows you to specify multiple formats as opposed to just multiple values. The dateElement, dateTimeElement, and timeElements all show examples of these. Multiple rules may conflict with each other as long as their conditions are mutually exclusive. For the dateElement, the first rule (row 14) does not allow a final Z while the second rule (row 15) requires it. But that works fine because of the condition attached to each. Note that any date in the format YYYY-MM-DD has exactly 10 characters; adding the final Z makes it 11 characters. The conditions derive directly from these observations: when the dateElement has fewer than 11 characters, the date must be in the YYYY-MM-DD format; when it exceeds 10 characters, it must be in the YYYY-MM-DDZ format.
  • Reprise: Generated rules are illustrative, not normative. This important point bears repeating with respect to the timeElement. This element has only two rules associated with it but you should consider that as just a starting point. For example, here’s an expanded list of possible time formats that would allow your template to accept a more flexible range of time values:
       (a) HH:MI:SS.SSSZ   (b) HH:MI:SS.SSS   (c) HH:MI:SS.SSZ   (d) HH:MI:SS.SS   (e) HH:MI:SS.SZ   (f) HH:MI:SS.S   (g) HH:MI:SSZ

Adding Additional Rules

To add more rules, you need to identify a condition for each one. For convenience, here’s a copy of the rule list from the previous page:

   (a) HH:MI:SS.SSSZ   (b) HH:MI:SS.SSS   (c) HH:MI:SS.SSZ   (d) HH:MI:SS.SS   (e) HH:MI:SS.SZ   (f) HH:MI:SS.S   (g) HH:MI:SSZ

The two rules generated by default (items a and b) use the string-length function to check for the presence or absence of the final “Z”, just as with dateElement. But because some formats in the list are the same length, you can’t rely solely on string length; you must introduce more complicated clauses and additional functions. For example, items (b) and (c) are both the same length, so you need to change the single rule for (b) from this:

Figure 6. Adding a Rule to the timeElement: Open the context menu (1) on the timeElement, and select Add New Rule. You’ll see the Add New Constraint wizard (2). Set the action to setDateMask, and then click the Date Mask field to open the date mask editor (3). On the main wizard, change “Conditional?” from “No” to “Yes” to expose the Condition field. Click on the Condition field to open the condition editor (4).
   string-length(.) < 13

to this:

   string-length(.) < 13 and not(ends-with(., 'Z'))"

Then you can add this rule for (c):

   string-length(.) < 13 and ends-with(., 'Z')

To define this new rule, access the Add New Constraint wizard by opening the context menu on the timeElement in the Structure view (not the XML view) and selecting Add New Rule. The wizard lets you define both the mask and the Boolean conditions?see Figure 6. In the top part of the wizard you specify the target nodes using XPath and the action to apply to those nodes. In the bottom portion, change "Conditional?" from "No" to "Yes." When you do that, the wizard exposes additional fields. Clicking on the condition field opens another wizard to specify the condition, also an XPath expression.

Author's Note: For convenience, the wizard includes a list of XPath functions in a drop-down selector but, at the time of this writing, the list is incomplete; however, you can simply type the condition manually.

Here's some perspective on placeholders:

  • Placeholders may be misleading. A more positive spin on this statement is that a generated placeholder is just a starting point for describing the content of an element. Like generated rules, placeholders are emphatically not normative. For example, the intElement uses the placeholder 12345, implying a five-digit whole number. But an intElement may contain one, five, or nine digits. And it may be negative. The 12345 is intended to suggest that the value is simply a whole number, positive or negative, with any number of digits. Similarly, the floatElement with a placeholder of 54321.00, seems to indicate a number with five integer digits on the left, and two fractional digits. So if an element has three decimal places should that fail validation as a floatElement? Remember that the rule determines what is acceptable for an element; the placeholder is but a human-readable mnemonic. Therefore, you can think of the 54321.00 placeholder as an element containing a whole number component and a fractional component, with no specific claims about the magnitude of either. Depending on your preferences, you may consider this perspective misleading. A more general approach would be to use a placeholder such as doubleElement uses, "type=double" (or perhaps just "double").
  • Elements may have multiple rules, but only one placeholder. Refer again to the timeElement in Figure 5. It appears only once in the Structure view, but if you click on that node you will see two corresponding rules in the ItemRules view (rows 18 and 19 of Table 1). Curiously, the generated placeholder (HH:MI:SS.SZ) does not match either rule. With just those two rules you might specify a placeholder using regular expression notation, e.g. HH:MI:SS.SS(Z|S), to correspond to the two rules. If, however, you intend to implement the seven rules for timeElement shown earlier (or even more), a regular expression covering all of them would be unwieldy. In that case, you might opt for the more generic approach of just saying "time-value" or something similar.
  • An element's rules must cover its universe of discourse. Referring to the two rules for dateElement (rows 14 and 15 in Table 1), you might wonder why they indicate less than 11 and greater than 10 instead of equal to 10 and equal to 11. What happens to a value that is 12 characters? With the current rules, this value would trigger the date mask in row 15 of the table and the value would fail validation?as it should. If instead you were looking only for exactly 10 or 11 characters, a 12-character value would not trigger either of those rules, so the input document would not fail validation.

Numeric Mask Subtleties

To further examine numeric masks applied with the setNumberMask predicate, take a look at the two provided sample files (DataTypeSamples/NumberMaskSamples.cam and DataTypeSamples/NumberMaskSamples.xml). The template is quite small, consisting of just two elements under the root. Here's the structure:

               %54321.00%       %54321.00%        

And here are the rules:


The preceding rules determine that, if an element is an apply a mask with octothorps; if it's a apply a mask with zeros.

The NumberMaskSamples.xml file is nothing more than 17 separate test cases for the and the same test cases repeated for the . Here's a portion of the XML instance:

        1.42     12.42     123.42     1234.42     . . .     1.42     12.42     123.42     1234.42     . . .   

These illustrate some interesting points about numeric masks. Table 2 shows the validation results for each test case value using both types of masks. You may run the validation yourself with the files provided. The XML file contains each number in the Value column inserted into both an and a . The main difference between octothorp and zero as mask elements is that the former allows zero suppression while the latter does not. Using octothorps, the values in rows 1-3 are valid, while only row 3 is valid with the zero mask because the integer portion has three digits, matching the mask.

Row 4 is curious because although the integer portion of the mask contains only three octothorps, values with four integer digits are still considered valid. That is, the octothorp mask makes no restriction on the number of digits to the left of the decimal point. For the fractional portion, however, it is a bit less forgiving. There are two octothorps in the mask. Values with fewer than two decimal digits validate (row 16) but values with more than two do not (row 14).

The specification does not mention that negative numbers are permitted with either mask character but both mask characters do allow a leading minus sign. As you've seen the zero mask is quite particular about digit count to the left of the decimal point. But does the minus sign count as a digit? This seems to be a spot where CAM cannot quite make up its mind, because the values in rows 6 and 7 are both valid.

Table 2. Numeric Mask Differences: Running the same set of tests with a mask using zero suppression (###.##) and a mask requiring all digits (000.00) shows the differences in their behavior. The shaded cells indicate apparent inconsistency in the mask behavior, as described in the comments.
# Value Mask ###.## Mask 000.00 Comments
1 1.42 Pass Fail Leading spaces are OK with "#" but not with "0".
2 12.42 Pass Fail "#" allows fewer than the number of places in the mask to the left of the decimal by definition (due to zero suppression).
3 123.42 Pass Pass Digit count matches mask.
4 1234.42 Pass Fail "#" allows more than the number of places in the mask to the left of the decimal.
5 -1.42 Pass Fail Minus sign allowed.
6 -12.42 Pass Pass The minus sign should either be counted as a digit?
7 -123.42 Pass Pass ? or not counted as a digit, but not both!
8 -1234.42 Pass Fail ?
9 - 2.23 Pass Fail Leading spaces are OK with "#".
10 - 3.23 Pass Fail ?
11 - 4.23 Pass Fail ?
12 -003.23 Pass Pass Leading zeroes are optional with "#" but required with "0".
13 0.23 Pass Fail ?
14 0.234 Fail Fail "#" does not allow more than two digits to the right.
15 .23 Pass Fail ?
16 .2 Pass Fail "#" does allow fewer than two digits to the right by definition (due to zero suppression).
17 . Pass Fail ?

Common Rules

The purchase order example discussed earlier included both a billing address and a shipping address. Each has precisely the same set of child elements and each of those children have the same validation rules. The default template generation duplicates both the elements and the rules. Good code design, however, dictates that duplication be removed to avoid future maintenance issues. This section discusses removing duplicate rules and the following section discusses removing duplicate structure.

Earlier under Business Rules you changed the billing zip code rule from setNumberMask(######.##) to setStringMask(00000) to properly validate a 5-digit zip code. Now you can add a broader rule to encompass the shipping zip code as well.

First, open the context menu for the //shipTo/zip node in the Structure view and select Add New Rule (analogous to Figure 6). In the Rule XPath set of check boxes, uncheck the "Parent" box that is checked by default. This changes the XPath just above it from //shipTo/zip to //zip, which will match any zip node in the document. Set the action to setStringMask('00000'). For real-world complexity, add a condition that lets this rule apply to five-digit zip codes in preparation for another new rule that will handle nine-digit zip codes. The top frame of Figure 7 shows both rules.

Author's Note: Actually, Figure 7 shows duplicates of both rules: one operating on the //shipTo/zip node and another on the //zip node. Although you won't need duplicate rules when this example is complete, it's worth showing them here to make a point: the rules pertaining to //zip apply to both shipping zip codes as well as billing zip codes.
Figure 7. Varying Rule Scope: The top frame shows rules specific to elements within elements (that do not apply to elements) plus rules that apply to any elements. Note that the latter also appear when you move focus to the element under shown in the lower frame.

When you have //shipTo/zip selected in the Structure view, you'll see its rules in the ItemRules view. Here you see four: those applying to the specific node and those applying to any zip node. Now change focus from //shipTo/zip in the Structure view to //billTo/zip (bottom frame of Figure 7) and observe that only the two rules that apply to all zip nodes appear in the ItemRules view. The XPath selector you define determines the scope of your rule. When you are satisfied with your understanding here go ahead and delete the duplicate rules specific to the //shipTo/zip node, leaving just the two rules that apply to all zip nodes (see the file PurchaseOrder/purchaseOrder_with_generic_zip_rules.cam in the downloadable code).

One excellent yet subtle feature of the CAM editor deserves mention here: When you select a node in the Structure view you see all rules applicable to that node just as if they were attached specifically to that node. But when you look at the list of all rules in the Rules view (as in Figure 5) you find that each rule appears only once, no matter how many nodes it applies to in the structure.

Now that you have learned how to remove duplicate rules, the next section discusses how to remove duplicate structural pieces or element sub-trees.

Common Elements

The purchase order example includes both a shipping address and a billing address. Each includes typical child elements: name, street, city, state, and zip. More to the point, both include precisely the same set of child elements, meaning these elements are candidates for removing duplication. The editor makes this easy to do in just two steps:

  1. Convert the children of one address node to an include file.
  2. Reference the same include file for the other address node.
Author's Note: The specific steps are illustrated in Figure 8, and described next, but the discussion describes what the editor should do rather than what it does do. These particular actions, however, do not quite operate correctly in the current version of the editor, but after you understand the simple steps, you'll be able to fix the code manually until the editor defect is fixed. Given the responsiveness of the development team, however, it may already be fixed by the time you read this.
Figure 8. Eliminating Structure Duplication: Identify the nodes that have identical subtrees. Convert one node's children into an included subtree using the context menu (1). For all others, replace the children with a reference to the same included subtree (2). The final result (3) looks just like the original except that the icons have changed color.

Here's an explanation of the steps shown in Figure 8:

  1. Select the shipTo node in the Structure view and open the context menu. Under the Include choice, select "Make Element Children an Include." The editor prompts you for a file name because it stores XML fragments in a separate include file. Next, the editor shows (middle frame) an annotation on the shipTo element indicating it is now an include file. More subtly, the color of the child element icons has changed from blue to magenta.
  2. Now select the billTo node, open its context menu, and under the Include choice, select "Replace Children with an Include," referencing the include file that you created in step 1.
  3. When the change is successful you'll see the child element icons change from blue to magenta.

The changes are essentially transparent within the editor; you work with the child nodes, applying rules, etc., just as if they were "real" children rather than references to an included file. If you look at the XML source of the CAM template (via the View menu or using an external editor) you will find that the child elements are gone, replaced by a single element. You can find this intermediate template file in the downloadable code as PurchaseOrder/purchaseOrder_with_includes.cam.

The original CAM template included this code:

        %string%     %string%     %string%     %string%     %54321%           %string%     %string%     %string%     %string%     %54321.00%   

The new code?assuming an include file named po_address_include.xml?looks like this:

               po_address_include.xml                       po_address_include.xml        

The above elements use a file path relative to the location of the CAM template?the po_address_include.xml file must be in the same directory as the CAM template itself. Alternatively, you could use an absolute file path.

The actual po_address_include.xml included file now contains this XML:

        %string%     %string%     %string%     %string%     %54321%   

Notice that the root element is , because that is the element from which you generated the include file. But remember that you converted the children of to an include file, not the element itself. Therefore, the name of the root element here is immaterial. Indeed, you have already proved that by replacing the children of the element with this same include file. This is further affirmed by the presence of the ignoreRoot attribute in the elements shown above. My suggestion, then, is to change the root in this include file to something more meaningful, as shown below:

%string% %string% %string% %string% %54321%

You have used the two actions on the Include menu that affect the child elements of a given element. There are another two actions that affect the selected element itself. If, for example, you had chosen "Make Element an Include" instead of "Make Element Children an Include," the code in the CAM template would have been:


In this case, note that the ignoreRoot attribute is absent; its default value is "no." Because this code does use the root element, you cannot rename it or delete its attributes; you would need to use the original version above if you wished to include it in this fashion.

This use of an include file that requires the root element would seem to have little utility for removing duplicate code, because you would now need a separate file for and for . It's true that this feature is not useful in this scenario, but it could be quite useful in other scenarios. For example, if you had a more complex structure that needed two elements, you could leverage this capability.

Author's Note: Section 3.2.4, Imports, of the CAM specification, discusses how to use XPath to reference specific portions of an include file rather than the whole thing. However, this mechanism is not present in this implementation of the CAM processor. It would be quite handy, because you could then place all the XML fragments in a single include file. As it stands, each must be in its own file.

A Limitation With Mixed Content

One last important point to note is that CAM excels in structured XML processing but it has little support for mixed content. For example, in XSD you could define this schema:


And that would validate this XML:

      Dear fred:   Your 232 has shipped on 5/12/09.   

CAM does not support validating this type of content.

Now that you have a better sense of the differences between XSD Schema and CAM, the next part of this article delves more deeply into CAM itself.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Related Posts