devxlogo

Taking Data Validation to a Dynamic Level

Taking Data Validation to a Dynamic Level

lthough XML Schema was intended to be used to provide a better validation and definition layer for XML than Document Type Definitions (DTDs), certain underlying concepts were translated unintentionally. One concept was the notion that attributes could be set to enumerations (lists), but those enumerations needed to be specified within the schema itself. As XML becomes more distributed, as data structures become more complex and dependent on not just static but dynamic definitions, developers are beginning to recognize the need to have a schema language that itself supports such dynamic definitions.

Because ISO Schematron can now utilize XPath 2.0 (especially including functions such as doc() and unparsed-text()), Schematron may prove to be the critical “distributed validation” language for business content. I’ll discuss Schematron in more detail later, but many issues for which Schematron seems remarkably well suited became obvious to me while doing some consultative programming for Nordstrom—the high-end clothing retailer.

I was developing a method—through deployment over the web—to validate, display, and edit invoices coming from its vendors. At that time the EDI standard was being phased out for some form of XML (ebXML was still a year out from being finalized). Being of an XMLish sort of mind, I pulled out the tools that were available to me, including the newly minted XML Schema, and began modeling the invoices in XML Schema Definition (XSD). Most invoices mapped fairly cleanly. Where things got a little bit more problematic was having to validate to ensure that a given store ID was correct.

This problem may seem easy, perhaps even obvious, to solve: merely enumerate the list of stores in the schema elements, right? However, Nordstrom was in a period of consolidation at that point, and stores were being closed on a weekly basis. Perhaps the requirement could be dropped in favor of comparing against a regular expression? That wouldn’t work, though, because one of the key reasons for validating was ensuring that the stores in question were legitimate, and moreover that they hadn’t been closed. In the end, the developed solution was a kludge—post process the XML against a list retrieved directly from a database call on the server itself.

The same issue reared its ugly head more recently while working with schema-generated XForms, prompting a lot of thought about the nature of lists and their relationship to models. At their most fundamental, all lists ultimately have two things in common. First, a list is a sequence of objects. Second, and more importantly, for each object in the list, there is a key that is unique to that item identifying the object within the list. In an indexed area, that key is a number (usually zero or one based) that identifies the items by position. The “name” of that item (this key) is then unique only to the context of the array—known as a linear array—and only if the array itself doesn’t change in the sequence leading up to that number.

For instance, consider this simple list of colors:

colors = ['red', 'orange', 'yellow', 'green', 'blue', 'violet']

The list can be exposed as a sequential list with numeric keys, conceptually the same as:

colors = {0 : 'red', 1 : 'orange', 2 : 'yellow', 3 : 'green', 4 : 'blue', 5 : 'violet'}

Such a system implies that you can effectively access the given resource by its index (or similarly update the content by assigning it to the indexed entry):

print colors[0]=> redcolors[0] = 'scarlet'print colors[0]=>scarlet

An associative array, on the other hand, associates a formal name to a given term in the list:

colors = {rd : 'red', or : 'orange', ye : 'yellow', gn : 'green', bu : 'blue', vi : 'violet'}

You can then reference a given element from the array through the associated name:

print colors['rd']=> redcolors['rd'] = 'scarlet'print colors['rd']=> scarlet

In this case, the set of all keys in the associated array effectively make up a taxonomy. Note that it is the keys, not the contents, of the array that are more important here, especially because there is no specific prohibition on the contents being anything—up to and including other arrays.

Associative Arrays
If you take a look at a traditional HTML page, in general the

The content within each option text block is a “label” (the right side of an array item), while the value attribute holds the “name” of each item. Significantly, when an item is selected, and while the label for that item is displayed in the drop-down list, the value of that control will be the associated left-hand name of the option. A similar expression holds for XForms, though the relationship is a little more explicitly defined:

            red        rd    

These concepts are of course fundamental to most computer languages, but things get a little more complex when one considers a slightly different model in XForms:

            rd                                                                  

In this case the associative array (defined here as the names and labels on a color element) is contained in a data instance that is unique from the selection control. This location is a little strange if you’re coming from HTML—where the control holds the relevant array—but it’s pretty normal if you’re working with XForms content. In XForms content the underlying assumption is that the control is providing only a view on an existing data structure; the associated array acts much like a schema defining the constraints that determine which values are valid for that particular data property.

So far, so good; schema through an enumerated list is a pretty typical use of schemas (XSD schemas in particular). However, take the next step, which is shifting that particular constraint list so that it comes from an external XML file:

            rd                    

Conceptually, this approach isn’t all that much of a jump; although, notice what’s occurring in the bigger picture. If you look at this list of colors as being part of a schema, then that schema just became distributed—at least part of the schema exists outside of the scope of the rest of the schema. Again, this approach isn’t all that unusual because most schemas include some form of modularization aspect anyway—modularization that breaks up the schemas into distributed pieces. The schema’s beginning to drift into the network, but it’s still constrained and static.

Or is it? Consider one additional step:

>xf:model id="datamodel">            rd                    

In this case the source changed from being an XML document to being the result of a REST-invoked XQuery (substitute your own favorite server language here; the concepts are just as valid). Yet at the same time, something profound happens. The query assumes that some dynamic process has just taken place, which in turn means that the enumerations that the particular model can hold are no longer static. For instance, suppose that colors.xq reached into a Crayola crayon box (the one with 10 gazillion colors in it) and pulled out a random set of, say, 12 colors. Moreover, every day or so about 5 percent of the colors in the initial set are retired, and an equivalent amount is added. In essence, while you have an enumerated set of colors, you no longer have a static definition of that set—you cannot model it in XSD.

The example may seem contrived, but it is exactly the same problem as the Nordstrom’s store problem described previously, and is in fact becoming typical in building web applications. You could argue at this point that the results should be just a blank text field and an unconstrained set, but that’s not the case. The set is known and at any point can be validated, but the set is also dynamic—such validation is a function of time. The color may be available, the store may be active and not yet closed, but the validation state has now acquired time as a parameter. The world works this way, and to ignore this fact for the sake of convenience will result in building models (and applications) that fail in subtle but important ways.

Read-Only Web Services Scenarios
If this parameterization seems naggingly familiar, it is in fact the same issue that arises any time that you use web services. For the moment, consider only read-only web services based on the HTTP GET protocol. There are effectively two distinct scenarios where read-only web services are likely to occur.

The first is a situation where for one reason or another it is not feasible for the model in question to fit readily within the client environment directly (it’s hosted on a database, there are security permissions involved, and so forth). You can label such services as “convenience services.” In theory, you could host the information locally, but it wouldn’t be efficient to do so. In this particular case, the data environment is fundamentally static; the same call made at two different times but with otherwise identical parameters would retrieve the same content. You could theoretically create a schema for such a call, which would consist of a specific (albeit potentially large) set of values. A good example of this approach would be a postal code registry that would let you map postal codes to given townships in an area. While it is possible this registry may change, the change would be so seldom as to be insignificant to the modeler.

The second situation, however, is considerably more interesting. This case is one where the service itself is working with a dynamic environment. For instance, take the archetypal web service, which retrieves the changes in an equity stock from the beginning of the day to the current time (+/? some reporting delta). To keep things focused on the key, the service provides a listing of a given set of stocks that have increased in value since the last reporting period. The taxonomy in this case is both functional and dynamic; if it were given as a set of radio buttons, one for each stock, the number of buttons and the content of those buttons would change every time the web service refreshed.

It’s also not something that is limited to XForms. Indeed, one of the more insidious problems inherent in Asynchronous JavaScript and XML (AJAX) services in general is that it becomes increasingly difficult to validate an instance of a data model as that model becomes more diffuse and distributed. Consequently, most of the notions of validation that have been defined either for object-oriented programming or XML are no longer valid.

One of the more insidious problems inherent in AJAX services in general is that it becomes increasingly difficult to validate an instance of a data model as that model becomes more diffuse and distributed.

Now, this issue raises an important question: is validation necessary? Even in a completely trusted network, the answer is likely, “yes, some form of validation is necessary.” In such a network, XML (or related serialized content) still needs to be created at some point, and there is a possibility that the creation process for that XML is flawed; however, the validation involved there is more in line with comprehensive unit testing. After you seal the box and determine that the content in such a closed system is valid and consistent, the only source of potential errors would come from flaws in your model itself—something that, by definition, validation cannot solve (as such validation is part of the model).

However, the moment that you introduce the possibility of XML content coming from outside of the environment, then validation becomes crucial. And because one of the principle roles of XML is as a messaging format among heterogeneous systems, then it is likely that you will need some way to determine whether content entering the system is both internally consistent and legitimate.

A static schema language, however, can only provide at best structural or base-type validation, and even there as models become more complex the likelihood that such a schema can properly validate content becomes something of a game of chance. It cannot validate taxonomic information that exists outside the model, especially in a dynamic context. Moreover, it cannot validate the authenticity of a message.

One potential solution is to set up a complex infrastructure of web services tied specifically into SOAP/WSDL interchange, establish a federation system for identity management, wrap everything in encrypted bundles, and essentially build a full handshaking mechanism across all the systems involved to turn the fundamentally unreliable network of the Internet into a closed, private, and totally reliable network. To a great extent this approach drove the creation of most of the WS-* initiatives.

Deceptively Simple Schematron
However, a far simpler (if somewhat less secure) system is to simply give your validation mechanism the intelligence to talk to resources outside of the schema file itself. In a nutshell, this approach is the one that ISO Schematron takes. The idea behind Schematron is deceptively simple. A Schematron document consists of a collection of rules (rendered in XML), each of which operates a specific context consisting of an XPath expression. Within the rule is a set of assertions, which are themselves XPath expressions and predicates, that test given conditions about the context. If the condition succeeds, nothing happens; but if it fails, then Schematron returns a message in a specific text or XHTML format. While it is possible to use custom parsers, the most typical Schematron process runs this way:

  1. Author the Schematron document.
  2. Transform the Schematron using a special Schematron XSLT, which in turn generates another XSLT (the Schematron filter).
  3. Transform the file to be validated against the Schematron filter to produce a report.
  4. Pass the report onto the user or another process.

You can use this Schematron approach with either XSLT 1.0 and 2.0 processors (though in general I’d recommend the 2.0 approach simply because the capabilities of the language are much more sophisticated). Both expose one important function: the XSLT document() function (which isn’t a part of XPath 1.0, per se).

The document() function takes two arguments. The first argument consists of either a URL string or a node-set (or sequence in 2.0) of URLs, while the second argument consists of a document context (which usually takes the current node reference as an argument value). The result is in turn one or more documents from those URLs. Note that if those URLs are themselves parametric GET-based web services, then you can use them to retrieve content from an external service to validate content from dynamic taxonomies.

For instance, suppose that you had a web service that takes a single parameter—colorkey—and returned from that a single XML node of the form:

Suppose also the color corresponding to the key was found:

 

or:

You can then make a Schematron that can read the existing resource, query against the server, and generate the appropriate error message when the assertion is disproved:

 xmlns=http://purl.oclc.org/dsdl/schematron>   id="confirmTaxonomies">     context="colorkey">       name="$keyValue" value="."/>       name="colorDoc" value="document(concat('colors.xq?colorkey=',$keyValue),.)"/>       test="$colorDoc[@status=200]">         select="$colorDoc[@statusMessage]"/>            

In this case the pattern contains a single rule matching the colorkey element. The rule in turn defines some expressions for easier and clearer processing, and tests to see whether the @status of the incoming code is 200 (corresponding to an HTTP 200 “success” code). If not (that is, the web service returns an error) then the validator outputs the specifics of the message to the output-processing stream.

This approach—using something like Schematron as a declarative schema that nonetheless works well in a distributed context—should be examined more closely by those who work with XML data streams. In an increasingly connected world validation itself also needs to “go global,” become more functional, and shift toward a processing model that recognizes that the days you could describe XML content in a single static document are slipping away quickly.

devxblackblue

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.

About Our Journalist