Login | Register   
LinkedIn
Google+
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Validate Data With Regular Expressions and XSL

XSL query language can only perform queries based upon the complete text of a text node. Learn how to use regular expressions to let you search at a lower level of granularity than the XML text element.


advertisement

xtensible Stylesheet Language (XSL) has slowly been gaining a reputation as the SQL of the hierarchical data world. Even with only the partial implementation that Microsoft's XSL offers, you can perform surprisingly complex queries. Yet the language has been hamstrung somewhat in that the current XSL query language can only perform queries based upon the complete text of a text node. There is no clean way to search all nodes and return nodes that include an expression within a larger block of text, or that can validate specific types of text.

Furthermore, there are times where you need to include parameters in your XML (or XSL) that change depending upon external conditions (the XSL targets different browsers, the return XML must work against specific records passed through ASP parameters, and so forth). XML does support entities, of course, but in the current implementation, the entities must be defined within a DTD (useful for documents, but worthless for programmatic parameters). Microsoft's XML Schema notation as it exists right now simply doesn't support entities.

In both of these cases (and several others), you need a mechanism that can search at a lower level of granularity than the XML text element. Fortunately, the problem of searching within strings of text is extremely common, and over the years a number of tools have arisen to solve that particular problem set. The most powerful (and most pervasive) are the constructs known as regular expressions.



Introducing Regular Expressions
Regular expressions are staples in the Unix world—indeed, whole languages are built around them (Perl, Python, and Tcl, just to name a few). But surprisingly they have only begun creeping into the Windows world fairly recently, primarily in the scripting languages, JavaScript and VBScript, although you can use them to good effect in languages like Visual Basic and Java.

With a regular expression, you build a template to match the expression that you're looking for, known as a pattern. Once the pattern is assigned to the regular expression, then you can test strings against the pattern, and do such things as determine if one string contains another (as well as where the string is), replace the contents of one string with another string, or return a list of all strings that match the pattern.

Before getting into the details of regular expressions, note that the examples given here are in Visual Basic, but can be adapted to VBScript or (with a little bit of work) JavaScript. The regular expression engine is built into the scripting language DLLs, but to use the same thing in Visual Basic, you will need to select the References menu item from the VB Projects menu, then check the "Microsoft VBScript Regular Expressions" entry. As is typically the case, you will also need to have IE4 or IE5 installed to use these expressions from VB, although Netscape Navigator also provides nearly identical JavaScript support for regular expressions.

The RegExp object acts as the interface for regular expressions. The simplest regexes (as they are frequently referred to) match characters in a pattern. For example, suppose that you want to check to see if a document contains a specific term (for example "regular expressions"). This code loads the file into a variable, then passes the variable to a regular expression:


Public Function IsTermInDocument(filePath As String,_
                           expr As String) As Boolean
    Dim fs As FileSystemObject
    Dim ts As TextStream
    Dim re As RegExp
    Dim text As String

    Set fs = New FileSystemObject
    Set ts = fs.OpenTextFile(filePath, ForReading)
    text = ts.ReadAll
(1) Set re = New RegExp
(2) re.Pattern = expr
(3) IsTermInDocument = re.Test(text)
End Function

Debug.print IsTermInDocument("c:\bin\myPage.htm",_
"regular expression") 
' This returns true if "regular expression" is in the 
' document, false otherwise

This function will return a boolean value depending upon whether a given term is in a document. In this case, the object is instantiated (1), a pattern is assigned to the regex (2), then the text is tested against the pattern (3). If this was all that regexes were useful for, it would hardly be worth the effort—you could use instr() or a similar VB function to do much the same thing.

However, typically, your search or validation requirements are considerably more sophisticated than just searching for a given string. For example, suppose that you want to insure that a given field contains a well-formed zip code (well-formed in that it could be a legitimate code, although it may not be the valid code for a given location—this dichotomy between well-formed and valid expressions will appear throughout this article). Performing this test in Visual Basic gets ugly. You need to test that the expression has either 5 or 10 digits, and if the latter, then the 6th character must be a dash. With a regex, on the other hand, it's a piece of cake:


(1) Set IsZipCode = New RegExp
(2) IsZipCode.Pattern = "^\d{5}(-\d{4})?$"
(3) if IsZipCode.test("32545-2198") then

One (albeit cryptic) pattern string, and any zip code candidate can be accepted or rejected in a moment. The expression itself isn't all that hard to decode, however, and basically requires that you are aware of the meaning of certain escape characters. In this case, the expression can be read from left to write, as follows:


^	there is nothing in the string before the expression
\d	the next character must be a digit from 0-9
\d{5}	there should be five digits
-\d{4}	four more digits should appear after a - sign
(-\d{4})?	the dash and last four digits are optional
$	there is nothing else in the string after expression

The beauty of this approach is that once you define a pattern, you can use it on any number of strings without having to rebuild the regex object. By defining a set of basic regexes, you can cut down on the number of validation scripts considerably within your code. I reduced nearly 2000 lines of JavaScript code into a couple hundred lines of codes and regexes, and most of it consisted of code to deal with the few cases that regexes couldn't handle.

Writing regexes amounts to knowing how best to apply the sometimes arcane characters used within the expressions. Many punctuation marks typically have specialized meaning in regexes (some of which are covered in Table 1). The back slash character ("\") can be used to escape such characters so that you can perform a search on the character itself. For example, the regex ".+" will match a string with one or more characters, while "\.+" will match a string with one or more periods.

Similarly, alphanumeric characters can also be used as meta-characters (see Table 2). Unlike punctuation meta-characters, you use the backslash to indicate that the character is a meta-character. Thus, "d+" will match strings that consist of one or more "d" characters ("dd","abcd","dddd ddd"), while "\d+" match strings that consist of one or more digits ("12","abc123", but not "abcdd").

You can make an enormous number of expressions with just this set of characters (which are common to both the RegExp engine in Internet Explorer and Netscape Navigator). For example, suppose that you want to match a phone number. Typically, phone numbers can prove complicated to validate, because people tend to enter them in different ways:


(800)555-1212
1(800) 555-1212
1-800-555-1212
1.800.555.1212
etc.

Setting up a script to catch all of these variations can get complicated. A regular expression, on the other hand, is fairly simple:


Set IsPhoneNumber=new RegExp
IsPhoneNumber.pattern="^[01]?\s*[\(\.-]?(\d{3})[\)\.-]?\s*(\d{3})[\.-](\d{4})$"

This code will validate either a 0 or 1 as the first character, possible white space or parentheses around the area code, then either dashes or periods separating the rest of the phone number.

Search and Replace Data
Of course, validating a phone number is one thing, but it would be useful if the phone number could be put into a single consistent format that could be retrieved later. Consider this XML code:


<phoneNumber>
<areacode>123</areacode>
<exchange>456</exchange>
<local>7890</local>
</phoneNumber>

This phone pattern includes three expressions, (\d{3}), (\d{3}), and (\d{4}), wrapped in parentheses, denoting the area code, exchange, and local number respectively. In regex notation, the regex parser automatically remembers parenthetic expressions, and you can retrieve them numerically using the predefined operators $1, $2, $3, and so on. You do this with the RegExp replace method:


?re.Replace("1(352)351-4159",_ "<phoneNumber><areacode>$1</areacode>_
<exchange>$2</exchange><local>$3</local></phoneNumber>")

Specifically, the syntax is re.Replace(sourceString,replaceString), where sourceString is the data that you want to transform, and replace is the target string that you're using to replace the source. There is no limit to the number of times that you can include the operators $1 through $9, although you are limited to only nine matched expressions in the replace string.

This use of replace may seem a little counterintuitive, given how VB and VBScript's replace is used. However, you can duplicate that functionality as well with a regular expression, as is demonstrated with this Replacex function:


Public Function Replacex(sourceStr as String, oldStr as _
	String, newStr as String, optional ignoreCase as _
	Boolean = False,optional isGlobal as Boolean = True)
    Dim re As New RegExp
    re.Pattern = oldStr
    re.Global = isGlobal
    re.IgnoreCase = ignoreCase
    Replacex = re.Replace(sourceStr, newStr)
End Function

The Replacex function expands the normal VBScript replace function with one that can pass regexes. Thus, you could say:


Debug.Print Replacex("This is a test","is","at")
--> "That at a test"

But you could also use a regular expression as an argument:


Debug.Print Replacex("This is a test","\ws","at")
--> "That at a tatt"

or even combine the replace functionality with the stored expressions:


Debug.Print Replacex("This is a test","(\ws)","at$1")
--> "Thatis atis a tatist"

Notice the use of re.Global and re.IgnoreCase properties in the Replacex() function. Typically, a regex runs only until it finds the first instance that satisfies the test. However, for a replace function, you will usually want to make changes globally, so you should set it to true in that particular instance. Similarly, regexes typically are case sensitive. You can set the regex to be case insensitive, however, by setting IgnoreCase to true as well. Note that enabling either one of these functions will slow down the regex, which usually is not a problem for a small string, but can be highly detrimental with large documents.

Retrieve XML Nodes
By now, you may be starting to see the possibilities with regular expressions, especially in regards to some of the thornier issues of XML. In particular, one of the more vexing problems has to do with the incomplete implementation of entities in Microsoft's XML 2.0 parser. This actually breaks down into a couple problems. First, the XML parser requires that entities be defined within a DTD when the document initially loads. This can prove problematic, however, in circumstances where the entities themselves change meaning over time—for example, when setting the characteristics of an XML document or an XSL filter based upon some external criterion.

The second problem is that a DTD cannot be manipulated using XSL, and is difficult to work with even using scripting languages. Frequently it is useful to be able to set variables within XSL structures—expressions, such as a browser type or ASP parameter, that can be used within conditional statements.

Finally, there are times when it would be useful to get a list of elements from an XML document for which a given sub-element satisfies a criterion—for example, retrieving a list of all elements that have the word "XML" as part of the text.

Regular expressions offer solutions to these problems. Attacking the last problem, there are actually two slightly different domains to look at. In general, when I want to retrieve an XML element, that object may or may not be at the node I want to test. For example, consider a simple XML structure, a book catalog (see Listing 1). In all likelihood, your interest is not to retrieve the descriptions or titles of books—you want to retrieve the book nodes themselves. Fortunately, you can solve both problems with the same basic approach, as demonstrated in GetFilteredElements (see Listing 2).

GetFilteredElements works by taking either an XML document or a node within a document and converting it (and all of the children of that node) into a node list of type IXMLDOMNodeList. The filter is then applied to each node in turn, and if the text of that node satisfies the expression, then the node is marked with a temporary attribute called filter:filteredElementFound (in the unlikely event of a collision with this name, you should change the attribute to something else). Once all of the nodes have been examined, a new list consisting only of those nodes that contain the attribute is created, the attribute is stripped off, and the node list is returned.

If you don't pass a search query, then the document will retrieve only leaf nodes (those that have text or cdata sections but don't contain any element children) that have the enclosed text. On the other hand, if you do pass a search query, the function retrieves the nodes for which either they or their subnodes satisfy the regular expression. For example, this code will retrieve the first book's title and description nodes, and the fourth book's description node only:


Dim bookXML=new DOMDocument
bookXML.load("bookCatalog.xml")
Set nodelist=GetfilteredElements(bookXML,"xml")

On the other hand, this code retrieves the first and fourth book nodes, which you can then use to retrieve any sub-properties:


Set nodelist=GetfilteredElements(bookXML,"xml","//book")

In general, you should specify an XSL query filter whenever possible, as it will typically extract a smaller subset of nodes that need to be tested.

This function has real application in an XML "database," since one of the most common arguments used against XML by SQL developers is that there is no equivalent to the LIKE keyword in SQL. However, by combining XML and regular expressions, you can make queries that are considerably more flexible than what most LIKE statements offer.

Validate Data Within XSL Transform
Regular expressions are ideal for performing data validation within an XSL transform. For example, suppose that you wanted to generate a table showing books that mentioned XML in the description or in the title. While you could create a DOM-based version, the advantage of XSL is that you can create complex HTML pages without having to do a lot of string building.

The <xsl:script> node in XSL lets you introduce scripts that can be used for evaluation purposes, or to insert text into the output stream (you can't currently output DOM nodes to the stream, however, though I'm not exactly sure why). The default scripting language for XSL is JavaScript, and it turns out that in JavaScript (though not in VBScript or VB), you can use a shorthand notation for creating a regular expression: enclosing the expression in forward slashes ("//"). This XSL document, for example, will display the XML-centric books from the previous list:


<xsl:stylesheet xmlns:xsl="http://www.w3.org/TR/WD-xsl">
	<xsl:script language="JavaScript"><![CDATA[
		IsValidBookTopic=/xml/
	]]></xsl:script>
	<xsl:template match="/">
		<xsl:apply-templates select="//book" />	
	</xsl:template>
	<xsl:template match="book">
		<xsl:if expr="IsValidBookTopic.test(this.text)"> 
		<h1><xsl:value-of select="title"/></h1>
		<h2><xsl:value-of select="author"/></h2>
		<p><xsl:value-of select="description"/></p>
		</xsl:if>
	</xsl:template>
</xsl:stylesheet>

The regular expression is contained in the forward slashes (note that the expression is not enclosed in quotes), and when the document is instantiated, the script is rendered in memory as a RegExp object.

The power of this comes in the <xsl:if> expression further down the XSL document. While xsl:if can perform tests using the same comparison notation as the xsl:template match attribute, one little-known feature of the node is that it can also evaluate a boolean expression and act upon whether the result is true or false. Here, for example, the expression is IsValidBookTopic.test(this.text) where this refers to the node matched by the template. Because this expression will be true only when the book node has "XML" somewhere within its text body, only those books that feature the word "XML" will be formatted.

Of course, in general you will want to be able to change your criterion string. One strategy that I've found works well is to create parametric "entities" (borrowing from XML terminology, although these aren't entities in the DTD sense). In this case, you use the % character to denote that a given string expression is a parameter, and then assign it to an internal variable that you can then reference. For example, to display the books based on any subject, you could write a routine that looks like this SetXMLParameter() function:


Function SetXSLParameter(XslDoc as DOMDocument ,ParamName as _
	String,ParamValue as Variant) as DOMDocument
Dim XslDoc as DOMDocument
Dim ScriptNode as IXMLDOMElement
Dim re as RegExp

Set re=new RegExp
Re.global=True
Re.IgnoreCase=True
Re.pattern="%"+ParamName
For each ScriptNode in xslDoc.selectNodes("xsl:script")
	ScriptNode.text=Re.replace(ScriptNode.text,cstr(ParamValue))
Next
Return XslDoc
End Function

SetXSLParameter sets a parameter within the script node of an XSL document. While you could conceivably replace the expression throughout the XSL document rather than just in the script node, by making the change only in the script node, you gain flexibility in doing calculations or other processing based upon the data ahead of time. Note that the XSL document will be changed by this script, so if you plan to use the same XSL object again with a different filter, you may want to clone the script and work on the clone instead.

You would then only need to slightly modify the XSL script. Here is the catalog.xsl script using a parameter ($searchStr):


<xsl:stylesheet xmlns:xsl="http://www.w3.org/TR/WD-xsl">
	<xsl:script language="JavaScript"><![CDATA[
		IsValidBookTopic=/%searchStr/
	]]></xsl:script>
	<xsl:template match="/">
		<xsl:apply-templates select="//book" />	
	</xsl:template>
	<xsl:template match="book">
		<xsl:if expr="IsValidBookTopic.test(this.text)"> 
		<h1><xsl:value-of select="title"/></h1>
		<h2><xsl:value-of select="author"/></h2>
		<p><xsl:value-of select="description"/></p>
		</xsl:if>
	</xsl:template>
</xsl:stylesheet>

You could then output the results based upon some external value (a search string, for instance). To get the Pair-o-Dice Lost book only, you'd use this code (assuming that this is an ASP page):


Set xmlDoc=new DomDocument
Set xslDoc=new DOMDocument
XmlDoc.load "catalog.xml"
XslDoc.load "catalog.xsl"
SetXSLParameter xslDoc, "searchStr", "pair-o-dice"
Response.write xmlDoc.transformNode(xslDoc)

Note that you could use a similar technique for other parameterized entries, such as determining the browser type, passing ASP queryStr arguments, or writing Web-based components.

The combination of regular expressions and XSL makes for a powerful resource. From expanding search capabilities to validation and parameterization, regexes provide fine grain processing that XSL lacks, while XSL more easily handles complex hierarchical structures (which are a major bane with most regular expressions). Not surprisingly, one proposal currently under consideration for XPath evaluation (a key component of XSL) is including regexes as an integral part of the standard. Thus, learning how to use this sophisticated and extremely versatile language will serve you not just now, but for the foreseeable future.





   
Kurt Cagle is the author or co-author of twelve books and several dozen articles on web technologies, XML and web services. He is the president of Cagle Communications (Olympia, WA), which specializes in the production of training materials for XML and Web Services education. He can be reached at kurt@kurtcagle.net.
Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap
Thanks for your registration, follow us on our social networks to keep up-to-date