The year 2007 saw the release of what was, for all intents and purposes, the second generation of the XML core technologies. The first generation appeared over the course of about three years, from 1998 for the XML standard itself and 1999 for XPath and XSLT 1.0 to 2001 for the release of XML Schema. XSLT 1.0 in particular was a game changer for the technology, as it took a radically different approach to programming — creating a language written using nothing but markup that attempted to match XPath patterns and then, passing the XML nodes in question to a template to create new content.
This approach was extraordinarily powerful — because of the recursive nature of the templates, an XSLT stylesheet could transform anything from nearly flat database records to very deep documents with equal ease, could use wildcard matches to create generalized templates, and could additionally invoke functions on the XPath for more specialized processing. This meant that XSLT stylesheets began to take on the status of a secret weapon for XML developers, programs that could, in a few hundred lines of code, outdo imperative code double or triple its size.
However, XSLT 1.0’s power was also its achilles heel. As one pundit put it, the process of learning how to work with XSLT was very much akin to sawing off the top of your head, pulling out your brain, rotating it 90 degrees, then inserting it back in your head. For people steeped in C-like languages, XSLT was extraordinarily non-intuitive, and there were many operations, such as iterating over a sequence of numbers, that could only be accomplished by using recursion rather than iterator loops. Not surprisingly, especially when coupled with the sometimes cumbersome XML notation, adoption of XSLT remained very limited among traditional developers.
Yet even among XML developers who recognized the value of XSLT, the language had a reputation for being cumbersome, and a number of efforts emerged to improve the standard via a somewhat vague and amorphous extension mechanism, culminating in 2002 with the creation of a somewhat ad hoc group of functions defined under the banner of Extended XSLT (EXSLT). While these functions helped to identify (and standardize) a number of key functions (including additional math, string and regular expression handling, as well as a significant hole in date processing), it also identified a more serious problem, one that actually went to the core of the underlying XPath language.
It’s useful to understand the relationship between XPath and XSLT. The purpose of XPath is very much analogous to the role played by a SELECT statement in the SQL — it identified, relative to a given context in an XML document, a group of related nodes – elements, attributes, text and so forth — for further processing. The XPath language also had a secondary role in being able to performs some limited calculations on those nodes, though this capability existed primarily to support the underlying selection mechanism.
XSLT in turn made use of both the primary and secondary roles of XPath by providing a constructive language capability over the top of the XPath language. There are, in face a number of similarities between the XPath language and the Regular Expression (RegEx) language — one identifies nodes in a tree, while the other identifies text fragments in a text expression, but neither of them can actually do anything with this information by themselves — they aren’t constructive. Languages such as Perl or java-script could take the resulting pattern matches and use them to build other things in the text by way of the matches. XSLT works in much the same way, something that tends to get lost in translation when people see it as an XML language.
This focus though on XPath nodes in a nodeset, however, was proving to be too limited. There was no clean way of iterating over sets of words, for instance, or iterating over a sequence of numbers, or — and this was perhaps the most troublesome aspect — iterating across constructed nodes, because XPath assumed that there was effectively one and only one tree that it could pull information from, the one tree provided as the input (or passed as a parameter). The biggest consequence of this was that there was no way with this underlying data model to create a temporary tree and then process that tree in memory, short of outputting that tree and passing it onto a new XSLT process. You couldn’t combine two or more external inputs then process that combined set, something that is especially critical for working with external look-up tables. In short, the model itself was broken, badly.
This led to a decision by the W3C to go back to the drawing board, especially as there had been an increasing call by XML database vendors to provide a more comprehensive standardized query language that would do for XML what SQL did for relational databases – create a unified query language that could be used to query large datasets, return results, and manipulate them in different ways to generate new output. In 2002, the XPath working group was given a mandate to produce a 2.0 version of the language, one that could be used by both a revised XSLT language and by the proposed XQuery language … and thus began a major rethinking about data models.
From Nodesets to Sequences
The new data model (which began emerging around 2004) no longer revolved around nodesets, which were seen as being too restrictive. Instead, primacy shifted to sequences — lists of things, with comparatively little restriction on what those things could be. A sequence could consist of a linear set of nodes, for instance, but those nodes no longer were required to be all a part of the same document. The sequence could also consist of text strings, numbers, dates and times (which were now their own data type), or even more generic items — or could be combinations of all of these things.
Yet sequences were also, always, fundamentally linear. If you put a sequence inside of another sequence such as:
(a,b,(c,d,e))
that was equivalent to
(a,b,c,d,e)
Thus, you couldn’t build structures with nested sequences, because the whole role of a sequence is to be a, well, sequential list of items.
This seemingly mundane change would end up having huge implications. You could create operators such as the to operator, which lets you create iterations:
It meant that you could take a comma delimited string and split it into individual strings in a sequence:
Finally, and perhaps most significantly, you could pull in elements (or even documents) from multiple document sources and combine them:
Most of these operations are familiar to people working with XQuery (because sequences are so intrinsic to XQuery development), and once you gain some mastery in working with sequences in this manner you can do a lot of things that are difficult and tedious to do in XSLT 1.0 in a couple of lines of code. Such sequences also share a fair amount of similarity with node-sets in terms of their interface -- $mysequence[2] will retrieve the second item in the sequence, $mysequence[last()] will retrieve the last element, and $mysequence/count() or count($mysequence) (both forms are valid) will retrieve the number of items in the sequence.
However, there are also capabilities that sequences offer that can't be readily handled by nodesets in XSLT 1. For instance, you can slice sequences using the subsequence command -- subsequence($mysequence, 1,4) will return the first four items in the sequence, subsequence($mysequence, 3, 4) will return four sequential items starting from item 3, and subsequence($mysequence,2) will retrieve all items from the second item on.
Similarly, you can use the index-of function to match a given sequence item with its position or positions. For instance,
will return the value 5. If more than one match exists, then a sequence such as (3,4) would be returned, making it possible to iterate on those subelements:
which would generate a sequence of elements:
,
.You can also use the string-join() function which will take a sequence and concatenate it together, using a particular specified delimiter. For instace, a comma-delimited list could be generated from a sequence as
=> "red, orange, yellow, green, blue, violet"
User-Defined Functions
XSLT 1.0 is built around the template model, which includes both matching templates (that use the @match attribute) and named templates (that use the @name attribute). Named templates definitely have utility, but the central problem with such named templates is that they have to be invoked from within an This is the first line This is the second line This is part 1 of line 3 This is part 2 of line 3
Then an XSLT 2.0 stylesheet to transform it would look like the following:
The 1. The $elt-name variable extracts the local name of the element.
2. The $expanded-caps variable uses a simple regular expression to replace any upper case character with an underscore and that character (e.g., "ATest") becomes "_A_Test". Regular expression support is global, but can also be extended to both ignore case and work across line breaks.
3. Following that, the $spaced-name variable uses translate() to map underscores and dashes to spaces, and then applies normalize-space() to remove leading and trailing spaces and convert multiple contiguous spaces within the expression into a single space.
4. In $tokenized-seq, the results are then tokenized to convert them into a sequence, and the sequence in turn is broken into title text expressions before being rejoined with spaces in $final-name. The output of the function ignores any white space outside of the
Resource 1 resource1.xml Resource 2 resource2.xml Resource 3 resource3.xml
The following transformation takes the feed, retrieves the entry resources and stores them on the local file system:
The
There are a few limitations that you have to be careful with for this particular element -- once you start generating output elements in the main thread, you can't include an