eb 3.0 (aka the semantic web) is the transformation of the web into a data source for building new applications, which combine data in ways that the data source’s original author may not have anticipated. Although most web designers don’t think of their HTML web pages as data sources, the use of semantic markup is starting to fuel this new generation of applications that can extract the designers’ data and create new applications from it (see Sidebar 1. How Did the Web Get Here?).
These new “ad-hoc” applications are called mashups, because they combine (or “mash” together) data in innovative ways to create new views of the web. Many semantic web standards and technologies are contributing to the creation of mashups. Some of these include Semantic Markup, CSS, RSS, and Atom, the Atom Publishing Protocol (APP), RDFa, GRDDL and plain old XML.
Mashups all share three salient characteristics:
- They draw on sources of data directly on the web.
- They transform, combine, and re-transform this data to create innovative new outputs. Maps and timeline displays are typical mashup output formats.
- They can usually be done in a few hours. That means that the transformations are created rapidly in a high-productivity environment.
Regarding this last characteristic, most mashups would take a lot longer if you had to literally create relational database tables using a traditional RDBMS Data Description Language (DDL). Fortunately, modern systems can leverage metadata in XML files to automatically create in-memory data and indexes for high-performance mashups. You would need the CREATE TABLE statement only in legacy systems.
High-Level Architecture of a Mashup
Figure 1 shows the high-level architecture of a mashup system.
|Figure 1. The Architecture of a Mashup System: Here is the high-level architecture of a mashup system.|
Note that you will need some tools to pull your raw data from the web into XML format. Many tools do this. For example, the eXist-db open-source database provides a program called HTTPClient that performs HTTP GET functions on an input URL. To enable it, you need to change the configuration file to load this module into the database. The HTTPClient library will transform any malformed HTML into well-formed XML.
Another feature of the above architecture is the ability to store relevant data directly into a local XML store. When users create mashups, those users frequently use the same data sets. If the application also performs a “store” operation on the incoming data, it does not need to perform an HTTP GET. Many systems like the eXist-db database also automatically index these files so that even very large collections (100,000s of documents) will have very fast retrieval times.
XQuery and Cloud Computing
If you are doing mashups with data hosted on cloud computing data sets and using XQuery to select the data, both the CPU time and I/O metrics will drop dramatically. This is because XQuery is one of the few systems that is designed to use indexes on unstructured documents and to be very precise about what data is extracted from these data sets. If you are paying for cloud computing infrastructure, you may want to consider XQuery as your data selection language.
Data Selection with XPath
One of the first things a mashup must do is select the relevant data from incoming data pages, which can be done easily with XPath expressions (see Sidebar 2. XPath Built-in to XQuery). Using XPath expressions, it is easy to extract data from within an XHTML file even if that data is buried under ten levels of
This notation says start at the root of the file (the first forward slash) and find any list item anywhere regardless of the depth of the file (the double slash). So any web pages that puts relevant data in list items are quickly queried.
The exact XQuery expression for this would be similar to this:
let $mydata := doc("mypage.html")//li
You can also add qualifiers to the XPath expressions to find list items only within specific divs. For example, the following listing will select only the list items from within the main content area of an HTML file:
The square bracket notation (technically called a predicate) is like a SQL WHERE clause. It will return only list items that are nested somewhere under a div that has a class attribute equal to content-main. Note that the double slash at the end of the XPath indicates that the actual list item elements may be nested many layers inside the main content. If you replace the second double slash with a single slash, you will get only list items that are direct descendants of the main content.
XQuery, SQL and XSLT: Birds of a Feather
Developers who are familiar with data selection languages might wonder if all the things they have learned about SQL transfer to XQuery. The answer is “yes, without a doubt.” Everything you can do with SQL you can also do with XQuery, including:
- Adding WHERE clauses
- Creating indexes for fast search
- Selecting distinct values
- Restricting selections to the first N items
- Doing joins
- Changing sort order
- Changing grouping
In many cases the syntax is identical, in others only small changes are required. For example, simply adding an Order By clause to an XQuery statement will change the order of the result set. This is exactly the same as SQL.
Those familiar with XSLT will already be familiar with many parts of XQuery. All XPath knowledge also will transfer with almost no changes. The biggest difference I noticed when I ported XSLT mashup code to XQuery is that the large queries ran much faster on large data sets stored in native XML databases. This surprised me at first, but I was reminded that many native XML databases use the same data structures (B+ trees) and indexing schemes large RDBMSs use.
Limitations of Tools and Specifications
The tools being used to perform mashups in XQuery today have many limitations. Although XQuery does have many modern features of advanced functional languages (see Sidebar 3. Programming 100 CPU Cores: Procedural Languages Lacking), the implementations of individual vendors or libraries may be significantly different. For example, many options are available for HTTP GET, such as the ability to set timeouts or retry after a given interval, but they must be done on an ad-hoc basis and are not part of the current XQuery specification. Hopefully, these will be part of the XQuery specification or standard XQuery add-on libraries in the future.