eb 3.0 (aka the semantic web) is the transformation of the web into a data source for building new applications, which combine data in ways that the data source’s original author may not have anticipated. Although most web designers don’t think of their HTML web pages as data sources, the use of semantic markup is starting to fuel this new generation of applications that can extract the designers’ data and create new applications from it (see Sidebar 1. How Did the Web Get Here?).
These new “ad-hoc” applications are called mashups, because they combine (or “mash” together) data in innovative ways to create new views of the web. Many semantic web standards and technologies are contributing to the creation of mashups. Some of these include Semantic Markup, CSS, RSS, and Atom, the Atom Publishing Protocol (APP), RDFa, GRDDL and plain old XML.
Mashups all share three salient characteristics:
- They draw on sources of data directly on the web.
- They transform, combine, and re-transform this data to create innovative new outputs. Maps and timeline displays are typical mashup output formats.
- They can usually be done in a few hours. That means that the transformations are created rapidly in a high-productivity environment.
Regarding this last characteristic, most mashups would take a lot longer if you had to literally create relational database tables using a traditional RDBMS Data Description Language (DDL). Fortunately, modern systems can leverage metadata in XML files to automatically create in-memory data and indexes for high-performance mashups. You would need the CREATE TABLE statement only in legacy systems.
High-Level Architecture of a Mashup
Figure 1 shows the high-level architecture of a mashup system.
![]() |
|
Figure 1. The Architecture of a Mashup System: Here is the high-level architecture of a mashup system. |
Note that you will need some tools to pull your raw data from the web into XML format. Many tools do this. For example, the eXist-db open-source database provides a program called HTTPClient that performs HTTP GET functions on an input URL. To enable it, you need to change the configuration file to load this module into the database. The HTTPClient library will transform any malformed HTML into well-formed XML.
Another feature of the above architecture is the ability to store relevant data directly into a local XML store. When users create mashups, those users frequently use the same data sets. If the application also performs a “store” operation on the incoming data, it does not need to perform an HTTP GET. Many systems like the eXist-db database also automatically index these files so that even very large collections (100,000s of documents) will have very fast retrieval times.
XQuery and Cloud Computing
If you are doing mashups with data hosted on cloud computing data sets and using XQuery to select the data, both the CPU time and I/O metrics will drop dramatically. This is because XQuery is one of the few systems that is designed to use indexes on unstructured documents and to be very precise about what data is extracted from these data sets. If you are paying for cloud computing infrastructure, you may want to consider XQuery as your data selection language.
Data Selection with XPath
One of the first things a mashup must do is select the relevant data from incoming data pages, which can be done easily with XPath expressions (see Sidebar 2. XPath Built-in to XQuery). Using XPath expressions, it is easy to extract data from within an XHTML file even if that data is buried under ten levels of
//li
This notation says start at the root of the file (the first forward slash) and find any list item anywhere regardless of the depth of the file (the double slash). So any web pages that puts relevant data in list items are quickly queried.
The exact XQuery expression for this would be similar to this:
let $mydata := doc("mypage.html")//li
You can also add qualifiers to the XPath expressions to find list items only within specific divs. For example, the following listing will select only the list items from within the main content area of an HTML file:
//div[@class='content-main']//li
The square bracket notation (technically called a predicate) is like a SQL WHERE clause. It will return only list items that are nested somewhere under a div that has a class attribute equal to content-main. Note that the double slash at the end of the XPath indicates that the actual list item elements may be nested many layers inside the main content. If you replace the second double slash with a single slash, you will get only list items that are direct descendants of the main content.
XQuery, SQL and XSLT: Birds of a Feather
Developers who are familiar with data selection languages might wonder if all the things they have learned about SQL transfer to XQuery. The answer is “yes, without a doubt.” Everything you can do with SQL you can also do with XQuery, including:
- Adding WHERE clauses
- Creating indexes for fast search
- Selecting distinct values
- Restricting selections to the first N items
- Doing joins
- Changing sort order
- Changing grouping
In many cases the syntax is identical, in others only small changes are required. For example, simply adding an Order By clause to an XQuery statement will change the order of the result set. This is exactly the same as SQL.
Those familiar with XSLT will already be familiar with many parts of XQuery. All XPath knowledge also will transfer with almost no changes. The biggest difference I noticed when I ported XSLT mashup code to XQuery is that the large queries ran much faster on large data sets stored in native XML databases. This surprised me at first, but I was reminded that many native XML databases use the same data structures (B+ trees) and indexing schemes large RDBMSs use.
Limitations of Tools and Specifications
The tools being used to perform mashups in XQuery today have many limitations. Although XQuery does have many modern features of advanced functional languages (see Sidebar 3. Programming 100 CPU Cores: Procedural Languages Lacking), the implementations of individual vendors or libraries may be significantly different. For example, many options are available for HTTP GET, such as the ability to set timeouts or retry after a given interval, but they must be done on an ad-hoc basis and are not part of the current XQuery specification. Hopefully, these will be part of the XQuery specification or standard XQuery add-on libraries in the future.


Revolutionizing Search: A Glimpse Into Google’s Generative Experience
Google is revolutionizing the search experience as we know it with its latest generative experience. No longer will you be bound by the limitations of traditional keyword searching. Now, you


10 Productivity Hacks to Supercharge Your Business in 2023
Picture this: your team working seamlessly, completing tasks efficiently, and achieving goals with ease. Sounds like too good to be true? Not at all! With our productivity hacks, you can


GM Creates Open Source uProtocol and Invites Automakers to Adopt It: Revolutionizing Automotive Software Development.
General Motors (GM) recently announced its entry into the Eclipse Foundation. The Eclipse Foundation is a prominent open-source software foundation. In addition, GMC announced its contribution of “uProtocol” to facilitate


What is Metadata?
What is metadata? Well, It’s an odd concept to wrap your head around. Metadata is essentially the secondary layer of data that tracks details about the “regular” data. The regular


What We Should Expect from Cell Phone Tech in the Near Future
The earliest cell phones included boxy designs full of buttons and antennas, and they only made calls. Needless to say, we’ve come a long way from those classic brick phones


The Best Mechanical Keyboards For Programmers: Where To Find Them
When it comes to programming, a good mechanical keyboard can make all the difference. Naturally, you would want one of the best mechanical keyboards for programmers. But with so many


The Digital Panopticon: Is Big Brother Always Watching Us Online?
In the age of digital transformation, the internet has become a ubiquitous part of our lives. From socializing, shopping, and learning to more sensitive activities such as banking and healthcare,


Embracing Change: How AI Is Revolutionizing the Developer’s Role
The world of software development is changing drastically with the introduction of Artificial Intelligence and Machine Learning technologies. In the past, software developers were in charge of the entire development


The Benefits of Using XDR Solutions
Cybercriminals constantly adapt their strategies, developing newer, more powerful, and intelligent ways to attack your network. Since security professionals must innovate as well, more conventional endpoint detection solutions have evolved


How AI is Revolutionizing Fraud Detection
Artificial intelligence – commonly known as AI – means a form of technology with multiple uses. As a result, it has become extremely valuable to a number of businesses across


Companies Leading AI Innovation in 2023
Artificial intelligence (AI) has been transforming industries and revolutionizing business operations. AI’s potential to enhance efficiency and productivity has become crucial to many businesses. As we move into 2023, several


Step-by-Step Guide to Properly Copyright Your Website
Creating a website is not easy, but protecting your website is equally important. Implementing copyright laws ensures that the substance of your website remains secure and sheltered. Copyrighting your website