here is no question that the web is an unprecedented success. It is the single most adventurous and useful platform for information exchange ever conceived and built. The architectural choices that went into its design have lent it scalability, flexibility, and the ability to grow into new business models, application-level technologies, and varied uses.
Currently, the web is designed for use by people; and not by software. Visitors to a site such as Amazon—that provides book and movie ratings, information about available used copies, related products, and so forth—can easily parse the content visually. It is much more difficult for software to act on a visitor’s behalf, because the information is tied up in the presentation structure. It is possible to write software that scrapes these kinds of pages, but when Amazon’s designers change the look or style of the pages, the scrapers are likely to break.
A solution to extracting information from various content, and other data integration problems, has been envisioned for the Web from the start. This vision is known as the Semantic Web and it promotes the development of software systems capable of sharing, integrating, and supporting machine processing of the Web’s data. It is no longer a web of documents, it is a web of data.
People have been excited and skeptical about the Semantic Web since the beginning. Early hype, a long path to working systems, disagreements about goals and strategies, and all-around confusion have left the skeptics feeling smug and self-congratulatory.
Let the skeptics have their moment. In the meantime, Semantic Web technologies are continually and quietly enriching the existing web indirectly. The skeptics might be surprised by the companies already using these technologies to solve real problems today.
RDF Where There is None
An early complaint about the Semantic Web vision was that no one would enter quality metadata. The other great criticism was that no one would ever convert their data to the Resource Description Framework (RDF) model. While these seem like reasonable critiques, in practice they are not proving true.
First of all, sites like Delicious, Flickr, and other folksonomy-based sites demonstrate that when the bar is lowered and the value is demonstrated, people will happily contribute tags and other metadata. Delicious can filter out typos and bogus tags by looking at the most common terms for a page. The challenge to the proponents of Semantic Web technologies is to make it as easy to select terms from standard and shared vocabularies as it is to type arbitrary tags.
Secondly, new technologies are eliminating the need to convert data to RDF directly. These include Gleaning Resource Descriptions from Dialects of Languages (GRDDL), RDFa, and SPARQL endpoints. GRDDL and RDFa allow RDF to be produced through standard transformations from existing XML and XHTML resources. Simple markup, no more complicated than current presentations, allows proper metadata to be mixed in the presentation structure and domain-specific hackery like microformats. With these tools in place and supported by certain content publishers, it will be trivial to support publication metadata, licensing information, geotagging information, and the like from the pages you visit. It is also possible to link this extracted information to different data sources for further discovery.
SPARQL endpoints allow RDF views into both RDF and non-RDF data. Some projects leverage other technologies, such as Mulgara Semantic Store, which uses D2RQ in its Relational Resolver to allow RDF queries to include results from non-RDF data sources. This kind of combination allows the RDF model to be populated with content from existing Customer Relationship Management (CRM), Enterprise Resource Planning (ERP), and other relational systems. There is no need to convert the data and store it as RDF; it is generated on the fly.
Linking Open Data Project
As RDF data is made available publicly on the web and in the Enterprise, it allows for technologies to create relationships across data sources. The Linking Open Data project has gained tremendous momentum in the past year and is now connecting billions of triples worth of data together through billions of links.
As an example, consider the thousands of Wikipedia volunteers who curate the concepts and relationships that keep the site up-to-date and (presumably) accurate. These include facts such as that the Louvre is a museum in Paris, France. These terms and relationships are now converted monthly into RDF and are exposed at DBPedia.
It is now possible to take a term from Wikipedia, query DBPedia for metadata about this term, and convert the alternate names for the term and its geographic information into a Flickr query for pictures constrained to a specific location. Following our previous example of The Louvre, you can find a slew of high-quality and related pictures from Flickr by going here.
Now, imagine taking all of that information, tying together social networking sites (via OpenSocial and Friend-of-a-Friend (FOAF) profiles), Creative Commons information, geotagging information, Dublin Core publication metadata, the CIA Factbook, U.S. Census information, etc., and you see the emergence of a web of data.Emerging Infrastructure
The specifications behind the Semantic Web provide the ability to encode, link, and reason about data. Historically, it has been impossible to characterize an unqualified URL as a document or a reference to a non-network-addressable resource. The W3C Technical Architecture Group (TAG) has recently reached a decision that these non-networked resources can be given URIs/URLs. An infrastructure that resolves these references can indicate the special status by returning an HTTP response code of 303 (See Other) instead of a 200 (OK).
The Online Computer Library Center’s (OCLC) Persistent URL infrastructure was recently rearchitectured for scalability and a series of new features, such as support for this 303 guidance. This new version lays down some key infrastructure for assigning good, resolvable names for terms and concepts, something that has been sorely missing in the Semantic Web technology stack. As such, the new system can define concepts to disambiguate RDF subjects. URIs can be given to proteins, people, legislation, places, etc. While historically you may have chosen a pseudo-canonical URL from a site such as Wikipedia, now it is possible to define a new canonical URL for the terms and subjects that are of interest to your organization.
Embeddable Semantic Web Applications
Thomson Reuters runs a free site, called OpenCalais, for identifying terms and concepts from within unstructured text. With plugins such as Gnosis for FireFox, it is possible to turn the OpenCalais service directly on the pages you visit to identify people, places, organizations, industries, etc., even on sites that do not publish information with support for GRDDL, microformats, and RDFa. These extracted terms can then be linked back into other data sources to automate the process of extracting information as you surf the Web. This service is a step toward a larger vision. Thomson Reuters’ CEO has even caught Semantic Web skeptic Tim O’Reilly’s attention with his vision of where this is going.
Another FireFox plugin, Solvent from the Simile project, makes it easy for you to compose lightweight and shareable screen scrapers to extract content from arbitrary pages. This highlights that, while it is great when sites support Semantic Web technologies, the success of the vision does not require everyone to get on board. Automated and semi-automated extraction are key approaches to linking content in structured and unstructured forms.
Support By Open Source and Commercial Organizations
One of the major barriers to adoption of semantic technologies is the lack of support in software. There have always been quality parsing, producing, and querying APIs, but major software initiatives have in general taken a wait-and-see approach. This is increasingly becoming less of an issue as major open source initiatives such as Drupal and Mozilla have committed to supporting RDF and SPARQL.
Perhaps more valuable than adoption by Open Source projects is the long-anticipated support for the technologies by major commercial software players. This too has finally come to pass. Oracle was one of the first major vendors to adopt RDF and OWL in its database engines. It cleverly co-opted its existing Spatial Engine (with its network data model) to support the graph models of RDF. It is now possible to mix RDF and non-RDF data within the same database engine.
Industry giants Yahoo! and Microsoft have also been making announcements and acquisitions in this space. Google is promoting interoperability in the social networking world through Open Social while MySpace, eBay, Twitter, and Yahoo! are pursuing DataPortability initiatives.
New technology companies have emerged along the way with tools to help developers, knowledge workers, and other organizational stakeholders build software systems around these ideas. TopQuadrant’s TopBraid Suite, Franz’s AllegroGraph, the Talis Platform, Thetus Publisher, and OpenLink’s Virtuoso server are among the leaders of these emerging markets.
Companies like Zepheira, LLC, Semantic Arts, and Sandpiper Software are working with major corporations around the world to adopt these ideas within their organizations with training, strategic guidance, and implementation assistance.
The pain of failed Enterprise Application Integration (EAI) and Service-Oriented Architecture (SOA) initiatives are driving financial services, news media, insurance, and other conservative industries to look for new solutions to its IT needs. Those industries are considering the successes of the web and want to know how to adopt those ideas internally. The Cleveland Clinic is a leader in adopting Semantic Web technologies to improve their ability to meet the needs of their patients. The goals of the clinic are to lower their IT costs, add business functionality, and avoid the technology flux treadmill. The clinic’s goal is not to use semantic technologies per se; it’s to use them as viable solutions.
Learning About Semantic Technologies
Developers, managers, and executives can learn about Semantic Web technologies at major conferences. This year, relevant content has appeared at:
- Linked Data Planet
- Rich Web Experience
- Semantic Technology Conference
- NoFluffJustStuff tour
- Museums and the Web
Semantic Web technologies are here in many important ways, and you are most likely using these technologies on a daily basis; even if it’s an indirect usage. The success of these technologies is not simply a question of everyone adopting the same models and the same terms; it is about a rich and vibrant ecosystem of data, documents, and software tied together in useful ways.