Closing the Loop
NLP and semantic web approaches are complementary because implicit metadata extracted from text and explicit metadata provided by publishers only differ in the way they are obtained. Why not take the best of both worlds and use highly precise, human controlled information when available, and fall back to automated methods where such information is partial or missing?
Yahoo's SearchMonkey platform allows for the integration of NLP and semantic web approaches. SearchMonkey provides motivation for publishers to open up metadata through a compelling use case, which is the ability to modify the presentation of search results through plug-ins to the search interface.
Content providers can either rely on their own efforts in annotating the content, or turn to providers such as
Dapper to semantify their pages. Third-party developers can create custom data services to extract information from the web page. Custom data services can be:
- An XSLT stylesheet
- Implemented as an external service, which performs an arbitrarily complex transformation
- Combined information from the page and from an external API (for example, associating data with Flickr pages by calling the Flickr API)
Regardless of the source, all metadata in SearchMonkey is represented internally in a triple format and made available to applications in an XML serialization called dataRSS
that is also RDFa compatible. SearchMonkey application developers can combine all metadata associated with a web page to develop the above mentioned plug-ins to the search interface. Applications compete in that users of Yahoo! Search implicitly vote by choosing the applications they like to enable as part of their search configuration.
Semantic technology was crucial in realizing the open design of SearchMonkey. With many existing metadata
Google Base—users are limited by technology to a small number of object types (housing, jobs, products, etc.) when providing information. Similarly, the attributes of those objects are also limited and prescribed by the platform provider (that is, Google).
In SearchMonkey, RDF made it possible to separate syntax and vocabulary in that publishers are free to use any vocabulary, opening up the system to support the long tail of web content. In contrast to XML technologies, vocabularies described in RDF or OWL can be easily extended by anyone without breaking interoperability, applications can safely work on the part of the metadata they understand, and ignore the rest. Further—in a distributed knowledge representation scenario—metadata is kept and maintained locally, and thus publishers are no longer locked into a centralized system. Relying on open standards means that both developers and publishers can rely on compatible tools to manage their metadata.
On occasion, semantic web technologies have been simplified to support the average web developer who has some familiarity with XML standards and a working knowledge of PHP. An example of such simplification was deciding which query language to use. SPARQL, the de facto standard in the semantic web, is clearly overly complicated for the average developer, and more complex than the task required. Thus, the choice was made for an XPath-like query language, similar to the Fresnel Selector Language (FSL).
Listing 1 shows a typical snippet of RDF data that a SearchMonkey application would receive as input.
If you want the application to display the address of the restaurant, use this:
$ret['dict']['key'] = ‘Address’;
$ret['dict']['value'] = Data::get(‘com.yahoo.search.rdfa/vcard:VCard/vcard:adr/vcard:street-address’);
After defining other properties of the display (image, links, further key-value pairs), the resulting user
experience looks like Figure 1
(see circled listing).
Figure 1. Defined Properties: After defining your properties, your listing should appear like the listing that is circled in the figure.|
While the semantic web has had many successes since its conception over 10 years ago, most of the results have been demonstrated in relatively narrow expert domains, where the inferential power of ontology languages brought a clear benefit, or within the enterprise, where the additional levels of control allowed sophisticated methods of metadata management to be introduced in a top-down manner. However, this seems to be the year when the semantic web is becoming a reality on the public web as publishers become familiar with technologies such as RDFa, and developers start to realize the first applications that make use of RDF data distributed across the web.
Building out the semantic web is as much a technical as a social process; thus, even if the technologies are by-and-large ready, there is still an important role for the community to play. One example is fostering agreements around vocabularies. While there are vocabularies for many domains, the overall coverage of semantic web vocabularies is at best partial, and the quality varies. In practice, users are still wondering where they would find a particular vocabulary to fit their need, and how they should choose among the existing ones. It is also certain that basic technologies will need to be revisited over time. Although using RDF-based technologies has become a lot easier over time, the arrival of large numbers of publishers will no doubt push further downward the quality of metadata on the web, which means that developers will need tools for dealing with inconsistencies.
As the example of the SearchMonkey query language shows above, the trend is toward sacrificing some expressive power in return for usability and robustness in the face of mistakes.
In the long term, we also hope to see further convergence between the HTML Web (including embedded metadata) and the Web of Linked Data. For publishers, choosing between publishing embedded metadata microformats, RDFa, and publishing linked data is a difficult and confusing process at the moment. Hopefully, a clearer understanding of the relationship between embedded metadata and linked data will emerge as these technologies are put to use at larger and larger scales.