Semantic search has attracted a lot of attention in the past year, largely due to the growth of the semantic web as a whole. The term semantic search itself is popular enough to be considered overused. The term refers to searching large semantic web datasets, which is a typical problem for semantic web search engines such as Swoogle, Sindice, SWSE, Falcon-S, and Watson. The term also refers to methods of searching web documents beyond the syntactic level of matching keywords. This article discusses semantic search in this second sense.
The current generation of search engines is severely limited in its understanding of the user’s intent—and the web’s content—and consequently in matching the needs for information with the vast supply of resources on the web. For Information Retrieval (IR) purposes, both queries and documents are typically treated at a word or gram level, with minimal language processing involved. The search engine is missing a semantic-level understanding of the query or the content and can only understand the content of a document by picking out the most commonly occurring or underlined words.
Even though search is considered a functional technology, there are limits to a syntax-based approach. The following list shows some examples of these limitations.
- It is almost impossible to return search results that relate to the secondary sense of a term—especially if a dominant sense exists—for example, try searching for George Bush the beer brewer as compared to the President.
- The capabilities of computational advertising, which is largely also an IR problem (for example, retrieving matching ads from a fixed inventory), are clearly impacted because of the sparsity of advertisements.
- When no clear key exists, search engines are unable to perform queries on descriptions of objects. For example, try searching for the author of this article with the keywords ‘semantic web researcher working for yahoo.’
- Current search technology is unable to satisfy any complex queries requiring information integration such as analysis, prediction, scheduling, etc. An example of such integration-based tasks is opinion mining regarding products or services. While there have been some successes in opinion mining with pure sentiment analysis, it is often the case that users like to know what specific aspects of a product or service are being described in positive or negative terms and to have the search results appear aggregated and organized. Information integration is not possible without structured representations of content.
- Multimedia queries are also difficult to answer, as multimedia objects are typically described with only a few keywords (tagging) or sentences. This is typically too little text for the statistical methods of IR to be effective.
Semantic search—defined as IR with the capabilities to understand the user’s intent and the web’s content at a much deeper, conceptual level—can address these limitations.
Two Roads to Semantic Search
There are two approaches toward semantic search and both have received attention in the past months. The first approach builds on the automatic analysis of text using Natural Language Processing (NLP). The second approach uses semantic web technologies, which aims to make the web more easily searchable by allowing publishers to expose their (meta)data.
Natural Language Processing
While NLP technology has been researched for decades, the processing it requires has prohibited large-scale applications. Moore’s law, however, is on the side of NLP.Companies such as PowerSet (which was recently purchased by Microsoft) and Hakia are rising quickly. Their systems extract entities from text, disambiguate them against large-scale background knowledge sources (PowerSet uses Freebase, Hakia has its own ontology), and then record the relationships as found in the text. PowerSet and Hakia also allow users to query using full questions, although most users still prefer to type in keywords, which their engines support as well.
|Author’s Note: Information extraction is easier with web sites that follow certain templates (typically, web sites generated from a database). In these cases, special methods (so called wrappers) can be developed or automatically trained to extract the information with very high precision.|
Publishers can avoid the costs and quality issues associated with NLP if they are willing to expose their metadata. Most publishers are willing to invest some time and effort in exposing structured data directly if they see improvement to their traffic patterns.
Exposing metadata is the starting point for a semantic search approach associated with the semantic web. The semantic web targets the entire web, including documents, databases, and services. The most important recent development is in the area of embedding metadata directly into web documents. Compared to microformats, which are limited to describing the most common types of information (persons, events, etc.), the new RDFa standard (a candidate recommendation from the W3C) has the capability to encode metadata in HTML using any RDF or OWL vocabulary. The decoupling of syntax and semantics is crucial from the perspective of search engine providers; while support for RDFa needs to be implemented only once, each microformat requires a separate parsing of the HTML page. An alternative is GRDDL, which allows the publisher to attach an arbitrary transformation to the page that ‘converts’ the information to RDF. GRDDL is more complicated for search engine providers to support in that running arbitrary XSLT stylesheets on HTML pages carries certain risks.
|Editor’s Note: The author of this article works for Yahoo!, suppliers of the SearchMonkey technology.|
Closing the Loop
NLP and semantic web approaches are complementary because implicit metadata extracted from text and explicit metadata provided by publishers only differ in the way they are obtained. Why not take the best of both worlds and use highly precise, human controlled information when available, and fall back to automated methods where such information is partial or missing?
Yahoo’s SearchMonkey platform allows for the integration of NLP and semantic web approaches. SearchMonkey provides motivation for publishers to open up metadata through a compelling use case, which is the ability to modify the presentation of search results through plug-ins to the search interface.
Content providers can either rely on their own efforts in annotating the content, or turn to providers such as OpenCalais, Zemanta, or Dapper to semantify their pages. Third-party developers can create custom data services to extract information from the web page. Custom data services can be:
- An XSLT stylesheet
- Implemented as an external service, which performs an arbitrarily complex transformation
- Combined information from the page and from an external API (for example, associating data with Flickr pages by calling the Flickr API)
Regardless of the source, all metadata in SearchMonkey is represented internally in a triple format and made available to applications in an XML serialization called dataRSS that is also RDFa compatible. SearchMonkey application developers can combine all metadata associated with a web page to develop the above mentioned plug-ins to the search interface. Applications compete in that users of Yahoo! Search implicitly vote by choosing the applications they like to enable as part of their search configuration.
Semantic technology was crucial in realizing the open design of SearchMonkey. With many existing metadata platforms—such as Google Base—users are limited by technology to a small number of object types (housing, jobs, products, etc.) when providing information. Similarly, the attributes of those objects are also limited and prescribed by the platform provider (that is, Google).
In SearchMonkey, RDF made it possible to separate syntax and vocabulary in that publishers are free to use any vocabulary, opening up the system to support the long tail of web content. In contrast to XML technologies, vocabularies described in RDF or OWL can be easily extended by anyone without breaking interoperability, applications can safely work on the part of the metadata they understand, and ignore the rest. Further—in a distributed knowledge representation scenario—metadata is kept and maintained locally, and thus publishers are no longer locked into a centralized system. Relying on open standards means that both developers and publishers can rely on compatible tools to manage their metadata.
On occasion, semantic web technologies have been simplified to support the average web developer who has some familiarity with XML standards and a working knowledge of PHP. An example of such simplification was deciding which query language to use. SPARQL, the de facto standard in the semantic web, is clearly overly complicated for the average developer, and more complex than the task required. Thus, the choice was made for an XPath-like query language, similar to the Fresnel Selector Language (FSL). Listing 1 shows a typical snippet of RDF data that a SearchMonkey application would receive as input. If you want the application to display the address of the restaurant, use this:
$ret['dict']['key'] = ‘Address’; $ret['dict']['value'] = Data::get(‘com.yahoo.search.rdfa/vcard:VCard/vcard:adr/vcard:street-address’);
After defining other properties of the display (image, links, further key-value pairs), the resulting user experience looks like Figure 1 (see circled listing).
|Figure 1. Defined Properties: After defining your properties, your listing should appear like the listing that is circled in the figure.|
While the semantic web has had many successes since its conception over 10 years ago, most of the results have been demonstrated in relatively narrow expert domains, where the inferential power of ontology languages brought a clear benefit, or within the enterprise, where the additional levels of control allowed sophisticated methods of metadata management to be introduced in a top-down manner. However, this seems to be the year when the semantic web is becoming a reality on the public web as publishers become familiar with technologies such as RDFa, and developers start to realize the first applications that make use of RDF data distributed across the web.
Building out the semantic web is as much a technical as a social process; thus, even if the technologies are by-and-large ready, there is still an important role for the community to play. One example is fostering agreements around vocabularies. While there are vocabularies for many domains, the overall coverage of semantic web vocabularies is at best partial, and the quality varies. In practice, users are still wondering where they would find a particular vocabulary to fit their need, and how they should choose among the existing ones. It is also certain that basic technologies will need to be revisited over time. Although using RDF-based technologies has become a lot easier over time, the arrival of large numbers of publishers will no doubt push further downward the quality of metadata on the web, which means that developers will need tools for dealing with inconsistencies.
As the example of the SearchMonkey query language shows above, the trend is toward sacrificing some expressive power in return for usability and robustness in the face of mistakes.In the long term, we also hope to see further convergence between the HTML Web (including embedded metadata) and the Web of Linked Data. For publishers, choosing between publishing embedded metadata microformats, RDFa, and publishing linked data is a difficult and confusing process at the moment. Hopefully, a clearer understanding of the relationship between embedded metadata and linked data will emerge as these technologies are put to use at larger and larger scales.