emantic search has attracted a lot of attention in the past year, largely due to the growth of the semantic web as a whole. The term semantic search itself is popular enough to be considered overused. The term refers to searching large semantic web datasets, which is a typical problem for semantic web search engines such as Swoogle
, and Watson
. The term also refers to methods of searching web documents beyond the syntactic level of matching keywords. This article discusses semantic search in this second sense.
The current generation of search engines is severely limited in its understanding of the user's intent—and the
web's content—and consequently in matching the needs for information with the vast supply of resources on the web. For Information Retrieval (IR) purposes, both queries and documents are typically treated at a word or gram level, with minimal language processing involved. The search engine is missing a semantic-level understanding of the query or the content and can only understand the content of a document by picking out the most commonly occurring or underlined words.
Even though search is considered a functional technology, there are limits to a syntax-based approach. The following list shows some examples of these limitations.
- It is almost impossible to return search results that relate to the secondary sense of a term—especially if
a dominant sense exists—for example, try searching for George Bush the beer brewer as compared to the President.
- The capabilities of computational advertising, which is largely also an IR problem (for example, retrieving matching ads from a fixed inventory), are clearly impacted because of the sparsity of advertisements.
- When no clear key exists, search engines are unable to perform queries on descriptions of objects. For example, try
searching for the author of this article with the keywords ‘semantic web researcher working for yahoo.’
- Current search technology is unable to satisfy any complex queries requiring information integration such as analysis,
prediction, scheduling, etc. An example of such integration-based tasks is opinion mining regarding products or
services. While there have been some successes in opinion mining with pure sentiment analysis, it is often the case
that users like to know what specific aspects of a product or service are being described in positive or negative
terms and to have the search results appear aggregated and organized. Information integration is not possible without
structured representations of content.
- Multimedia queries are also difficult to answer, as multimedia objects are typically described with only a few
keywords (tagging) or sentences. This is typically too little text for the statistical methods of IR to be effective.
Semantic search—defined as IR with the capabilities to understand the user's intent and the web's content at a
much deeper, conceptual level—can address these limitations.
Two Roads to Semantic Search
There are two approaches toward semantic search and both have received attention in the past months. The first approach builds on the automatic analysis of text using Natural Language Processing (NLP). The second approach uses semantic web technologies, which aims to make the web more easily searchable by allowing publishers to expose their (meta)data.
Natural Language Processing
While NLP technology has been researched for decades, the processing it requires has prohibited large-scale applications. Moore's law, however, is on the side of NLP.
Companies such as PowerSet (which was recently purchased by Microsoft) and Hakia are rising quickly. Their systems extract entities from text, disambiguate them against large-scale background knowledge sources (PowerSet uses Freebase, Hakia has its own ontology), and then record the relationships as found in the text. PowerSet and Hakia also allow users to query using full questions, although most users still prefer to type in keywords, which their engines support as well.
|Author's Note: Information extraction is easier with web sites that follow certain templates (typically, web sites
generated from a database). In these cases, special methods (so called wrappers) can be developed or automatically
trained to extract the information with very high precision.|
Publishers can avoid the costs and quality issues associated with NLP if they are willing to expose their metadata. Most publishers are willing to invest some time and effort in exposing structured data directly if they see improvement to their traffic patterns.
Exposing metadata is the starting point for a semantic search approach associated with the semantic web. The semantic
web targets the entire web, including documents, databases, and services. The most important recent development is in
the area of embedding metadata directly into web documents. Compared to
microformats, which are limited to describing the most common types of information (persons, events, etc.), the new RDFa standard (a candidate recommendation from the W3C) has the capability to encode metadata in HTML using any RDF or OWL vocabulary. The decoupling of syntax and semantics is crucial from the perspective of search engine providers; while support for RDFa needs to be implemented only once, each microformat requires a separate parsing of the HTML page. An alternative is GRDDL, which allows the publisher to attach an arbitrary transformation to the page that 'converts' the information to RDF. GRDDL is more complicated for search engine providers to support in that running arbitrary XSLT stylesheets on HTML pages carries certain risks.
|Editor's Note: The author of this article works for Yahoo!, suppliers of the SearchMonkey technology.|