advertisement
Premier Club Log In/Registration
  Include Code  Search Tips
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   SKILLBUILDING  |   TIP BANK  |   SOURCEBANK  |   FORUMS  |   NEWSLETTERS
Browse DevX
Partners & Affiliates
advertisement
advertisement
advertisement
Rate this item | 0 users have rated this item.
Email this articleEmail this article
 
Semantic Search Arrives at the Web
The current generation of search engines is severely limited in its understanding of the user's intent and the web's content. Find out how semantic search can address these limitations.  

advertisement
emantic search has attracted a lot of attention in the past year, largely due to the growth of the semantic web as a whole. The term semantic search itself is popular enough to be considered overused. The term refers to searching large semantic web datasets, which is a typical problem for semantic web search engines such as Swoogle, Sindice, SWSE, Falcon-S, and Watson. The term also refers to methods of searching web documents beyond the syntactic level of matching keywords. This article discusses semantic search in this second sense.

The current generation of search engines is severely limited in its understanding of the user's intent—and the web's content—and consequently in matching the needs for information with the vast supply of resources on the web. For Information Retrieval (IR) purposes, both queries and documents are typically treated at a word or gram level, with minimal language processing involved. The search engine is missing a semantic-level understanding of the query or the content and can only understand the content of a document by picking out the most commonly occurring or underlined words.

Even though search is considered a functional technology, there are limits to a syntax-based approach. The following list shows some examples of these limitations.

  • It is almost impossible to return search results that relate to the secondary sense of a term—especially if a dominant sense exists—for example, try searching for George Bush the beer brewer as compared to the President.
  • The capabilities of computational advertising, which is largely also an IR problem (for example, retrieving matching ads from a fixed inventory), are clearly impacted because of the sparsity of advertisements.
  • When no clear key exists, search engines are unable to perform queries on descriptions of objects. For example, try searching for the author of this article with the keywords ‘semantic web researcher working for yahoo.’
  • Current search technology is unable to satisfy any complex queries requiring information integration such as analysis, prediction, scheduling, etc. An example of such integration-based tasks is opinion mining regarding products or services. While there have been some successes in opinion mining with pure sentiment analysis, it is often the case that users like to know what specific aspects of a product or service are being described in positive or negative terms and to have the search results appear aggregated and organized. Information integration is not possible without structured representations of content.
  • Multimedia queries are also difficult to answer, as multimedia objects are typically described with only a few keywords (tagging) or sentences. This is typically too little text for the statistical methods of IR to be effective.
Semantic search—defined as IR with the capabilities to understand the user's intent and the web's content at a much deeper, conceptual level—can address these limitations.

Two Roads to Semantic Search
There are two approaches toward semantic search and both have received attention in the past months. The first approach builds on the automatic analysis of text using Natural Language Processing (NLP). The second approach uses semantic web technologies, which aims to make the web more easily searchable by allowing publishers to expose their (meta)data.

Natural Language Processing
While NLP technology has been researched for decades, the processing it requires has prohibited large-scale applications. Moore's law, however, is on the side of NLP. Companies such as PowerSet (which was recently purchased by Microsoft) and Hakia are rising quickly. Their systems extract entities from text, disambiguate them against large-scale background knowledge sources (PowerSet uses Freebase, Hakia has its own ontology), and then record the relationships as found in the text. PowerSet and Hakia also allow users to query using full questions, although most users still prefer to type in keywords, which their engines support as well.

Author's Note: Information extraction is easier with web sites that follow certain templates (typically, web sites generated from a database). In these cases, special methods (so called wrappers) can be developed or automatically trained to extract the information with very high precision.





Semantic Web
Publishers can avoid the costs and quality issues associated with NLP if they are willing to expose their metadata. Most publishers are willing to invest some time and effort in exposing structured data directly if they see improvement to their traffic patterns.

Exposing metadata is the starting point for a semantic search approach associated with the semantic web. The semantic web targets the entire web, including documents, databases, and services. The most important recent development is in the area of embedding metadata directly into web documents. Compared to microformats, which are limited to describing the most common types of information (persons, events, etc.), the new RDFa standard (a candidate recommendation from the W3C) has the capability to encode metadata in HTML using any RDF or OWL vocabulary. The decoupling of syntax and semantics is crucial from the perspective of search engine providers; while support for RDFa needs to be implemented only once, each microformat requires a separate parsing of the HTML page. An alternative is GRDDL, which allows the publisher to attach an arbitrary transformation to the page that 'converts' the information to RDF. GRDDL is more complicated for search engine providers to support in that running arbitrary XSLT stylesheets on HTML pages carries certain risks.

Editor's Note: The author of this article works for Yahoo!, suppliers of the SearchMonkey technology.

  Next Page: Closing the Loop
Page 1: IntroductionPage 2: Closing the Loop
advertisement
Advertising Info  |   Member Services  |   Permissions  |   Contact Us  |   Help  |   Feedback  |   Site Map  |   Network Map  |   About


JupiterOnlineMedia

internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and JupiterOnlineMedia

Jupitermedia Corporate Info


Legal Notices, Licensing, Reprints, & Permissions, Privacy Policy.

Advertise | Newsletters | Tech Jobs | Shopping | E-mail Offers

Solutions
Whitepapers and eBooks
IBM Whitepaper: Innovative Collaboration to Advance Your Business
Internet.com eBook: Real Life Rails
Avaya Article: Call Control XML - Powerful, Standards-Based Call Control
Tripwire Whitepaper: Seven Practical Steps to Mitigate Virtualization Security Risks
Internet.com eBook: The Pros and Cons of Outsourcing
Go Parallel Article: Scalable Parallelism with Intel(R) Threading Building Blocks
Internet.com eBook: Best Practices for Developing a Web Site
IBM CXO Whitepaper: The 2008 Global CEO Study "The Enterprise of the Future"
Avaya Article: Call Control XML in Action - A CCXML Auto Attendant
Go Parallel Article: James Reinders on the Intel Parallel Studio Beta Program
IBM CXO Whitepaper: Unlocking the DNA of the Adaptable Workforce--The Global Human Capital Study 2008
Adobe Acrobat Connect Pro: Web Conferencing and eLearning Whitepapers
Go Parallel Article: Getting Started with TBB on Windows
HP eBook: Storage Networking , Part 1
MORE WHITEPAPERS, EBOOKS, AND ARTICLES
Webcasts
Go Parallel Video: Intel(R) Threading Building Blocks: A New Method for Threading in C++
HP Video: Is Your Data Center Ready for a Real World Disaster?
Microsoft Partner Portal Video: Microsoft Gold Certified Partners Build Successful Practices
HP On Demand Webcast: Virtualization in Action
Go Parallel Video: Performance and Threading Tools for Game Developers
Rackspace Hosting Center: Customer Videos
Intel vPro Developer Virtual Bootcamp
HP Disaster-Proof Solutions eSeminar
HP On Demand Webcast: Discover the Benefits of Virtualization
MORE WEBCASTS, PODCASTS, AND VIDEOS
Downloads and eKits
Microsoft Download: Silverlight 2 Software Development Kit Beta 2
30-Day Trial: SPAMfighter Exchange Module
Red Gate Download: SQL Toolbelt
Iron Speed Designer Application Generator
Microsoft Download: Silverlight 2 Beta 2 Runtime
MORE DOWNLOADS, EKITS, AND FREE TRIALS
Tutorials and Demos
IBM IT Innovation Article: Green Servers Provide a Competitive Advantage
Microsoft Article: Expression Web 2 for PHP Developers--Simplify Your PHP Applications
Featured Algorithm: Intel Threading Building Blocks - parallel_reduce
MORE TUTORIALS, DEMOS AND STEP-BY-STEP GUIDES