Semantic Search Arrives at the Web

Semantic Search Arrives at the Web

Semantic search has attracted a lot of attention in the past year, largely due to the growth of the semantic web as a whole. The term semantic search itself is popular enough to be considered overused. The term refers to searching large semantic web datasets, which is a typical problem for semantic web search engines such as Swoogle, Sindice, SWSE, Falcon-S, and Watson. The term also refers to methods of searching web documents beyond the syntactic level of matching keywords. This article discusses semantic search in this second sense.

The current generation of search engines is severely limited in its understanding of the user’s intent—and the web’s content—and consequently in matching the needs for information with the vast supply of resources on the web. For Information Retrieval (IR) purposes, both queries and documents are typically treated at a word or gram level, with minimal language processing involved. The search engine is missing a semantic-level understanding of the query or the content and can only understand the content of a document by picking out the most commonly occurring or underlined words.

Even though search is considered a functional technology, there are limits to a syntax-based approach. The following list shows some examples of these limitations.

  • It is almost impossible to return search results that relate to the secondary sense of a term—especially if a dominant sense exists—for example, try searching for George Bush the beer brewer as compared to the President.
  • The capabilities of computational advertising, which is largely also an IR problem (for example, retrieving matching ads from a fixed inventory), are clearly impacted because of the sparsity of advertisements.
  • When no clear key exists, search engines are unable to perform queries on descriptions of objects. For example, try searching for the author of this article with the keywords ‘semantic web researcher working for yahoo.’
  • Current search technology is unable to satisfy any complex queries requiring information integration such as analysis, prediction, scheduling, etc. An example of such integration-based tasks is opinion mining regarding products or services. While there have been some successes in opinion mining with pure sentiment analysis, it is often the case that users like to know what specific aspects of a product or service are being described in positive or negative terms and to have the search results appear aggregated and organized. Information integration is not possible without structured representations of content.
  • Multimedia queries are also difficult to answer, as multimedia objects are typically described with only a few keywords (tagging) or sentences. This is typically too little text for the statistical methods of IR to be effective.

Semantic search—defined as IR with the capabilities to understand the user’s intent and the web’s content at a much deeper, conceptual level—can address these limitations.

Two Roads to Semantic Search
There are two approaches toward semantic search and both have received attention in the past months. The first approach builds on the automatic analysis of text using Natural Language Processing (NLP). The second approach uses semantic web technologies, which aims to make the web more easily searchable by allowing publishers to expose their (meta)data.

Natural Language Processing
While NLP technology has been researched for decades, the processing it requires has prohibited large-scale applications. Moore’s law, however, is on the side of NLP.Companies such as PowerSet (which was recently purchased by Microsoft) and Hakia are rising quickly. Their systems extract entities from text, disambiguate them against large-scale background knowledge sources (PowerSet uses Freebase, Hakia has its own ontology), and then record the relationships as found in the text. PowerSet and Hakia also allow users to query using full questions, although most users still prefer to type in keywords, which their engines support as well.

Author’s Note: Information extraction is easier with web sites that follow certain templates (typically, web sites generated from a database). In these cases, special methods (so called wrappers) can be developed or automatically trained to extract the information with very high precision.

 

Semantic Web
Publishers can avoid the costs and quality issues associated with NLP if they are willing to expose their metadata. Most publishers are willing to invest some time and effort in exposing structured data directly if they see improvement to their traffic patterns.

Exposing metadata is the starting point for a semantic search approach associated with the semantic web. The semantic web targets the entire web, including documents, databases, and services. The most important recent development is in the area of embedding metadata directly into web documents. Compared to microformats, which are limited to describing the most common types of information (persons, events, etc.), the new RDFa standard (a candidate recommendation from the W3C) has the capability to encode metadata in HTML using any RDF or OWL vocabulary. The decoupling of syntax and semantics is crucial from the perspective of search engine providers; while support for RDFa needs to be implemented only once, each microformat requires a separate parsing of the HTML page. An alternative is GRDDL, which allows the publisher to attach an arbitrary transformation to the page that ‘converts’ the information to RDF. GRDDL is more complicated for search engine providers to support in that running arbitrary XSLT stylesheets on HTML pages carries certain risks.

Editor’s Note: The author of this article works for Yahoo!, suppliers of the SearchMonkey technology.

Closing the Loop
NLP and semantic web approaches are complementary because implicit metadata extracted from text and explicit metadata provided by publishers only differ in the way they are obtained. Why not take the best of both worlds and use highly precise, human controlled information when available, and fall back to automated methods where such information is partial or missing?

Yahoo’s SearchMonkey platform allows for the integration of NLP and semantic web approaches. SearchMonkey provides motivation for publishers to open up metadata through a compelling use case, which is the ability to modify the presentation of search results through plug-ins to the search interface.

Content providers can either rely on their own efforts in annotating the content, or turn to providers such as OpenCalais, Zemanta, or Dapper to semantify their pages. Third-party developers can create custom data services to extract information from the web page. Custom data services can be:

  • An XSLT stylesheet
  • Implemented as an external service, which performs an arbitrarily complex transformation
  • Combined information from the page and from an external API (for example, associating data with Flickr pages by calling the Flickr API)

Regardless of the source, all metadata in SearchMonkey is represented internally in a triple format and made available to applications in an XML serialization called dataRSS that is also RDFa compatible. SearchMonkey application developers can combine all metadata associated with a web page to develop the above mentioned plug-ins to the search interface. Applications compete in that users of Yahoo! Search implicitly vote by choosing the applications they like to enable as part of their search configuration.

Semantic technology was crucial in realizing the open design of SearchMonkey. With many existing metadata platforms—such as Google Base—users are limited by technology to a small number of object types (housing, jobs, products, etc.) when providing information. Similarly, the attributes of those objects are also limited and prescribed by the platform provider (that is, Google).

In SearchMonkey, RDF made it possible to separate syntax and vocabulary in that publishers are free to use any vocabulary, opening up the system to support the long tail of web content. In contrast to XML technologies, vocabularies described in RDF or OWL can be easily extended by anyone without breaking interoperability, applications can safely work on the part of the metadata they understand, and ignore the rest. Further—in a distributed knowledge representation scenario—metadata is kept and maintained locally, and thus publishers are no longer locked into a centralized system. Relying on open standards means that both developers and publishers can rely on compatible tools to manage their metadata.

On occasion, semantic web technologies have been simplified to support the average web developer who has some familiarity with XML standards and a working knowledge of PHP. An example of such simplification was deciding which query language to use. SPARQL, the de facto standard in the semantic web, is clearly overly complicated for the average developer, and more complex than the task required. Thus, the choice was made for an XPath-like query language, similar to the Fresnel Selector Language (FSL). Listing 1 shows a typical snippet of RDF data that a SearchMonkey application would receive as input. If you want the application to display the address of the restaurant, use this:

    $ret['dict'][1]['key'] = ‘Address’;    $ret['dict'][1]['value'] = Data::get(‘com.yahoo.search.rdfa/vcard:VCard/vcard:adr/vcard:street-address’);

After defining other properties of the display (image, links, further key-value pairs), the resulting user experience looks like Figure 1 (see circled listing).

Figure 1. Defined Properties: After defining your properties, your listing should appear like the listing that is circled in the figure.

Going Forward
While the semantic web has had many successes since its conception over 10 years ago, most of the results have been demonstrated in relatively narrow expert domains, where the inferential power of ontology languages brought a clear benefit, or within the enterprise, where the additional levels of control allowed sophisticated methods of metadata management to be introduced in a top-down manner. However, this seems to be the year when the semantic web is becoming a reality on the public web as publishers become familiar with technologies such as RDFa, and developers start to realize the first applications that make use of RDF data distributed across the web.

Building out the semantic web is as much a technical as a social process; thus, even if the technologies are by-and-large ready, there is still an important role for the community to play. One example is fostering agreements around vocabularies. While there are vocabularies for many domains, the overall coverage of semantic web vocabularies is at best partial, and the quality varies. In practice, users are still wondering where they would find a particular vocabulary to fit their need, and how they should choose among the existing ones. It is also certain that basic technologies will need to be revisited over time. Although using RDF-based technologies has become a lot easier over time, the arrival of large numbers of publishers will no doubt push further downward the quality of metadata on the web, which means that developers will need tools for dealing with inconsistencies.

As the example of the SearchMonkey query language shows above, the trend is toward sacrificing some expressive power in return for usability and robustness in the face of mistakes.In the long term, we also hope to see further convergence between the HTML Web (including embedded metadata) and the Web of Linked Data. For publishers, choosing between publishing embedded metadata microformats, RDFa, and publishing linked data is a difficult and confusing process at the moment. Hopefully, a clearer understanding of the relationship between embedded metadata and linked data will emerge as these technologies are put to use at larger and larger scales.

devx-admin

devx-admin

Share the Post:
Apple Tech

Apple’s Search Engine Disruptor Brewing?

As the fourth quarter of 2023 kicks off, the technology sphere is abuzz with assorted news and advancements. Global stocks exhibit mixed results, whereas cryptocurrency

Revolutionary Job Market

AI is Reshaping the Tech Job Market

The tech industry is facing significant layoffs in 2023, with over 224,503 workers in the U.S losing their jobs. However, experts maintain that job security

Foreign Relations

US-China Trade War: Who’s Winning?

The August 2023 visit of Gina Raimondo, the U.S. Secretary of Commerce, to China demonstrated the progress being made in dialogue between the two nations.

Pandemic Recovery

Conquering Pandemic Supply Chain Struggles

The worldwide coronavirus pandemic has underscored supply chain challenges that resulted in billions of dollars in losses for automakers in 2021. Consequently, several firms are

Game Changer

How ChatGPT is Changing the Game

The AI-powered tool ChatGPT has taken the computing world by storm, receiving high praise from experts like Brex design lead, Pietro Schirano. Developed by OpenAI,

Apple Tech

Apple’s Search Engine Disruptor Brewing?

As the fourth quarter of 2023 kicks off, the technology sphere is abuzz with assorted news and advancements. Global stocks exhibit mixed results, whereas cryptocurrency tokens have seen a substantial

GlobalFoundries Titan

GlobalFoundries: Semiconductor Industry Titan

GlobalFoundries, a company that might not be a household name but has managed to make enormous strides in its relatively short 14-year history. As the third-largest semiconductor foundry in the

Revolutionary Job Market

AI is Reshaping the Tech Job Market

The tech industry is facing significant layoffs in 2023, with over 224,503 workers in the U.S losing their jobs. However, experts maintain that job security in the sector remains strong.

Foreign Relations

US-China Trade War: Who’s Winning?

The August 2023 visit of Gina Raimondo, the U.S. Secretary of Commerce, to China demonstrated the progress being made in dialogue between the two nations. However, the United States’ stance

Pandemic Recovery

Conquering Pandemic Supply Chain Struggles

The worldwide coronavirus pandemic has underscored supply chain challenges that resulted in billions of dollars in losses for automakers in 2021. Consequently, several firms are now contemplating constructing domestic manufacturing

Game Changer

How ChatGPT is Changing the Game

The AI-powered tool ChatGPT has taken the computing world by storm, receiving high praise from experts like Brex design lead, Pietro Schirano. Developed by OpenAI, ChatGPT is known for its

Future of Cybersecurity

Cybersecurity Battles: Lapsus$ Era Unfolds

In 2023, the cybersecurity field faces significant challenges due to the continuous transformation of threats and the increasing abilities of hackers. A prime example of this is the group of

Apple's AI Future

Inside Apple’s AI Expansion Plans

Rather than following the widespread pattern of job cuts in the tech sector, Apple’s CEO Tim Cook disclosed plans to increase the company’s UK workforce. The main area of focus

AI Finance

AI Stocks to Watch

As investor interest in artificial intelligence (AI) grows, many companies are highlighting their AI product plans. However, discovering AI stocks that already generate revenue from generative AI, such as OpenAI,

Web App Security

Web Application Supply Chain Security

Today’s web applications depend on a wide array of third-party components and open-source tools to function effectively. This reliance on external resources poses significant security risks, as malicious actors can

Thrilling Battle

Thrilling Battle: Germany Versus Huawei

The German interior ministry has put forward suggestions that would oblige telecommunications operators to decrease their reliance on equipment manufactured by Chinese firms Huawei and ZTE. This development comes after

iPhone 15 Unveiling

The iPhone 15’s Secrets and Surprises

As we dive into the most frequently asked questions and intriguing features, let us reiterate that the iPhone 15 brings substantial advancements in technology and design compared to its predecessors.

Chip Overcoming

iPhone 15 Pro Max: Overcoming Chip Setbacks

Apple recently faced a significant challenge in the development of a key component for its latest iPhone series, the iPhone 15 Pro Max, which was unveiled just a week ago.

Performance Camera

iPhone 15: Performance, Camera, Battery

Apple’s highly anticipated iPhone 15 has finally hit the market, sending ripples of excitement across the tech industry. For those considering upgrading to this new model, three essential features come

Battery Breakthrough

Electric Vehicle Battery Breakthrough

The prices of lithium-ion batteries have seen a considerable reduction, with the cost per kilowatt-hour dipping under $100 for the first occasion in two years, as reported by energy analytics

Economy Act Soars

Virginia’s Clean Economy Act Soars Ahead

Virginia has made significant strides towards achieving its short-term carbon-free objectives as outlined in the Clean Economy Act of 2020. Currently, about 44,000 megawatts (MW) of wind, solar, and energy

Renewable Storage Innovation

Innovative Energy Storage Solutions

The Department of Energy recently revealed a significant investment of $325 million in advanced battery technologies to store excess renewable energy produced by solar and wind sources. This funding will

Renesas Tech Revolution

Revolutionizing India’s Tech Sector with Renesas

Tushar Sharma, a semiconductor engineer at Renesas Electronics, met with Indian Prime Minister Narendra Modi to discuss the company’s support for India’s “Make in India” initiative. This initiative focuses on

Development Project

Thrilling East Windsor Mixed-Use Development

Real estate developer James Cormier, in collaboration with a partnership, has purchased 137 acres of land in Connecticut for $1.15 million with the intention of constructing residential and commercial buildings.

USA Companies

Top Software Development Companies in USA

Navigating the tech landscape to find the right partner is crucial yet challenging. This article offers a comparative glimpse into the top software development companies in the USA. Through a

Software Development

Top Software Development Companies

Looking for the best in software development? Our list of Top Software Development Companies is your gateway to finding the right tech partner. Dive in and explore the leaders in

India Web Development

Top Web Development Companies in India

In the digital race, the right web development partner is your winning edge. Dive into our curated list of top web development companies in India, and kickstart your journey to

USA Web Development

Top Web Development Companies in USA

Looking for the best web development companies in the USA? We’ve got you covered! Check out our top 10 picks to find the right partner for your online project. Your

Clean Energy Adoption

Inside Michigan’s Clean Energy Revolution

Democratic state legislators in Michigan continue to discuss and debate clean energy legislation in the hopes of establishing a comprehensive clean energy strategy for the state. A Senate committee meeting

Chips Act Revolution

European Chips Act: What is it?

In response to the intensifying worldwide technology competition, Europe has unveiled the long-awaited European Chips Act. This daring legislative proposal aims to fortify Europe’s semiconductor supply chain and enhance its