Getting Started with OpenCalais and SearchMonkey

any digital documents reside on the web and other networks, but few of them have sufficient metadata to accurately identify the content. Adding metadata to a document has typically been the burden of the document author, and few authors take the time to create detailed metadata for their documents, relying instead on editing software to create metadata such as author’s name and modified history. However, few document editors can create metadata describing the document’s content.

The method for creating metadata may be changing with the public release of OpenCalais. OpenCalais is a public web service that integrates with applications, such as authoring tools, to create metadata about unstructured English document content (with other languages to follow later). Even documents that are created without metadata can benefit from the OpenCalais web service.

The OpenCalais web service receives unstructured English content and identifies named entities, facts, events buried in the text, and extracts them into metadata. You can use this metadata extraction for auto-tagging webblogs and index documents.

Auto Tagging Blog Entries

 
Figure 1. The Tagaroo Plugin: The Tagaroo plugin provides a list of suggested tags based on your content.

Imagine writing your next blog post, and prior to uploading, your editor suggests to post a few images along with it. Using the OpenCalais Tagaroo plugin for WordPress, you can get a list of suggested tags to choose from and view relevant pictures to send with your post.

With Tagaroo installed, you first write your blog as usual, then Tagaroo—connecting to OpenCalais—provides a list of suggested tags based on the content (see Figure 1). Tagaroo then uses the tags you select to search Flickr for related images (not shown).

OpenCalais HTTP Interface
OpenCalais uses HTTP to post encoded content. To post content, you first encode it into an application/x-www-form-urlencoded POST request, using the following Java command:

URLEncoder.encode(parameter, "UTF-8");

You can then create A POST request in Java as shown here:

StringBuilder sb = new StringBuilder(content.length() + 1024);sb.append("licenseID=").append(licenseID);sb.append("&content=").append(content);sb.append("&XML=").append(additionalParameters);URLConnection connection = new URL(API_URL).openConnection();connection.addRequestProperty("Content-Type",		"application/x-www-form-urlencoded");connection.addRequestProperty("Content-Length", valueOf(sb.length()));connection.setDoOutput(true);OutputStream out = connection.getOutputStream();OutputStreamWriter writer = new OutputStreamWriter(out);writer.write(sb.toString());writer.flush();
Embedding Metadata
You can also use OpenCalais to add metadata to existing web pages. A web page would be much more useful for machine processing, if the people and organizations embedded in the page were already extracted into header micro-formats. Search engines such as Yahoo! could then use this information to make it possible to show more useful and visually appealing search results.

In PHP, you can include the OpenCalais Marmoset package to generate the metadata for every page. By including a PHP Marmoset header and footer in the page content, it will inject metadata that can be used by selected search engines to more accurately index the content of the document.

The additional parameters include the requested encoded response format. For example, to receive RDF/XML, you may pass parameters like this:

					

The successful response of the HTTP services is encoded in an XML string. It can be decoded as follows:

builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();Document doc = builder.parse(connection.getInputStream());return new StringReader(doc.getDocumentElement().getTextContent());

Every document is assigned a unique docId for the DocInfo resource in the RDF/XML response. The response also includes OpenCalais processing metadata in a DocInfoMeta. However, the interesting parts are the InstanceInfo resources. These resources contain the docId, subject, and context of a named entity, fact, or event. By extracting the subjects referenced by the docId, through the InstanceInfo resources, you can identify what is referenced in the unstructured content of the original document.

Once decoded, interesting information can be extracted from the response metadata. You can use the subjects found in the document to auto-tag the content, search for other relevant information, index the document’s metadata for later use, or simply embed the metadata within the document.

Indexing Documents
OpenCalais can also be useful to content managers to create smart indexes. Instead of indexing by keywords, you can index by referenced subject. If you have a collection of unstructured documents, in a website for example, you can use OpenCalais to help manage and reference them together. By using the OpenCalais API, a website’s side navigation bar can suggest other related documents based on the conceptual subject, instead of word matching as is used by most indexes.

By taking the RDF/XML document returned by the OpenCalais HTTP interface and storing it in a Sesame RDF store, you can enable an application to find documents related to anything in the RDF store. To create a new Sesame native RDF store, or open an existing one, use the following code:

Sample OpenCalais Indexing Firefox Plugin
Even if you do not manage your own website, you still can begin indexing documents. Included in this article is an example Firefox plugin that uses OpenCalais to index visited web pages in an embedded Firefox browser. You must first download and save the xpi file then open it in Mozilla Firefox to install it. After installing the plugin and restarting Firefox, you can initiate the plugin by selecting View ? Sidebar ? Related Pages. While this is open, any visited web pages will get indexed into an internal Sesame RDF store. At the same time, documents with any overlapping subjects will be included in the list on the left.

To test the ability of OpenCalais and this plugin, try navigating to a popular news item of the day from two different news sources and see if OpenCalais can recognise the common subjects between them. The plugin (as implemented) passes all text found in the web page, which usually includes a lot of noise (from sidebars and repeated website information). This can cause a large number of false positives-particularly from within the same website.

To modify the behaviour of the plugin, simply open it with a ZIP archive manager, such as WinZip, and extract the files into a directory to reveal the JavaScript and Java source code used.

repository = new SailRepository(new NativeStore(dir));repository.initialize();

You can then store the RDF/XML response into the repository like this:

ValueFactory vf = repository.getValueFactory();RepositoryConnection con = repository.getConnection();try {	con.add(reader, "", RDFFormat.RDFXML);} finally {	con.close();}

Once you have a collection of documents indexed using OpenCalais, you can then query the repository for related documents. For example, to find all documents about the Halifax Comedy Festival, you can use the following SPARQL query:

PREFIX :SELECT DISTINCT ?docWHERE { ?instance :docId ?doc .	?instance :subject ?subject .	?subject :name "Halifax Comedy Festival" }

To find documents related to one another, you can use the following SPARQL query:

PREFIX :SELECT DISTINCT ?docA ?docBWHERE { ?instanceA :docId ?docA . ?instanceB :docId ?docB	?instanceA :subject ?subject . ?instanceB :subject ?subject }

Getting Started with Yahoo!’s SearchMonkey
It is not uncommon for a web site to direct users to a search engine’s page for the website—”search this site.” This is provided by major search engines (Yahoo! calls this processes “Search Builder”). It traditionally comes at the cost of losing control on how information is shown in the result. However, Yahoo! is changing this through their new SearchMonkey platform.

 
Figure 2. SearchMonkey Application Dashboard: Get started with SearchMonkey at the dashboard.

SearchMonkey is Yahoo! Search’s new open platform to build result enhancements based on information about the given web page. Each result entry triggers a SearchMonkey application that creates a quick data mashup, pulling in information about that page from a variety of sources. These mashups may include micro-formats embedded in the page (possibly created by Marmoset), other embedded metadata such as RDFa, remote data feeds, or other remote custom data services.

If you want to enhance your site’s search results, you can use the SearchMonkey platform to change the way your site is displayed in Yahoo!’s results page. On the first page of all DevX articles, for example, is the document’s author. To add this information to all search results, you need to create a Custom Data Service to extract the data from the page, then create a Presentation Application (result enhancement), and finally, if you have authenticated your site, you can request to change the default presentation results for your site.

To get started, point your browser to the SearchMonkey Application Dashboard (you will need to sign in with a Yahoo! ID).Click on “create new data service” (see Figure 2) and follow the steps until you get to the data extraction part. You then need to fill in an Extensible Stylesheet Language (XSL) file. For example, to extract the author’s name from the first page of a DevX article (or the last page of their bio), you can use the transformation as follows. It uses two XPath expressions to search for the

element with the class of “articleAuthor” or “articlebio,” and extracts them as the name of the article creator (dc:creator/foaf:name).

                                                                                    

Once you have created the custom data services you want to use, you can start to create the result enhancement or Presentation Application by clicking on “Create a New Application” from the dashboard home page. In the Appearance section, is a PHP script broken into sections corresponding to a possible result format. This includes title/summary, image, links, and description items. On the right side are links to insert the extracted data into the search result. Simply double-click on the “SMDEFAULT” value for the corresponding position where you want it to show and click on the link on the right side. For example, to add an author name to the description of a result, double click on “SMDEFAULT” next to “$ret[‘dict’][0][‘key’]” and type “Author” (including quotes). Then click on “SMDEFAULT” next to “$ret[‘dict’][0][‘value’]” and click on (dc:creator/foaf:name) – you may need to click on a (+) to expand the bottom branch. You should now have changed a section of the code to look something like this:

		// Key Value pairs - up to 4    $ret['dict'][0]['key'] = "Author";    $ret['dict'][0]['value'] = Data::get('smid:NKz/dc:creator/foaf:name');
 
Figure 3. Search result: Test your enhancements in a Yahoo! Search result.

Test your enhancement in a Yahoo! search result (see Figure 3), by adding the enhancement to your Yahoo! preferences. Whenever you are logged in and searching on Yahoo! your new enhancement will be triggered.

Yahoo! SearchMonkey is currently investigating ways for site owners to self-administer result enhancements. Currently, to add this enhancement to anonymous users, you must first share your application in the SearchMonkey gallery and authenticate your site by using Yahoo!’s new Site Explorer. Then, contact Yahoo! for approval of your search result enhancement.

Conclusion
Metadata in web pages and other documents is starting to play a bigger role in indexing and summarizing documents online. Here you have seen a few ways in which this information can be processed and used to enhance the user’s experience. With these new APIs, semantic web technologies are now much easier to use and more practical to embed in other applications. For more information on OpenCalais or SearchMonkey please see their corresponding documentation.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Overview

Recent Articles: