Login | Register   
RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Getting Started with OpenCalais and SearchMonkey

OpenCalais and Yahoo!'s SearchMonkey both focus on document metadata. OpenCalais is a new toolkit that allows you to incorporate semantic functionality within your blog, content management system, website or application. Yahoo!'s SearchMonkey API allows you to connect data services from around the world and integrate them into the Yahoo! search engine.

any digital documents reside on the web and other networks, but few of them have sufficient metadata to accurately identify the content. Adding metadata to a document has typically been the burden of the document author, and few authors take the time to create detailed metadata for their documents, relying instead on editing software to create metadata such as author's name and modified history. However, few document editors can create metadata describing the document's content.

The method for creating metadata may be changing with the public release of OpenCalais. OpenCalais is a public web service that integrates with applications, such as authoring tools, to create metadata about unstructured English document content (with other languages to follow later). Even documents that are created without metadata can benefit from the OpenCalais web service.

The OpenCalais web service receives unstructured English content and identifies named entities, facts, events buried in the text, and extracts them into metadata. You can use this metadata extraction for auto-tagging webblogs and index documents.

Auto Tagging Blog Entries

Figure 1. The Tagaroo Plugin: The Tagaroo plugin provides a list of suggested tags based on your content.
Imagine writing your next blog post, and prior to uploading, your editor suggests to post a few images along with it. Using the OpenCalais Tagaroo plugin for WordPress, you can get a list of suggested tags to choose from and view relevant pictures to send with your post.

With Tagaroo installed, you first write your blog as usual, then Tagaroo—connecting to OpenCalais—provides a list of suggested tags based on the content (see Figure 1). Tagaroo then uses the tags you select to search Flickr for related images (not shown).

OpenCalais HTTP Interface
OpenCalais uses HTTP to post encoded content. To post content, you first encode it into an application/x-www-form-urlencoded POST request, using the following Java command:

URLEncoder.encode(parameter, "UTF-8");

You can then create A POST request in Java as shown here:

StringBuilder sb = new StringBuilder(content.length() + 1024); sb.append("licenseID=").append(licenseID); sb.append("&content=").append(content); sb.append("&XML=").append(additionalParameters); URLConnection connection = new URL(API_URL).openConnection(); connection.addRequestProperty("Content-Type", "application/x-www-form-urlencoded"); connection.addRequestProperty("Content-Length", valueOf(sb.length())); connection.setDoOutput(true); OutputStream out = connection.getOutputStream(); OutputStreamWriter writer = new OutputStreamWriter(out); writer.write(sb.toString()); writer.flush();

Embedding Metadata
You can also use OpenCalais to add metadata to existing web pages. A web page would be much more useful for machine processing, if the people and organizations embedded in the page were already extracted into header micro-formats. Search engines such as Yahoo! could then use this information to make it possible to show more useful and visually appealing search results.

In PHP, you can include the OpenCalais Marmoset package to generate the metadata for every page. By including a PHP Marmoset header and footer in the page content, it will inject metadata that can be used by selected search engines to more accurately index the content of the document.

The additional parameters include the requested encoded response format. For example, to receive RDF/XML, you may pass parameters like this:

<c:params xmlns:c='http://s.opencalais.com/1/pred/' xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'> <c:processingDirectives c:contentType='text/txt' c:outputFormat='xml/rdf'> </c:processingDirectives> <c:userDirectives c:allowDistribution='true' c:allowSearch='true' c:externalID='17cabs901' c:submitter='ABC'> </c:userDirectives> <c:externalMetadata> </c:params>

The successful response of the HTTP services is encoded in an XML string. It can be decoded as follows:

builder = DocumentBuilderFactory.newInstance().newDocumentBuilder(); Document doc = builder.parse(connection.getInputStream()); return new StringReader(doc.getDocumentElement().getTextContent());

Every document is assigned a unique docId for the DocInfo resource in the RDF/XML response. The response also includes OpenCalais processing metadata in a DocInfoMeta. However, the interesting parts are the InstanceInfo resources. These resources contain the docId, subject, and context of a named entity, fact, or event. By extracting the subjects referenced by the docId, through the InstanceInfo resources, you can identify what is referenced in the unstructured content of the original document.

Once decoded, interesting information can be extracted from the response metadata. You can use the subjects found in the document to auto-tag the content, search for other relevant information, index the document's metadata for later use, or simply embed the metadata within the document.

Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.