any digital documents reside on the web and other networks, but few of them have sufficient metadata to accurately
identify the content. Adding metadata to a document has typically been the burden of the document author, and
few authors take the time to create detailed metadata for their documents, relying instead on editing software to
create metadata such as author's name and modified history. However, few document editors can create metadata
describing the document's content
The method for creating metadata may be changing with the public release of OpenCalais. OpenCalais is a public web service that integrates with applications, such as authoring tools, to create metadata about unstructured English document content (with other languages to follow later). Even documents that are created without metadata can benefit from the OpenCalais web service.
The OpenCalais web service receives unstructured English content and identifies named entities, facts, events
buried in the text, and extracts them into metadata. You can use this metadata extraction for auto-tagging webblogs
and index documents.
Auto Tagging Blog Entries
Figure 1. The Tagaroo Plugin: The Tagaroo plugin provides a list of suggested tags based on your content.
Imagine writing your next blog post, and prior to uploading, your editor suggests to post a few images along with it. Using the OpenCalais Tagaroo plugin for WordPress, you can get a list of suggested tags to choose from and view relevant pictures to send with your post.
With Tagaroo installed, you first write your blog as usual, then Tagaroo—connecting to OpenCalais—provides a list of
suggested tags based on the content (see Figure 1). Tagaroo then uses the tags you select to search Flickr for related images (not shown).
OpenCalais HTTP Interface
OpenCalais uses HTTP to post encoded content. To post content, you first encode it into an application/x-www-form-urlencoded POST request, using
the following Java command:
You can then create A POST request in Java as shown here:
StringBuilder sb = new StringBuilder(content.length() + 1024);
URLConnection connection = new URL(API_URL).openConnection();
OutputStream out = connection.getOutputStream();
OutputStreamWriter writer = new OutputStreamWriter(out);
You can also use OpenCalais to add metadata to existing web pages. A web page would be much more useful for machine processing, if the people and organizations embedded in the page were already extracted into header micro-formats. Search engines such as Yahoo! could then use this information to make it possible to show more useful and visually appealing search results.
In PHP, you can include the OpenCalais Marmoset package to generate the metadata for every page. By including a PHP
Marmoset header and footer in the page content, it will inject metadata that can be used by selected search engines to
more accurately index the content of the document.
The additional parameters include the requested encoded response format. For example, to receive RDF/XML, you may pass parameters like this:
<c:userDirectives c:allowDistribution='true' c:allowSearch='true'
The successful response of the HTTP services is encoded in an XML string. It can be decoded as follows:
builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(connection.getInputStream());
return new StringReader(doc.getDocumentElement().getTextContent());
Every document is assigned a unique docId
for the DocInfo resource in the RDF/XML response. The response also includes OpenCalais processing metadata in a DocInfoMeta. However, the interesting parts are the InstanceInfo resources. These resources contain the docId, subject, and context of a named entity, fact, or event. By extracting the subjects referenced by the docId, through the InstanceInfo resources, you can identify what is referenced in the unstructured content of the original document.
Once decoded, interesting information can be extracted from the response metadata. You can use the subjects found in the document to auto-tag the content, search for other relevant information, index the document's metadata for later use, or simply embed the metadata within the document.