any digital documents reside on the web and other networks, but few of them have sufficient metadata to accurately identify the content. Adding metadata to a document has typically been the burden of the document author, and few authors take the time to create detailed metadata for their documents, relying instead on editing software to create metadata such as author’s name and modified history. However, few document editors can create metadata describing the document’s content.
The method for creating metadata may be changing with the public release of OpenCalais. OpenCalais is a public web service that integrates with applications, such as authoring tools, to create metadata about unstructured English document content (with other languages to follow later). Even documents that are created without metadata can benefit from the OpenCalais web service.
The OpenCalais web service receives unstructured English content and identifies named entities, facts, events buried in the text, and extracts them into metadata. You can use this metadata extraction for auto-tagging webblogs and index documents.
Auto Tagging Blog Entries
![]() |
|
Figure 1. The Tagaroo Plugin: The Tagaroo plugin provides a list of suggested tags based on your content. |
Imagine writing your next blog post, and prior to uploading, your editor suggests to post a few images along with it. Using the OpenCalais Tagaroo plugin for WordPress, you can get a list of suggested tags to choose from and view relevant pictures to send with your post.
With Tagaroo installed, you first write your blog as usual, then Tagaroo—connecting to OpenCalais—provides a list of suggested tags based on the content (see Figure 1). Tagaroo then uses the tags you select to search Flickr for related images (not shown).
OpenCalais HTTP Interface
OpenCalais uses HTTP to post encoded content. To post content, you first encode it into an application/x-www-form-urlencoded POST request, using the following Java command:
URLEncoder.encode(parameter, "UTF-8");
You can then create A POST request in Java as shown here:
StringBuilder sb = new StringBuilder(content.length() + 1024);sb.append("licenseID=").append(licenseID);sb.append("&content=").append(content);sb.append("&XML=").append(additionalParameters);URLConnection connection = new URL(API_URL).openConnection();connection.addRequestProperty("Content-Type", "application/x-www-form-urlencoded");connection.addRequestProperty("Content-Length", valueOf(sb.length()));connection.setDoOutput(true);OutputStream out = connection.getOutputStream();OutputStreamWriter writer = new OutputStreamWriter(out);writer.write(sb.toString());writer.flush();
|
The additional parameters include the requested encoded response format. For example, to receive RDF/XML, you may pass parameters like this:
The successful response of the HTTP services is encoded in an XML string. It can be decoded as follows:
builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();Document doc = builder.parse(connection.getInputStream());return new StringReader(doc.getDocumentElement().getTextContent());
Every document is assigned a unique docId for the DocInfo resource in the RDF/XML response. The response also includes OpenCalais processing metadata in a DocInfoMeta. However, the interesting parts are the InstanceInfo resources. These resources contain the docId, subject, and context of a named entity, fact, or event. By extracting the subjects referenced by the docId, through the InstanceInfo resources, you can identify what is referenced in the unstructured content of the original document.
Once decoded, interesting information can be extracted from the response metadata. You can use the subjects found in the document to auto-tag the content, search for other relevant information, index the document’s metadata for later use, or simply embed the metadata within the document.
Indexing Documents
OpenCalais can also be useful to content managers to create smart indexes. Instead of indexing by keywords, you can index by referenced subject. If you have a collection of unstructured documents, in a website for example, you can use OpenCalais to help manage and reference them together. By using the OpenCalais API, a website’s side navigation bar can suggest other related documents based on the conceptual subject, instead of word matching as is used by most indexes.
By taking the RDF/XML document returned by the OpenCalais HTTP interface and storing it in a Sesame RDF store, you can enable an application to find documents related to anything in the RDF store. To create a new Sesame native RDF store, or open an existing one, use the following code:
|
repository = new SailRepository(new NativeStore(dir));repository.initialize();
You can then store the RDF/XML response into the repository like this:
ValueFactory vf = repository.getValueFactory();RepositoryConnection con = repository.getConnection();try { con.add(reader, "", RDFFormat.RDFXML);} finally { con.close();}
Once you have a collection of documents indexed using OpenCalais, you can then query the repository for related documents. For example, to find all documents about the Halifax Comedy Festival, you can use the following SPARQL query:
PREFIX : SELECT DISTINCT ?docWHERE { ?instance :docId ?doc . ?instance :subject ?subject . ?subject :name "Halifax Comedy Festival" }
To find documents related to one another, you can use the following SPARQL query:
PREFIX : SELECT DISTINCT ?docA ?docBWHERE { ?instanceA :docId ?docA . ?instanceB :docId ?docB ?instanceA :subject ?subject . ?instanceB :subject ?subject }
Getting Started with Yahoo!’s SearchMonkey
It is not uncommon for a web site to direct users to a search engine’s page for the website—”search this site.” This is provided by major search engines (Yahoo! calls this processes “Search Builder”). It traditionally comes at the cost of losing control on how information is shown in the result. However, Yahoo! is changing this through their new SearchMonkey platform.
![]() |
|
Figure 2. SearchMonkey Application Dashboard: Get started with SearchMonkey at the dashboard. |
SearchMonkey is Yahoo! Search’s new open platform to build result enhancements based on information about the given web page. Each result entry triggers a SearchMonkey application that creates a quick data mashup, pulling in information about that page from a variety of sources. These mashups may include micro-formats embedded in the page (possibly created by Marmoset), other embedded metadata such as RDFa, remote data feeds, or other remote custom data services.
If you want to enhance your site’s search results, you can use the SearchMonkey platform to change the way your site is displayed in Yahoo!’s results page. On the first page of all DevX articles, for example, is the document’s author. To add this information to all search results, you need to create a Custom Data Service to extract the data from the page, then create a Presentation Application (result enhancement), and finally, if you have authenticated your site, you can request to change the default presentation results for your site.
To get started, point your browser to the SearchMonkey Application Dashboard (you will need to sign in with a Yahoo! ID).Click on “create new data service” (see Figure 2) and follow the steps until you get to the data extraction part. You then need to fill in an Extensible Stylesheet Language (XSL) file. For example, to extract the author’s name from the first page of a DevX article (or the last page of their bio), you can use the transformation as follows. It uses two XPath expressions to search for the
-
Once you have created the custom data services you want to use, you can start to create the result enhancement or Presentation Application by clicking on “Create a New Application” from the dashboard home page. In the Appearance section, is a PHP script broken into sections corresponding to a possible result format. This includes title/summary, image, links, and description items. On the right side are links to insert the extracted data into the search result. Simply double-click on the “SMDEFAULT” value for the corresponding position where you want it to show and click on the link on the right side. For example, to add an author name to the description of a result, double click on “SMDEFAULT” next to “$ret[‘dict’][0][‘key’]” and type “Author” (including quotes). Then click on “SMDEFAULT” next to “$ret[‘dict’][0][‘value’]” and click on (dc:creator/foaf:name) – you may need to click on a (+) to expand the bottom branch. You should now have changed a section of the code to look something like this:
// Key Value pairs - up to 4 $ret['dict'][0]['key'] = "Author"; $ret['dict'][0]['value'] = Data::get('smid:NKz/dc:creator/foaf:name');
![]() |
|
Figure 3. Search result: Test your enhancements in a Yahoo! Search result. |
Test your enhancement in a Yahoo! search result (see Figure 3), by adding the enhancement to your Yahoo! preferences. Whenever you are logged in and searching on Yahoo! your new enhancement will be triggered.
Yahoo! SearchMonkey is currently investigating ways for site owners to self-administer result enhancements. Currently, to add this enhancement to anonymous users, you must first share your application in the SearchMonkey gallery and authenticate your site by using Yahoo!’s new Site Explorer. Then, contact Yahoo! for approval of your search result enhancement.
Conclusion
Metadata in web pages and other documents is starting to play a bigger role in indexing and summarizing documents online. Here you have seen a few ways in which this information can be processed and used to enhance the user’s experience. With these new APIs, semantic web technologies are now much easier to use and more practical to embed in other applications. For more information on OpenCalais or SearchMonkey please see their corresponding documentation.


The Best Mechanical Keyboards For Programmers: Where To Find Them
When it comes to programming, a good mechanical keyboard can make all the difference. Naturally, you would want one of the best mechanical keyboards for programmers. But with so many


The Digital Panopticon: Is Big Brother Always Watching Us Online?
In the age of digital transformation, the internet has become a ubiquitous part of our lives. From socializing, shopping, and learning to more sensitive activities such as banking and healthcare,


Embracing Change: How AI Is Revolutionizing the Developer’s Role
The world of software development is changing drastically with the introduction of Artificial Intelligence and Machine Learning technologies. In the past, software developers were in charge of the entire development


The Benefits of Using XDR Solutions
Cybercriminals constantly adapt their strategies, developing newer, more powerful, and intelligent ways to attack your network. Since security professionals must innovate as well, more conventional endpoint detection solutions have evolved


How AI is Revolutionizing Fraud Detection
Artificial intelligence – commonly known as AI – means a form of technology with multiple uses. As a result, it has become extremely valuable to a number of businesses across


Companies Leading AI Innovation in 2023
Artificial intelligence (AI) has been transforming industries and revolutionizing business operations. AI’s potential to enhance efficiency and productivity has become crucial to many businesses. As we move into 2023, several


Step-by-Step Guide to Properly Copyright Your Website
Creating a website is not easy, but protecting your website is equally important. Implementing copyright laws ensures that the substance of your website remains secure and sheltered. Copyrighting your website


Fivetran Pricing Explained
One of the biggest trends of the 21st century is the massive surge in analytics. Analytics is the process of utilizing data to drive future decision-making. With so much of


Kubernetes Logging: What You Need to Know
Kubernetes from Google is one of the most popular open-source and free container management solutions made to make managing and deploying applications easier. It has a solid architecture that makes


Why Is Ransomware Such a Major Threat?
One of the most significant cyber threats faced by modern organizations is a ransomware attack. Ransomware attacks have grown in both sophistication and frequency over the past few years, forcing


Tools You Need to Make a Data Dictionary
Data dictionaries are crucial for organizations of all sizes that deal with large amounts of data. they are centralized repositories of all the data in organizations, including metadata such as


10 Software Development Tips to Get Early Funding for your Startup
If you’re thinking about a startup, it’s likely you need to raise an initial round of funding for your venture. This article covers some of the very early development techniques