Login | Register   
RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Getting Started with OpenCalais and SearchMonkey : Page 2

OpenCalais and Yahoo!'s SearchMonkey both focus on document metadata. OpenCalais is a new toolkit that allows you to incorporate semantic functionality within your blog, content management system, website or application. Yahoo!'s SearchMonkey API allows you to connect data services from around the world and integrate them into the Yahoo! search engine.

Indexing Documents
OpenCalais can also be useful to content managers to create smart indexes. Instead of indexing by keywords, you can index by referenced subject. If you have a collection of unstructured documents, in a website for example, you can use OpenCalais to help manage and reference them together. By using the OpenCalais API, a website's side navigation bar can suggest other related documents based on the conceptual subject, instead of word matching as is used by most indexes.

By taking the RDF/XML document returned by the OpenCalais HTTP interface and storing it in a Sesame RDF store, you can enable an application to find documents related to anything in the RDF store. To create a new Sesame native RDF store, or open an existing one, use the following code:

Sample OpenCalais Indexing Firefox Plugin
Even if you do not manage your own website, you still can begin indexing documents. Included in this article is an example Firefox plugin that uses OpenCalais to index visited web pages in an embedded Firefox browser. You must first download and save the xpi file then open it in Mozilla Firefox to install it. After installing the plugin and restarting Firefox, you can initiate the plugin by selecting View → Sidebar → Related Pages. While this is open, any visited web pages will get indexed into an internal Sesame RDF store. At the same time, documents with any overlapping subjects will be included in the list on the left.

To test the ability of OpenCalais and this plugin, try navigating to a popular news item of the day from two different news sources and see if OpenCalais can recognise the common subjects between them. The plugin (as implemented) passes all text found in the web page, which usually includes a lot of noise (from sidebars and repeated website information). This can cause a large number of false positives-particularly from within the same website.

To modify the behaviour of the plugin, simply open it with a ZIP archive manager, such as WinZip, and extract the files into a directory to reveal the JavaScript and Java source code used.

repository = new SailRepository(new NativeStore(dir)); repository.initialize();

You can then store the RDF/XML response into the repository like this:

ValueFactory vf = repository.getValueFactory(); RepositoryConnection con = repository.getConnection(); try { con.add(reader, "", RDFFormat.RDFXML); } finally { con.close(); }

Once you have a collection of documents indexed using OpenCalais, you can then query the repository for related documents. For example, to find all documents about the Halifax Comedy Festival, you can use the following SPARQL query:

PREFIX :<http://s.opencalais.com/1/pred/> SELECT DISTINCT ?doc WHERE { ?instance :docId ?doc . ?instance :subject ?subject . ?subject :name "Halifax Comedy Festival" }

To find documents related to one another, you can use the following SPARQL query:

PREFIX :<http://s.opencalais.com/1/pred/> SELECT DISTINCT ?docA ?docB WHERE { ?instanceA :docId ?docA . ?instanceB :docId ?docB ?instanceA :subject ?subject . ?instanceB :subject ?subject }

Getting Started with Yahoo!'s SearchMonkey
It is not uncommon for a web site to direct users to a search engine's page for the website—"search this site." This is provided by major search engines (Yahoo! calls this processes "Search Builder"). It traditionally comes at the cost of losing control on how information is shown in the result. However, Yahoo! is changing this through their new SearchMonkey platform.

Figure 2. SearchMonkey Application Dashboard: Get started with SearchMonkey at the dashboard.
SearchMonkey is Yahoo! Search's new open platform to build result enhancements based on information about the given web page. Each result entry triggers a SearchMonkey application that creates a quick data mashup, pulling in information about that page from a variety of sources. These mashups may include micro-formats embedded in the page (possibly created by Marmoset), other embedded metadata such as RDFa, remote data feeds, or other remote custom data services.

If you want to enhance your site's search results, you can use the SearchMonkey platform to change the way your site is displayed in Yahoo!'s results page. On the first page of all DevX articles, for example, is the document's author. To add this information to all search results, you need to create a Custom Data Service to extract the data from the page, then create a Presentation Application (result enhancement), and finally, if you have authenticated your site, you can request to change the default presentation results for your site.

To get started, point your browser to the SearchMonkey Application Dashboard (you will need to sign in with a Yahoo! ID). Click on "create new data service" (see Figure 2) and follow the steps until you get to the data extraction part. You then need to fill in an Extensible Stylesheet Language (XSL) file. For example, to extract the author's name from the first page of a DevX article (or the last page of their bio), you can use the transformation as follows. It uses two XPath expressions to search for the <div> element with the class of "articleAuthor" or "articlebio," and extracts them as the name of the article creator (dc:creator/foaf:name).

<?xml version="1.0"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:template match="/"> <adjunctcontainer> <adjunct id="smid:{$smid}" version="1.0"> <item rel="dc:creator"> <meta property="foaf:name"> <xsl:value-of select="substring-after(//div[@class='articleAuthor'], 'by')"/> <xsl:value-of select="//div[@class='articlebio']//b"/> </meta> </item> </adjunct> </adjunctcontainer> </xsl:template> </xsl:stylesheet>

Once you have created the custom data services you want to use, you can start to create the result enhancement or Presentation Application by clicking on "Create a New Application" from the dashboard home page. In the Appearance section, is a PHP script broken into sections corresponding to a possible result format. This includes title/summary, image, links, and description items. On the right side are links to insert the extracted data into the search result. Simply double-click on the "SMDEFAULT" value for the corresponding position where you want it to show and click on the link on the right side. For example, to add an author name to the description of a result, double click on "SMDEFAULT" next to "$ret['dict'][0]['key']" and type "Author" (including quotes). Then click on "SMDEFAULT" next to "$ret['dict'][0]['value']" and click on (dc:creator/foaf:name) - you may need to click on a (+) to expand the bottom branch. You should now have changed a section of the code to look something like this:

// Key Value pairs - up to 4 $ret['dict'][0]['key'] = "Author"; $ret['dict'][0]['value'] = Data::get('smid:NKz/dc:creator/foaf:name');

Figure 3. Search result: Test your enhancements in a Yahoo! Search result.
Test your enhancement in a Yahoo! search result (see Figure 3), by adding the enhancement to your Yahoo! preferences. Whenever you are logged in and searching on Yahoo! your new enhancement will be triggered.

Yahoo! SearchMonkey is currently investigating ways for site owners to self-administer result enhancements. Currently, to add this enhancement to anonymous users, you must first share your application in the SearchMonkey gallery and authenticate your site by using Yahoo!'s new Site Explorer. Then, contact Yahoo! for approval of your search result enhancement.

Metadata in web pages and other documents is starting to play a bigger role in indexing and summarizing documents online. Here you have seen a few ways in which this information can be processed and used to enhance the user's experience. With these new APIs, semantic web technologies are now much easier to use and more practical to embed in other applications. For more information on OpenCalais or SearchMonkey please see their corresponding documentation.

James Leigh is an independent software consultant based in Toronto, has experience modeling business problems and concepts in software, and specializes in performance and technology integration. James has a background in semantic web technologies and decentralized networks. He is an active member in the OpenRDF community, and he's a developer of Sesame and Elmo.
Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



Thanks for your registration, follow us on our social networks to keep up-to-date