Integrate Cocoon with Lucene for Full Text Search of Unstructured Data

ocoon is a powerful tool for publishing content to multiple formats from XML. This second article of a two-part series describes how to integrate Cocoon with Lucene, a full text search engine written in Java. Lucene allows you to easily index and search unstructured information in a collection of documents. (For further details on Cocoon, see the first article, “Integrating Cocoon with PostgreSQL.”)

The combination of Cocoon and Lucene enables you to process both structured and unstructured data, which is valuable because while structured data provides enhanced search and comparison ability, most of the world’s data is unstructured in the form of letters and other free-form documents.

The Setup
As a prerequisite, you must have a functioning Tomcat and Cocoon setup. The “Integrating Cocoon with PostgreSQL” article provides setup instructions for Cocoon and Tomcat. If you plan to use PostgreSQL as well, perform the steps described in the “Integrate PostgreSQL” section of the article. Otherwise, you need perform only the steps described in the “Set Up Tomcat and Cocoon” section. I have written Cocoon applications that use both PostgreSQL and Lucene, but you may need to use only one of the two technologies alongside Cocoon. Presumably, you’re reading this article because you need to use either Lucene with Cocoon or both Lucene and PostgreSQL with Cocoon.

Note: This article refers to $TOMCAT_HOME, $COCOON_HOME, and $SERVER_HOSTNAME. Those variables refer to the same values described in the previous article.

To integrate Cocoon 2.1.6 and Lucene 1.4.3, you first need to perform the following steps:

  1. Download the Lucene distribution. You should get lucene-1.4.3.tar.gz or a later version, if there is one.
  2. Unpack the Lucene distribution into a directory (preferably outside of the Tomcat directory tree to avoid confusion). This article subsequently refers to /path/to/lucene-1.4.3 as $LUCENE_HOME.
  3. Delete $COCOON_HOME/WEB-INF/lib/lucene-1.4.1.jar. By default, Cocoon 2.1.6 comes with Lucene 1.4.1.
  4. Make sure you delete the old version (lucene-1.4.1.jar). If you do not, Java will just use the old version and you will wonder why you can’t use the new features in Lucene 1.4.3 (or whatever newer version you drop in).
  5. Copy lucene-1.4.3.jar from $LUCENE_HOME into $COCOON_HOME/WEB-INF/lib/.

In order for Lucene to work, you need to initially index your set of documents. Creating the index is not a difficult process, but it does require a small amount of code. Lucene is a library, not an out-of-the-box search application. It provides an API that lets you index and search your files. You must tap into the API to write a short Java program that creates the index.

Download this article’s sample code, then perform the following actions:

  1. Type: “tar xzf sample_code_cocoon_lucene.tgz” (without the quotes)
  2. Move the sample_code_lucene directory, if you want to place it somewhere more convenient. Henceforth, the sample_code_lucene directory will be referred to as $SAMPLE_CODE_HOME. $SAMPLE_CODE_HOME is the full path to sample_code_lucene.
  3. Type: “cd $SAMPLE_CODE_HOME/cli” (without the quotes, and substituting the value of $SAMPLE_CODE_HOME for the placeholder)

The downloadable code includes two Java source files called MyIndexer.java and MySearcher.java. Compile them, generate the index, and then run a test search on the index with the following sequence of commands:

  1. javac -classpath $LUCENE_HOME/lucene-1.4.3.jar *.java
  2. java -classpath $LUCENE_HOME/lucene-1.4.3.jar:. MyIndexer
  3. java -classpath $LUCENE_HOME/lucene-1.4.3.jar:. MySearcher

Obviously, change lucene-1.4.3.jar if you are using a different version of Lucene. The first command compiles the two Java classes, but it won’t generate any output. The second command creates the index in $SAMPLE_CODE_HOME/index, and it does not generate any output either. The third command should generate output similar to Figure 1.

 
Figure 1. Screen Shot of Output from Test Search on the Index

The downloadable code includes a small collection of sample documents in text format. The documents are SEC filings by public companies. Take a look at the text files in the documents subdirectory of the unpacked downloadable code archive. You may also want to examine MyIndexer.java to see how to index a collection of documents. By writing different index code, you can index different types of documents. I chose to index text documents because it is simple to demonstrate and because much of the world’s data is in plain text. You could alter the indexFile method of the MyIndexer class to parse HTML or XML files and allow intelligent searching of HTML or XML without the markup polluting the search space. Presumably your new version of the indexFile method would strip the HTML/XML markup and leave only the bare content along with metadata in Lucene fields, because you would not want people matching on markup like or

or .

As you may have deduced from the three steps you just executed, once you have indexed the files, you can call the Lucene search methods to retrieve matches to queries. Try some of the queries shown in Figure 1, and try others as well. The search results you see in Figure 1 are in a simple format that is not XML: key and value pairs. MySearcher.java obtains the information from the Lucene API via the following section of code:

System.out.println("score = " + hits.score(i));System.out.println("filename = " + document.get("filename"));System.out.println("companyName = " + document.get("companyName"));System.out.println("format = " + document.get("format"));System.out.println("formType = " + document.get("formType"));System.out.println("filedDate = " + document.get("filedDate"));

As you can see, Lucene exposes a hash table-like interface for retrieving fields, which makes it easy to obtain field values and display the information in key/value pairs. If you wrap the Lucene search results in XML, you can apply XSLT to the XML and manipulate the data within Cocoon using the natural method of transformations. The following section demonstrates this technique.

Lucene Search Results in XML
For a basic Web-based search interface, you need three pages:

  1. The search form where the user fills out the query terms
  2. The results page that displays the results of the search
  3. The document detail page that displays the document when the user clicks on a result item

The search form is simple: a standard Cocoon pipeline that displays a static page. You could easily tailor it to display personalized information like the user’s name, the weather in the user’s city, or other information. This tutorial doesn’t show that, but the point is that the search input page doesn’t by itself do anything pertaining to the search other than display a form to collect information on the query from the user.

Look at search_form.html in the content subdirectory of the sample code. The results page is where Lucene-related logic goes. It is generated by content/search_result.xsp for the actual data in XML format in conjunction with style/search_result.xsl for the stylesheet that transforms the XML data into HTML. The document detail page probably will not contain any Lucene-related code, although if you have unusual application requirements you might have some Lucene-related code. The detail page varies widely based on your presentation needs and preferences, and it is not part of the search logic, so it isn’t covered here.

Perform the following actions to set up a Web application that demonstrates Lucene integration with Cocoon:

  1. Open $SAMPLE_CODE_HOME/content/search_result.xsp in your editor (emacs or vi or whatever you like).
  2. Change the value of indexLocation from “/home/wchao/tmp/scratch/sample_code_lucene/index” to “$SAMPLE_CODE_HOME/index”. “$SAMPLE_CODE_HOME/index” means something like “/home/xyz/projects/sample_code_lucene/index” (i.e., do not literally insert “$SAMPLE_CODE_HOME”, but instead replace $SAMPLE_CODE_HOME with the full path of the sample_code_lucene directory).
  3. Type: “cd $SAMPLE_CODE_HOME” (without the quotes and substituting the value of $SAMPLE_CODE_HOME for the placeholder)
  4. Type: “cp -a my_lucene_app $COCOON_HOME” (without the quotes and substituting the value of $COCOON_HOME for the placeholder)

Now open your Web browser and go to http://$SERVER_HOSTNAME:8080/cocoon/my_lucene_app/search_form.html. You should see a screen similar to Figure 2.

 
Figure 2. Search_form Web Interface

If you type in the query “car”, you should see Avis Group Holdings. You can click on the highlighted name to view the document detail page for Avis Group Holdings. You should also try some of the queries previously entered in the command line search program (MySearcher).

To review how the Cocoon Web application works and how it makes use of Lucene, view the $COCOON_HOME/my_lucene_app/sitemap.xmap file. The search_form.html page is a simple HTML page that contains a form – no explanation needed. The search_result.html page is actually two files: content/search_result.xsp and style/search_result.xsl. First, search_result.xsp generates the XML data, and then search_result.xsl transforms the XML data into HTML. The following is a breakdown of the search_result.xsp file’s process:

  1. The user’s query is retrieved via request.getParameter(“query”).
  2. A new Searcher is instantiated. As the name implies, a Searcher lets you search through a collection of documents.
  3. A new Analyzer is instantiated. An Analyzer performs analysis on the query and determines whether documents in the collection of stored documents match the query terms. Analyzers are useful because they let you match in different ways (soundex, suffix stripping, synonyms, etc.). If you need a new way of matching, you can just write an Analyzer.
  4. A Query is instantiated. You can think of a Query as a package containing the query terms and the Analyzer used to perform match operations.
  5. The Searcher instance is invoked with the query.
  6. The Searcher instance returns a Hits collection. Each object in the Hits collection is a Document instance.
  7. You then iterate over the Hits collection, assigning the current Document instance to a variable in the loop.
  8. Each iteration in the Hits collection generates an XML fragment.
 
Figure 3. XML File Transformed into a Table

The XML fragment for an example document might look like this:

  1  .532  53  00283654.txt  Walmart  text  10-K  20040930

In search_result.xsl, you transform the XML file with multiple tags into the table that you see in Figure 3.

Lastly, when you click on a company name, you see the document detail page (an SEC filing for a company). The document detail page is implemented with a simple Cocoon pipeline in the sitemap.

Cocoon and Lucene: A Powerful Combination
Hopefully, you now agree that Cocoon and Lucene form a powerful combination that enables you to quickly and easily develop Web applications for searching unstructured information. By wrapping the Lucene return data in XML and then using Cocoon’s XML-handling capabilities, you tap into the power of XML for multi-channel publishing (Web and PDA/wireless, for example) of the huge amount of unstructured information in the world.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Overview

Recent Articles: