RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Integrate Cocoon with Lucene for Full Text Search of Unstructured Data

By wrapping the Lucene search engine's return data in XML and then using Cocoon's XML-handling capabilities, you tap into the power of XML for multi-channel publishing of unstructured information.

ocoon is a powerful tool for publishing content to multiple formats from XML. This second article of a two-part series describes how to integrate Cocoon with Lucene, a full text search engine written in Java. Lucene allows you to easily index and search unstructured information in a collection of documents. (For further details on Cocoon, see the first article, "Integrating Cocoon with PostgreSQL.")

The combination of Cocoon and Lucene enables you to process both structured and unstructured data, which is valuable because while structured data provides enhanced search and comparison ability, most of the world's data is unstructured in the form of letters and other free-form documents.

The Setup
As a prerequisite, you must have a functioning Tomcat and Cocoon setup. The "Integrating Cocoon with PostgreSQL" article provides setup instructions for Cocoon and Tomcat. If you plan to use PostgreSQL as well, perform the steps described in the "Integrate PostgreSQL" section of the article. Otherwise, you need perform only the steps described in the "Set Up Tomcat and Cocoon" section. I have written Cocoon applications that use both PostgreSQL and Lucene, but you may need to use only one of the two technologies alongside Cocoon. Presumably, you're reading this article because you need to use either Lucene with Cocoon or both Lucene and PostgreSQL with Cocoon.

Note: This article refers to $TOMCAT_HOME, $COCOON_HOME, and $SERVER_HOSTNAME. Those variables refer to the same values described in the previous article.

To integrate Cocoon 2.1.6 and Lucene 1.4.3, you first need to perform the following steps:

  1. Download the Lucene distribution. You should get lucene-1.4.3.tar.gz or a later version, if there is one.
  2. Unpack the Lucene distribution into a directory (preferably outside of the Tomcat directory tree to avoid confusion). This article subsequently refers to /path/to/lucene-1.4.3 as $LUCENE_HOME.
  3. Delete $COCOON_HOME/WEB-INF/lib/lucene-1.4.1.jar. By default, Cocoon 2.1.6 comes with Lucene 1.4.1.
  4. Make sure you delete the old version (lucene-1.4.1.jar). If you do not, Java will just use the old version and you will wonder why you can't use the new features in Lucene 1.4.3 (or whatever newer version you drop in).
  5. Copy lucene-1.4.3.jar from $LUCENE_HOME into $COCOON_HOME/WEB-INF/lib/.

In order for Lucene to work, you need to initially index your set of documents. Creating the index is not a difficult process, but it does require a small amount of code. Lucene is a library, not an out-of-the-box search application. It provides an API that lets you index and search your files. You must tap into the API to write a short Java program that creates the index.

Download this article's sample code, then perform the following actions:

  1. Type: "tar xzf sample_code_cocoon_lucene.tgz" (without the quotes)
  2. Move the sample_code_lucene directory, if you want to place it somewhere more convenient. Henceforth, the sample_code_lucene directory will be referred to as $SAMPLE_CODE_HOME. $SAMPLE_CODE_HOME is the full path to sample_code_lucene.
  3. Type: "cd $SAMPLE_CODE_HOME/cli" (without the quotes, and substituting the value of $SAMPLE_CODE_HOME for the placeholder)

The downloadable code includes two Java source files called MyIndexer.java and MySearcher.java. Compile them, generate the index, and then run a test search on the index with the following sequence of commands:

  1. javac -classpath $LUCENE_HOME/lucene-1.4.3.jar *.java
  2. java -classpath $LUCENE_HOME/lucene-1.4.3.jar:. MyIndexer
  3. java -classpath $LUCENE_HOME/lucene-1.4.3.jar:. MySearcher

Obviously, change lucene-1.4.3.jar if you are using a different version of Lucene. The first command compiles the two Java classes, but it won't generate any output. The second command creates the index in $SAMPLE_CODE_HOME/index, and it does not generate any output either. The third command should generate output similar to Figure 1.

Figure 1. Screen Shot of Output from Test Search on the Index

The downloadable code includes a small collection of sample documents in text format. The documents are SEC filings by public companies. Take a look at the text files in the documents subdirectory of the unpacked downloadable code archive. You may also want to examine MyIndexer.java to see how to index a collection of documents. By writing different index code, you can index different types of documents. I chose to index text documents because it is simple to demonstrate and because much of the world's data is in plain text. You could alter the indexFile method of the MyIndexer class to parse HTML or XML files and allow intelligent searching of HTML or XML without the markup polluting the search space. Presumably your new version of the indexFile method would strip the HTML/XML markup and leave only the bare content along with metadata in Lucene fields, because you would not want people matching on markup like <body> or <p> or <font>.

As you may have deduced from the three steps you just executed, once you have indexed the files, you can call the Lucene search methods to retrieve matches to queries. Try some of the queries shown in Figure 1, and try others as well. The search results you see in Figure 1 are in a simple format that is not XML: key and value pairs. MySearcher.java obtains the information from the Lucene API via the following section of code:

System.out.println("score = " + hits.score(i));
System.out.println("filename = " + document.get("filename"));
System.out.println("companyName = " + document.get("companyName"));
System.out.println("format = " + document.get("format"));
System.out.println("formType = " + document.get("formType"));
System.out.println("filedDate = " + document.get("filedDate"));

As you can see, Lucene exposes a hash table-like interface for retrieving fields, which makes it easy to obtain field values and display the information in key/value pairs. If you wrap the Lucene search results in XML, you can apply XSLT to the XML and manipulate the data within Cocoon using the natural method of transformations. The following section demonstrates this technique.

Close Icon
Thanks for your registration, follow us on our social networks to keep up-to-date