Lucene: Add Indexing and Search to Your Web Apps

ucene is one of the bright lights in the world of open source software: an industrial-strength package that many companies use for many diverse purposes. I have been using it for three years to add indexing and search capabilities to my Java applications, although Lucene technology is now available for Python, C/C++, and Perl programmers as well.

This article provides a crash course for using Lucene and presents an open source project I’ve written that uses Lucene to index, search, cluster, and categorize documents (Word, PowerPoint, PDF, HTML, OpenOffice.org, and AbiWord).

Fast Start
Before going any further, download the accompanying code file so that you can read through the full source code for the examples. In addition to the Java and JSP source files for the examples, the download file also contains the third-party JAR library files that you will need.

Now, let’s begin. I usually use Lucene in Web applications running in the Tomcat servlet/JSP container. The next section discusses this use case, but first let’s examine two simple programs: one for building a Lucene index and the other for searching with this index. These examples just scratch the surface of uses for the Lucene APIs, but they suffice for getting you started on simple projects. Refer to the online Lucene documentation for more advanced uses.

An instance of the Lucene Document class is a container for fields (a field is a name and a value associated with that name). The following are the four types of fields available:

  1. Keyword: The text stored in the value part of this field is indexed and stored in the index.
  2. UnIndexed: The text stored in the value part of this field is stored in the index but is not indexed, so it can not be searched.
  3. UnStored: The text stored in the value part of this field is analyzed with a specific tokenizer and indexed so that it can be searched but is not stored in the index.
  4. Text: The text stored in the value part of this field is analyzed with a specific tokenizer and indexed so that it can be searched but is stored in the index.

Sometimes it is useful to store indexed text in the index itself using Text fields. For example, in most of my projects I like to show highlighted search words in the original text. Having the original text cached in the index makes this simple to do. For applications where search results need to show only files matching a query, using an UnStored field saves room in the index. I often use UnIndexed fields to store the original document type (e.g., file, Web URL, database query to get the indexed data, etc.) and the original data location (e.g., file path, URL, etc.).

Creating a Lucene Index
The example file MakeIndex.java in the directory “simple_example” shows the few lines of code required to create a new Lucene index. This demo program reads all test files in the directory “example_text_files” and adds them to the index. The following few lines of code in the example program use the Lucene class libraries.

Create a Lucene IndexWriter instance:

IndexWriter indexWriter = new IndexWriter("index", new StandardAnalyzer(), true);

This creates the index in the directory “index”. The standard analyzer tokenizes text and discards some common noise words. The third argument is a flag to indicate that Lucene should delete any existing indices in the “index” directory. You would set this flag to false if you wanted to add to an existing index.

Create a new Document instance and add it to the index:

Document document = new Document();document.add(Field.Text("text", new FileReader(fullPath)));document.add(Field.UnIndexed("filepath", fullPath));indexWriter.addDocument(document);

The second argument to Field.Text can be the string of text to index (in which case it is stored in the index) or a file reader (in which case the text is read and indexed, but not stored in the index).

Running the MakeIndex Program
When you run the make index program with the three sample text files that I put in the “example_text_files” directory, you see the following output:

Wrote file ./example_text_files/AI_Go_Consciousness.txt to index.Wrote file ./example_text_files/Jumpstarting the Semantic.txt to index.Wrote file ./example_text_files/Loving Lisp.txt to index.

The Lucene index is written to the “index” directory.

Searching an Existing Lucene Index
Create a new search instance and a standard text analyzer:

Searcher searcher = new IndexSearcher("index");Analyzer analyzer = new StandardAnalyzer();

Note that you specify that the existing index is stored in the directory “index”. The MakeIndex program created this index.

The following code performs a query on a line of text the user enters and prints out the search results (assuming an input stream in):

  String line = in.readLine();  Query query = QueryParser.parse(line, "text", analyzer);  System.out.println("Searching for: " + query.toString("text"));  Hits hits = searcher.search(query);  System.out.println("Number of matching documents = " + hits.length());  for (int i = 0; i < hits.length(); i++) {    Document doc = hits.doc(i);    System.out.println("File: " + doc.get("filepath") + ", score: " + hits.score(i));  }

When you create a search query, you specify that the search should be performed on any text data in the document filed "text". There is nothing special about the name "text"?it is just the field name that you specified in the MakeIndex program.

Running the SearchText Program
The following text shows the input (in bold text) and output from the example program:

Search query (enter a blank query to stop) : AI GoSearching for: ai goNumber of matching documents = 1File: ./example_text_files/AI_Go_Consciousness.txt, score: 0.3521486Search query (enter a blank query to stop) : Lisp consSearching for: lisp consNumber of matching documents = 1File: ./example_text_files/Loving Lisp.txt, score: 0.18363969Search query (enter a blank query to stop) : 

Using Lucene in a Web Application

 
Figure 1. User Enters "Lisp" into the Search Field

The second example in this article shows a (reasonably) practical Web application. You can add any number of Word, PowerPoint, PDF, OpenOffice.org, HTML, and AbiWord documents to a directory of your choice. When the Web application starts up, it indexes all documents in this specified directory and lets users search these documents and download the documents found in the search process. Review Figure 1 and Figure 2 before implementing this example.

After entering search word(s), hit the return key to see the search results.

The same document has a different score in Word format than it does in OpenOffice.org format. This is because the Word text extraction code is not very precise, while the text extractor for OpenOffice.org documents is more precise. Unlike Microsoft Office documents that have undocumented and changing formats, the OpenOffice.org document formats are well-documented and simple to use. The lesson is: if you care about your data, stay away from proprietary data formats.

 
Figure 2. Results from User's "Lisp" Search Query

Implement a Lucene-Based Web Application
I used IntelliJ to build this simple Web application. The example directory web_app_example contains both IntelliJ project files and an ant build.xml file. Figure 3 shows the files used in the project:

 
Figure 3. IntelliJ Project View of Files for Example Web Application

Most of the code in this example deals with reading a variety of document formats. I borrowed the code from my KBtextmaster open source project, which contains a lot of natural language processing code that 99 percent of the readers of this article probably don't care about, so I pulled out just what you need here for extracting text from a variety of document types. The example uses this code without discussing it since I want to concentrate on using Lucene in this article.

The Java class StartupWorkServlet is run automatically when the JSP container (I use Tomcat 5.x) runs (see the file web.xml if you don't already know how to do this). This class is derived from the Servlet class, but it does not process GET or POST requests. Its sole purpose is to index files in a specified directory when Tomcat starts up.

You can specify this directory by defining a file path as the value of the environment variable PORTAL_DATA_PATH. You should place sample documents in the directory $PORTAL_DATA_PATH/documents. A Lucene index will be built in $PORTAL_DATA_PATH/data/LUCENE. The value of PORTAL_DATA_PATH is determined in static initialization code in the class API (in my KBtextmaster project, this class contains the public API for the system; here it only is a place to get the root document path).

The code in StartupWorkServlet is almost identical to the code in the MakeIndex class in the previous example, except I use my GetDocumentText utility class to extract plain text from the documents to be indexed.

The code to perform the search is embedded directly in the JSP file index.jsp (I left out the page directive for brevity):

<%@ page contentType="text/html;charset=UTF-8" language="java" %>  Example Lucene Search Web Application      

Please enter your search query:

<% if (request.getParameter("query") != null && request.getParameter("query").length() > 1) { %>

Results for '<%=request.getParameter("query")%>':

<% try { Searcher searcher = new IndexSearcher(API.getRootPath() + "/data/LUCENE/"); Analyzer analyzer = new StandardAnalyzer(); Query query = QueryParser.parse(request.getParameter("query"), "text", analyzer); System.out.println("Searching for: " + query.toString("text")); Hits hits = searcher.search(query); out.println("

Number of matching documents = " + hits.length()+"

"); for (int i = 0; i < hits.length(); i++) { Document doc = hits.doc(i); out.println("

File: " + doc.get("filepath") + ", score: " + hits.score(i)+"

"); } searcher.close(); } catch (Exception ee) { out.println("

Error: " + ee + "

"); } } %>

The only other interesting part of this application is the code for downloading files. Notice in the previous code listing that the URL looked like this:

    download.jsp?download=

The following listing shows the contents of download.jsp:

<%  String path = request.getParameter("download");  if ( path != null) {    try {      response.setContentType("application/binary");      response.setHeader("Content-disposition",                         "attachment; filename="" + path + """);      ServletOutputStream sos = response.getOutputStream();      com.knowledgebooks.utils.FileUtils.copyFilePathToOutputStream(         new java.io.File(path), sos);    } catch (Exception ee) {      ee.printStackTrace();    }  }%>

The static utility method FileUtils.copyFilePathToOutputStream simply copies the contents of a local file to the servlet output stream.

Wrap-Up
The two code examples in this article used a precompiled JAR file containing the Lucene libraries. When I develop applications with Lucene, I prefer to add the Lucene source code directly to my IntelliJ or Eclipse projects.

Lucene is fantastically useful. I hope that this article serves as a fun introduction that motivates you to spend a few hours learning how to integrate Lucene into your own projects.

The Lucene project Web site has a "sandbox" area with several useful sub-projects such as utilities for highlighting search words in retrieved text. Check it out.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Overview

Recent Articles: