ucene is one of the bright lights in the world of open source software: an industrial-strength package that many companies use for many diverse purposes. I have been using it for three years to add indexing and search capabilities to my Java applications, although Lucene technology is now available for Python, C/C++, and Perl programmers as well.
This article provides a crash course for using Lucene and presents an open source project I've written that uses Lucene to index, search, cluster, and categorize documents (Word, PowerPoint, PDF, HTML, OpenOffice.org, and AbiWord).
Before going any further, download the accompanying code file so that you can read through the full source code for the examples. In addition to the Java and JSP source files for the examples, the download file also contains the third-party JAR library files that you will need.
Now, let's begin. I usually use Lucene in Web applications running in the Tomcat servlet/JSP container. The next section discusses this use case, but first let's examine two simple programs: one for building a Lucene index and the other for searching with this index. These examples just scratch the surface of uses for the Lucene APIs, but they suffice for getting you started on simple projects. Refer to the online Lucene documentation for more advanced uses.
An instance of the Lucene
Document class is a container for fields (a field is a name and a value associated with that name). The following are the four types of fields available:
Keyword: The text stored in the value part of this field is indexed and stored in the index.
UnIndexed: The text stored in the value part of this field is stored in the index but is not indexed, so it can not be searched.
UnStored: The text stored in the value part of this field is analyzed with a specific tokenizer and indexed so that it can be searched but is not stored in the index.
Text: The text stored in the value part of this field is analyzed with a specific tokenizer and indexed so that it can be searched but is stored in the index.
Sometimes it is useful to store indexed text in the index itself using
Text fields. For example, in most of my projects I like to show highlighted search words in the original text. Having the original text cached in the index makes this simple to do. For applications where search results need to show only files matching a query, using an
UnStored field saves room in the index. I often use
UnIndexed fields to store the original document type (e.g., file, Web URL, database query to get the indexed data, etc.) and the original data location (e.g., file path, URL, etc.).