Login | Register   
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Lucene: Add Indexing and Search to Your Web Apps : Page 3

Get a crash course in using Lucene, an open source Java library that enables you to add indexing and search capabilities to your Web applications and documents.


advertisement
Using Lucene in a Web Application
 
Figure 1. User Enters "Lisp" into the Search Field

The second example in this article shows a (reasonably) practical Web application. You can add any number of Word, PowerPoint, PDF, OpenOffice.org, HTML, and AbiWord documents to a directory of your choice. When the Web application starts up, it indexes all documents in this specified directory and lets users search these documents and download the documents found in the search process. Review Figure 1 and Figure 2 before implementing this example.

After entering search word(s), hit the return key to see the search results.



The same document has a different score in Word format than it does in OpenOffice.org format. This is because the Word text extraction code is not very precise, while the text extractor for OpenOffice.org documents is more precise. Unlike Microsoft Office documents that have undocumented and changing formats, the OpenOffice.org document formats are well-documented and simple to use. The lesson is: if you care about your data, stay away from proprietary data formats.

 
Figure 2. Results from User's "Lisp" Search Query

Implement a Lucene-Based Web Application
I used IntelliJ to build this simple Web application. The example directory web_app_example contains both IntelliJ project files and an ant build.xml file. Figure 3 shows the files used in the project:

 
Figure 3. IntelliJ Project View of Files for Example Web Application

Most of the code in this example deals with reading a variety of document formats. I borrowed the code from my KBtextmaster open source project, which contains a lot of natural language processing code that 99 percent of the readers of this article probably don't care about, so I pulled out just what you need here for extracting text from a variety of document types. The example uses this code without discussing it since I want to concentrate on using Lucene in this article.

The Java class StartupWorkServlet is run automatically when the JSP container (I use Tomcat 5.x) runs (see the file web.xml if you don't already know how to do this). This class is derived from the Servlet class, but it does not process GET or POST requests. Its sole purpose is to index files in a specified directory when Tomcat starts up.

You can specify this directory by defining a file path as the value of the environment variable PORTAL_DATA_PATH. You should place sample documents in the directory $PORTAL_DATA_PATH/documents. A Lucene index will be built in $PORTAL_DATA_PATH/data/LUCENE. The value of PORTAL_DATA_PATH is determined in static initialization code in the class API (in my KBtextmaster project, this class contains the public API for the system; here it only is a place to get the root document path).

The code in StartupWorkServlet is almost identical to the code in the MakeIndex class in the previous example, except I use my GetDocumentText utility class to extract plain text from the documents to be indexed.

The code to perform the search is embedded directly in the JSP file index.jsp (I left out the page directive for brevity):

<%@ page contentType="text/html;charset=UTF-8" language="java" %> <html> <head><title>Example Lucene Search Web Application</title></head> <body> <h3>Please enter your search query:</h3> <form action="index.jsp" method="post"> <input type="text" size="50" name="query" /> </form> <% if (request.getParameter("query") != null && request.getParameter("query").length() > 1) { %> <h3>Results for '<%=request.getParameter("query")%>':</h3> <% try { Searcher searcher = new IndexSearcher(API.getRootPath() + "/data/LUCENE/"); Analyzer analyzer = new StandardAnalyzer(); Query query = QueryParser.parse(request.getParameter("query"), "text", analyzer); System.out.println("Searching for: " + query.toString("text")); Hits hits = searcher.search(query); out.println("<h3>Number of matching documents = " + hits.length()+"</h3>"); for (int i = 0; i < hits.length(); i++) { Document doc = hits.doc(i); out.println("<p>File: <a href=\"download.jsp?download=" + doc.get("filepath")+"\">" + doc.get("filepath") + "</a>, score: " + hits.score(i)+"</p>"); } searcher.close(); } catch (Exception ee) { out.println("<b><p>Error: " + ee + "</p></b>"); } } %> </body> </html>

The only other interesting part of this application is the code for downloading files. Notice in the previous code listing that the URL looked like this:

download.jsp?download=<an absolute file path>

The following listing shows the contents of download.jsp:

<% String path = request.getParameter("download"); if ( path != null) { try { response.setContentType("application/binary"); response.setHeader("Content-disposition", "attachment; filename=\"" + path + "\""); ServletOutputStream sos = response.getOutputStream(); com.knowledgebooks.utils.FileUtils.copyFilePathToOutputStream( new java.io.File(path), sos); } catch (Exception ee) { ee.printStackTrace(); } } %>

The static utility method FileUtils.copyFilePathToOutputStream simply copies the contents of a local file to the servlet output stream.



Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap