Login | Register   
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Integrate Cocoon with Lucene for Full Text Search of Unstructured Data : Page 2

By wrapping the Lucene search engine's return data in XML and then using Cocoon's XML-handling capabilities, you tap into the power of XML for multi-channel publishing of unstructured information.


advertisement
Lucene Search Results in XML
For a basic Web-based search interface, you need three pages:
  1. The search form where the user fills out the query terms
  2. The results page that displays the results of the search
  3. The document detail page that displays the document when the user clicks on a result item

The search form is simple: a standard Cocoon pipeline that displays a static page. You could easily tailor it to display personalized information like the user's name, the weather in the user's city, or other information. This tutorial doesn't show that, but the point is that the search input page doesn't by itself do anything pertaining to the search other than display a form to collect information on the query from the user.

Look at search_form.html in the content subdirectory of the sample code. The results page is where Lucene-related logic goes. It is generated by content/search_result.xsp for the actual data in XML format in conjunction with style/search_result.xsl for the stylesheet that transforms the XML data into HTML. The document detail page probably will not contain any Lucene-related code, although if you have unusual application requirements you might have some Lucene-related code. The detail page varies widely based on your presentation needs and preferences, and it is not part of the search logic, so it isn't covered here.



Perform the following actions to set up a Web application that demonstrates Lucene integration with Cocoon:

  1. Open $SAMPLE_CODE_HOME/content/search_result.xsp in your editor (emacs or vi or whatever you like).
  2. Change the value of indexLocation from "/home/wchao/tmp/scratch/sample_code_lucene/index" to "$SAMPLE_CODE_HOME/index". "$SAMPLE_CODE_HOME/index" means something like "/home/xyz/projects/sample_code_lucene/index" (i.e., do not literally insert "$SAMPLE_CODE_HOME", but instead replace $SAMPLE_CODE_HOME with the full path of the sample_code_lucene directory).
  3. Type: "cd $SAMPLE_CODE_HOME" (without the quotes and substituting the value of $SAMPLE_CODE_HOME for the placeholder)
  4. Type: "cp -a my_lucene_app $COCOON_HOME" (without the quotes and substituting the value of $COCOON_HOME for the placeholder)

Now open your Web browser and go to http://$SERVER_HOSTNAME:8080/cocoon/my_lucene_app/search_form.html. You should see a screen similar to Figure 2.

 
Figure 2. Search_form Web Interface

If you type in the query "car", you should see Avis Group Holdings. You can click on the highlighted name to view the document detail page for Avis Group Holdings. You should also try some of the queries previously entered in the command line search program (MySearcher).

To review how the Cocoon Web application works and how it makes use of Lucene, view the $COCOON_HOME/my_lucene_app/sitemap.xmap file. The search_form.html page is a simple HTML page that contains a form – no explanation needed. The search_result.html page is actually two files: content/search_result.xsp and style/search_result.xsl. First, search_result.xsp generates the XML data, and then search_result.xsl transforms the XML data into HTML. The following is a breakdown of the search_result.xsp file's process:

  1. The user's query is retrieved via request.getParameter("query").
  2. A new Searcher is instantiated. As the name implies, a Searcher lets you search through a collection of documents.
  3. A new Analyzer is instantiated. An Analyzer performs analysis on the query and determines whether documents in the collection of stored documents match the query terms. Analyzers are useful because they let you match in different ways (soundex, suffix stripping, synonyms, etc.). If you need a new way of matching, you can just write an Analyzer.
  4. A Query is instantiated. You can think of a Query as a package containing the query terms and the Analyzer used to perform match operations.
  5. The Searcher instance is invoked with the query.
  6. The Searcher instance returns a Hits collection. Each object in the Hits collection is a Document instance.
  7. You then iterate over the Hits collection, assigning the current Document instance to a variable in the loop.
  8. Each iteration in the Hits collection generates an XML fragment.

 
Figure 3. XML File Transformed into a Table

The XML fragment for an example document might look like this:

<document> <position>1</position> <score>.532</score> <relevance-percent>53</relevance-percent> <filename>00283654.txt</filename> <company-name>Walmart</company-name> <format>text</format> <form-type>10-K</form-type> <filed-date>20040930</filed-date> </document>

In search_result.xsl, you transform the XML file with multiple <document> tags into the table that you see in Figure 3.

Lastly, when you click on a company name, you see the document detail page (an SEC filing for a company). The document detail page is implemented with a simple Cocoon pipeline in the sitemap.

Cocoon and Lucene: A Powerful Combination
Hopefully, you now agree that Cocoon and Lucene form a powerful combination that enables you to quickly and easily develop Web applications for searching unstructured information. By wrapping the Lucene return data in XML and then using Cocoon's XML-handling capabilities, you tap into the power of XML for multi-channel publishing (Web and PDA/wireless, for example) of the huge amount of unstructured information in the world.



Wellie Chao has been active in the business of technology for many years, has been involved with software and hardware since 1984, and has been writing Web-based software in a variety of languages and on different platforms since 1994.
Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap