dcsimg
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Implementing Full Text Indexing with PostgreSQL : Page 3

Although PostgreSQL doesn't include full-text indexing and search capabilities by default, you can install and use the TSearch2 module that ships with PostgreSQL to index and search large amounts of text efficiently.


advertisement
Using TSearch2
You need to make several modifications to the table to use Tsearch2. First, add a vectors column. The vectors column carries the vector value of the column data you are going to be searching. In this case you want to calculate the vectors for the description column. To add the column to the table execute the following command.

   ALTER TABLE public.cmd_contact_task_entries
   ADD COLUMN vectors tsvector;
The command adds a new type—called tsvector—to PostgreSQL via Tsearch. After adding the column you also need to add a new index. The index to be created is a gist index. Gist is an acronym for Generalized Search Tree. You can find more information on the specifics of this type of index here. Tsearch uses the gist index to store the words of the documents by lexeme—a fundamental unit of language. A lexeme of a word is based on the lexicon (the collection of words) of a particular language such as English. For example, a lexeme of the word find is found, or finding.

   CREATE INDEX description_index 
   ON cmd_contact_task_entries 
   USING gist(vectors);
You have now created the basic structure for using full text indexing with the cmd_contact_task_entries table but you need to update the vectors column to have the correct information. You can do this with the following update statement:

   UPDATE cmd_contact_task_entries 
   SET vectors = to_tsvector(description);
After completing these steps, you can start processing queries. Here's an example query that looks for the same terms as the regex example at the beginning of this article. The query displays matched items in the TSearch headline format:

   SELECT entry_id,headline(description,q), 
   rank(vectors,q) 
   FROM cmd_contact_task_entries, 
   to_tsquery('apache|php') AS q 
   WHERE vectors @@ q ORDER BY rank(vectors,q) DESC;
On our work system, the query returns 216 rows (versus only 7 using the regex query). The first five are displayed below:

   -[ RECORD 1]------------------------------------------ 
   entry_id | 2435
   headline | <b>php</b> <b>php</b>-devel <b>php</b>
   -imap <b>php</b>-ldap pine psmisc python 
   python-devel
   rank | 0.468559
   
   -[ RECORD 2 ]----------------------------------------- 
   entry_id | 4167
   headline | <b>php</b> for <b>apache</b> and we 
   attempted to fix the dependencies several times and 
   ended up
   rank | 0.468559
   
   -[ RECORD 3 ]----------------------------------------- 
   entry_id | 2786
   headline | <b>php</b> 4.3.2 and compiled 
   (but <b>apache</b> wasn't starting back up 
   properly....redoing it with
   rank | 0.40951
   
   -[ RECORD 4 ]----------------------------------------- 
   entry_id | 2203
   headline | <b>apache</b> 1.3. Installed LXP 0.9.
   <b>Apache</b> is now running with LXP but did not 
   get a chance
   rank | 0.3439
   
   -[ RECORD 5 ]----------------------------------------- 
   entry_id | 1504
   headline | <b>Apache</b> 1.3.20, <b>PHP</b> 4.0.5, 
   lxp 0.7.9, OpenSSL and
   rank | 0.3439
Notice that using the headline function in the search query causes the results to have the search terms emphasized with the <b> (bold) html tag. Also, the results are ranked to provide a weighted result.

You can make this more manageable by rewriting the query as a function. Because you'll need to return multiple rows, the function needs to be a SetOF function. When creating a SetOf function you need to create a custom type:

   CREATE TYPE desc_results 
   AS (entry_id INTEGER, headline TEXT, rank REAL);
After creating the custom type you create a function that calls that type:

   CREATE OR REPLACE FUNCTION dresults(text) 
   RETURNS SETOF desc_results AS '
   SELECT entry_id,headline(description,q),
   rank(vectors,q) 
   FROM cmd_contact_task_entries, 
   to_tsquery($1) AS q 
   WHERE vectors @@ q ORDER BY rank(vectors,q) 
   DESC;' LANGUAGE 'SQL';
Now, rather than writing the full search query, you can use a much simpler query such as:

   SELECT * FROM dresults('apache|php');
TSearch2 has many capabilities that were not discussed in this article. If after reading this, you're still thirsting for more information I suggest you review the excellent project documentation listed in the resources column.



Joshua D. Drake is President of Command Prompt, Inc. a dedicated PostgreSQL support and custom programming company. He is also the co-author of "Practical PostgreSQL" from O'Reilly and Associates.
Thanks for your registration, follow us on our social networks to keep up-to-date