ike many companies, over the years my company developed a home-grown billing/project system that allows our programmers to bill to a Web application. This system is quite capable; however one of the drawbacks to the system is that there’s no way to search the billable entries or customer notes.
Of course, we could use simple LIKE or regex queries via SQL but that would be resource-intensive and would not provide the detailed results we would like. For example, to search for all billing entries that had a description with the words ‘apache’ (to protect the customer name) and ‘php’ we could write a query like this:
SELECT description FROM cmd_contact_task_entries WHERE description ~ 'php' AND description ~ 'apache';
|Author’s Note: The tilde (~) in the query represents the regular expression operator in PostgreSQL.|
In our system the preceding query would return seven rows, but it would not rank the matching documents. Moreover, if we were looking for more detailed information we would have to create some fairly onerous queries. What this system really needs is full text indexing.
Although PostgreSQL is, in my opinion, the most capable open source database available, it does have its drawbacks. One such drawback is that PostgreSQL doesn’t include full text indexing as part of the core distribution. However, as with any good open source project, the PostgreSQL community has contributed many useful modules?and one such module is TSearch2.
TSearch2 is a maintained contrib module for both PostgreSQL 7.3 and 7.4. In short, TSearch2 is a new indexable data type that stores the position of words within a particular document?thus making the retrieval of those documents based on specified parameters both efficient and reliable.
There are three ways to get TSearch2 implemented in your PostgreSQL installation.
- Download and compile the source
- Download and install the contrib rpm (if you are running Linux)
- Purchase a binary distribution
Compiling PostgreSQL is outside the scope of this article. However, I will cover compiling and installing the TSearch2 module itself.
The TSearch2 module is included with the PostgreSQL source and can be found within the root of the tree at contrib/tsearch2.
After you have configured and installed PostgreSQL itself, you need to perform two more simple steps to get Tsearch2 installed. From the root of the PostgreSQL source tree, enter the following two commands:
cd contrib/tsearch2 make; make install
That’s it. The module is now installed. Of course you still have to configure PostgreSQL for its use. For the examples in this article, you’ll use a development database called Fingerless (long story) to test TSearch2.
To configure the database to use TSearch2 you need to execute the tsearch2.sql file against the fingerless database. To do that, enter the following psql command:
psql fingerless < tsearch2.sql
Executing the script produces some noisy output as it creates sample tables and indexes for testing Tsearch2. Don't worry about all this output. When the script completes it will have added the following tables to your database (see Table 1):
Table 1. A list of tables added by executing the tsearch2.sql file.
You will also have a detailed list of new functions and data types. For more information on what the script actually adds to your database please see the tsearch2.sql file or the Web site documentation.
The billing application keeps the main billing details in the table named public.cmd_contact_task_entries.
The table has the structure shown in Table 2:
Table 2: Structure of the public.cmd_contact_task_entries table.
|entry_id||integer||not null default nextval('"cmd_contact_task_e_entry_id_seq"'::text)|
|created||timestamp with time zone|
|billed||timestamp with time zone|
The table also has this index:
"cmd_contact_task_entries_pkey" primary key, btree (entry_id)
You need to make several modifications to the table to use Tsearch2. First, add a vectors column. The vectors column carries the vector value of the column data you are going to be searching. In this case you want to calculate the vectors for the description column. To add the column to the table execute the following command.
ALTER TABLE public.cmd_contact_task_entries ADD COLUMN vectors tsvector;
The command adds a new type?called tsvector?to PostgreSQL via Tsearch. After adding the column you also need to add a new index. The index to be created is a gist index. Gist is an acronym for Generalized Search Tree. You can find more information on the specifics of this type of index here. Tsearch uses the gist index to store the words of the documents by lexeme?a fundamental unit of language. A lexeme of a word is based on the lexicon (the collection of words) of a particular language such as English. For example, a lexeme of the word find is found, or finding.
CREATE INDEX description_index ON cmd_contact_task_entries USING gist(vectors);
You have now created the basic structure for using full text indexing with the cmd_contact_task_entries table but you need to update the vectors column to have the correct information. You can do this with the following update statement:
UPDATE cmd_contact_task_entries SET vectors = to_tsvector(description);
After completing these steps, you can start processing queries. Here's an example query that looks for the same terms as the regex example at the beginning of this article. The query displays matched items in the TSearch headline format:
SELECT entry_id,headline(description,q), rank(vectors,q) FROM cmd_contact_task_entries, to_tsquery('apache|php') AS q WHERE vectors @@ q ORDER BY rank(vectors,q) DESC;
On our work system, the query returns 216 rows (versus only 7 using the regex query). The first five are displayed below:
-[ RECORD 1]------------------------------------------ entry_id | 2435 headline | php php-devel php -imap php-ldap pine psmisc python python-devel rank | 0.468559 -[ RECORD 2 ]----------------------------------------- entry_id | 4167 headline | php for apache and we attempted to fix the dependencies several times and ended up rank | 0.468559 -[ RECORD 3 ]----------------------------------------- entry_id | 2786 headline | php 4.3.2 and compiled (but apache wasn't starting back up properly....redoing it with rank | 0.40951 -[ RECORD 4 ]----------------------------------------- entry_id | 2203 headline | apache 1.3. Installed LXP 0.9. Apache is now running with LXP but did not get a chance rank | 0.3439 -[ RECORD 5 ]----------------------------------------- entry_id | 1504 headline | Apache 1.3.20, PHP 4.0.5, lxp 0.7.9, OpenSSL and rank | 0.3439
Notice that using the headline function in the search query causes the results to have the search terms emphasized with the (bold) html tag. Also, the results are ranked to provide a weighted result.
You can make this more manageable by rewriting the query as a function. Because you'll need to return multiple rows, the function needs to be a SetOF function. When creating a SetOf function you need to create a custom type:
CREATE TYPE desc_results AS (entry_id INTEGER, headline TEXT, rank REAL);
After creating the custom type you create a function that calls that type:
CREATE OR REPLACE FUNCTION dresults(text) RETURNS SETOF desc_results AS ' SELECT entry_id,headline(description,q), rank(vectors,q) FROM cmd_contact_task_entries, to_tsquery($1) AS q WHERE vectors @@ q ORDER BY rank(vectors,q) DESC;' LANGUAGE 'SQL';
Now, rather than writing the full search query, you can use a much simpler query such as:
SELECT * FROM dresults('apache|php');
TSearch2 has many capabilities that were not discussed in this article. If after reading this, you're still thirsting for more information I suggest you review the excellent project documentation listed in the resources column.