Implementing Full Text Indexing with PostgreSQL

ike many companies, over the years my company developed a home-grown billing/project system that allows our programmers to bill to a Web application. This system is quite capable; however one of the drawbacks to the system is that there’s no way to search the billable entries or customer notes.

Of course, we could use simple LIKE or regex queries via SQL but that would be resource-intensive and would not provide the detailed results we would like. For example, to search for all billing entries that had a description with the words ‘apache’ (to protect the customer name) and ‘php’ we could write a query like this:

   SELECT description    FROM cmd_contact_task_entries    WHERE description ~ 'php'    AND description ~ 'apache';

Author’s Note: The tilde (~) in the query represents the regular expression operator in PostgreSQL.

In our system the preceding query would return seven rows, but it would not rank the matching documents. Moreover, if we were looking for more detailed information we would have to create some fairly onerous queries. What this system really needs is full text indexing.

Although PostgreSQL is, in my opinion, the most capable open source database available, it does have its drawbacks. One such drawback is that PostgreSQL doesn’t include full text indexing as part of the core distribution. However, as with any good open source project, the PostgreSQL community has contributed many useful modules?and one such module is TSearch2.

TSearch2 is a maintained contrib module for both PostgreSQL 7.3 and 7.4. In short, TSearch2 is a new indexable data type that stores the position of words within a particular document?thus making the retrieval of those documents based on specified parameters both efficient and reliable.

Installing TSearch2
There are three ways to get TSearch2 implemented in your PostgreSQL installation.

  1. Download and compile the source
  2. Download and install the contrib rpm (if you are running Linux)
  3. Purchase a binary distribution

Compiling PostgreSQL is outside the scope of this article. However, I will cover compiling and installing the TSearch2 module itself.

The TSearch2 module is included with the PostgreSQL source and can be found within the root of the tree at contrib/tsearch2.

After you have configured and installed PostgreSQL itself, you need to perform two more simple steps to get Tsearch2 installed. From the root of the PostgreSQL source tree, enter the following two commands:

   cd contrib/tsearch2   make; make install

That’s it. The module is now installed. Of course you still have to configure PostgreSQL for its use. For the examples in this article, you’ll use a development database called Fingerless (long story) to test TSearch2.

To configure the database to use TSearch2 you need to execute the tsearch2.sql file against the fingerless database. To do that, enter the following psql command:

   psql fingerless < tsearch2.sql

Executing the script produces some noisy output as it creates sample tables and indexes for testing Tsearch2. Don't worry about all this output. When the script completes it will have added the following tables to your database (see Table 1):

Table 1. A list of tables added by executing the tsearch2.sql file.

Schema Name Type Owner
public pg_ts_cfg table postgres
public pg_ts_cfgmap table postgres
public pg_ts_dict table postgres
public pg_ts_parser table postgres

You will also have a detailed list of new functions and data types. For more information on what the script actually adds to your database please see the tsearch2.sql file or the Web site documentation.

The billing application keeps the main billing details in the table named public.cmd_contact_task_entries.

The table has the structure shown in Table 2:

Table 2: Structure of the public.cmd_contact_task_entries table.

Column Type Modifiers
entry_id integer not null default nextval('"cmd_contact_task_e_entry_id_seq"'::text)
task_id integer  
creator_employee_id integer  
description text  
time_spent interval  
created timestamp with time zone  
billed timestamp with time zone  
hourly_rate numeric(6,2)  
internal_notes text default ''::text
contact_id integer  
note_id integer default 0

The table also has this index:

   "cmd_contact_task_entries_pkey"    primary key, btree (entry_id)

Using TSearch2
You need to make several modifications to the table to use Tsearch2. First, add a vectors column. The vectors column carries the vector value of the column data you are going to be searching. In this case you want to calculate the vectors for the description column. To add the column to the table execute the following command.

   ALTER TABLE public.cmd_contact_task_entries   ADD COLUMN vectors tsvector;

The command adds a new type?called tsvector?to PostgreSQL via Tsearch. After adding the column you also need to add a new index. The index to be created is a gist index. Gist is an acronym for Generalized Search Tree. You can find more information on the specifics of this type of index here. Tsearch uses the gist index to store the words of the documents by lexeme?a fundamental unit of language. A lexeme of a word is based on the lexicon (the collection of words) of a particular language such as English. For example, a lexeme of the word find is found, or finding.

   CREATE INDEX description_index    ON cmd_contact_task_entries    USING gist(vectors);

You have now created the basic structure for using full text indexing with the cmd_contact_task_entries table but you need to update the vectors column to have the correct information. You can do this with the following update statement:

   UPDATE cmd_contact_task_entries    SET vectors = to_tsvector(description);

After completing these steps, you can start processing queries. Here's an example query that looks for the same terms as the regex example at the beginning of this article. The query displays matched items in the TSearch headline format:

   SELECT entry_id,headline(description,q),    rank(vectors,q)    FROM cmd_contact_task_entries,    to_tsquery('apache|php') AS q    WHERE vectors @@ q ORDER BY rank(vectors,q) DESC;

On our work system, the query returns 216 rows (versus only 7 using the regex query). The first five are displayed below:

   -[ RECORD 1]------------------------------------------    entry_id | 2435   headline | php php-devel php   -imap php-ldap pine psmisc python    python-devel   rank | 0.468559      -[ RECORD 2 ]-----------------------------------------    entry_id | 4167   headline | php for apache and we    attempted to fix the dependencies several times and    ended up   rank | 0.468559      -[ RECORD 3 ]-----------------------------------------    entry_id | 2786   headline | php 4.3.2 and compiled    (but apache wasn't starting back up    properly....redoing it with   rank | 0.40951      -[ RECORD 4 ]-----------------------------------------    entry_id | 2203   headline | apache 1.3. Installed LXP 0.9.   Apache is now running with LXP but did not    get a chance   rank | 0.3439      -[ RECORD 5 ]-----------------------------------------    entry_id | 1504   headline | Apache 1.3.20, PHP 4.0.5,    lxp 0.7.9, OpenSSL and   rank | 0.3439

Notice that using the headline function in the search query causes the results to have the search terms emphasized with the (bold) html tag. Also, the results are ranked to provide a weighted result.

You can make this more manageable by rewriting the query as a function. Because you'll need to return multiple rows, the function needs to be a SetOF function. When creating a SetOf function you need to create a custom type:

   CREATE TYPE desc_results    AS (entry_id INTEGER, headline TEXT, rank REAL);

After creating the custom type you create a function that calls that type:

   CREATE OR REPLACE FUNCTION dresults(text)    RETURNS SETOF desc_results AS '   SELECT entry_id,headline(description,q),   rank(vectors,q)    FROM cmd_contact_task_entries,    to_tsquery($1) AS q    WHERE vectors @@ q ORDER BY rank(vectors,q)    DESC;' LANGUAGE 'SQL';

Now, rather than writing the full search query, you can use a much simpler query such as:

   SELECT * FROM dresults('apache|php');

TSearch2 has many capabilities that were not discussed in this article. If after reading this, you're still thirsting for more information I suggest you review the excellent project documentation listed in the resources column.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Overview

The Latest

chrome os developer mode

How to Turn on Chrome OS Developer Mode

Google’s Chrome OS is a popular operating system that is widely used on Chromebooks and other devices. While it is designed to be simple and user-friendly, there are times when users may want to access additional features and functionality. One way to do this is by turning on Chrome OS

homes in the real estate industry

Exploring the Latest Tech Trends Impacting the Real Estate Industry

The real estate industry is changing thanks to the newest technological advancements. These new developments — from blockchain and AI to virtual reality and 3D printing — are poised to change how we buy and sell homes. Real estate brokers, buyers, sellers, wholesale real estate professionals, fix and flippers, and beyond may

man on floor with data

DevX Quick Guide to Data Ingestion

One of the biggest trends of the 21st century is the massive surge in internet usage. With major innovations such as smart technology, social media, and online shopping sites, the internet has become an essential part of everyday life for a large portion of the population. Due to this internet