Home » Implementing Full Text Indexing with PostgreSQL

Implementing Full Text Indexing with PostgreSQL

ike many companies, over the years my company developed a home-grown billing/project system that allows our programmers to bill to a Web application. This system is quite capable; however one of the drawbacks to the system is that there’s no way to search the billable entries or customer notes.

Of course, we could use simple LIKE or regex queries via SQL but that would be resource-intensive and would not provide the detailed results we would like. For example, to search for all billing entries that had a description with the words ‘apache’ (to protect the customer name) and ‘php’ we could write a query like this:

   SELECT description    FROM cmd_contact_task_entries    WHERE description ~ 'php'    AND description ~ 'apache';

Author’s Note: The tilde (~) in the query represents the regular expression operator in PostgreSQL.

In our system the preceding query would return seven rows, but it would not rank the matching documents. Moreover, if we were looking for more detailed information we would have to create some fairly onerous queries. What this system really needs is full text indexing.

Although PostgreSQL is, in my opinion, the most capable open source database available, it does have its drawbacks. One such drawback is that PostgreSQL doesn’t include full text indexing as part of the core distribution. However, as with any good open source project, the PostgreSQL community has contributed many useful modules?and one such module is TSearch2.

TSearch2 is a maintained contrib module for both PostgreSQL 7.3 and 7.4. In short, TSearch2 is a new indexable data type that stores the position of words within a particular document?thus making the retrieval of those documents based on specified parameters both efficient and reliable.

Installing TSearch2
There are three ways to get TSearch2 implemented in your PostgreSQL installation.

Download and compile the source
Download and install the contrib rpm (if you are running Linux)
Purchase a binary distribution

Compiling PostgreSQL is outside the scope of this article. However, I will cover compiling and installing the TSearch2 module itself.

The TSearch2 module is included with the PostgreSQL source and can be found within the root of the tree at contrib/tsearch2.

After you have configured and installed PostgreSQL itself, you need to perform two more simple steps to get Tsearch2 installed. From the root of the PostgreSQL source tree, enter the following two commands:

   cd contrib/tsearch2   make; make install

That’s it. The module is now installed. Of course you still have to configure PostgreSQL for its use. For the examples in this article, you’ll use a development database called Fingerless (long story) to test TSearch2.

To configure the database to use TSearch2 you need to execute the tsearch2.sql file against the fingerless database. To do that, enter the following psql command:

   psql fingerless < tsearch2.sql

Executing the script produces some noisy output as it creates sample tables and indexes for testing Tsearch2. Don't worry about all this output. When the script completes it will have added the following tables to your database (see Table 1):

Table 1. A list of tables added by executing the tsearch2.sql file.

Schema	Name	Type	Owner
public	pg_ts_cfg	table	postgres
public	pg_ts_cfgmap	table	postgres
public	pg_ts_dict	table	postgres
public	pg_ts_parser	table	postgres

You will also have a detailed list of new functions and data types. For more information on what the script actually adds to your database please see the tsearch2.sql file or the Web site documentation.

The billing application keeps the main billing details in the table named public.cmd_contact_task_entries.

The table has the structure shown in Table 2:

Table 2: Structure of the public.cmd_contact_task_entries table.

Column	Type	Modifiers
entry_id	integer	not null default nextval('"cmd_contact_task_e_entry_id_seq"'::text)
task_id	integer
creator_employee_id	integer
description	text
time_spent	interval
created	timestamp with time zone
billed	timestamp with time zone
hourly_rate	numeric(6,2)
internal_notes	text	default ''::text
contact_id	integer
note_id	integer	default 0

The table also has this index:

   "cmd_contact_task_entries_pkey"    primary key, btree (entry_id)

Using TSearch2
You need to make several modifications to the table to use Tsearch2. First, add a vectors column. The vectors column carries the vector value of the column data you are going to be searching. In this case you want to calculate the vectors for the description column. To add the column to the table execute the following command.

   ALTER TABLE public.cmd_contact_task_entries   ADD COLUMN vectors tsvector;

The command adds a new type?called tsvector?to PostgreSQL via Tsearch. After adding the column you also need to add a new index. The index to be created is a gist index. Gist is an acronym for Generalized Search Tree. You can find more information on the specifics of this type of index here. Tsearch uses the gist index to store the words of the documents by lexeme?a fundamental unit of language. A lexeme of a word is based on the lexicon (the collection of words) of a particular language such as English. For example, a lexeme of the word find is found, or finding.

   CREATE INDEX description_index    ON cmd_contact_task_entries    USING gist(vectors);

You have now created the basic structure for using full text indexing with the cmd_contact_task_entries table but you need to update the vectors column to have the correct information. You can do this with the following update statement:

   UPDATE cmd_contact_task_entries    SET vectors = to_tsvector(description);

After completing these steps, you can start processing queries. Here's an example query that looks for the same terms as the regex example at the beginning of this article. The query displays matched items in the TSearch headline format:

   SELECT entry_id,headline(description,q),    rank(vectors,q)    FROM cmd_contact_task_entries,    to_tsquery('apache|php') AS q    WHERE vectors @@ q ORDER BY rank(vectors,q) DESC;

On our work system, the query returns 216 rows (versus only 7 using the regex query). The first five are displayed below:

   -[ RECORD 1]------------------------------------------    entry_id | 2435   headline | php php-devel php   -imap php-ldap pine psmisc python    python-devel   rank | 0.468559      -[ RECORD 2 ]-----------------------------------------    entry_id | 4167   headline | php for apache and we    attempted to fix the dependencies several times and    ended up   rank | 0.468559      -[ RECORD 3 ]-----------------------------------------    entry_id | 2786   headline | php 4.3.2 and compiled    (but apache wasn't starting back up    properly....redoing it with   rank | 0.40951      -[ RECORD 4 ]-----------------------------------------    entry_id | 2203   headline | apache 1.3. Installed LXP 0.9.   Apache is now running with LXP but did not    get a chance   rank | 0.3439      -[ RECORD 5 ]-----------------------------------------    entry_id | 1504   headline | Apache 1.3.20, PHP 4.0.5,    lxp 0.7.9, OpenSSL and   rank | 0.3439

Notice that using the headline function in the search query causes the results to have the search terms emphasized with the (bold) html tag. Also, the results are ranked to provide a weighted result.

You can make this more manageable by rewriting the query as a function. Because you'll need to return multiple rows, the function needs to be a SetOF function. When creating a SetOf function you need to create a custom type:

   CREATE TYPE desc_results    AS (entry_id INTEGER, headline TEXT, rank REAL);

After creating the custom type you create a function that calls that type:

   CREATE OR REPLACE FUNCTION dresults(text)    RETURNS SETOF desc_results AS '   SELECT entry_id,headline(description,q),   rank(vectors,q)    FROM cmd_contact_task_entries,    to_tsquery($1) AS q    WHERE vectors @@ q ORDER BY rank(vectors,q)    DESC;' LANGUAGE 'SQL';

Now, rather than writing the full search query, you can use a much simpler query such as:

   SELECT * FROM dresults('apache|php');

TSearch2 has many capabilities that were not discussed in this article. If after reading this, you're still thirsting for more information I suggest you review the excellent project documentation listed in the resources column.

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.

About Our Journalist

Charlie Frank

Charlie has over a decade of experience in website administration and technology management. As the site admin, he oversees all technical aspects of running a high-traffic online platform, ensuring optimal performance, security, and user experience.
View Author

Walmart’s fintech startup to disrupt BNPL market
Cameron Wiggins April 26, 2024 5:38 PM

From humble cart to thriving business: The Nitro Bar’s success story
Rashan Dixon April 26, 2024 3:50 PM

Adapting business strategies amidst unpredictable economies
April Isaacs April 26, 2024 3:11 PM

Singaporean entrepreneur’s journey to grocery chain success
Cameron Wiggins April 26, 2024 3:06 PM

Rhode Island coffee startup finds success
April Isaacs April 26, 2024 1:39 PM

Erel Margalit’s enduring impact on Israeli tech industry
Noah Nguyen April 26, 2024 1:12 PM

Bilt Rewards transform US rental sector
Cameron Wiggins April 26, 2024 8:58 AM

Chinese e-commerce platform upsets US market
Noah Nguyen April 26, 2024 8:54 AM

Stripe refines business model for fintech competitiveness
Johannah Lopez April 26, 2024 7:13 AM

Addressing challenges in Israel’s tech sector
Rashan Dixon April 25, 2024 5:58 PM

Gender disparity challenged by women-centric coworking spaces
April Isaacs April 25, 2024 5:56 PM

From cart to empire: The Nitro Bar’s rise
Cameron Wiggins April 25, 2024 5:23 PM

FlexAI nets $30 million to simplify AI computing infrastructure
Noah Nguyen April 25, 2024 3:36 PM

Google fires 28 protesting employees, maintains professional ethics
Johannah Lopez April 25, 2024 1:49 PM

Growing interest in small business ownership among millennials
Johannah Lopez April 25, 2024 1:31 PM

India’s rise in space exploration with private corporations
Johannah Lopez April 25, 2024 8:42 AM

Walmart partners with fintech One for installment payment service
Johannah Lopez April 25, 2024 7:59 AM

Biden administration bans noncompete agreements nationwide
April Isaacs April 25, 2024 7:01 AM

Media mogul poised for $1 billion boost if shares surge
Cameron Wiggins April 24, 2024 5:19 PM

Bitcoin miners’ strategy stabilizes market pre-halving
Noah Nguyen April 24, 2024 5:12 PM

Digital banking tackles fraud with advanced methods
Johannah Lopez April 24, 2024 1:55 PM

Wealth accumulation: Path to personal freedom and societal impact
Cameron Wiggins April 24, 2024 1:25 PM

Japan’s finance minister warns of currency market intervention
Rashan Dixon April 24, 2024 1:14 PM

TabaPay set to acquire Synapse’s assets
April Isaacs April 24, 2024 11:58 AM

Baton Rouge Entrepreneurship Week 2024 announced
Cameron Wiggins April 24, 2024 11:36 AM

Show More