Browse DevX
Sign up for e-mail newsletters from DevX


Implementing Full Text Indexing with PostgreSQL : Page 3

Although PostgreSQL doesn't include full-text indexing and search capabilities by default, you can install and use the TSearch2 module that ships with PostgreSQL to index and search large amounts of text efficiently.




Building the Right Environment to Support AI, Machine Learning and Deep Learning

Using TSearch2
You need to make several modifications to the table to use Tsearch2. First, add a vectors column. The vectors column carries the vector value of the column data you are going to be searching. In this case you want to calculate the vectors for the description column. To add the column to the table execute the following command.

ALTER TABLE public.cmd_contact_task_entries ADD COLUMN vectors tsvector;

The command adds a new type—called tsvector—to PostgreSQL via Tsearch. After adding the column you also need to add a new index. The index to be created is a gist index. Gist is an acronym for Generalized Search Tree. You can find more information on the specifics of this type of index here. Tsearch uses the gist index to store the words of the documents by lexeme—a fundamental unit of language. A lexeme of a word is based on the lexicon (the collection of words) of a particular language such as English. For example, a lexeme of the word find is found, or finding.

CREATE INDEX description_index ON cmd_contact_task_entries USING gist(vectors);

You have now created the basic structure for using full text indexing with the cmd_contact_task_entries table but you need to update the vectors column to have the correct information. You can do this with the following update statement:

UPDATE cmd_contact_task_entries SET vectors = to_tsvector(description);

After completing these steps, you can start processing queries. Here's an example query that looks for the same terms as the regex example at the beginning of this article. The query displays matched items in the TSearch headline format:

SELECT entry_id,headline(description,q), rank(vectors,q) FROM cmd_contact_task_entries, to_tsquery('apache|php') AS q WHERE vectors @@ q ORDER BY rank(vectors,q) DESC;

On our work system, the query returns 216 rows (versus only 7 using the regex query). The first five are displayed below:

-[ RECORD 1]------------------------------------------ entry_id | 2435 headline | <b>php</b> <b>php</b>-devel <b>php</b> -imap <b>php</b>-ldap pine psmisc python python-devel rank | 0.468559 -[ RECORD 2 ]----------------------------------------- entry_id | 4167 headline | <b>php</b> for <b>apache</b> and we attempted to fix the dependencies several times and ended up rank | 0.468559 -[ RECORD 3 ]----------------------------------------- entry_id | 2786 headline | <b>php</b> 4.3.2 and compiled (but <b>apache</b> wasn't starting back up properly....redoing it with rank | 0.40951 -[ RECORD 4 ]----------------------------------------- entry_id | 2203 headline | <b>apache</b> 1.3. Installed LXP 0.9. <b>Apache</b> is now running with LXP but did not get a chance rank | 0.3439 -[ RECORD 5 ]----------------------------------------- entry_id | 1504 headline | <b>Apache</b> 1.3.20, <b>PHP</b> 4.0.5, lxp 0.7.9, OpenSSL and rank | 0.3439

Notice that using the headline function in the search query causes the results to have the search terms emphasized with the <b> (bold) html tag. Also, the results are ranked to provide a weighted result.

You can make this more manageable by rewriting the query as a function. Because you'll need to return multiple rows, the function needs to be a SetOF function. When creating a SetOf function you need to create a custom type:

CREATE TYPE desc_results AS (entry_id INTEGER, headline TEXT, rank REAL);

After creating the custom type you create a function that calls that type:

CREATE OR REPLACE FUNCTION dresults(text) RETURNS SETOF desc_results AS ' SELECT entry_id,headline(description,q), rank(vectors,q) FROM cmd_contact_task_entries, to_tsquery($1) AS q WHERE vectors @@ q ORDER BY rank(vectors,q) DESC;' LANGUAGE 'SQL';

Now, rather than writing the full search query, you can use a much simpler query such as:

SELECT * FROM dresults('apache|php');

TSearch2 has many capabilities that were not discussed in this article. If after reading this, you're still thirsting for more information I suggest you review the excellent project documentation listed in the resources column.

Joshua D. Drake is President of Command Prompt, Inc. a dedicated PostgreSQL support and custom programming company. He is also the co-author of "Practical PostgreSQL" from O'Reilly and Associates.
Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



Thanks for your registration, follow us on our social networks to keep up-to-date