Create a Robust Document Management System with Oracle Text

f you need a way to search through various documents with differing formats (PDF, Word documents, PowerPoint presentations, Excel spreadsheets)?including Web content?use Oracle Text. Oracle Text has a powerful text-search engine that allows users to search and manage documents and URLs as quickly and easily as any other data. If you’re an application developer, you can utilize Oracle Text searches on anything from simple comment fields to disparate documents, and you can even create a full-scale document management system. This article examines a content management example, using sample code and step-by-step instructions, to help you do just this.

The examples this article covers come from an actual content management system at my company. Currently, document owners manually enter metadata about each document into a content management intranet. For assigning categories, brands, and communication types to the document, manual data selection is sufficient.

The challenge comes when the owner needs to add keywords about the document to the content management system’s database. Most often, the owner opens the document, reads through it, and either copies and pastes the information into the administration screen or puts in the keywords he or she thinks best represent the document. As a result, the search results for documents on this site did not always return all the documents associated with particular keywords?since the owner did not enter all relevant data into the keywords field when adding the record.

By using the Oracle Text functionality, the content management team now has better results from user searches, as well as an easier method for document owners to manage keyword information about the document. In conjunction with the Oracle Text data, document owners can also add words that are synonymous with the document even when those words aren’t part of the text within the document.

Figure 1 depicts the architecture utilized in this example.

Figure 1: Architecture for Content Management System Utilizing Oracle Text

You can use some of these features to deliver a robust content management search engine. Just follow the steps.

Step 1: Preliminary Setup
These instructions assume you have Oracle Text installed. If you haven’t set it up yet, please refer to the Oracle installation guide. If you have access to OracleMetaLink, you can follow [Note:150316.1]. The following statement will perform a quick check to verify whether Oracle Text was set up when your database was installed (connected as CTXSYS user):

column library_name format a8 column file_spec format a60   select library_name,file_spec,dynamic,status from user_libraries; LIBRARY 	FILE_SPEC                                                    	D	STATUS--------	--------------------------------	-	-------DR$LIBX  	D:oracleora92inoractxx9.dll                             	Y	VALIDDR$LIB                                                                		N	VALID

After you have ensured Oracle Text is properly configured, connect as user SYSTEM. To follow the examples throughout this article, you need to create a user account and a directory where the documents that need indexing reside. You can customize the location of your document repository by changing the following path:

create user cmdemo identified by password;grant dba,connect,resource,create session to cmdemo;create or replace directory DIR_TESTCASE as 'G:ContentMgmt';grant read on directory DIR_TESTCASE to PUBLIC;

Step 2: Create Table and Import Data
As user CMDEMO, create the main table to house your content:

CREATE TABLE CMDEMO.CONTENT_INVENTORY (  CONTENT_INVENTORY_ID 	NUMBER NOT NULL,   FILE_NAME 		VARCHAR2(1024),   KEYWORDS 		VARCHAR2(2048),   URL 			SYS.HTTPURITYPE,   TEXT 			BLOB DEFAULT empty_blob(),   TEXT_LOADED 		CHAR(1) DEFAULT 'N' NOT NULL,     CONSTRAINT UNQ_CONTENT_INVENTORY       UNIQUE(CONTENT_INVENTORY_ID) USING INDEX    TABLESPACE USERS); 

Several fields within this routine are worth noting and understanding:

  • Notice that you set your BLOB field to default to empty_blob(). This avoids uninitialized blob errors.
  • The FILE_NAME field is the name of your binary documents (e.g., Word docs, PDFs, XLS, etc.).
  • The KEYWORDS field, as mentioned earlier, gives you the ability to add words that are synonymous with the document but are not part of the text within the document (an example of using this comes later).
  • The URL field allows you to harvest text from internal and external Web sites.
  • The TEXT field is where the binary documents are loaded into the database via your PL/SQL routine, which I will cover in just a moment.
  • The TEXT_LOADED field is optional in this example, but it is useful in determining whether the document has been loaded into the database yet.

Load several rows into your CONTENT_MANAGEMENT table. This may seem redundant, but you are loading different file types in each row for demonstration purposes:

--PDF DocumentINSERT INTO CMDEMO.CONTENT_INVENTORY   (CONTENT_INVENTORY_ID, FILE_NAME, KEYWORDS)    VALUES ('1', 'install.pdf', 'encompassing');--Word DocumentINSERT INTO CMDEMO.CONTENT_INVENTORY   (CONTENT_INVENTORY_ID, FILE_NAME, KEYWORDS)    VALUES ('2', 'Readme.doc', '');--Excel SpreadsheetINSERT INTO CMDEMO.CONTENT_INVENTORY   (CONTENT_INVENTORY_ID, FILE_NAME, KEYWORDS)    VALUES ('3', 'OO4O_ORA.XLS', '');--Powerpoint PresentationINSERT INTO CMDEMO.CONTENT_INVENTORY   (CONTENT_INVENTORY_ID, FILE_NAME, KEYWORDS)    VALUES ('4', ' SalesPresentation.ppt ', '');--Internal website (Intranet)INSERT INTO CMDEMO.CONTENT_INVENTORY   (CONTENT_INVENTORY_ID, URL, KEYWORDS)    VALUES ('5', sys.httpuritype.createuri('http://www.your-intranet.com'), '');--External websiteINSERT INTO CMDEMO.CONTENT_INVENTORY   (CONTENT_INVENTORY_ID, URL, KEYWORDS)    VALUES ('6', sys.httpuritype.createuri('http://otn.oracle.com/index.html'), '');COMMIT;

Notice you gave an additional keyword in the KEYWORDS column on the first record. You will use this as additional metadata searching later in the example.

Now that you have some rows of data, you must load the files specified in FILE_NAME into the TEXT column of your CONTENT_INVENTORY table. You can load this data in a variety of methods, including:

  • DBMS_LOB.LOADFROMFILE ? the method you use in this article
  • SQL INSERT statements
  • OCI ? Oracle Call Interface
  • Ctxload executable
  • SQL*Loader

The following routine creates a cursor of your data in the CONTENT_MANAGEMENT table and loads the documents into the BLOB field (only for the records that have a FILE_NAME; you’ll see how to index records that have URLs defined separately in the next step):

CREATE OR REPLACE PROCEDURE loadLOBFromBFILE IS   Dest_loc       BLOB;   Src_loc        BFILE := BFILENAME('DIR_TESTCASE', '');   Amount         INTEGER := 0;   cursor ContentInv_cursor is     --create a recordset of the file names     --that we need to load from the filesystem into the      --Oracle database.     SELECT        CONTENT_INVENTORY_ID,        FILE_NAME       FROM CMDEMO.CONTENT_INVENTORY       WHERE FILE_NAME IS NOT NULL;   varContentInv_id CONTENT_INVENTORY.CONTENT_INVENTORY_ID%TYPE;   varFileName	   CONTENT_INVENTORY.FILE_NAME%TYPE;BEGIN open ContentInv_cursor; loop   varFileName := NULL;   varContentInv_id := NULL;   fetch ContentInv_cursor into varContentInv_id, varFileName;   exit when ContentInv_cursor%notfound;   Src_loc := BFILENAME('DIR_TESTCASE', varFileName);   IF DBMS_LOB.FILEEXISTS (Src_loc) = 1 THEN     SELECT text INTO Dest_loc FROM CONTENT_INVENTORY        WHERE CONTENT_INVENTORY_ID = varContentInv_id FOR UPDATE;     /* Opening the LOB is mandatory */     DBMS_LOB.OPEN(Src_loc, DBMS_LOB.LOB_READONLY);     DBMS_LOB.LOADFROMFILE(Dest_loc, Src_loc, dbms_lob.getlength(Src_loc));     /* Closing the LOB is mandatory */     DBMS_LOB.CLOSE(Src_loc);	    /* Optional update statement */     UPDATE CMDEMO.CONTENT_INVENTORY        SET TEXT_LOADED = 'Y'        WHERE CONTENT_INVENTORY_ID = varContentInv_id;     COMMIT;   END IF;end loop;EXCEPTION WHEN OTHERS THEN RAISE_APPLICATION_ERROR (-20001, 'Error occurred while loading LOBs.'); close ContentInv_cursor;END;/That created the procedure. Now, execute it and load the documents into the database:SQL> set serveroutput on;SQL> EXECUTE loadLOBFromBFILE;PL/SQL procedure successfully completed.

Step 3: Index Content (with Oracle Text)
Now comes the really exciting part. You will instruct Oracle to crawl your documents and index them. To index the binary documents you just loaded with your PL/SQL routine, execute the following code using Oracle Text’s BASIC_LEXER:

begin   ctx_ddl.create_preference('doc_lexer', 'BASIC_LEXER');   ctx_ddl.set_attribute('doc_lexer', 'printjoins', '_-'); end; / create index idxContentMgmtBinary on CMDEMO.CONTENT_INVENTORY(TEXT) indextype is ctxsys.context  parameters ('lexer doc_lexer sync (on commit)');

Depending on your system’s speed, the preceding statement may take a minute to process. The new transactional parameter “sync (on commit)” in Oracle 10g enables you to have your index immediately refresh when changes occur to your base table. Previously, you most likely would have done this periodically via a DBMS_JOB.

To index your URLs, both internal and external sites, you use a different statement and different preferences:

drop index idxContentMgmtURL;create index idxContentMgmtURL on content_inventory(url) indextype is ctxsys.context;

That does it. You have now indexed the content in each of these files and the content from your internal and external Web sites.

Step 4: Performing Queries
To prove your indexes were created properly and to demonstrate a number of ways for querying your content management repository, run through a number of test cases:

SQL> --for readabilitySQL> SET LINESIZE 5000SQL> SET PAGESIZE 5000

The following simple example demonstrates returning documents (not URLs) that match one specific word. In this case, you are searching for the word “oracle” in any of your documents:

SQL> select content_inventory_id, file_name	from CMDEMO.content_inventory 	where contains(text,'oracle') > 0;CONTENT_INVENTORY_ID FILE_NAME-------------------- --------------                   3 OO4O_ORA.XLS                   2 Readme.doc                   1 install.pdf

For the sake of brevity, I left off the returned result sets for the following queries. The purpose here is to show queries for some scenarios you likely will encounter:

  1. This example returns URLs (not documents) that match one specific word:
    SQL> select content_inventory_id, url	from CMDEMO.content_inventory	where contains(url,'&WORD_TO_FIND') > 0;
  2. This example returns documents or URLs that match ANY one of your words. In this case, you are looking for documents and URLs that contain the word “sales” or “oracle”:
    SQL> select content_inventory_id, 	  decode(file_name,null,url,file_name) Location	from CMDEMO.content_inventory 	where contains(text,'sales OR oracle') > 0	   or contains(url,'sales OR oracle') > 0;
  3. This example returns documents or URLs that match ALL of your words. In this case, you are looking for documents and URLs that contain the word “sales” and “oracle”:
    SQL> select content_inventory_id, file_name	from CMDEMO.content_inventory 	where contains(text,'sales AND oracle') > 0	   or contains(url,'sales AND oracle') > 0;
  4. This example returns documents that contain the word “oracle” but not the word “cluster”:
    select content_inventory_id, file_name from CMDEMO.content_inventory where contains(text,'oracle') > 0   and contains(text,'cluster') = 0

As mentioned earlier, your business users also want the flexibility to enter keywords that are not part of the actual document or Web site being indexed. For this reason, you created the KEYWORDS field in your CONTENT_INVENTORY table. When you inserted your records into this source table, you gave one of the records a value in the KEYWORDS column. None of your source documents or URLs has the word “encompassing” in them. To return documents or URLs that match this specific keyword, use the following routine:

SQL> select content_inventory_id, 	  decode(file_name,null,url,file_name) Location	from CMDEMO.content_inventory 	where contains(text,'encompassing') > 0	   or contains(url,'encompassing') > 0	   or upper(keywords) like '%ENCOMPASSING%';

Step 5: Ad hoc Query
Suppose you had some data sitting in another table, schema, or even on a different database. You could still harness the power of the Oracle Text index you previously defined to perform searches on this disconnected data source. The policy-based services of “ctx_doc.policy_filter” allow text processing of ad hoc data. Notice in the example below that the MARKETING_DATA table does not have an Oracle Text index set on it. You pull the “text” column from this table and use your previously defined “idxContentMgmtBinary” index to filter the data:

declare  output 	clob;  input 	blob;  vfile_name	varchar2(255);  NUMBER_FOUND	NUMBER;begin  select text, file_name into input, vfile_name     from scott.marketing_data     where marketing_data_id = 1;  ctx_doc.policy_filter('idxContentMgmtBinary',input,output);  -- now query using the in memory CLOB data  select count(1) INTO NUMBER_FOUND from dual   where upper(output) like '%ORACLE%';  IF NUMBER_FOUND > 0 THEN    dbms_output.put_line(vfile_name || ' contains the query string specified.');  ELSE    dbms_output.put_line('No records matched your search criteria');  END IF;  dbms_lob.freetemporary(output);end;/

Provided the document sitting in your MARKETING_DATA table contains the query string you specified, you will see the following output:

MarketingInfo.pdf contains the query string specified.

Blow Your Users Away
You’ve seen how using Oracle Text can help you deliver highly sophisticated search techniques. Using this article’s examples, you showed your query results via SQL*Plus. Typically, you would present the document/URL returned by the example queries as a “clickable” link to the user. If you wanted to get fancy, you could present the documents with the query terms highlighted by using the CTX_DOC package.

For more information and instructions on utilizing Oracle Text, see the Oracle Text pages. As a next step, you could review the document management/content management needs your business users have and pilot the Oracle Text functionality with them. The business users at my company were blown away when we showed them how quickly we could put this together.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Overview

The Latest

microsoft careers

Top Careers at Microsoft

Microsoft has gained its position as one of the top companies in the world, and Microsoft careers are flourishing. This multinational company is efficiently developing popular software and computers with other consumer electronics. It is a dream come true for so many people to acquire a high paid, high-prestige job

your company's audio

4 Areas of Your Company Where Your Audio Really Matters

Your company probably relies on audio more than you realize. Whether you’re creating a spoken text message to a colleague or giving a speech, you want your audio to shine. Otherwise, you could cause avoidable friction points and potentially hurt your brand reputation. For example, let’s say you create a

chrome os developer mode

How to Turn on Chrome OS Developer Mode

Google’s Chrome OS is a popular operating system that is widely used on Chromebooks and other devices. While it is designed to be simple and user-friendly, there are times when users may want to access additional features and functionality. One way to do this is by turning on Chrome OS