f you need a way to search through various documents with differing formats (PDF, Word documents, PowerPoint presentations, Excel spreadsheets)?including Web content?use Oracle Text. Oracle Text has a powerful text-search engine that allows users to search and manage documents and URLs as quickly and easily as any other data. If you’re an application developer, you can utilize Oracle Text searches on anything from simple comment fields to disparate documents, and you can even create a full-scale document management system. This article examines a content management example, using sample code and step-by-step instructions, to help you do just this.
The examples this article covers come from an actual content management system at my company. Currently, document owners manually enter metadata about each document into a content management intranet. For assigning categories, brands, and communication types to the document, manual data selection is sufficient.
The challenge comes when the owner needs to add keywords about the document to the content management system’s database. Most often, the owner opens the document, reads through it, and either copies and pastes the information into the administration screen or puts in the keywords he or she thinks best represent the document. As a result, the search results for documents on this site did not always return all the documents associated with particular keywords?since the owner did not enter all relevant data into the keywords field when adding the record.
By using the Oracle Text functionality, the content management team now has better results from user searches, as well as an easier method for document owners to manage keyword information about the document. In conjunction with the Oracle Text data, document owners can also add words that are synonymous with the document even when those words aren’t part of the text within the document.
Figure 1 depicts the architecture utilized in this example.
|Figure 1: Architecture for Content Management System Utilizing Oracle Text|
You can use some of these features to deliver a robust content management search engine. Just follow the steps.
Step 1: Preliminary Setup
These instructions assume you have Oracle Text installed. If you haven’t set it up yet, please refer to the Oracle installation guide. If you have access to OracleMetaLink, you can follow [Note:150316.1]. The following statement will perform a quick check to verify whether Oracle Text was set up when your database was installed (connected as CTXSYS user):
column library_name format a8 column file_spec format a60 select library_name,file_spec,dynamic,status from user_libraries; LIBRARY FILE_SPEC D STATUS-------- -------------------------------- - -------DR$LIBX D:oracleora92inoractxx9.dll Y VALIDDR$LIB N VALID
After you have ensured Oracle Text is properly configured, connect as user SYSTEM. To follow the examples throughout this article, you need to create a user account and a directory where the documents that need indexing reside. You can customize the location of your document repository by changing the following path:
create user cmdemo identified by password;grant dba,connect,resource,create session to cmdemo;create or replace directory DIR_TESTCASE as 'G:ContentMgmt';grant read on directory DIR_TESTCASE to PUBLIC;
Step 2: Create Table and Import Data
As user CMDEMO, create the main table to house your content:
CREATE TABLE CMDEMO.CONTENT_INVENTORY ( CONTENT_INVENTORY_ID NUMBER NOT NULL, FILE_NAME VARCHAR2(1024), KEYWORDS VARCHAR2(2048), URL SYS.HTTPURITYPE, TEXT BLOB DEFAULT empty_blob(), TEXT_LOADED CHAR(1) DEFAULT 'N' NOT NULL, CONSTRAINT UNQ_CONTENT_INVENTORY UNIQUE(CONTENT_INVENTORY_ID) USING INDEX TABLESPACE USERS);
Several fields within this routine are worth noting and understanding:
- Notice that you set your BLOB field to default to empty_blob(). This avoids uninitialized blob errors.
- The FILE_NAME field is the name of your binary documents (e.g., Word docs, PDFs, XLS, etc.).
- The KEYWORDS field, as mentioned earlier, gives you the ability to add words that are synonymous with the document but are not part of the text within the document (an example of using this comes later).
- The URL field allows you to harvest text from internal and external Web sites.
- The TEXT field is where the binary documents are loaded into the database via your PL/SQL routine, which I will cover in just a moment.
- The TEXT_LOADED field is optional in this example, but it is useful in determining whether the document has been loaded into the database yet.
Load several rows into your CONTENT_MANAGEMENT table. This may seem redundant, but you are loading different file types in each row for demonstration purposes:
--PDF DocumentINSERT INTO CMDEMO.CONTENT_INVENTORY (CONTENT_INVENTORY_ID, FILE_NAME, KEYWORDS) VALUES ('1', 'install.pdf', 'encompassing');--Word DocumentINSERT INTO CMDEMO.CONTENT_INVENTORY (CONTENT_INVENTORY_ID, FILE_NAME, KEYWORDS) VALUES ('2', 'Readme.doc', '');--Excel SpreadsheetINSERT INTO CMDEMO.CONTENT_INVENTORY (CONTENT_INVENTORY_ID, FILE_NAME, KEYWORDS) VALUES ('3', 'OO4O_ORA.XLS', '');--Powerpoint PresentationINSERT INTO CMDEMO.CONTENT_INVENTORY (CONTENT_INVENTORY_ID, FILE_NAME, KEYWORDS) VALUES ('4', ' SalesPresentation.ppt ', '');--Internal website (Intranet)INSERT INTO CMDEMO.CONTENT_INVENTORY (CONTENT_INVENTORY_ID, URL, KEYWORDS) VALUES ('5', sys.httpuritype.createuri('http://www.your-intranet.com'), '');--External websiteINSERT INTO CMDEMO.CONTENT_INVENTORY (CONTENT_INVENTORY_ID, URL, KEYWORDS) VALUES ('6', sys.httpuritype.createuri('http://otn.oracle.com/index.html'), '');COMMIT;
Notice you gave an additional keyword in the KEYWORDS column on the first record. You will use this as additional metadata searching later in the example.
Now that you have some rows of data, you must load the files specified in FILE_NAME into the TEXT column of your CONTENT_INVENTORY table. You can load this data in a variety of methods, including:
- DBMS_LOB.LOADFROMFILE ? the method you use in this article
- SQL INSERT statements
- OCI ? Oracle Call Interface
- Ctxload executable
The following routine creates a cursor of your data in the CONTENT_MANAGEMENT table and loads the documents into the BLOB field (only for the records that have a FILE_NAME; you’ll see how to index records that have URLs defined separately in the next step):
CREATE OR REPLACE PROCEDURE loadLOBFromBFILE IS Dest_loc BLOB; Src_loc BFILE := BFILENAME('DIR_TESTCASE', ''); Amount INTEGER := 0; cursor ContentInv_cursor is --create a recordset of the file names --that we need to load from the filesystem into the --Oracle database. SELECT CONTENT_INVENTORY_ID, FILE_NAME FROM CMDEMO.CONTENT_INVENTORY WHERE FILE_NAME IS NOT NULL; varContentInv_id CONTENT_INVENTORY.CONTENT_INVENTORY_ID%TYPE; varFileName CONTENT_INVENTORY.FILE_NAME%TYPE;BEGIN open ContentInv_cursor; loop varFileName := NULL; varContentInv_id := NULL; fetch ContentInv_cursor into varContentInv_id, varFileName; exit when ContentInv_cursor%notfound; Src_loc := BFILENAME('DIR_TESTCASE', varFileName); IF DBMS_LOB.FILEEXISTS (Src_loc) = 1 THEN SELECT text INTO Dest_loc FROM CONTENT_INVENTORY WHERE CONTENT_INVENTORY_ID = varContentInv_id FOR UPDATE; /* Opening the LOB is mandatory */ DBMS_LOB.OPEN(Src_loc, DBMS_LOB.LOB_READONLY); DBMS_LOB.LOADFROMFILE(Dest_loc, Src_loc, dbms_lob.getlength(Src_loc)); /* Closing the LOB is mandatory */ DBMS_LOB.CLOSE(Src_loc); /* Optional update statement */ UPDATE CMDEMO.CONTENT_INVENTORY SET TEXT_LOADED = 'Y' WHERE CONTENT_INVENTORY_ID = varContentInv_id; COMMIT; END IF;end loop;EXCEPTION WHEN OTHERS THEN RAISE_APPLICATION_ERROR (-20001, 'Error occurred while loading LOBs.'); close ContentInv_cursor;END;/That created the procedure. Now, execute it and load the documents into the database:SQL> set serveroutput on;SQL> EXECUTE loadLOBFromBFILE;PL/SQL procedure successfully completed.
Step 3: Index Content (with Oracle Text)
Now comes the really exciting part. You will instruct Oracle to crawl your documents and index them. To index the binary documents you just loaded with your PL/SQL routine, execute the following code using Oracle Text’s BASIC_LEXER:
begin ctx_ddl.create_preference('doc_lexer', 'BASIC_LEXER'); ctx_ddl.set_attribute('doc_lexer', 'printjoins', '_-'); end; / create index idxContentMgmtBinary on CMDEMO.CONTENT_INVENTORY(TEXT) indextype is ctxsys.context parameters ('lexer doc_lexer sync (on commit)');
Depending on your system’s speed, the preceding statement may take a minute to process. The new transactional parameter “sync (on commit)” in Oracle 10g enables you to have your index immediately refresh when changes occur to your base table. Previously, you most likely would have done this periodically via a DBMS_JOB.
To index your URLs, both internal and external sites, you use a different statement and different preferences:
drop index idxContentMgmtURL;create index idxContentMgmtURL on content_inventory(url) indextype is ctxsys.context;
That does it. You have now indexed the content in each of these files and the content from your internal and external Web sites.
Step 4: Performing Queries
To prove your indexes were created properly and to demonstrate a number of ways for querying your content management repository, run through a number of test cases:
SQL> --for readabilitySQL> SET LINESIZE 5000SQL> SET PAGESIZE 5000
The following simple example demonstrates returning documents (not URLs) that match one specific word. In this case, you are searching for the word “oracle” in any of your documents:
SQL> select content_inventory_id, file_name from CMDEMO.content_inventory where contains(text,'oracle') > 0;CONTENT_INVENTORY_ID FILE_NAME-------------------- -------------- 3 OO4O_ORA.XLS 2 Readme.doc 1 install.pdf
For the sake of brevity, I left off the returned result sets for the following queries. The purpose here is to show queries for some scenarios you likely will encounter:
- This example returns URLs (not documents) that match one specific word:
SQL> select content_inventory_id, url from CMDEMO.content_inventory where contains(url,'&WORD_TO_FIND') > 0;
- This example returns documents or URLs that match ANY one of your words. In this case, you are looking for documents and URLs that contain the word “sales” or “oracle”:
SQL> select content_inventory_id, decode(file_name,null,url,file_name) Location from CMDEMO.content_inventory where contains(text,'sales OR oracle') > 0 or contains(url,'sales OR oracle') > 0;
- This example returns documents or URLs that match ALL of your words. In this case, you are looking for documents and URLs that contain the word “sales” and “oracle”:
SQL> select content_inventory_id, file_name from CMDEMO.content_inventory where contains(text,'sales AND oracle') > 0 or contains(url,'sales AND oracle') > 0;
- This example returns documents that contain the word “oracle” but not the word “cluster”:
select content_inventory_id, file_name from CMDEMO.content_inventory where contains(text,'oracle') > 0 and contains(text,'cluster') = 0
As mentioned earlier, your business users also want the flexibility to enter keywords that are not part of the actual document or Web site being indexed. For this reason, you created the KEYWORDS field in your CONTENT_INVENTORY table. When you inserted your records into this source table, you gave one of the records a value in the KEYWORDS column. None of your source documents or URLs has the word “encompassing” in them. To return documents or URLs that match this specific keyword, use the following routine:
SQL> select content_inventory_id, decode(file_name,null,url,file_name) Location from CMDEMO.content_inventory where contains(text,'encompassing') > 0 or contains(url,'encompassing') > 0 or upper(keywords) like '%ENCOMPASSING%';
Step 5: Ad hoc Query
Suppose you had some data sitting in another table, schema, or even on a different database. You could still harness the power of the Oracle Text index you previously defined to perform searches on this disconnected data source. The policy-based services of “ctx_doc.policy_filter” allow text processing of ad hoc data. Notice in the example below that the MARKETING_DATA table does not have an Oracle Text index set on it. You pull the “text” column from this table and use your previously defined “idxContentMgmtBinary” index to filter the data:
declare output clob; input blob; vfile_name varchar2(255); NUMBER_FOUND NUMBER;begin select text, file_name into input, vfile_name from scott.marketing_data where marketing_data_id = 1; ctx_doc.policy_filter('idxContentMgmtBinary',input,output); -- now query using the in memory CLOB data select count(1) INTO NUMBER_FOUND from dual where upper(output) like '%ORACLE%'; IF NUMBER_FOUND > 0 THEN dbms_output.put_line(vfile_name || ' contains the query string specified.'); ELSE dbms_output.put_line('No records matched your search criteria'); END IF; dbms_lob.freetemporary(output);end;/
Provided the document sitting in your MARKETING_DATA table contains the query string you specified, you will see the following output:
MarketingInfo.pdf contains the query string specified.
Blow Your Users Away
You’ve seen how using Oracle Text can help you deliver highly sophisticated search techniques. Using this article’s examples, you showed your query results via SQL*Plus. Typically, you would present the document/URL returned by the example queries as a “clickable” link to the user. If you wanted to get fancy, you could present the documents with the query terms highlighted by using the CTX_DOC package.
For more information and instructions on utilizing Oracle Text, see the Oracle Text pages. As a next step, you could review the document management/content management needs your business users have and pilot the Oracle Text functionality with them. The business users at my company were blown away when we showed them how quickly we could put this together.