Create a Robust Document Management System with Oracle Text

Create a Robust Document Management System with Oracle Text

f you need a way to search through various documents with differing formats (PDF, Word documents, PowerPoint presentations, Excel spreadsheets)?including Web content?use Oracle Text. Oracle Text has a powerful text-search engine that allows users to search and manage documents and URLs as quickly and easily as any other data. If you’re an application developer, you can utilize Oracle Text searches on anything from simple comment fields to disparate documents, and you can even create a full-scale document management system. This article examines a content management example, using sample code and step-by-step instructions, to help you do just this.

The examples this article covers come from an actual content management system at my company. Currently, document owners manually enter metadata about each document into a content management intranet. For assigning categories, brands, and communication types to the document, manual data selection is sufficient.

The challenge comes when the owner needs to add keywords about the document to the content management system’s database. Most often, the owner opens the document, reads through it, and either copies and pastes the information into the administration screen or puts in the keywords he or she thinks best represent the document. As a result, the search results for documents on this site did not always return all the documents associated with particular keywords?since the owner did not enter all relevant data into the keywords field when adding the record.

By using the Oracle Text functionality, the content management team now has better results from user searches, as well as an easier method for document owners to manage keyword information about the document. In conjunction with the Oracle Text data, document owners can also add words that are synonymous with the document even when those words aren’t part of the text within the document.

Figure 1 depicts the architecture utilized in this example.

Figure 1: Architecture for Content Management System Utilizing Oracle Text

You can use some of these features to deliver a robust content management search engine. Just follow the steps.

Step 1: Preliminary Setup
These instructions assume you have Oracle Text installed. If you haven’t set it up yet, please refer to the Oracle installation guide. If you have access to OracleMetaLink, you can follow [Note:150316.1]. The following statement will perform a quick check to verify whether Oracle Text was set up when your database was installed (connected as CTXSYS user):

column library_name format a8 column file_spec format a60   select library_name,file_spec,dynamic,status from user_libraries; LIBRARY 	FILE_SPEC                                                    	D	STATUS--------	--------------------------------	-	-------DR$LIBX  	D:oracleora92inoractxx9.dll                             	Y	VALIDDR$LIB                                                                		N	VALID

After you have ensured Oracle Text is properly configured, connect as user SYSTEM. To follow the examples throughout this article, you need to create a user account and a directory where the documents that need indexing reside. You can customize the location of your document repository by changing the following path:

create user cmdemo identified by password;grant dba,connect,resource,create session to cmdemo;create or replace directory DIR_TESTCASE as 'G:ContentMgmt';grant read on directory DIR_TESTCASE to PUBLIC;

Step 2: Create Table and Import Data
As user CMDEMO, create the main table to house your content:

CREATE TABLE CMDEMO.CONTENT_INVENTORY (  CONTENT_INVENTORY_ID 	NUMBER NOT NULL,   FILE_NAME 		VARCHAR2(1024),   KEYWORDS 		VARCHAR2(2048),   URL 			SYS.HTTPURITYPE,   TEXT 			BLOB DEFAULT empty_blob(),   TEXT_LOADED 		CHAR(1) DEFAULT 'N' NOT NULL,     CONSTRAINT UNQ_CONTENT_INVENTORY       UNIQUE(CONTENT_INVENTORY_ID) USING INDEX    TABLESPACE USERS); 

Several fields within this routine are worth noting and understanding:

  • Notice that you set your BLOB field to default to empty_blob(). This avoids uninitialized blob errors.
  • The FILE_NAME field is the name of your binary documents (e.g., Word docs, PDFs, XLS, etc.).
  • The KEYWORDS field, as mentioned earlier, gives you the ability to add words that are synonymous with the document but are not part of the text within the document (an example of using this comes later).
  • The URL field allows you to harvest text from internal and external Web sites.
  • The TEXT field is where the binary documents are loaded into the database via your PL/SQL routine, which I will cover in just a moment.
  • The TEXT_LOADED field is optional in this example, but it is useful in determining whether the document has been loaded into the database yet.

Load several rows into your CONTENT_MANAGEMENT table. This may seem redundant, but you are loading different file types in each row for demonstration purposes:

--PDF DocumentINSERT INTO CMDEMO.CONTENT_INVENTORY   (CONTENT_INVENTORY_ID, FILE_NAME, KEYWORDS)    VALUES ('1', 'install.pdf', 'encompassing');--Word DocumentINSERT INTO CMDEMO.CONTENT_INVENTORY   (CONTENT_INVENTORY_ID, FILE_NAME, KEYWORDS)    VALUES ('2', 'Readme.doc', '');--Excel SpreadsheetINSERT INTO CMDEMO.CONTENT_INVENTORY   (CONTENT_INVENTORY_ID, FILE_NAME, KEYWORDS)    VALUES ('3', 'OO4O_ORA.XLS', '');--Powerpoint PresentationINSERT INTO CMDEMO.CONTENT_INVENTORY   (CONTENT_INVENTORY_ID, FILE_NAME, KEYWORDS)    VALUES ('4', ' SalesPresentation.ppt ', '');--Internal website (Intranet)INSERT INTO CMDEMO.CONTENT_INVENTORY   (CONTENT_INVENTORY_ID, URL, KEYWORDS)    VALUES ('5', sys.httpuritype.createuri('http://www.your-intranet.com'), '');--External websiteINSERT INTO CMDEMO.CONTENT_INVENTORY   (CONTENT_INVENTORY_ID, URL, KEYWORDS)    VALUES ('6', sys.httpuritype.createuri('http://otn.oracle.com/index.html'), '');COMMIT;

Notice you gave an additional keyword in the KEYWORDS column on the first record. You will use this as additional metadata searching later in the example.

Now that you have some rows of data, you must load the files specified in FILE_NAME into the TEXT column of your CONTENT_INVENTORY table. You can load this data in a variety of methods, including:

  • DBMS_LOB.LOADFROMFILE ? the method you use in this article
  • SQL INSERT statements
  • OCI ? Oracle Call Interface
  • Ctxload executable
  • SQL*Loader

The following routine creates a cursor of your data in the CONTENT_MANAGEMENT table and loads the documents into the BLOB field (only for the records that have a FILE_NAME; you’ll see how to index records that have URLs defined separately in the next step):

CREATE OR REPLACE PROCEDURE loadLOBFromBFILE IS   Dest_loc       BLOB;   Src_loc        BFILE := BFILENAME('DIR_TESTCASE', '');   Amount         INTEGER := 0;   cursor ContentInv_cursor is     --create a recordset of the file names     --that we need to load from the filesystem into the      --Oracle database.     SELECT        CONTENT_INVENTORY_ID,        FILE_NAME       FROM CMDEMO.CONTENT_INVENTORY       WHERE FILE_NAME IS NOT NULL;   varContentInv_id CONTENT_INVENTORY.CONTENT_INVENTORY_ID%TYPE;   varFileName	   CONTENT_INVENTORY.FILE_NAME%TYPE;BEGIN open ContentInv_cursor; loop   varFileName := NULL;   varContentInv_id := NULL;   fetch ContentInv_cursor into varContentInv_id, varFileName;   exit when ContentInv_cursor%notfound;   Src_loc := BFILENAME('DIR_TESTCASE', varFileName);   IF DBMS_LOB.FILEEXISTS (Src_loc) = 1 THEN     SELECT text INTO Dest_loc FROM CONTENT_INVENTORY        WHERE CONTENT_INVENTORY_ID = varContentInv_id FOR UPDATE;     /* Opening the LOB is mandatory */     DBMS_LOB.OPEN(Src_loc, DBMS_LOB.LOB_READONLY);     DBMS_LOB.LOADFROMFILE(Dest_loc, Src_loc, dbms_lob.getlength(Src_loc));     /* Closing the LOB is mandatory */     DBMS_LOB.CLOSE(Src_loc);	    /* Optional update statement */     UPDATE CMDEMO.CONTENT_INVENTORY        SET TEXT_LOADED = 'Y'        WHERE CONTENT_INVENTORY_ID = varContentInv_id;     COMMIT;   END IF;end loop;EXCEPTION WHEN OTHERS THEN RAISE_APPLICATION_ERROR (-20001, 'Error occurred while loading LOBs.'); close ContentInv_cursor;END;/That created the procedure. Now, execute it and load the documents into the database:SQL> set serveroutput on;SQL> EXECUTE loadLOBFromBFILE;PL/SQL procedure successfully completed.

Step 3: Index Content (with Oracle Text)
Now comes the really exciting part. You will instruct Oracle to crawl your documents and index them. To index the binary documents you just loaded with your PL/SQL routine, execute the following code using Oracle Text’s BASIC_LEXER:

begin   ctx_ddl.create_preference('doc_lexer', 'BASIC_LEXER');   ctx_ddl.set_attribute('doc_lexer', 'printjoins', '_-'); end; / create index idxContentMgmtBinary on CMDEMO.CONTENT_INVENTORY(TEXT) indextype is ctxsys.context  parameters ('lexer doc_lexer sync (on commit)');

Depending on your system’s speed, the preceding statement may take a minute to process. The new transactional parameter “sync (on commit)” in Oracle 10g enables you to have your index immediately refresh when changes occur to your base table. Previously, you most likely would have done this periodically via a DBMS_JOB.

To index your URLs, both internal and external sites, you use a different statement and different preferences:

drop index idxContentMgmtURL;create index idxContentMgmtURL on content_inventory(url) indextype is ctxsys.context;

That does it. You have now indexed the content in each of these files and the content from your internal and external Web sites.

Step 4: Performing Queries
To prove your indexes were created properly and to demonstrate a number of ways for querying your content management repository, run through a number of test cases:

SQL> --for readabilitySQL> SET LINESIZE 5000SQL> SET PAGESIZE 5000

The following simple example demonstrates returning documents (not URLs) that match one specific word. In this case, you are searching for the word “oracle” in any of your documents:

SQL> select content_inventory_id, file_name	from CMDEMO.content_inventory 	where contains(text,'oracle') > 0;CONTENT_INVENTORY_ID FILE_NAME-------------------- --------------                   3 OO4O_ORA.XLS                   2 Readme.doc                   1 install.pdf

For the sake of brevity, I left off the returned result sets for the following queries. The purpose here is to show queries for some scenarios you likely will encounter:

  1. This example returns URLs (not documents) that match one specific word:
    SQL> select content_inventory_id, url	from CMDEMO.content_inventory	where contains(url,'&WORD_TO_FIND') > 0;
  2. This example returns documents or URLs that match ANY one of your words. In this case, you are looking for documents and URLs that contain the word “sales” or “oracle”:
    SQL> select content_inventory_id, 	  decode(file_name,null,url,file_name) Location	from CMDEMO.content_inventory 	where contains(text,'sales OR oracle') > 0	   or contains(url,'sales OR oracle') > 0;
  3. This example returns documents or URLs that match ALL of your words. In this case, you are looking for documents and URLs that contain the word “sales” and “oracle”:
    SQL> select content_inventory_id, file_name	from CMDEMO.content_inventory 	where contains(text,'sales AND oracle') > 0	   or contains(url,'sales AND oracle') > 0;
  4. This example returns documents that contain the word “oracle” but not the word “cluster”:
    select content_inventory_id, file_name from CMDEMO.content_inventory where contains(text,'oracle') > 0   and contains(text,'cluster') = 0

As mentioned earlier, your business users also want the flexibility to enter keywords that are not part of the actual document or Web site being indexed. For this reason, you created the KEYWORDS field in your CONTENT_INVENTORY table. When you inserted your records into this source table, you gave one of the records a value in the KEYWORDS column. None of your source documents or URLs has the word “encompassing” in them. To return documents or URLs that match this specific keyword, use the following routine:

SQL> select content_inventory_id, 	  decode(file_name,null,url,file_name) Location	from CMDEMO.content_inventory 	where contains(text,'encompassing') > 0	   or contains(url,'encompassing') > 0	   or upper(keywords) like '%ENCOMPASSING%';

Step 5: Ad hoc Query
Suppose you had some data sitting in another table, schema, or even on a different database. You could still harness the power of the Oracle Text index you previously defined to perform searches on this disconnected data source. The policy-based services of “ctx_doc.policy_filter” allow text processing of ad hoc data. Notice in the example below that the MARKETING_DATA table does not have an Oracle Text index set on it. You pull the “text” column from this table and use your previously defined “idxContentMgmtBinary” index to filter the data:

declare  output 	clob;  input 	blob;  vfile_name	varchar2(255);  NUMBER_FOUND	NUMBER;begin  select text, file_name into input, vfile_name     from scott.marketing_data     where marketing_data_id = 1;  ctx_doc.policy_filter('idxContentMgmtBinary',input,output);  -- now query using the in memory CLOB data  select count(1) INTO NUMBER_FOUND from dual   where upper(output) like '%ORACLE%';  IF NUMBER_FOUND > 0 THEN    dbms_output.put_line(vfile_name || ' contains the query string specified.');  ELSE    dbms_output.put_line('No records matched your search criteria');  END IF;  dbms_lob.freetemporary(output);end;/

Provided the document sitting in your MARKETING_DATA table contains the query string you specified, you will see the following output:

MarketingInfo.pdf contains the query string specified.

Blow Your Users Away
You’ve seen how using Oracle Text can help you deliver highly sophisticated search techniques. Using this article’s examples, you showed your query results via SQL*Plus. Typically, you would present the document/URL returned by the example queries as a “clickable” link to the user. If you wanted to get fancy, you could present the documents with the query terms highlighted by using the CTX_DOC package.

For more information and instructions on utilizing Oracle Text, see the Oracle Text pages. As a next step, you could review the document management/content management needs your business users have and pilot the Oracle Text functionality with them. The business users at my company were blown away when we showed them how quickly we could put this together.

devx-admin

devx-admin

Share the Post:
Clean Energy Adoption

Inside Michigan’s Clean Energy Revolution

Democratic state legislators in Michigan continue to discuss and debate clean energy legislation in the hopes of establishing a comprehensive clean energy strategy for the

Chips Act Revolution

European Chips Act: What is it?

In response to the intensifying worldwide technology competition, Europe has unveiled the long-awaited European Chips Act. This daring legislative proposal aims to fortify Europe’s semiconductor

Revolutionized Low-Code

You Should Use Low-Code Platforms for Apps

As the demand for rapid software development increases, low-code platforms have emerged as a popular choice among developers for their ability to build applications with

Global Layoffs

Tech Layoffs Are Getting Worse Globally

Since the start of 2023, the global technology sector has experienced a significant rise in layoffs, with over 236,000 workers being let go by 1,019

Clean Energy Adoption

Inside Michigan’s Clean Energy Revolution

Democratic state legislators in Michigan continue to discuss and debate clean energy legislation in the hopes of establishing a comprehensive clean energy strategy for the state. A Senate committee meeting

Chips Act Revolution

European Chips Act: What is it?

In response to the intensifying worldwide technology competition, Europe has unveiled the long-awaited European Chips Act. This daring legislative proposal aims to fortify Europe’s semiconductor supply chain and enhance its

Revolutionized Low-Code

You Should Use Low-Code Platforms for Apps

As the demand for rapid software development increases, low-code platforms have emerged as a popular choice among developers for their ability to build applications with minimal coding. These platforms not

Cybersecurity Strategy

Five Powerful Strategies to Bolster Your Cybersecurity

In today’s increasingly digital landscape, businesses of all sizes must prioritize cyber security measures to defend against potential dangers. Cyber security professionals suggest five simple technological strategies to help companies

Global Layoffs

Tech Layoffs Are Getting Worse Globally

Since the start of 2023, the global technology sector has experienced a significant rise in layoffs, with over 236,000 workers being let go by 1,019 tech firms, as per data

Huawei Electric Dazzle

Huawei Dazzles with Electric Vehicles and Wireless Earbuds

During a prominent unveiling event, Huawei, the Chinese telecommunications powerhouse, kept quiet about its enigmatic new 5G phone and alleged cutting-edge chip development. Instead, Huawei astounded the audience by presenting

Cybersecurity Banking Revolution

Digital Banking Needs Cybersecurity

The banking, financial, and insurance (BFSI) sectors are pioneers in digital transformation, using web applications and application programming interfaces (APIs) to provide seamless services to customers around the world. Rising

FinTech Leadership

Terry Clune’s Fintech Empire

Over the past 30 years, Terry Clune has built a remarkable business empire, with CluneTech at the helm. The CEO and Founder has successfully created eight fintech firms, attracting renowned

The Role Of AI Within A Web Design Agency?

In the digital age, the role of Artificial Intelligence (AI) in web design is rapidly evolving, transitioning from a futuristic concept to practical tools used in design, coding, content writing

Generative AI Revolution

Is Generative AI the Next Internet?

The increasing demand for Generative AI models has led to a surge in its adoption across diverse sectors, with healthcare, automotive, and financial services being among the top beneficiaries. These

Microsoft Laptop

The New Surface Laptop Studio 2 Is Nuts

The Surface Laptop Studio 2 is a dynamic and robust all-in-one laptop designed for creators and professionals alike. It features a 14.4″ touchscreen and a cutting-edge design that is over

5G Innovations

GPU-Accelerated 5G in Japan

NTT DOCOMO, a global telecommunications giant, is set to break new ground in the industry as it prepares to launch a GPU-accelerated 5G network in Japan. This innovative approach will

AI Ethics

AI Journalism: Balancing Integrity and Innovation

An op-ed, produced using Microsoft’s Bing Chat AI software, recently appeared in the St. Louis Post-Dispatch, discussing the potential concerns surrounding the employment of artificial intelligence (AI) in journalism. These

Savings Extravaganza

Big Deal Days Extravaganza

The highly awaited Big Deal Days event for October 2023 is nearly here, scheduled for the 10th and 11th. Similar to the previous year, this autumn sale has already created

Cisco Splunk Deal

Cisco Splunk Deal Sparks Tech Acquisition Frenzy

Cisco’s recent massive purchase of Splunk, an AI-powered cybersecurity firm, for $28 billion signals a potential boost in tech deals after a year of subdued mergers and acquisitions in the

Iran Drone Expansion

Iran’s Jet-Propelled Drone Reshapes Power Balance

Iran has recently unveiled a jet-propelled variant of its Shahed series drone, marking a significant advancement in the nation’s drone technology. The new drone is poised to reshape the regional

Solar Geoengineering

Did the Overshoot Commission Shoot Down Geoengineering?

The Overshoot Commission has recently released a comprehensive report that discusses the controversial topic of Solar Geoengineering, also known as Solar Radiation Modification (SRM). The Commission’s primary objective is to

Remote Learning

Revolutionizing Remote Learning for Success

School districts are preparing to reveal a substantial technological upgrade designed to significantly improve remote learning experiences for both educators and students amid the ongoing pandemic. This major investment, which

Revolutionary SABERS Transforming

SABERS Batteries Transforming Industries

Scientists John Connell and Yi Lin from NASA’s Solid-state Architecture Batteries for Enhanced Rechargeability and Safety (SABERS) project are working on experimental solid-state battery packs that could dramatically change the

Build a Website

How Much Does It Cost to Build a Website?

Are you wondering how much it costs to build a website? The approximated cost is based on several factors, including which add-ons and platforms you choose. For example, a self-hosted

Battery Investments

Battery Startups Attract Billion-Dollar Investments

In recent times, battery startups have experienced a significant boost in investments, with three businesses obtaining over $1 billion in funding within the last month. French company Verkor amassed $2.1

Copilot Revolution

Microsoft Copilot: A Suit of AI Features

Microsoft’s latest offering, Microsoft Copilot, aims to revolutionize the way we interact with technology. By integrating various AI capabilities, this all-in-one tool provides users with an improved experience that not

AI Girlfriend Craze

AI Girlfriend Craze Threatens Relationships

The surge in virtual AI girlfriends’ popularity is playing a role in the escalating issue of loneliness among young males, and this could have serious repercussions for America’s future. A

AIOps Innovations

Senser is Changing AIOps

Senser, an AIOps platform based in Tel Aviv, has introduced its groundbreaking AI-powered observability solution to support developers and operations teams in promptly pinpointing the root causes of service disruptions