Browse DevX
Sign up for e-mail newsletters from DevX


The Oracle 10g Data Pump API Speeds Up the ETL Process : Page 2

How can you keep slow extract, transform, and load operations from bogging down your production instance? Oracle 10g's Data Pump API offers a great solution.



DBMS_DATAPUMP, a supplied PL/SQL package, implements the DataPump API. Using DataPump directly via the API instead of using the command-line expdp and impdp provides several advantages, including the following:
  • You can integrate the load code with other PL/SQL ETL code.
  • You don't have to coordinate PL/SQL ETL procedures with O/S jobs.
  • Most interestingly, for this article, you don't need a dump file set. You can move data directly from your load database into your production database using the dbms_datapump API over a database link. No export job is required on the source system. The database link is supplied directly as a parameter to the import job on the target system, instead of a dump file set.

So, the strategy prescribed in this article will be to move data from the load instance to the production instance via DBMS_DATAPUMP over a database link.

For example, a job to refresh the production table ALL_SHIP_DATES_PLUS_BILL_HIST would go as follows:

  1. On the load instance, download the raw data that goes into ALL_SHIP_DATES_PLUS_BILL_HIST.
  2. On the load instance, transform the data into the right resultset to go into ALL_SHIP_DATES_PLUS_BILL_HIST.
  3. Put this data into a table named ALL_SHIP_DATES_PLUS_BILL_HIST on the load instance.
    [Note: If you were refreshing more than just this one table, you'd carry out steps 1-3 for all the other tables at this point before proceeding.]
  4. Copy the data from ALL_SHIP_DATES_PLUS_BILL_HIST on the load instance to ALL_SHIP_DATES_PLUS_BILL_HIST on the production instance, using DBMS_DATAPUMP with a database link.

At this point, you may be wondering, 'Why not just use SQL to move the data over the database link?' For example, if you were copying the contents of ALL_SHIP_DATES_PLUS_BILL_HIST from instance A to B, you could log on to instance B and issue these commands:

insert into ALL_SHIP_DATES_PLUS_BILL_HIST (select * from ALL_SHIP_DATES_PLUS_BILL_HIST@instanceA.world );

If you're moving more than a handful of tables' worth of data, DBMS_DATAPUMP is superior to the above approach for a number of reasons:

  • Ease of use and manageability: You don't need to specify the name of each table individually. You can supply this information by supplying a pattern, a schema, a tablespace, etc. This improves manageability as well as ease of programming: as new tables are added, make sure the table names fit the pattern or are in the appropriate schema or tablespace, and they will automatically be added to the Data Pump import.
  • Optimization: DBMS_DATAPUMP is optimized for batch performance, and it is self-tuning.
  • Indexes: You can bring over indexes as well as tables, eliminating the need to rebuild indexes after refreshing tables.
  • Interruptibility: DBMS_DATAPUMP exports and loads are interruptible and resumable. As mentioned in Sidebar 2: How Data Pump Works, information about each export or load is stored in a database table, so a stopped DataPump process can easily be resumed where it left off.
  • Logging, error handling, and troubleshooting: These capabilities are already extensively supported in DBMS_DATAPUMP. You'll have to write this code yourself if you implement your own load using straight selects over database links.
  • Ease of monitoring: You can attach to the running Data Pump import at any time and get its status, percent completion, any errors, etc.

Creating a Job

Data Pump imports and exports are called jobs, even though they have nothing to do with the Oracle Job Queue. The word job captures the batch, interruptible nature of a Data Pump import or export. Be aware, though, that DP jobs are not run by job queue processes, they are not created or handled via Oracle's job handling API, and they won't show up in DBA_JOBS or other Job Queue views.

The following three basic steps kick off a Data Pump job via the API, be it Import or Export:

  1. Create the job.
  2. Define parameters for the job.
  3. Start the job.

Once the job has started, you can optionally use the API to:

  • Monitor the job until it completes;
  • Detach from the job and reattach later;
  • Stop the job; and
  • Restart a previously stopped job.

Step 1: Create the Job
Use the dbms_datapump.open function to create the job. This function returns a number that serves as the job handle. You'll use the job handle in all subsequent operations to refer to the job. (If you lose the job handle, you can get it again by calling dbms_datapump.attach.)

When you create the job, you need to provide the following basic information:

  • Whether it's an import or export job
  • What the job mode will be (The modes are like those in original Export/Import, e.g., TABLE mode vs. SCHEMA mode vs. FULL mode vs. ...)
  • The name of the database link you'll be using, or NULL if none

You can optionally supply a job name, the version of the objects, and whether or not to compress metadata (see the documentation for full details). If you don't supply a job name, your job is given a system-generated name.

The following example declares a table-mode import:

my_handle number;
my_handle := dbms_datapump.open (operation => 'IMPORT',job_mode => 'TABLE', remote_link => 
my_db_link, job_name => my_job_name, version=>'LATEST' ) ; dbms_output.put_line ('Opened job ' || my_job_name || ' with handle: ' || my_handle); . . . end; / Opened job IMP_PS2_TS with handle: 53

Note that this handle is invalid out of context (i.e., if you get a handle, stop execution, and later try to plug the handle that's showing on your screen into a new block, it won't work). To attach to this job in a different context, you need to get a fresh handle using DBMS_DATAPUMP.ATTACH.

You can see what you've done so far by querying USER_DATAPUMP_JOBS or DBA_DATAPUMP_JOBS. Your job should show up with the name you've assigned it and the status DEFINING.

You'll also be able to see the master table defined for this job. The master table is in your schema, and it has the same name as the job. (Consequently, you cannot create a job with the same name as an existing schema object.)

Step 2: Define Parameters for the Job
To define parameters for the job, you must do two main things:
  1. Set any optional parameters.
  2. Name the objects in the job.

You set optional parameters by calling dbms_datapump.set_parameter. A particularly useful parameter for this task is the one that controls Data Pump Import's behavior when it tries to import a table that already exists. You can choose from TRUNCATE, REPLACE, APPEND, and SKIP. Since you're pulling pre-formatted data from the load instance into an existing table on the production instance, choose TRUNCATE or REPLACE.

The advantage of TRUNCATE is that it preserves grants, and you can set up DBMS_DATAPUMP to re-create those grants (although that can lead to messy cascades of invalidated procedures). The advantage of REPLACE is that you can easily change table structures by changing them on the load instance only, and then letting the night's data pump job propagate the changes to production. That sure beats manually maintaining identical table definitions on both instances.

The set_parameter procedure is very easy to use, as this command shows:

dbms_datapump.set_parameter (my_handle, name => 'TABLE_EXISTS_ACTION', value => 'TRUNCATE');

To demonstrate the second part of defining parameters, naming the objects that will be included in this job, pick the names of the tables that you want included from the above table-level import. You do this using the dbms_datapump.metadata_filter procedure. You tell it what type of objects you're naming using the object_type parameter, and then provide either an expression that the object names must match to be included in your dataset or a list of object names to include. This command retrieves just one table:

dbms_datapump.metadata_filter (handle => my_handle, name => 'NAME_LIST', value => 
'(''ALL_SHIP_DATES_PLUS_BILL_HIST'')', object_type => 'TABLE');

Step 3: Kick Off the Job
You start the job by calling dbms_datapump.start_job. Again, this is straightforward:

dbms_output.put_line ('Started job ' || my_job_name || ' with handle: ' || my_handle);

Thanks for your registration, follow us on our social networks to keep up-to-date