devxlogo

The Oracle 10g Data Pump API Speeds Up the ETL Process

The Oracle 10g Data Pump API Speeds Up the ETL Process

onsider the following scenario. You have a production system that needs a lot of data from other systems on hand during the day. The data sources that you get this data from return results slowly, and you’ve got network latency that results in slow extracts. Additionally, a fair amount of transformations have to be done to the data, which also takes time. Thus, it’s time-consuming to extract, transform, and load these data from the other systems into your main instance. How can you keep these long, slow extract, transform, and load (ETL) operations from bogging down your production instance?

Of course, if you have the access and authority to push for changes in the source systems, you can speed up the ETL process by speeding up the extract at the source end, and even by providing the data in a format that needs less transformation. But in the real world, you often can’t waltz in and demand that another team optimize for delivery to your own project. Long, slow extracts and complex transformations are a fact of life, and it’s up to you to minimize the impact on your system.

The best solution?that you don’t have to buy any outside products to implement?is creating a separate load instance, running the ETL on this instance, and then shipping the finished, freshly loaded tables to the production instance (see Sidebar 1: Solutions for Minimizing Impact of Slow ETL Operations on Your Production Instance for a few others). This article demonstrates how to implement this solution, which involves three steps:

  1. Set up a separate load instance.
  2. Carry out the ETL on the load instance.
  3. Ship the finished, freshly loaded tables from the load instance to the production instance.

Step 1 is a straightforward DBA matter. Step 2 requires you to know your data sources and your target system. The major implementation decision is: how shall I carry out Step 3? What’s the best way to ship the data from the load instance to the production instance?

What You Need
You will need access to two Oracle instances running Oracle 10gR1 or 10gR2, and the CREATE DATABASE LINK privilege on at least one instance.

Oracle Data Pump

Oracle 10g has a new feature that offers a great solution. Oracle Data Pump is Oracle’s next-generation version of Export and Import. (For those unfamiliar with these tools, Export extracts data and metadata from an Oracle database into a proprietary-format dump file; Import extracts data and metadata from an Export dump file and recreates the source objects in the target database. See Sidebar 2: How Data Pump Works.)

Oracle Data Pump can be used to move data between databases, to copy databases, or to back them up (although Oracle recommends you use RMAN for database copies and backups, not Data Pump). From a programmer’s point of view, there are three big differences between Data Pump and the original Export/Import:

  1. Data Pump is faster. A lot faster.
  2. Data Pump jobs are resumable, and you can attach/detach to them while they run.
  3. Data Pump can be controlled from within the database. You can kick off a Data Pump export or import job from a PL/SQL procedure, or use PL/SQL to monitor the progress of a Data Pump job from within the database.

Data Pump uses two access methods to import and export table data: Direct Path load/unload and External Tables. It chooses the access method behind the scenes on a per-table basis, and the two are mutually compatible?data unloaded with one method can be loaded with the other.

Most programmers know Data Pump through the two command-line utilities (expdp and impdp) that replace the original Export and Import (exp and imp). The expdp and impdp utilities have similar interfaces to exp and imp and can be used for the same tasks. Note, however, that Data Pump dump files are not compatible with original Export and Import, and vice versa. (For a full comparison between exp/imp and expdp/impdp, refer to Chapters 1 and 2 of the Oracle Database Utilities manual?login required.)

Data Pump is also accessible through a PL/SQL API. In fact, expdp and impdp are just front-ends to this API. The command-line utilities transform your command into calls to the Data Pump API, which is implemented as the DBMS_DATAPUMP PL/SQL supplied package, and the DBMS_METADATA package, which handles all operations involving metadata (i.e., database object definitions, such as table structure, storage parameters, etc).

The DataPump API: DBMS_DATAPUMP

DBMS_DATAPUMP, a supplied PL/SQL package, implements the DataPump API. Using DataPump directly via the API instead of using the command-line expdp and impdp provides several advantages, including the following:

  • You can integrate the load code with other PL/SQL ETL code.
  • You don’t have to coordinate PL/SQL ETL procedures with O/S jobs.
  • Most interestingly, for this article, you don’t need a dump file set. You can move data directly from your load database into your production database using the dbms_datapump API over a database link. No export job is required on the source system. The database link is supplied directly as a parameter to the import job on the target system, instead of a dump file set.

So, the strategy prescribed in this article will be to move data from the load instance to the production instance via DBMS_DATAPUMP over a database link.

For example, a job to refresh the production table ALL_SHIP_DATES_PLUS_BILL_HIST would go as follows:

  1. On the load instance, download the raw data that goes into ALL_SHIP_DATES_PLUS_BILL_HIST.
  2. On the load instance, transform the data into the right resultset to go into ALL_SHIP_DATES_PLUS_BILL_HIST.
  3. Put this data into a table named ALL_SHIP_DATES_PLUS_BILL_HIST on the load instance.

    [Note: If you were refreshing more than just this one table, you’d carry out steps 1-3 for all the other tables at this point before proceeding.]

  4. Copy the data from ALL_SHIP_DATES_PLUS_BILL_HIST on the load instance to ALL_SHIP_DATES_PLUS_BILL_HIST on the production instance, using DBMS_DATAPUMP with a database link.

At this point, you may be wondering, ‘Why not just use SQL to move the data over the database link?’ For example, if you were copying the contents of ALL_SHIP_DATES_PLUS_BILL_HIST from instance A to B, you could log on to instance B and issue these commands:

delete from ALL_SHIP_DATES_PLUS_BILL_HIST;insert into ALL_SHIP_DATES_PLUS_BILL_HIST (select * from [email protected] );commit;

If you’re moving more than a handful of tables’ worth of data, DBMS_DATAPUMP is superior to the above approach for a number of reasons:

  • Ease of use and manageability: You don’t need to specify the name of each table individually. You can supply this information by supplying a pattern, a schema, a tablespace, etc. This improves manageability as well as ease of programming: as new tables are added, make sure the table names fit the pattern or are in the appropriate schema or tablespace, and they will automatically be added to the Data Pump import.
  • Optimization: DBMS_DATAPUMP is optimized for batch performance, and it is self-tuning.
  • Indexes: You can bring over indexes as well as tables, eliminating the need to rebuild indexes after refreshing tables.
  • Interruptibility: DBMS_DATAPUMP exports and loads are interruptible and resumable. As mentioned in Sidebar 2: How Data Pump Works, information about each export or load is stored in a database table, so a stopped DataPump process can easily be resumed where it left off.
  • Logging, error handling, and troubleshooting: These capabilities are already extensively supported in DBMS_DATAPUMP. You’ll have to write this code yourself if you implement your own load using straight selects over database links.
  • Ease of monitoring: You can attach to the running Data Pump import at any time and get its status, percent completion, any errors, etc.

Creating a Job

Data Pump imports and exports are called jobs, even though they have nothing to do with the Oracle Job Queue. The word job captures the batch, interruptible nature of a Data Pump import or export. Be aware, though, that DP jobs are not run by job queue processes, they are not created or handled via Oracle’s job handling API, and they won’t show up in DBA_JOBS or other Job Queue views.

The following three basic steps kick off a Data Pump job via the API, be it Import or Export:

  1. Create the job.
  2. Define parameters for the job.
  3. Start the job.

Once the job has started, you can optionally use the API to:

  • Monitor the job until it completes;
  • Detach from the job and reattach later;
  • Stop the job; and
  • Restart a previously stopped job.

Step 1: Create the Job

Use the dbms_datapump.open function to create the job. This function returns a number that serves as the job handle. You’ll use the job handle in all subsequent operations to refer to the job. (If you lose the job handle, you can get it again by calling dbms_datapump.attach.)

When you create the job, you need to provide the following basic information:

  • Whether it’s an import or export job
  • What the job mode will be (The modes are like those in original Export/Import, e.g., TABLE mode vs. SCHEMA mode vs. FULL mode vs. …)
  • The name of the database link you’ll be using, or NULL if none

You can optionally supply a job name, the version of the objects, and whether or not to compress metadata (see the documentation for full details). If you don’t supply a job name, your job is given a system-generated name.

The following example declares a table-mode import:

declare my_handle number;beginmy_handle := dbms_datapump.open (operation => 'IMPORT',job_mode => 'TABLE', remote_link => 
my_db_link, job_name => my_job_name, version=>'LATEST' ) ;dbms_output.put_line ('Opened job ' || my_job_name || ' with handle: ' || my_handle);. . .end;/Opened job IMP_PS2_TS with handle: 53

Note that this handle is invalid out of context (i.e., if you get a handle, stop execution, and later try to plug the handle that’s showing on your screen into a new block, it won’t work). To attach to this job in a different context, you need to get a fresh handle using DBMS_DATAPUMP.ATTACH.

You can see what you’ve done so far by querying USER_DATAPUMP_JOBS or DBA_DATAPUMP_JOBS. Your job should show up with the name you’ve assigned it and the status DEFINING.

You’ll also be able to see the master table defined for this job. The master table is in your schema, and it has the same name as the job. (Consequently, you cannot create a job with the same name as an existing schema object.)

Step 2: Define Parameters for the Job

To define parameters for the job, you must do two main things:

  1. Set any optional parameters.
  2. Name the objects in the job.

You set optional parameters by calling dbms_datapump.set_parameter. A particularly useful parameter for this task is the one that controls Data Pump Import’s behavior when it tries to import a table that already exists. You can choose from TRUNCATE, REPLACE, APPEND, and SKIP. Since you’re pulling pre-formatted data from the load instance into an existing table on the production instance, choose TRUNCATE or REPLACE.

The advantage of TRUNCATE is that it preserves grants, and you can set up DBMS_DATAPUMP to re-create those grants (although that can lead to messy cascades of invalidated procedures). The advantage of REPLACE is that you can easily change table structures by changing them on the load instance only, and then letting the night’s data pump job propagate the changes to production. That sure beats manually maintaining identical table definitions on both instances.

The set_parameter procedure is very easy to use, as this command shows:

dbms_datapump.set_parameter (my_handle, name => 'TABLE_EXISTS_ACTION', value => 'TRUNCATE');

To demonstrate the second part of defining parameters, naming the objects that will be included in this job, pick the names of the tables that you want included from the above table-level import. You do this using the dbms_datapump.metadata_filter procedure. You tell it what type of objects you’re naming using the object_type parameter, and then provide either an expression that the object names must match to be included in your dataset or a list of object names to include. This command retrieves just one table:

dbms_datapump.metadata_filter (handle => my_handle, name => 'NAME_LIST', value => 
'(''ALL_SHIP_DATES_PLUS_BILL_HIST'')', object_type => 'TABLE');

Step 3: Kick Off the Job

You start the job by calling dbms_datapump.start_job. Again, this is straightforward:

dbms_datapump.start_job(my_handle);dbms_output.put_line ('Started job ' || my_job_name || ' with handle: ' || my_handle);

Putting It Together

Now you’re ready to put Steps 1, 2, and 3 together and kick off the job. See Listing 1 for the code.

You can keep an eye on the job while it’s running by querying USER_DATAPUMP_JOBS. You can also see details about the sessions connected to a job by looking at the view DBA_DATAPUMP_SESSIONS. As the job progresses, the job status should change from RUNNING to COMPLETING, and finally vanish when the job is done.

If you query the master table while the job is running, you’ll note that it changes as the job progresses, keeping track of job progress in case the job is stopped.

The next section looks at more detailed job monitoring.

Monitoring a Running Job

One of the big advantages of DBMS_DATAPUMP is its sophisticated monitoring tools. You can connect to a running job and get a really rich bundle of information on its status. All of this information is accessed through the DBMS_DATAPUMP.GET_STATUS procedure, covered in Sidebar 3: About GET_STATUS.

The information returned by GET_STATUS provides running updates on the job status in the code in Listing 2.

Fetching a Lot of Tables

Data Pump is a bit of overkill if you’re fetching only one table, of course. And one of the reasons to use Data Pump is that it’s so easy to specify a set of tables. Suppose that your load process needs to refresh all the tables in your schema that start with ALL_BILL%. You can simply supply this expression to dbms_datapump.metadata_filter:

dbms_datapump.metadata_filter (handle => my_handle, name => 'NAME_EXPR', value => 
'LIKE ''ALL_BILL%''', object_type => 'TABLE');

Or you can fetch all the tables in a given schema or a given tablespace. For example, refresh all the tables that are in the DATA_TO_LOAD tablespace on the source instance:

dbms_datapump.metadata_filter (handle => my_handle, name => 'TABLESPACE_EXPR', value => 
'=''DATA_TO_LOAD''', object_type => 'TABLE');

See Listing 3 and Listing 4 for the full code for these examples.

Note that the last example is pulling an entire tablespace. However, unlike transportable tablespaces (TTS), you don’t need to put the source tablespace in read-only mode, because the tables are being pulled one at a time. DBMS_DATAPUMP also supports transportable tablespace export/import (see Chapter 27 of the Oracle Database PL/SQL Packages and Types Reference for details?login required).

You can also do a schema-mode import. In this case, you would call dbms_datapump.open with mode=>'SCHEMA', and use dbms_datapump.metadata_filter with object_type =>'SCHEMA' to specify the schema names you want to load.

Large Amounts of Data

If you’re transferring a large amount of data and your source and target databases have limited bandwidth between them, then it might be faster to export the tablespaces on the source machine, move the file over, and import the tablespaces on your target machine. You can do this either with the dbms_datapump API or with expdp/impdp.

You’ll need to define a directory object on the source and target machines to hold the dump file sets:

CREATE DIRECTORY dpump_dir AS /oracle/dpumpdir ;

If you’re using ASM, you’ll need to create the directory as a reference to the appropriate disk group:

SQL> CREATE or REPLACE DIRECTORY dpump_dir as '+DISKGP1/DPUMP';

The user executing the Data Pump commands needs to have read and write access to these directories:

grant read, write on directory dpump_dir to bulkload;grant read, write on directory dpump_log to bulkload;

If your source and target databases are on the same machine, then the fastest solution may be to set the Data Pump directory object to the same directory on both machines, use Data Pump Export on the source instance to export the tablespace set from the source, leave it in place, and use Data Pump Import on the target instance to pull the tablespace dump file set from the source.

DBMS_STREAMS_TABLESPACE_ADM

Oracle 10g provides another new way to copy a tablespace from one database to another. It isn’t part of the Data Pump API, but it’s worth a mention. DBMS_STREAMS_TABLESPACE_ADM.PULL_SIMPLE_TABLESPACE will copy a single-datafile tablespace from one database to another over a database link. Like the preceding Data Pump examples, it can even do this cross-platform.

Resolve a Common Data Warehousing Issue

You can use the solution presented in this article to handle a common dilemma in data warehousing environments. For example, you might be building a data mart that combines information from a set of slow data warehouses in many different segments of the enterprise to help top-level management make decisions. You want your data mart to be as fast as possible, and to be up as much as possible. Now you know how to implement the best solution using the DBMS_DATAPUMP API.

devxblackblue

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.

About Our Journalist