
onsider the following scenario. You have a production system that needs a lot of data from other systems on hand during the day. The data sources that you get this data from return results slowly, and you've got network latency that results in slow extracts. Additionally, a fair amount of transformations have to be done to the data, which also takes time. Thus, it's time-consuming to extract, transform, and load these data from the other systems into your main instance. How can you keep these long, slow extract, transform, and load (ETL) operations from bogging down your production instance?
Of course, if you have the access and authority to push for changes in the source systems, you can speed up the ETL process by speeding up the extract at the source end, and even by providing the data in a format that needs less transformation. But in the real world, you often can't waltz in and demand that another team optimize for delivery to your own project. Long, slow extracts and complex transformations are a fact of life, and it's up to you to minimize the impact on your system.
The best solutionthat you don't have to buy any outside products to implementis creating a separate load instance, running the ETL on this instance, and then shipping the finished, freshly loaded tables to the production instance (see Sidebar 1: Solutions for Minimizing Impact of Slow ETL Operations on Your Production Instance for a few others). This article demonstrates how to implement this solution, which involves three steps:
- Set up a separate load instance.
- Carry out the ETL on the load instance.
- Ship the finished, freshly loaded tables from the load instance to the production instance.
Step 1 is a straightforward DBA matter. Step 2 requires you to know your data sources and your target system. The major implementation decision is: how shall I carry out Step 3? What's the best way to ship the data from the load instance to the production instance?
What You Need |
You will need access to two Oracle instances running Oracle 10gR1 or 10gR2, and the CREATE DATABASE LINK privilege on at least one instance. |
Oracle Data Pump
Oracle 10
g has a new feature that offers a great solution. Oracle Data Pump is Oracle's next-generation version of Export and Import. (For those unfamiliar with these tools, Export extracts data and metadata from an Oracle database into a proprietary-format dump file; Import extracts data and metadata from an Export dump file and recreates the source objects in the target database. See
Sidebar 2: How Data Pump Works.)
Oracle Data Pump can be used to move data between databases, to copy databases, or to back them up (although Oracle recommends you use RMAN for database copies and backups, not Data Pump). From a programmer's point of view, there are three big differences between Data Pump and the original Export/Import:
- Data Pump is faster. A lot faster.
- Data Pump jobs are resumable, and you can attach/detach to them while they run.
- Data Pump can be controlled from within the database. You can kick off a Data Pump export or import job from a PL/SQL procedure, or use PL/SQL to monitor the progress of a Data Pump job from within the database.
Data Pump uses two access methods to import and export table data: Direct Path load/unload and External Tables. It chooses the access method behind the scenes on a per-table basis, and the two are mutually compatibledata unloaded with one method can be loaded with the other.
Most programmers know Data Pump through the two command-line utilities (expdp
and impdp
) that replace the original Export and Import (exp
and imp
). The expdp
and impdp
utilities have similar interfaces to exp
and imp
and can be used for the same tasks. Note, however, that Data Pump dump files are not compatible with original Export and Import, and vice versa. (For a full comparison between exp/imp
and expdp/impdp
, refer to Chapters 1 and 2 of the Oracle Database Utilities manuallogin required.)
Data Pump is also accessible through a PL/SQL API. In fact, expdp
and impdp
are just front-ends to this API. The command-line utilities transform your command into calls to the Data Pump API, which is implemented as the DBMS_DATAPUMP
PL/SQL supplied package, and the DBMS_METADATA
package, which handles all operations involving metadata (i.e., database object definitions, such as table structure, storage parameters, etc).