hapter 21: Using SQL*Loader to?Load Data
In the scripts provided in Appendix A, a large number of insert commands are executed. In place of those inserts, you could create a file containing the data to be loaded and then use Oracle’s SQL*Loader utility to load the data. This chapter provides you with an overview of the use of SQL*Loader and its major capabilities. Two additional data-movement utilities, Export and Import, are covered in Chapter 40. SQL*Loader, Export, and Import are described in great detail in the Oracle9i Database Utilities provided with the standard Oracle documentation set.SQL*Loader loads data from external files into tables in the Oracle database. SQL*Loader requires two primary files: the datafile, which contains the information to be loaded, and the control file, which contains information on the format of the data, the records and fields within the file, the order in which they are to be loaded, and even, when needed, the names of the multiple files that will be used for data. You can also combine the control file information into the datafile itself, although the two are usually separated to make it easier to reuse the control file.When executed, SQL*Loader will automatically create a log file and a “bad” file. The log file records the status of the load, such as the number of rows processed and the number of rows committed. The “bad” file will contain all the rows that were rejected during the load due to data errors, such as nonunique values in primary key columns.Within the control file, you can specify additional commands to govern the load criteria. If these criteria are not met by a row, the row will be written to a “discard” file. The log, bad, and discard files will have the extensions .log, .bad, and .dsc, respectively. Control files are typically given the extension .ctl.SQL*Loader is a powerful utility for loading data, for several reasons:
- It is highly flexible, allowing you to manipulate the data as it is being loaded.
- You can use SQL*Loader to break a single large data set into multiple sets of data during commit processing, significantly reducing the size of the transactions processed by the load.
- You can use its Direct Path loading option to perform loads very quickly.
The Control File
The control file tells Oracle how to read and load the data. The control file tells SQL*Loader where to find the source data for the load and the tables into which to load the data, along with any other rules that must be applied during the load processing. These rules can include restrictions for discards (similar to where clauses for queries) and instructions for combining multiple physical rows in an input file into a single row during an insert. SQL*Loader will use the control file to create the insert commands executed for the data load.The control file is created at the operating-system level, using any text editor that enables you to save plain text files. Within the control file, commands do not have to obey any rigid formatting requirements, but standardizing your command syntax will make later maintenance of the control file simpler.The following listing shows a sample control file for loading data into the BOOKSHELF table:
LOAD DATAINFILE 'bookshelf.dat'INTO TABLE BOOKSHELF(Title POSITION(01:100) CHAR, Publisher POSITION(101:120) CHAR, CategoryName POSITION(121:140) CHAR, Rating POSITION(141:142) CHAR)
In this example, data is loaded from the file bookshelf.dat into the BOOKSHELF table. The bookshelf.dat file contains the data for all four of the BOOKSHELF columns, with whitespace padding out the unused characters in those fields. Thus, the Publisher column value always begins at space 101 in the file, even if the Title value is less than 100 characters. Although this formatting makes the input file larger, it may simplify the loading process. No length needs to be given for the fields, since the starting and ending positions within the input data stream effectively give the field length.The infile clause names the input file, and the into table clause specifies the table into which the data will be loaded. Each of the columns is listed, along with the position where its data resides in each physical record in the file. This format allows you to load data even if the source data’s column order does not match the order of columns in your table.To perform this load, the user executing the load must have INSERT privilege on the BOOKSHELF table.Loading Variable-Length Data
If the columns in your input file have variable lengths, you can use SQL*Loader commands to tell Oracle how to determine when a value ends. In the following example, a comma separates the input values:
LOAD DATAINFILE 'bookshelf.dat'BADFILE '/user/load/bookshelf.bad'TRUNCATEINTO TABLE BOOKSHELFFIELDS TERMINATED BY ","(Title, Publisher, CategoryName, Rating)
The fields terminated by “,” clause tells SQL*Loader that during the load, each column value will be terminated by a comma. Thus, the input file does not have to be 142 characters wide for each row, as was the case in the first load example. The lengths of the columns are not specified in the control file, since they will be determined during the load.In this example, the name of the bad file is specified by the badfile clause. In general, the name of the bad file is only given when you want to redirect the file to a different directory.This example also shows the use of the truncate clause within a control file. When this control file is executed by SQL*Loader, the BOOKSHELF table will be truncated before the start of the load. Since truncate commands cannot be rolled back, you should use care when using this option. In addition to truncate, you can use the following options:
- append???Adds rows to the table.
- insert???Adds rows to an empty table. If the table is not empty, the load will abort with an error.
- replace???Empties the table and then adds the new rows. The user must have DELETE privilege on the table.
Starting the Load
To execute the commands in the control file, you need to run SQL*Loader with the appropriate parameters. SQL*Loader is started via the SQLLDR command at the operating system prompt.
NOTE:???The SQL*Loader executable may consist of the name SQLLDR followed by a version number. Consult your platform-specific Oracle documentation for the exact name. For Oracle9i, the executable file should be named SQLLDR.When you execute SQLLDR, you need to specify the control file, username/password, and other critical load information, as shown in Table 21-1.
Table 21-1???SQL*Loader Options
The direct keyword, which invokes the Direct Path load option, is described in “Direct Path Loading” later in this chapter.The SILENT parameter tells SQLLDR to suppress certain informative data:
- HEADER suppresses the SQL*LOADER header.
- FEEDBACK suppresses the feedback at each commit point.
- ERRORS suppresses the logging (in the log file) of each record that caused an Oracle error, although the count is still logged.
- DISCARDS suppresses the logging (in the log file) of each record that was discarded, although the count is still logged.
- PARTITIONS disables the writing of the per-partition statistics to the log file.
- ALL suppresses all of the preceding.
NOTE:???Commands in the control file override any in the calling command line.Let’s load a sample set of data into the BOOKSHELF table, which has four columns (Title, Publisher, CategoryName, and Rating). The data to be loaded is in a file called bookshelf.txt, and consists of two records:
NOTE:???Each line is ended by a carriage return. Even though the first line’s last value is not as long as the column it is being loaded into, the row will stop at the carriage return.The data is separated by commas, and we don’t want to delete the data previously loaded into BOOKSHELF, so the control file will look like this:
Save that file as bookshelf.ctl, in the same directory as the input data file. Next, run SQLLDR and tell it to use the control file:
When the load completes, you should have one successfully loaded record and one failure. The successfully loaded record will be in the BOOKSHELF table:
A file named bookshelf.bad will be created, and will contain one record:
Why was that record rejected? Check the log file, bookshelf.log, which will say, in part:
Row 2, the “Another Title” row, was rejected because the value for the CategoryName column violated the foreign key constraintADULTPIC is not listed as a category in the CATEGORY table.Because the rows that failed are isolated into the bad file, you can use that file as the input for a later load once the data has been corrected.Logical and Physical Records
In Table 21-1, several of the keywords refer to “logical” rows. A logical row is a row that is inserted into the database. Depending on the structure of the input file, multiple physical rows may be combined to make a single logical row.For example, the input file may look like this:
in which case there would be a one-to-one relationship between that physical record and the logical record it creates. But the datafile may look like this instead:
Control File Syntax Notes
The full syntax for SQL*Loader control files is shown in the “SQLLDR” entry in the Alphabetical Reference, so it is not repeated here.Within the load clause, you can specify that the load is recoverable or unrecoverable. The unrecoverable clause only applies to Direct Path loading, and is described in “Tuning Data Loads” later in this chapter.In addition to using the concatenate clause, you can use the continueif clause to control the manner in which physical records are assembled into logical records. The this clause refers to the current physical record, while the next clause refers to the next physical record. For example, you could create a two-character continuation character at the start of each physical record. If that record should be concatenated to the preceding record, set that value equal to ‘**’. You could then use the continueif next (1:2)=’**’ clause to create a single logical record from the multiple physical records. The ‘**’ continuation character will not be part of the merged record.The syntax for the into table clause includes a when clause. The when clause, shown in the following listing, serves as a filter applied to rows prior to their insertion into the table. For example, you can specify
to load only books with ratings greater than 3 into the table. Any row that does not pass the when condition will be written to the discard file. Thus, the discard file contains rows that can be used for later loads, but that did not pass the current set of when conditions. You can use multiple when conditions, connected with and clauses.Use the trailing nullcols clause if you are loading variable-length records for which the last column does not always have a value. With this clause in effect, SQL*Loader will generate NULL values for those columns.As shown in an example earlier in this chapter, you can use the fields terminated by clause to load variable-length data. Rather than being terminated by a character, the fields can be terminated by whitespace or enclosed by characters or optionally enclosed by other characters.For example, the following entry loads AuthorName values and sets the values to uppercase during the insert. If the value is blank, a NULL is inserted:
ReturnDate?POSITION?(1:11)?DATE?”Mon-DD-YYYY”Within the into table clause, you can use the recnum keyword to assign a record number to each logical record as it is read from the datafile, and that value will be inserted into the assigned column of the table. The constant keyword allows you to assign a constant value to a column during the load. For character columns, enclose the constant value within single quotes. If you use the sysdate keyword, the selected column will be populated with the current system date and time.
CheckOutDate?SYSDATEIf you use the sequence option, SQL*Loader will maintain a sequence of values during the load. As records are processed, the sequence value will be increased by the increment you specify. If the rows fail during insert (and are sent to the bad file), those sequence values will not be reused. If you use the max keyword within the sequence option, the sequence values will use the current maximum value of the column as the starting point for the sequence. The following listing shows the use of the sequence option:
Seqnum_col??SEQUENCE(MAX,1)You can also specify the starting value and increment for a sequence to use when inserting. The following example inserts values starting with a value of 100, incrementing by 2. If a row is rejected during the insert, its sequence value is skipped.
Seqnum_col??SEQUENCE(100,2)If you store numbers in VARCHAR2 columns, avoid using the sequence option for those columns. For example, if your table already contains the values 1 through 10 in a VARCHAR2 column, then the maximum value within that column is 9the greatest character string. Using that as the basis for a sequence option will cause SQL*Loader to attempt to insert a record using 10 as the newly created valueand that may conflict with the existing record.SQL*Loader control files can support complex logic and business rules. For example, your input data for a column holding monetary values may have an implied decimal; 9990 would be inserted as 99.90. In SQL*Loader, you could insert this by performing the calculation during the data load:
See the “SQL*Loader Case Studies” of the Oracle9i Utilities Guide for additional SQL*Loader examples and sample control files.
Managing Data Loads
Loading large data volumes is a batch operation. Batch operations should not be performed concurrently with the small transactions prevalent in many database applications. If you have many concurrent users executing small transactions against a table, you should schedule your batch operations against that table to occur at a time when no users are accessing the table.Oracle maintains read consistency for users’ queries. If you execute the SQL*Loader job against the table at the same time that other users are querying the table, Oracle will internally maintain undo entries to enable those users to see their data as it existed when they first queried the data. To minimize the amount of work Oracle must perform to maintain read consistency (and to minimize the associated performance degradation caused by this overhead), schedule your long-running data load jobs to be performed when few other actions are occurring in the database. In particular, avoid contention with other accesses of the same table.Design your data load processing to be easy to maintain and reuse. Establish guidelines for the structure and format of the input datafiles. The more standardized the input data formats are, the simpler it will be to reuse old control files for the data loads. For repeated scheduled loads into the same table, your goal should be to reuse the same control file each time. Following each load, you will need to review and move the log, bad, data, and discard files so they do not accidentally get overwritten.Within the control file, use comments to indicate any special processing functions being performed. To create a comment within the control file, begin the line with two dashes, as shown in the following example:
Repeating Data Loads
Data loads do not always work exactly as planned. Many variables are involved in a data load, and not all of them will always be under your control. For example, the owner of the source data may change its data formatting, invalidating part of your control file. Business rules may change, forcing additional changes. Database structures and space availability may change, further affecting your ability to load the data.In an ideal case, a data load will either fully succeed or fully fail. However, in many cases, a data load will partially succeed, making the recovery process more difficult. If some of the records have been inserted into the table, then attempting to reinsert those records should result in a primary key violation. If you are generating the primary key value during the insert (via the sequence option), then those rows may not fail the second timeand will be inserted twice.To determine where a load failed, use the log file. The log file will record the commit points as well as the errors encountered. All of the rejected records should be in either the bad file or the discard file. You can minimize the recovery effort by forcing the load to fail if many errors are encountered. To force the load to abort before a large number of errors is encountered, use the errors keyword of the SQLLDR command. You can also use the discardmax keyword to limit the number of discarded records permitted before the load aborts.If you set errors to 0, the first error will cause the load to fail. What if that load fails after 100 records have been inserted? You will have two options: identify and delete the inserted records and reapply the whole load, or skip the successfully inserted records. You can use the skip keyword of SQLLDR to skip the first 100 records during its load processing. The load will then continue with record 101 (which, we hope, has been fixed prior to the reload attempt). If you cannot identify the rows that have just been loaded into the table, you will need to use the skip option during the restart process.The proper settings for errors and discardmax depend on the load. If you have full control over the data load process, and the data is properly “cleaned” before being extracted to a load file, you may have very little tolerance for errors and discards. On the other hand, if you do not have control over the source for the input datafile, you need to set errors and discardmax high enough to allow the load to complete. After the load has completed, you need to review the log file, correct the data in the bad file, and reload the data using the original bad file as the new input file. If rows have been incorrectly discarded, you need to do an additional load using the original discard file as the new input file.After modifying the errant CategoryName value, you can rerun the BOOKSHELF table load example using the original bookshelf.dat file. During the reload, you have two options when using the original input datafile:
- Skip the first row by specifying skip=1 in the SQLLDR command line.
- Attempt to load both rows, whereby the first row fails because it has already been loaded (and thus causes a primary key violation).
Tuning Data Loads
In addition to running the data load processes at off-peak hours, you can take other steps to improve the load performance. The following steps all impact your overall database environment, and must be coordinated with the database administrator. The tuning of a data load should not be allowed to have a negative impact on the database or on the business processes it supports.First, batch data loads may be timed to occur while the database is in NOARCHIVELOG mode. While in NOARCHIVELOG mode, the database does not keep an archive of its online redo log files prior to overwriting them. Eliminating the archiving process improves the performance of transactions. Since the data is being loaded from a file, you can re-create the loaded data at a later time by reloading the datafile rather than recovering it from an archived redo log file.However, there are significant potential issues with disabling NOARCHIVELOG mode. You will not be able to perform a point-in-time recovery of the database unless archiving is enabled. If there are non-batch transactions performed in the database, you will probably need to run the database in ARCHIVELOG mode all the time, including during your loads. Furthermore, switching between ARCHIVELOG and NOARCHIVELOG modes requires you to shut down the instance. If you switch the instance to NOARCHIVELOG mode, perform your data load, and then switch the instance back to ARCHIVELOG mode, you should perform a backup of the database (see Chapter 40) immediately following the restart.Instead of running the entire database in NOARCHIVELOG mode, you can disable archiving for your data load process by using the unrecoverable keyword within SQL*Loader. The unrecoverable option disables the writing of redo log entries for the transactions within the data load. You should only use this option if you will be able to re-create the transactions from the input files during a recovery. If you follow this strategy, you must have adequate space to store old input files in case they are needed for future recoveries. The unrecoverable option is only available for Direct Path loads, as described in the next section.Rather than control the redo log activity at the load process level, you can control it at the table or partition level. If you define an object as nologging, then block-level inserts performed by SQL*Loader Direct Path loading and the insert /*+ APPEND */ command will not generate redo log entries.If your operating environment has multiple processors, you can take advantage of the CPUs by parallelizing the data load. The parallel option of SQLLDR, as described in the next section, uses multiple concurrent data load processes to reduce the overall time required to load the data.In addition to these approaches, you should work with your database administrator to make sure the database environment and structures are properly tuned for data loads. Tuning efforts should include the following:
- Preallocate space for the table, to minimize dynamic extensions during the loads.
- Allocate sufficient memory resources to the shared memory areas, including the log buffer area.
- Streamline the data writing process by creating multiple database writer (DBWR) processes for the database.
- Remove any unnecessary triggers during the data loads. If possible, disable or remove the triggers prior to the load, and perform the trigger operations on the loaded data manually after it has been loaded.
- Remove or disable any unnecessary constraints on the table. You can use SQL*Loader to dynamically disable and re-enable constraints.
- Remove any indexes on the tables. If the data has been properly cleaned prior to the data load, then uniqueness checks and foreign key validations will not be necessary during the loads. Dropping indexes prior to data loads significantly improves performance.
Direct Path Loading
SQL*Loader, when inserting records, generates a large number of insert statements. To avoid the overhead associated with using a large number of inserts, you may use the Direct Path option in SQL*Loader. The Direct Path option creates preformatted data blocks and inserts those blocks into the table. As a result, the performance of your load can dramatically improve. To use the Direct Path option, you must not be performing any functions on the values being read from the input file.Any indexes on the table being loaded will be placed into a temporary DIRECT LOAD state (you can query the index status from USER_INDEXES). Oracle will move the old index values to a temporary index it creates and manages. Once the load has completed, the old index values will be merged with the new values to create the new index, and Oracle will drop the temporary index it created. When the index is once again valid, its status will change to VALID. To minimize the amount of space necessary for the temporary index, presort the data by the indexed columns. The name of the index for which the data is presorted should be specified via a sorted indexes clause in the control file.To use the direct path option, specify
as a keyword on the SQLLDR command line or include this option in the control file.If you use the Direct Path option, you can use the unrecoverable keyword to improve your data load performance. This instructs Oracle not to generate redo log entries for the load. If you need to recover the database at a later point, you will need to re-execute the data load in order to recover the table’s data. All conventional path loads are recoverable, and all Direct Path loads are recoverable by default.Direct Path loads are faster than conventional loads, and unrecoverable Direct Path loads are faster still. Since performing unrecoverable loads impacts your recovery operations, you need to weigh the costs of that impact against the performance benefit you will realize. If your hardware environment has additional resources available during the load, you can use the parallel Direct Path load option to divide the data load work among multiple processes. The parallel Direct Path operations may complete the load job faster than a single Direct Path load.Instead of using the parallel option, you could partition the table being loaded (see Chapter 18). Since SQL*Loader allows you to load a single partition, you could execute multiple concurrent SQL*Loader jobs to populate the separate partitions of a partitioned table. This method requires more database administration work (to configure and manage the partitions), but it gives you more flexibility in the parallelization and scheduling of the load jobs.As of Oracle9i, you can take advantage of multithreaded loading functionality for Direct Path loads to convert column arrays to stream buffers and perform stream buffer loading in parallel. Use the streamsize parameter and multithreading flag to enable this feature.Direct Path loading may impact the space required for the table’s data. Since Direct Path loading inserts blocks of data, it does not follow the usual methods for allocating space within a table. The blocks are inserted at the end of the table, after its high-water mark, which is the highest block into which the table’s data has ever been written. If you insert 100 blocks worth of data into a table and then delete all of the rows, the high-water mark for the table will still be set at 100. If you then perform a conventional SQL*Loader data load, the rows will be inserted into the already allocated blocks. If you instead perform a Direct Path load, Oracle will insert new blocks of data following block 100, potentially increasing the space allocation for the table. The only way to lower the high-water mark for a table is to truncate it (which deletes all rows and cannot be rolled back) or to drop and re-create it. You should work with your database administrator to identify space issues prior to starting your load.
Additional Oracle9i Enhancements
In addition to features noted earlier in this chapter, SQL*Loader features support for Unicode and expanded datatypes. As of Oracle9i, SQL*Loader can load integer and zoned/packed decimal datatypes across platforms with different byte ordering and accept EBCDIC-based zoned or packed decimal data encoded in IBM format. SQL*Loader also offers support for loading XML columns, loading object types with subtypes (see Chapter 30), and Unicode (UTF16 character set). SQL*Loader also provides native support for the new Oracle9i date, time, and interval-related datatypes (see Chapter 9).If a SQL*Loader job fails, you may be able to resume it where it failed using the resumable, resumable_name, and resumable_timeout options. For example, if the segment to which the loader job was writing could not extend, you can disable the load job, fix the space allocation problem, and resume the job. Your ability to perform these actions depends on the configuration of the database; work with your DBA to make sure the resumable features are enabled and adequate undo history is maintained for your purposes.As of Oracle9i, you can access external files as if they are tables inside the database. This “external table” feature, described in Chapter 25, allows you to potentially avoid loading large volumes of data into the database. The syntax for external table definitions very closely resembles that of the SQL*Loader control file. Although they are limited in some significant ways (you cannot perform DML on external tables, for example), you should consider external tables as alternatives to data loads. See Chapter 25 for implementation details.