Cloning Data Accurately and Randomly

Cloning Data Accurately and Randomly

ecently I encountered a couple of Oracle-tuning projects that required me to evaluate the performance of PL/SQL and SQL code. I had to generate a substantial amount of data from existing data in a table?say from 500 rows to 500,000 rows. In my case, each record represented a participant in a certain time period and any modification of the participant’s attribute triggered the creation of a new record for the same participant at a new time period.

As you can see, the number of participants and the frequency of modifications determined the data size. My task was to develop a mechanism for generating data for different modifications scenarios: 10 percent of participants getting modified per time period, then 30 percent, 50 percent, etc. In addition, modifications had to be done randomly in order to emulate real-life data.

This task of data cloning is quite common in any testing or tuning environment. Obviously, it requires some technique to automate it.

How do you easily generate a random sample of data from existing data in a table?

Develop code that clones your data:

  1. Use Oracle’s standard DBMS_RANDOM package to create a random number generator.
  2. Create a custom package that extracts sample data accurately and randomly.
Using the DBMS_RANDOM Package
Your first step is to create a random number generator. Prior to the release of Oracle 8.0, developers had to create their own. Starting with version 8.0 Oracle introduced the DBMS_RANDOM package, which provides a built-in random number generator. It is faster than generators written in PL/SQL because it calls Oracle’s internal random number generator. By contrast, custom PL/SQL generators have to execute multiple lines of math calculations to come up with a random number.

This package is not installed by default. It has to be created by a SYS user running the Catoctk.sql script located in the $ORACLE_HOMERdbmsAdmin directory. This script executes several scripts in succession to ensure that DBMS_RANDOM is available for use. This package is granted to PUBLIC by default.

Your first step is to initialize the generator using the INITIALIZE procedure with a seed. Here is the syntax specification for this procedure:

Oracle recommends you use at least a five-digit value to generate sufficiently random numbers. In my code I am going to use the UID function, which uniquely identifies the current user, and multiply it by 10,000 to ensure that at least a five-digit value is used as the seed. This integer is generated by the Oracle RDBMS whenever a user connects to the database.

Here is the syntax:

You only need to run this procedure once. You use the initialization section of the custom package for that purpose. The initialization section is a powerful mechanism: PL/SQL detects automatically when this code should be run. You do not have to execute any statements explicitly, and you can be sure they are run only once. (Because initialization resets your sequence of random numbers and starts it from the beginning, if you kept calling the initialization procedure, you’d repeat the same numbers over and over again.)

The seed value can be changed programmatically anytime by calling the SEED procedure. The specification is:

To acquire a random number after the generator has been initialized, call the RANDOM function, which returns a random-value integer.

Listing 1 shows the code for creating the simple Rand package. Let’s test it by executing following command five times in SQL*PLus:

SELECT Rand.RandomNum   FROM Dual;
Here is the output I get from the above command (yours will differ):
Good enough.Creating a Custom Package
In case you’re thinking of extracting sample data with Oracle’s SAMPLE clause, don’t bother. I’ve only had negative experiences using it. (Read the sidebar.)

Caveat for Extracting Sample Data

Oracle’s SAMPLE keyword allows you to extract a sample block of data from a table. Read on to find out why you needn’t bother…

Instead, you’re best off creating a custom package, as I did. ParticipantData uses the previously created Rand package (see Listing 1) to extract sample data accurately and randomly.

Here is the algorithm:

  1. Calculate the number of rows in the table (T) for criteria specified in the WHERE clause.
  2. Calculate the number of rows you need to retrieve (N) based on the percentage (P) passed to the procedure as a parameter using the following formula:

    N = (T * P) / 100

  3. Create a SELECT statement that dynamically assigns a random number to each row, sorts it by this number, and extracts the top N rows.
The key to this plan is the SELECT statement. Here is an example of extracting 500 rows randomly for the period number 43:
SELECT ParticipantSeq   FROM   (SELECT P.*,           Rand.RandomNum AS MyNum       FROM Participant P       WHERE StartPeriod = 43       ORDER BY MyNum)   WHERE ROWNUM 
To make this work, I created two procedures, ModifyRandomByPeriod and ModifyRandomPeriodList:
  • Procedure ModifyRandomByPeriod does all the work per period, specified by the p_StartPeriod parameter, by extracting a randomly specified percentage of the data and modifying the values in one of the fields of the Participant table.

  • Procedure ModifyRandomPeriodList accepts the list of comma-delimited periods, or ‘ALL,’ for the whole table as a parameter to extract and modify the data across multiple periods. It parses the list, creates an array (a PL/SQL table) of periods, and calls ModifyRandomByPeriod for each of them in a loop.
Listing 2 shows the code from the ParticipantData package that’s relevant to your task. In my case, there are 15,568 records in my Participant table for the “January 1999” period. What happens when I use the ModifyRandomByPeriod procedure and run it five times, setting the parameter p_Percentage to 10 percent?

The data in csr_RandomParticipants cursor, and its order, is completely random after each run. Moreover, the number of rows extracted (and later cloned) are 100 percent accurate:

My task is accomplished. So is yours.

As you can see, the technique described in this 10-Minute Solution is pretty generic and can be used to extract random data sample from any table or set of related tables.


Share the Post: