Creating a Custom Package
In case you're thinking of extracting sample data with Oracle's SAMPLE clause, don't bother. I've only had negative experiences using it. (Read the sidebar
|Caveat for Extracting Sample Data
|Oracle's SAMPLE keyword allows you to extract a sample block of data from a table. Read on to find out why you needn't bother... |
Instead, you're best off creating a custom package, as I did. ParticipantData uses the previously created Rand package (see Listing 1) to extract sample data accurately and randomly.
Here is the algorithm:
- Calculate the number of rows in the table (T) for criteria specified in the WHERE clause.
- Calculate the number of rows you need to retrieve (N) based on the percentage (P) passed to the procedure as a parameter using the following formula:
N = (T * P) / 100
- Create a SELECT statement that dynamically assigns a random number to each row, sorts it by this number, and extracts the top N rows.
The key to this plan is the SELECT statement. Here is an example of extracting 500 rows randomly for the period number 43:
Rand.RandomNum AS MyNum
FROM Participant P
WHERE StartPeriod = 43
ORDER BY MyNum)
WHERE ROWNUM <= 500;
To make this work, I created two procedures, ModifyRandomByPeriod and ModifyRandomPeriodList:
- Procedure ModifyRandomByPeriod does all the work per period, specified by the p_StartPeriod parameter, by extracting a randomly specified percentage of the data and modifying the values in one of the fields of the Participant table.
- Procedure ModifyRandomPeriodList accepts the list of comma-delimited periods, or 'ALL,' for the whole table as a parameter to extract and modify the data across multiple periods. It parses the list, creates an array (a PL/SQL table) of periods, and calls ModifyRandomByPeriod for each of them in a loop.
shows the code from the ParticipantData package that's relevant to your task. In my case, there are 15,568 records in my Participant table for the "January 1999" period. What happens when I use the ModifyRandomByPeriod procedure and run it five times, setting the parameter p_Percentage
to 10 percent?
The data in csr_RandomParticipants cursor, and its order, is completely random after each run. Moreover, the number of rows extracted (and later cloned) are 100 percent accurate:
My task is accomplished. So is yours.
As you can see, the technique described in this 10-Minute Solution is pretty generic and can be used to extract random data sample from any table or set of related tables.