Generating Realistic and Non-Random Sample Data : Page 2
Generating realistic sample data is often harder than you might think, and generating sample data that you can use for a successful data-mining demonstration requires even more planning. This article explores the process and presents guidelines for realistic sample data generation.
by Mark Frawley
Jan 25, 2005
Page 2 of 4
Randomized vs. Real Data
The first seven guidelines help determine how you'll approach the algorithmic structure of the data generator. In the absence of a specialized toolset, you'll be tempted to use an RNG to pick the values for each variable (see point #8, on the preceding page), because it's readily available and appears to offer an easy way to get the dynamism and apparent randomness of "real" data. But simple approaches will be approximations, and naïve attempts may degenerate to the point of not being useful. Consider:
Loosely speaking, "random" means that all possible events (of count N) have an equal probability of occurrence (1/N), given enough trials. But casual appearances notwithstanding, this is generally not the nature of real-world business data. Each business and application has a unique probability and frequency distribution in its data. Real data tends to be sparse, characterized by hotspots and not evenly distributed.
The RNG available is likely to be deterministic, meaning that given the same starting state, the generator will produce the identical sequence of random numbers each timecalled a pseudo-RNG. While this would be unacceptable in applications such as cryptography, it is not usually an issue for the present purpose and may in fact be useful in controlling the repeatability of data generation runs. This can be helpful when tweaking other parameters of the generator to create the distribution or pattern required, by holding the random component "constant".
If ease of developing the data generator is paramount, pseudo-randomized data might be deemed an acceptable model of real data in spite of the previous points. In this case, be aware that if the number of trials is not large enough, the results will be unevenly distributed across the range of possibilities, producing "choppy" and unrealistic, rather than smoothly varying, generated data even at aggregated levels. While this may seem obvious, you'll see that it was discovered the hard way in the real-world case study described later.
Finally, you will need some convenient way to browse the generated data to check for realism and conformance to any imposed patterns. Ideally, you should have a multidimensional cube browser available under the assumed scenario.
A Case Study
Here's a representative case that illustrates the previous points. The case study uses Microsoft SQL Server for the relational star schema (see Figure 1), Analysis Services for the cube (see Figure 2), and Panorama NovaView for the cube browser. First, I'll show the data specification and then examine an implementation.
Figure 1. Database Schema: The figure shows the database schema for a sample BI application demonstration.
Figure 2. Cube Elements: The figure shows the Cube Elements as seen in Analysis Services Cube Editor.