Generating realistic sample data is often harder than you might think, and generating sample data that you can use for a successful data-mining demonstration requires even more planning. This article explores the process and presents guidelines for realistic sample data generation.
by Mark Frawley
Jan 25, 2005
Page 1 of 4
ave you ever needed to populate a database with realistic, but generated (as opposed to actual) data, in volumes that demand automation? Such a need arises in many circumstances, such as:
Stress-testing on new applications where historical data does not yet exist.
Data Scrubbing, where historical, representative data is available but the identity of customers and other sensitive data must be obscured for some reason, such as a demonstration open to the public.
Training and Documentation, where representative data would improve comprehension.
Sales and Marketing Proof-of-Concept, where realistic data, especially if tailored to a prospect's industry, would be much more compelling than the usual one-size-fits-all.
Sometimes, you need data that is not only realistic, but also "tells a story," This is typical in a business intelligence (BI) or data mining context where you want to superimpose pre-determined patterns on the data in the data warehouse or datamart, so that the patterns can then be "discovered" in the course of the exercise. After all, pattern discovery is one of the main reasons for applying these technologies.
When faced with the task of generating "realistic" data in volume for any of these reasons, most of us non-statistician developers would probably equate "realistic" with "random". We would remember that our programming environment of choice, even Excel, offers a random number generator (RNG). Can't be that hard, you might figure. Unfortunately, you'd be wrong. The fact is, it's surprisingly hard to do these things effectively, especially if you require not just "realistic" but pre-patterned data. This article focuses on some of the issues involved, taking the general perspective of a relational star-schema datamart feeding a multidimensional cube. While the points raised in this article are valid regardless of which technology you choose to apply, this is the easiest configuration in which to visualize the generated data to check its conformance to goals. I've assumed that you have a basic knowledge of dimensional data modeling and OLAP browsing.
The most important step in developing a data generator is to first create a specification for the data you'll need. Doing that helps to reveal the complexity involved before you write a line of code, and may well show that you may have to modify requirements to match time and resource constraints. The following are minimum data-creation guidelines, with examples:
The overall volume of data required: An average of 100 transactions per business day for a years' worth of business days.
The variables to be generated, typically numeric measures or relationships: Revenue, Cost, Customer of Transaction, Product of Transaction.
The essential business rules and other constraints absolutely required for any semblance of realism: Fee = 5% of transaction's value; transactions can only exist on business days.
If applicable, the patterns to be superimposed on the generated data for later "discovery". These would likely be expressed as business scenarios: While all products are profitable in the aggregate, they are not consistently profitable when broken down by customer and region. The generated data should reflect this.
Any special constraints or requirements of the database schema you will be populating: In a star schema with transaction/day grain, where transactions can exist for more than one day, the data generator will have to take this into accountcarrying some fraction of transactions from the prior day forward to the next day, creating new rows for them, while also marking the remaining transactions from the prior day "completed". This must be done in realistic proportions, and the proportions may need to vary over time to reflect the superimposed pattern for discovery.
Source of referential informationthe dimensions. Can you identify this source or will you need to generate it too, or at least scrub (i.e. obscuring actual customer names) the data? Customers, products, geographic regions, salespeople, countries.
The implementation must run fast, because you're likely to need many trials to tune it to produce acceptable results.
The range or set of values each variable needs to produce adequate realism: For example, 100 customers, 20 products, 3 geographic regions and business dates between 5/1/2004 and 12/31/2004 should be represented in the generated data. Revenue should range between $100 and $10000 per transaction, with an average value of $2500.