Generating realistic sample data is often harder than you might think, and generating sample data that you can use for a successful data-mining demonstration requires even more planning. This article explores the process and presents guidelines for realistic sample data generation.
by Mark Frawley
January 25, 2005
ave you ever needed to populate a database with realistic, but generated (as opposed to actual) data, in volumes that demand automation? Such a need arises in many circumstances, such as:
Stress-testing on new applications where historical data does not yet exist.
Data Scrubbing, where historical, representative data is available but the identity of customers and other sensitive data must be obscured for some reason, such as a demonstration open to the public.
Training and Documentation, where representative data would improve comprehension.
Sales and Marketing Proof-of-Concept, where realistic data, especially if tailored to a prospect's industry, would be much more compelling than the usual one-size-fits-all.
Sometimes, you need data that is not only realistic, but also "tells a story," This is typical in a business intelligence (BI) or data mining context where you want to superimpose pre-determined patterns on the data in the data warehouse or datamart, so that the patterns can then be "discovered" in the course of the exercise. After all, pattern discovery is one of the main reasons for applying these technologies.
It's quick, easy and you get access to all the articles on DevX.
This registration/login is to allow you to read articles on devx.com. Already a member?
To become a member of DevX.com create your Member Profile by completing the form below. Membership is free!