advertisement
Premier Club Log In/Registration
  Include Code  Search Tips
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   SKILLBUILDING  |   TIP BANK  |   SOURCEBANK  |   FORUMS  |   NEWSLETTERS
Browse DevX
Partners & Affiliates
advertisement
advertisement
advertisement
Average Rating: 5/5 | Rate this item | 1 user has rated this item.
Email this articleEmail this article
 
Generating Realistic and Non-Random Sample Data
Generating realistic sample data is often harder than you might think, and generating sample data that you can use for a successful data-mining demonstration requires even more planning. This article explores the process and presents guidelines for realistic sample data generation. 

advertisement
ave you ever needed to populate a database with realistic, but generated (as opposed to actual) data, in volumes that demand automation? Such a need arises in many circumstances, such as:

  • Stress-testing on new applications where historical data does not yet exist.
  • Data Scrubbing, where historical, representative data is available but the identity of customers and other sensitive data must be obscured for some reason, such as a demonstration open to the public.
  • Training and Documentation, where representative data would improve comprehension.
  • Sales and Marketing Proof-of-Concept, where realistic data, especially if tailored to a prospect's industry, would be much more compelling than the usual one-size-fits-all.
Sometimes, you need data that is not only realistic, but also "tells a story," This is typical in a business intelligence (BI) or data mining context where you want to superimpose pre-determined patterns on the data in the data warehouse or datamart, so that the patterns can then be "discovered" in the course of the exercise. After all, pattern discovery is one of the main reasons for applying these technologies.

When faced with the task of generating "realistic" data in volume for any of these reasons, most of us non-statistician developers would probably equate "realistic" with "random". We would remember that our programming environment of choice, even Excel, offers a random number generator (RNG). Can't be that hard, you might figure. Unfortunately, you'd be wrong. The fact is, it's surprisingly hard to do these things effectively, especially if you require not just "realistic" but pre-patterned data. This article focuses on some of the issues involved, taking the general perspective of a relational star-schema datamart feeding a multidimensional cube. While the points raised in this article are valid regardless of which technology you choose to apply, this is the easiest configuration in which to visualize the generated data to check its conformance to goals. I've assumed that you have a basic knowledge of dimensional data modeling and OLAP browsing.

Preliminary Observations
The most important step in developing a data generator is to first create a specification for the data you'll need. Doing that helps to reveal the complexity involved before you write a line of code, and may well show that you may have to modify requirements to match time and resource constraints. The following are minimum data-creation guidelines, with examples:

  1. The overall volume of data required:
    An average of 100 transactions per business day for a years' worth of business days.
  2. The variables to be generated, typically numeric measures or relationships:
    Revenue, Cost, Customer of Transaction, Product of Transaction.
  3. The essential business rules and other constraints absolutely required for any semblance of realism:
    Fee = 5% of transaction's value; transactions can only exist on business days.
  4. If applicable, the patterns to be superimposed on the generated data for later "discovery". These would likely be expressed as business scenarios:
    While all products are profitable in the aggregate, they are not consistently profitable when broken down by customer and region. The generated data should reflect this.
  5. Any special constraints or requirements of the database schema you will be populating:
    In a star schema with transaction/day grain, where transactions can exist for more than one day, the data generator will have to take this into account—carrying some fraction of transactions from the prior day forward to the next day, creating new rows for them, while also marking the remaining transactions from the prior day "completed". This must be done in realistic proportions, and the proportions may need to vary over time to reflect the superimposed pattern for discovery.
  6. Source of referential information—the dimensions. Can you identify this source or will you need to generate it too, or at least scrub (i.e. obscuring actual customer names) the data?
    Customers, products, geographic regions, salespeople, countries.
  7. The implementation must run fast, because you're likely to need many trials to tune it to produce acceptable results.
  8.  
  9. The range or set of values each variable needs to produce adequate realism:
    For example, 100 customers, 20 products, 3 geographic regions and business dates between 5/1/2004 and 12/31/2004 should be represented in the generated data. Revenue should range between $100 and $10000 per transaction, with an average value of $2500.
Does the task still seem simple?

  Next Page: Randomized vs. Real Data
Page 1: IntroductionPage 3: The Requirements
Page 2: Randomized vs. Real DataPage 4: The Implementation
Untitled
advertisement
Advertising Info  |   Member Services  |   Permissions  |   Contact Us  |   Help  |   Feedback  |   Site Map  |   Network Map  |   About


JupiterOnlineMedia

internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and JupiterOnlineMedia

Jupitermedia Corporate Info


Legal Notices, Licensing, Reprints, & Permissions, Privacy Policy.

Advertise | Newsletters | Tech Jobs | Shopping | E-mail Offers

Solutions
Whitepapers and eBooks
Intel PDF: Virtualization Delivers Data Center Efficiency
Intel eBook: Managing the Evolving Data Center
Microsoft Article: BitLocker Brings Encryption to Windows Server 2008
Symantec eBook: The Guide to E-Mail Archiving and Management
Microsoft Article: RODCs Transform Branch Office Security
Go Parallel Article: James Reinders on the Intel Parallel Studio Beta Program
Avaya Article: Advancing the State of the Art in Customer Service
Adobe Acrobat Connect Pro: Web Conferencing and eLearning Whitepapers
Avaya Article: Avaya AE Services Provide Rapid Telephony Integration with Facebook
Go Parallel Article: Getting Started with TBB on Windows
HP eBook: Storage Networking , Part 1
MORE WHITEPAPERS, EBOOKS, AND ARTICLES
Webcasts
Intel Seminar: Efficiencies in Hardware/Software Virtualization
HP Webcast: Disaster Recovery Planning
Go Parallel Video: Performance and Threading Tools for Game Developers
HP Video: StorageWorks EVA4400 and Oracle
HP Webcast: Storage Is Changing Fast - Be Ready or Be Left Behind
MORE WEBCASTS, PODCASTS, AND VIDEOS
Downloads and eKits
IBM TCO eKIT: Your IT Budget is Under Attack, Get in Control
IBM Energy Efficiency eKIT: Learn How to Reduce Costs
30-Day Trial: SPAMfighter Exchange Module
Red Gate Download: SQL Toolbelt and free High-Performance SQL Code eBook
Iron Speed Designer Application Generator
MORE DOWNLOADS, EKITS, AND FREE TRIALS
Tutorials and Demos
Microsoft Article: Silverlight Streaming--Free Video Hosting for All
Featured Algorithm: Intel Threading Building Blocks - parallel_reduce
HP Demo: StorageWorks EVA4400
MORE TUTORIALS, DEMOS AND STEP-BY-STEP GUIDES