Generating Realistic and Non-Random Sample Data : Page 3
Generating realistic sample data is often harder than you might think, and generating sample data that you can use for a successful data-mining demonstration requires even more planning. This article explores the process and presents guidelines for realistic sample data generation.
by Mark Frawley
Jan 25, 2005
Page 3 of 4
We needed to make a presentation to a potential client showing the benefits of applying BI to their business. The data had to be recognizable and reasonable to the client and to have an embedded scenario which would let us show off BI technology by "discovering" it.
The steady-state business situation to be simulated was as follows:
A fictional wholesale bank BankABC does business in certain products. Transactions represent customers' activities in these products. Transactions have a face value, and can be outstanding for multiple days, but incur a fixed overhead cost for each day they are outstanding. On the day a transaction is completed, the bank collects a fee which is 1% of the face value. Profit consists of this fee minus the accumulated overhead cost. Various back offices around the world process the transactions. In the aggregate, profit is positive and trending upward.
The "pattern" to be imposed on this steady state was as follows:
As of a certain date, BankABC begins offering a new product. Because bank employees are insufficiently trained in the new product, all transactions tend to remain outstanding on average for a significantly longer period than previously, such that profitability in all products begins to decline, eventually becoming negative even in the aggregate. BI technology is then used both to detect the decline in profitability and to determine that increasing average processing time is its cause. As a result of this analysis, the bank institutes a crash course of training in the new product for its staff. After instituting this change, further analysis in the ensuing months shows that profitability for all products becomes positive again, and overall profitability resumes its upward trend.
This perfectly reasonableeven over-simplifiedbusiness scenario was nevertheless non-trivial to simulate realistically. Note that while the scenario specifies certain things unambiguously, many essential details are left unstated (which is typical of such scenarios). I needed not only to design the mechanics, but also to make many choices to "fill in the gaps" and decide where shortcuts would not undermine the presentation.
Creating a specification using the eight guidelines mentioned previously resulted in the following:
Data Volume: An average of 60 transactions per business day for five months, from 7/1/2004 to 3/31/2005.
Variables: Revenue, cost, elapsed days, customer, and product to transaction associations.
Business rules: No data on weekends.
Patterns to be superimposed: As per the scenario; there are two products initially, with the new product (Documentary Collections) introduced on 9/1/2004; the scenario should fully play out over the allowed date range.
Requirements of the database schema: Rows must be created in the fact table for each transaction for every day it is active, thus each day's data is partially dependent on the prior day's data.
Referential data: 24 customers, 9 back offices, 8 countries, 4 regions, 14 employees, 3 products, with all data generated but realistic. Note that these unrealistic cardinalities were deemed acceptable in this context.
The implementation must run fast: The small cardinalities chosen ensure this. It will be implemented as T-SQL scripts.
The range for each variable: You can see the ranges more in the code listings discussed later in this article. Average face value per transaction should be about $250K.