ave you ever needed to populate a database with realistic, but generated (as opposed to actual) data, in volumes that demand automation? Such a need arises in many circumstances, such as:
- Stress-testing on new applications where historical data does not yet exist.
- Data Scrubbing, where historical, representative data is available but the identity of customers and other sensitive data must be obscured for some reason, such as a demonstration open to the public.
- Training and Documentation, where representative data would improve comprehension.
- Sales and Marketing Proof-of-Concept, where realistic data, especially if tailored to a prospect’s industry, would be much more compelling than the usual one-size-fits-all.
Sometimes, you need data that is not only realistic, but also “tells a story,” This is typical in a business intelligence (BI) or data mining context where you want to superimpose pre-determined patterns on the data in the data warehouse or datamart, so that the patterns can then be “discovered” in the course of the exercise. After all, pattern discovery is one of the main reasons for applying these technologies.
When faced with the task of generating “realistic” data in volume for any of these reasons, most of us non-statistician developers would probably equate “realistic” with “random”. We would remember that our programming environment of choice, even Excel, offers a random number generator (RNG). Can’t be that hard, you might figure. Unfortunately, you’d be wrong. The fact is, it’s surprisingly hard to do these things effectively, especially if you require not just “realistic” but pre-patterned data. This article focuses on some of the issues involved, taking the general perspective of a relational star-schema datamart feeding a multidimensional cube. While the points raised in this article are valid regardless of which technology you choose to apply, this is the easiest configuration in which to visualize the generated data to check its conformance to goals. I’ve assumed that you have a basic knowledge of dimensional data modeling and OLAP browsing.
The most important step in developing a data generator is to first create a specification for the data you’ll need. Doing that helps to reveal the complexity involved before you write a line of code, and may well show that you may have to modify requirements to match time and resource constraints. The following are minimum data-creation guidelines, with examples:
- The overall volume of data required:
An average of 100 transactions per business day for a years’ worth of business days.
- The variables to be generated, typically numeric measures or relationships:
Revenue, Cost, Customer of Transaction, Product of Transaction.
- The essential business rules and other constraints absolutely required for any semblance of realism:
Fee = 5% of transaction’s value; transactions can only exist on business days.
- If applicable, the patterns to be superimposed on the generated data for later “discovery”. These would likely be expressed as business scenarios:
While all products are profitable in the aggregate, they are not consistently profitable when broken down by customer and region. The generated data should reflect this.
- Any special constraints or requirements of the database schema you will be populating:
In a star schema with transaction/day grain, where transactions can exist for more than one day, the data generator will have to take this into account?carrying some fraction of transactions from the prior day forward to the next day, creating new rows for them, while also marking the remaining transactions from the prior day “completed”. This must be done in realistic proportions, and the proportions may need to vary over time to reflect the superimposed pattern for discovery.
- Source of referential information?the dimensions. Can you identify this source or will you need to generate it too, or at least scrub (i.e. obscuring actual customer names) the data?
Customers, products, geographic regions, salespeople, countries.
- The implementation must run fast, because you’re likely to need many trials to tune it to produce acceptable results.
- The range or set of values each variable needs to produce adequate realism:
For example, 100 customers, 20 products, 3 geographic regions and business dates between 5/1/2004 and 12/31/2004 should be represented in the generated data. Revenue should range between $100 and $10000 per transaction, with an average value of $2500.
Does the task still seem simple?
Randomized vs. Real Data
The first seven guidelines help determine how you’ll approach the algorithmic structure of the data generator. In the absence of a specialized toolset, you’ll be tempted to use an RNG to pick the values for each variable (see point #8, on the preceding page), because it’s readily available and appears to offer an easy way to get the dynamism and apparent randomness of “real” data. But simple approaches will be approximations, and naïve attempts may degenerate to the point of not being useful. Consider:
- Loosely speaking, “random” means that all possible events (of count N) have an equal probability of occurrence (1/N), given enough trials. But casual appearances notwithstanding, this is generally not the nature of real-world business data. Each business and application has a unique probability and frequency distribution in its data. Real data tends to be sparse, characterized by hotspots and not evenly distributed.
- The RNG available is likely to be deterministic, meaning that given the same starting state, the generator will produce the identical sequence of random numbers each time?called a pseudo-RNG. While this would be unacceptable in applications such as cryptography, it is not usually an issue for the present purpose and may in fact be useful in controlling the repeatability of data generation runs. This can be helpful when tweaking other parameters of the generator to create the distribution or pattern required, by holding the random component “constant”.
- If ease of developing the data generator is paramount, pseudo-randomized data might be deemed an acceptable model of real data in spite of the previous points. In this case, be aware that if the number of trials is not large enough, the results will be unevenly distributed across the range of possibilities, producing “choppy” and unrealistic, rather than smoothly varying, generated data even at aggregated levels. While this may seem obvious, you’ll see that it was discovered the hard way in the real-world case study described later.
Finally, you will need some convenient way to browse the generated data to check for realism and conformance to any imposed patterns. Ideally, you should have a multidimensional cube browser available under the assumed scenario.
A Case Study
Here’s a representative case that illustrates the previous points. The case study uses Microsoft SQL Server for the relational star schema (see Figure 1), Analysis Services for the cube (see Figure 2), and Panorama NovaView for the cube browser. First, I’ll show the data specification and then examine an implementation.
We needed to make a presentation to a potential client showing the benefits of applying BI to their business. The data had to be recognizable and reasonable to the client and to have an embedded scenario which would let us show off BI technology by “discovering” it.
The steady-state business situation to be simulated was as follows:
A fictional wholesale bank BankABC does business in certain products. Transactions represent customers’ activities in these products. Transactions have a face value, and can be outstanding for multiple days, but incur a fixed overhead cost for each day they are outstanding. On the day a transaction is completed, the bank collects a fee which is 1% of the face value. Profit consists of this fee minus the accumulated overhead cost. Various back offices around the world process the transactions. In the aggregate, profit is positive and trending upward.
The “pattern” to be imposed on this steady state was as follows:
As of a certain date, BankABC begins offering a new product. Because bank employees are insufficiently trained in the new product, all transactions tend to remain outstanding on average for a significantly longer period than previously, such that profitability in all products begins to decline, eventually becoming negative even in the aggregate. BI technology is then used both to detect the decline in profitability and to determine that increasing average processing time is its cause. As a result of this analysis, the bank institutes a crash course of training in the new product for its staff. After instituting this change, further analysis in the ensuing months shows that profitability for all products becomes positive again, and overall profitability resumes its upward trend.
This perfectly reasonable?even over-simplified?business scenario was nevertheless non-trivial to simulate realistically. Note that while the scenario specifies certain things unambiguously, many essential details are left unstated (which is typical of such scenarios). I needed not only to design the mechanics, but also to make many choices to “fill in the gaps” and decide where shortcuts would not undermine the presentation.
Creating a specification using the eight guidelines mentioned previously resulted in the following:
- Data Volume: An average of 60 transactions per business day for five months, from 7/1/2004 to 3/31/2005.
- Variables: Revenue, cost, elapsed days, customer, and product to transaction associations.
- Business rules: No data on weekends.
- Patterns to be superimposed: As per the scenario; there are two products initially, with the new product (Documentary Collections) introduced on 9/1/2004; the scenario should fully play out over the allowed date range.
- Requirements of the database schema: Rows must be created in the fact table for each transaction for every day it is active, thus each day’s data is partially dependent on the prior day’s data.
- Referential data: 24 customers, 9 back offices, 8 countries, 4 regions, 14 employees, 3 products, with all data generated but realistic. Note that these unrealistic cardinalities were deemed acceptable in this context.
- The implementation must run fast: The small cardinalities chosen ensure this. It will be implemented as T-SQL scripts.
- The range for each variable: You can see the ranges more in the code listings discussed later in this article. Average face value per transaction should be about $250K.
Given the relatively small data volume required, I initially attempted to create the generated data for the fact table BankABCFact in Excel and use DTS to transfer it into the database. This was quick, and seemed particularly appealing for the superimposed scenario because of the apparent ease of manually tweaking the data to conform. However, I quickly discovered that this approach was much more cumbersome than expected, among other reasons because I could not express “rules” effectively. I therefore switched to an approach based on T-SQL scripts. Listing 1 creates the schema (including the database) and Listing 2 populates the dimension tables. Note that there is no explicit Time dimension?the project relies on Analysis Services’ ability to use the fact table as the Time dimension. The reader must create the Analysis Services cube based on the schema, but this is simple and Figures 1 and 2 provide an outline. For further details contact the author.
Populating a star schema is entirely about populating the fact table, after you’ve established the dimensions and their data. Populating fact table columns involves two principal tasks: selecting the foreign keys (i.e. creating the joins to the dimension tables), and generating the associated measure values. The general approach was to use the rand() function with various weightings and conditional logic to try to assign these values in a realistic way. Note that in T-SQL, the rand() function returns a single value within a given statement, thus set operations on multiple rows will not be effective. Instead, you must make individual calls to rand() on a per row basis, using a cursor or scrolling through the rows by other means, as is done in Listings 3, 4, and 5.
Two control tables drive the fact table populator: DateTranCountDriver, populated by Listing 3, and DateCustomerRandomizer, populated by Listing 4. Each listing has adjustable factors that you can tweak to achieve the requirements.
|Figure 3. Generate-Load-Review Cycle 1: Note the choppiness of the data even when viewed at this high level of aggregation. Also note the introduction of the new product on 9/1/2004, shortly followed by an overall decline to negative profit in all products.|
DateTranCountDriver was intended to pattern the number of transactions created and completed per day, thereby simulating the initial steady-state, followed by the increase in outstanding time, and finally the return to steady-state. Listing 3 includes queries that make it easier to see the pattern that is being created.
DateCustomerRandomizer was intended to randomize, within limits, the assignment of products and transaction face values to customers. The fact table populator (Listing 5) brought together these drivers while itself adding additional random elements. Listing 3, Listing 4 and Listing 5 contain additional comments. Listing 3 and Listing 4 are independent, but after changing either, you must run Listing 5 again. After each run of Listing 5, the Analysis Services cube is fully reprocessed, so you can browse it again using Panorama NovaView to compare the results to the desired scenario.
The generate-load-view cycle was repeated many times. Space does not permit including some of the intermediate (terrible) results, but Figure 3 and Figure 4 show the results that were eventually deemed close enough.
An interesting note about Figure 4: it seems to show that the elapsed time started trending upwards in August, prior to the introduction of the new product. This is not true in the data, but is a distortion caused by viewing the data at Month level, which (by definition) displays only one data point per product per month. You’d be able to see the desired pattern more clearly at Day level, but for implementation reasons related to the underlying cube and schema this was not possible to exhibit without more work than time allowed (interested readers may contact the author for details). Since a daily view would have been more compelling, this is another good example of the unexpected difficulty of meeting the requirements.
|Figure 4. Generate-Load-Review Cycle 2: Note how the average elapsed days per transaction meets the scenario overall. The trend was not intended to be so similar for all products, but we ran out of time for randomization.|
Certainly there are more sophisticated ways of “patterning” data than I applied, even using only T-SQL. For example, you could use the AVG and STDEV functions to control the distribution of data more precisely. Rather than using the rand() function directly when generating fact table rows, I could have populated a table variable with, say, the results of one million calls to rand() first, so as to work from a more uniform distribution of “randomness.” I could have invented a rules syntax and developed a full-blown table-driven rules engine. The ultimate would be a tool that let you draw the desired data patterns at various granularities and then generated the underlying data to support them!
My experience during this project has convinced me that there is only so much one can reasonably hope to achieve with simple methods. Had I realized this from the start, I would have saved valuable time that I instead spent discovering the points discussed in this article.
While each case will be different, I hope I have highlighted some of the many complex issues involved in generating realistic data. Though I used OLAP as a vehicle, most of the observations would apply equally to such a need in any data-intensive application. Forewarned is forearmed!