Browse DevX
Sign up for e-mail newsletters from DevX


Generating Realistic and Non-Random Sample Data : Page 4

Generating realistic sample data is often harder than you might think, and generating sample data that you can use for a successful data-mining demonstration requires even more planning. This article explores the process and presents guidelines for realistic sample data generation.




Building the Right Environment to Support AI, Machine Learning and Deep Learning

The Implementation
Given the relatively small data volume required, I initially attempted to create the generated data for the fact table BankABCFact in Excel and use DTS to transfer it into the database. This was quick, and seemed particularly appealing for the superimposed scenario because of the apparent ease of manually tweaking the data to conform. However, I quickly discovered that this approach was much more cumbersome than expected, among other reasons because I could not express "rules" effectively. I therefore switched to an approach based on T-SQL scripts. Listing 1 creates the schema (including the database) and Listing 2 populates the dimension tables. Note that there is no explicit Time dimension—the project relies on Analysis Services' ability to use the fact table as the Time dimension. The reader must create the Analysis Services cube based on the schema, but this is simple and Figures 1 and 2 provide an outline. For further details contact the author. Populating a star schema is entirely about populating the fact table, after you've established the dimensions and their data. Populating fact table columns involves two principal tasks: selecting the foreign keys (i.e. creating the joins to the dimension tables), and generating the associated measure values. The general approach was to use the rand() function with various weightings and conditional logic to try to assign these values in a realistic way. Note that in T-SQL, the rand() function returns a single value within a given statement, thus set operations on multiple rows will not be effective. Instead, you must make individual calls to rand() on a per row basis, using a cursor or scrolling through the rows by other means, as is done in Listings 3, 4, and 5.

Two control tables drive the fact table populator: DateTranCountDriver, populated by Listing 3, and DateCustomerRandomizer, populated by Listing 4. Each listing has adjustable factors that you can tweak to achieve the requirements.

Figure 3. Generate-Load-Review Cycle 1: Note the choppiness of the data even when viewed at this high level of aggregation. Also note the introduction of the new product on 9/1/2004, shortly followed by an overall decline to negative profit in all products.
DateTranCountDriver was intended to pattern the number of transactions created and completed per day, thereby simulating the initial steady-state, followed by the increase in outstanding time, and finally the return to steady-state. Listing 3 includes queries that make it easier to see the pattern that is being created.

DateCustomerRandomizer was intended to randomize, within limits, the assignment of products and transaction face values to customers. The fact table populator (Listing 5) brought together these drivers while itself adding additional random elements. Listing 3, Listing 4 and Listing 5 contain additional comments. Listing 3 and Listing 4 are independent, but after changing either, you must run Listing 5 again. After each run of Listing 5, the Analysis Services cube is fully reprocessed, so you can browse it again using Panorama NovaView to compare the results to the desired scenario. The generate-load-view cycle was repeated many times. Space does not permit including some of the intermediate (terrible) results, but Figure 3 and Figure 4 show the results that were eventually deemed close enough.

An interesting note about Figure 4: it seems to show that the elapsed time started trending upwards in August, prior to the introduction of the new product. This is not true in the data, but is a distortion caused by viewing the data at Month level, which (by definition) displays only one data point per product per month. You'd be able to see the desired pattern more clearly at Day level, but for implementation reasons related to the underlying cube and schema this was not possible to exhibit without more work than time allowed (interested readers may contact the author for details). Since a daily view would have been more compelling, this is another good example of the unexpected difficulty of meeting the requirements.

Figure 4. Generate-Load-Review Cycle 2: Note how the average elapsed days per transaction meets the scenario overall. The trend was not intended to be so similar for all products, but we ran out of time for randomization.
Certainly there are more sophisticated ways of "patterning" data than I applied, even using only T-SQL. For example, you could use the AVG and STDEV functions to control the distribution of data more precisely. Rather than using the rand() function directly when generating fact table rows, I could have populated a table variable with, say, the results of one million calls to rand() first, so as to work from a more uniform distribution of "randomness." I could have invented a rules syntax and developed a full-blown table-driven rules engine. The ultimate would be a tool that let you draw the desired data patterns at various granularities and then generated the underlying data to support them!

My experience during this project has convinced me that there is only so much one can reasonably hope to achieve with simple methods. Had I realized this from the start, I would have saved valuable time that I instead spent discovering the points discussed in this article. While each case will be different, I hope I have highlighted some of the many complex issues involved in generating realistic data. Though I used OLAP as a vehicle, most of the observations would apply equally to such a need in any data-intensive application. Forewarned is forearmed!

Mark Frawley is a Technical Business Analyst on the Citigate Hudson Pervasive Business Intelligence team that created an Executive Dashboard for ABN Amro. His focus areas include data modeling, relational and multidimensional database technology, software development, and writing and teaching about the BI industry. Prior to joining Citigate Hudson, Mark worked at a variety of financial institutions and consulting companies on a wide range of applications and technologies, most recently as one of the chief designers and developers of data warehouses and datamarts over a seven year period at Soci�t� Generale.
Thanks for your registration, follow us on our social networks to keep up-to-date