Login | Register   
RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Calculating Baseline Metrics for Business Intelligence : Page 3

At first glance, calculating a baseline metric—an average of some measure over a specific period of time—appears to be a simple task. In reality, baseline metrics are slippery beasts, and not at all as simple as they appear.

A Sample Technical Implementation
One way you can achieve baselines that take all these requirements into account is with a datamart structured as a star schema feeding a Microsoft Analysis Services cube that uses MDX for calculated measures and queries. The design and population of the star schema and of the MDX expressions used in the cube are equally important components of the solution sketched here. The scheme illustrated is intended to have the fewest constraints and the most flexibility possible—but in particular cases, it might be possible to meet requirements more simply than shown, for example with one rather than multiple columns in the fact table. The Star Schema
A typical star schema design for the sample scenario, assuming a need to track Transaction/Day data, would have one row in the fact table per Transaction/Day combination. Or, if each transaction has many measures or other data not at a Day grain (such as financial balances), one might design two fact tables joined in a view—one containing the transaction data and one containing the Transaction/Day data. Such a design would represent best practices, yet be sub-optimal for counting problems such as the one at hand. With multiple day rows for the same transaction, how would you count Tvol? A Tvol query at any given level of Time granularity must count each transaction exactly once. Further, logic dictates that when counting Tvol at the Day level, you should count a transaction only on the day it was created. If counting at the Week level, you'd count a transaction if it were created on any day in that week. At the Month level, it counts if it was created on any day in the month. You would treat levels such as Quarter and Year in like fashion.

Achieving such behavior using "COUNT" functions in either the SQL against the datamart or in the cube MDX, without double-counting or excessively complex code would be difficult if not impossible. To avoid this, the key idea for the datamart design is to change the "counting" problem to a "summing" problem. While this complicates the fact table load logic, it makes it possible to get a reliable count at any level of the Time hierarchy in a relatively simple way. To do this, you must add an integer column to the Transaction/Day fact table for each level of the Time hierarchy—for example, for Day, Week, Month, Quarter, and Year. These columns are then populated such that when you sum the appropriate column over the requested Time level, you obtain the correct count

The easiest way to understand this is to look at an example (see Table 2). Assume "today" is 4/6/2004, and that the earliest date each transaction appears is its creation date, after which it has Day rows for each consecutive Day until either its lifecycle is over or it is "today" (for incomplete transactions). In this example 3/22/2004 is a Monday, and no rows are added on weekends. The "IsCounted" columns are for counting transaction creation volume by summing. Table 2. The table shows how values in the "IsCounted" columns might look for the Day, Week, and Month levels.

Transaction Day IsCountedDay IsCountedWeek IsCountedMonth
ABC 3/22/2004 1 0 0
ABC 3/23/2004 0 0 0
ABC 3/24/2004 0 0 0
ABC 3/25/2004 0 0 0
ABC 3/26/2004 0 1 0
ABC 3/29/2004 0 0 1
XYZ 3/25/2004 1 0 0
XYZ 3/26/2004 0 1 0
XYZ 3/29/2004 0 0 0
XYZ 3/30/2004 0 0 0
XYZ 3/31/2004 0 0 1
XYZ 4/1/2004 0 0 0
XYZ 4/2/2004 0 0 0
XYZ 4/5/2004 0 0 0
XYZ 4/6/2004 0 0 0
DEF 4/6/2004 1 1 1

If the structure shown in Table 2 were queried via SQL it would be assumed that queries above Day level in the Time dimension would convert the requested level to Day level, either directly through constraints on Day or by joining Day to a dimension table which does this. From the data shown in Table 2, you can follow this logic:

For each transaction, each count column contains zero in every row except one. Summing on the count column associated with the Time constraint gives the correct volume count. For example, for Day 3/22/2004, summing the IsCountedDay column, we see there was one transaction created (ABC). For the Week 3/22/2004-3/26/2004 and summing the IsCountedWeek column we see there were two transactions created (ABC and XYZ). For the Month of March (where Day would be constrained to 3/1/2004-3/31/2004), summing the IsCountedMonth column also shows two transactions created (ABC and XYZ). Maintaining this table as the Time dimension goes forward does require the additional overhead of updating rows for "yesterday" when adding new rows for the same transaction "today." For example, DEF looks as shown above when 4/6/2004 is "today." When 4/7/2004 becomes "today," if DEF is still active, it then has two rows which look like Table 3:

Table 3. The table illustrates the process used to maintain the "IsCounted" columns on a day-by-day basis.

Transaction Day IsCountedDay IsCountedWeek IsCountedMonth
DEF 4/6/2004 1 0 0
DEF 4/7/2004 0 1 1

Because the update of "yesterday's" row is only necessary the first time new rows are added for "today," you can improve performance by taking this into account if the fact table is updated more than once a day. None of this deals with the multi-calendar, multi-location issues noted earlier—these are left to the Analysis Services portion of the solution.

The code to maintain this table as shown will not be simple. But as shown, it makes the SQL for counting Tvol both simple and immune to double-counting. See Sidebar 1 for useful variations on this technique.

Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



Thanks for your registration, follow us on our social networks to keep up-to-date