devxlogo

Calculating Baseline Metrics for Business Intelligence

Calculating Baseline Metrics for Business Intelligence

ven a casual observer of today’s IT scene has seen the buzzword storm that has developed around various forms of technology-enabled business management and monitoring: Business Performance Management (BPM), Business Process Management (also BPM), Corporate Performance Management (CPM), Business Activity Monitoring (BAM), and others. At first glance these acronyms all appear to be Business Intelligence (BI) by other names. It can be difficult to discern whether such designations simply describe the latest BI marketing concept or are new and distinct approaches. However, it is safe to say that all these approaches depend on the tools and techniques of BI; therefore, any discussion of basic BI techniques applies equally well to all these areas. This article discusses a type of business metric commonly needed for all these approaches: baselines.

What Are Baselines?
Baselines are time-lagged calculations (usually averages of one sort or another) which provide a basis for making comparisons of past performance to current performance. A baseline may also be forward-looking, such as when you establish a goal and are seeking to determine whether the trends show you’re likely to meet that goal?an essential piece of a Key Performance Indicator (KPI). Baselines have been around for as long as there have been analytic approaches to measuring execution and its results?in business, athletics, and medicine, to name a few.

Prior to the multidimensional cube technologies associated with BI, calculating useful baselines was difficult and expensive; analysts produced baselines infrequently and only for the most critical measures. Today, BI makes it practical to create a baseline for virtually any metric at any frequency. But that new simplicity is deceptive; calculating a meaningful baseline can still be complex, in ways that are very application-specific.

In this article I’ll illustrate this deceptive complexity using an example based on workflow analytics, in which the basic metrics and their baselines relate to transaction-processing volumes in a multi-national operational setting. Don’t get hung up on this specific example though. The issues and ideas are applicable to a wide range of analytic applications.

Workflow Baselines in a Transaction-processing Setting
Workflow analytics largely involve problems that answer the question “how many.” For example, counting how many transactions were processed in a given unit of time, or counting how many customer inquiries occurred during a specific period. The historical baseline data (as opposed to targets) are the averages of such counts over a specified “trailing period.” There’s nothing complex about this in principle. The chart in Figure 1 shows how the baseline for a transaction creation volume measure (Tvol) might be presented:

?
Figure 1. Scorecard Baseline Presentation: The figure shows baselines in blue alongside the current values for those metrics. Note that these relationships are not always the same as their relative magnitudes.

In this example, suppose the “trailing period” for the baseline calculation were one year. At first glance this seems simple enough. To get the baseline, you could sum the last 365 days of Tvol measures and divide by 365. For comparison to “today’s” daily Tvol measure this would suffice. But in a typical analytic application Tvol could be viewed at many different Time granularities in addition to Day, such as Week, Month, Quarter and Year; therefore, the baseline you provide must be relative to any arbitrary point in time. In other words, you must calculate the baseline dynamically. For example, to display a Tvol measure for Month MMYYYY, the baseline should be the sum of the daily measures for the 365 days preceding MMYYYY divided by 12. If at the Quarter level, it should be the sum of the daily measures for the 365 days preceding MMYYYY divided by 4. Each time period has its own proper divisor.

As a practical matter, having such dynamic calculations would be almost impossible without multidimensional cube technology (imagine doing this with SQL), and is not trivial even with it. Yet what has been discussed so far is only the beginning! In a real application the baseline calculations might additionally need to account for many specific conditions, such as:

  • Transactions Exist for Multiple Days. When transactions can exist for multiple days, counting problems become considerably more difficult. Such long-term transactions are common in business domains such as call-center and order fulfillment, where the lifecycle of a call or order may span more than one day. If not properly designed, double-counting is almost inevitable if your application lets users query any time period at any level.
  • Calculation on a Business Days Basis. Typically, measures are only generated on business days, not every day. In this case the proper divisor at the Day level is the number of business days in the prior 365 calendar days, taking into account weekends, holidays and leap years. Because of weekends, leap years, and holidays, the number of business days is not a constant, it’s a function of the day chosen for reporting, and of how much history the baseline includes.
  • Calculation with Multiple Calendars. In the sample scenario, the business processes transactions at multiple cities around the world. This means that the data encompasses multiple calendars reflecting different holidays and work norms (for example, six-day work weeks). The baseline divisor thus becomes a function not only of Time but also of Location. Aside from the complication of considering this in the baseline calculation, often there is not even a system of record for this information.
  • Effective “Live” Date Varies by Location. Suppose new locations periodically come “live” as the enterprise expands or as the transaction-processing application gets deployed around the world, such as in a phased Enterprise Resource Planning (ERP) implementation. In this case, the baseline calculation must take into account how long a particular location has been live; otherwise?the baseline would be grossly distorted.

For example, suppose Paris went live in October 2002 and you’re trying to display Tvol at the Month level for December 2002. The baseline should be the sum of October 2002 and November 2002 activity divided by 2, not 12.

On the other hand, after the Paris location has been live for more than a year, the baseline calculation should behave as described earlier?on a trailing twelve-month basis. Continuing the example, if the display were of October 2003, the baseline should now be the sum of October 2002 through September 2003, divided by 12. More generally, when more than a year’s worth of historical data for a measure exists, the baseline divisor for a given Time level is a constant. In this case, the monthly divisor becomes a constant 12 whenever the “live” date is at least one year prior to the reporting date.

Table 1 illustrates this behavior for the Month and Quarter levels for Paris, assuming it went live in October 2002:

Table 1. Month and Quarter baseline devisors for the Paris location.

QuarterBaseline DivisorMonthBaseline Divisor
Q4 2002 and earlierN/AOct 2002 and earlierN/A
Q1 20031Nov 20021
Q2 20032Dec 20022
Q3 20033Jan 20033
Q4 20034Feb 20034
Q1 2004 and later4Mar 20035
??Apr 20036
??May 20037
??Jun 20038
??Jul 20039
??Aug 200310
??Sep 200311
??Oct 200312
??Nov 2003 and later12

Author’s Note: Strictly speaking, the Day level divisor for a given City never becomes constant but “vibrates” around a constant after the City has been live for a year. For example, for a City with a 5-day work week and 10 annual holidays, the nominal number of business days in a year is 250 ((52*5) – 10)?the “constant”?but the actual number of business days in the preceding 365 calendar days varies day by day. It could assume values such as 249, 250, and 251, and possibly even 248 or 252. For baselines describing workflow volumes this makes very little difference, but in other applications (for example if interest calculations were involved) it could be important.

It’s easy to see that supporting all these requirements in a dynamic manner that works for any time period and also performs responsively is not a trivial task.

A Sample Technical Implementation
One way you can achieve baselines that take all these requirements into account is with a datamart structured as a star schema feeding a Microsoft Analysis Services cube that uses MDX for calculated measures and queries. The design and population of the star schema and of the MDX expressions used in the cube are equally important components of the solution sketched here. The scheme illustrated is intended to have the fewest constraints and the most flexibility possible?but in particular cases, it might be possible to meet requirements more simply than shown, for example with one rather than multiple columns in the fact table.

The Star Schema
A typical star schema design for the sample scenario, assuming a need to track Transaction/Day data, would have one row in the fact table per Transaction/Day combination. Or, if each transaction has many measures or other data not at a Day grain (such as financial balances), one might design two fact tables joined in a view?one containing the transaction data and one containing the Transaction/Day data. Such a design would represent best practices, yet be sub-optimal for counting problems such as the one at hand. With multiple day rows for the same transaction, how would you count Tvol? A Tvol query at any given level of Time granularity must count each transaction exactly once. Further, logic dictates that when counting Tvol at the Day level, you should count a transaction only on the day it was created. If counting at the Week level, you’d count a transaction if it were created on any day in that week. At the Month level, it counts if it was created on any day in the month. You would treat levels such as Quarter and Year in like fashion.

Achieving such behavior using “COUNT” functions in either the SQL against the datamart or in the cube MDX, without double-counting or excessively complex code would be difficult if not impossible. To avoid this, the key idea for the datamart design is to change the “counting” problem to a “summing” problem. While this complicates the fact table load logic, it makes it possible to get a reliable count at any level of the Time hierarchy in a relatively simple way.

To do this, you must add an integer column to the Transaction/Day fact table for each level of the Time hierarchy?for example, for Day, Week, Month, Quarter, and Year. These columns are then populated such that when you sum the appropriate column over the requested Time level, you obtain the correct count

The easiest way to understand this is to look at an example (see Table 2). Assume “today” is 4/6/2004, and that the earliest date each transaction appears is its creation date, after which it has Day rows for each consecutive Day until either its lifecycle is over or it is “today” (for incomplete transactions). In this example 3/22/2004 is a Monday, and no rows are added on weekends. The “IsCounted” columns are for counting transaction creation volume by summing.

Table 2. The table shows how values in the “IsCounted” columns might look for the Day, Week, and Month levels.

TransactionDayIsCountedDayIsCountedWeekIsCountedMonth
ABC3/22/2004100
ABC3/23/2004000
ABC3/24/2004000
ABC3/25/2004000
ABC3/26/2004010
ABC3/29/2004001
XYZ3/25/2004100
XYZ3/26/2004010
XYZ3/29/2004000
XYZ3/30/2004000
XYZ3/31/2004001
XYZ4/1/2004000
XYZ4/2/2004000
XYZ4/5/2004000
XYZ4/6/2004000
DEF4/6/2004111

If the structure shown in Table 2 were queried via SQL it would be assumed that queries above Day level in the Time dimension would convert the requested level to Day level, either directly through constraints on Day or by joining Day to a dimension table which does this.

From the data shown in Table 2, you can follow this logic:

For each transaction, each count column contains zero in every row except one. Summing on the count column associated with the Time constraint gives the correct volume count. For example, for Day 3/22/2004, summing the IsCountedDay column, we see there was one transaction created (ABC). For the Week 3/22/2004-3/26/2004 and summing the IsCountedWeek column we see there were two transactions created (ABC and XYZ). For the Month of March (where Day would be constrained to 3/1/2004-3/31/2004), summing the IsCountedMonth column also shows two transactions created (ABC and XYZ).

Maintaining this table as the Time dimension goes forward does require the additional overhead of updating rows for “yesterday” when adding new rows for the same transaction “today.” For example, DEF looks as shown above when 4/6/2004 is “today.” When 4/7/2004 becomes “today,” if DEF is still active, it then has two rows which look like Table 3:

Table 3. The table illustrates the process used to maintain the “IsCounted” columns on a day-by-day basis.

TransactionDayIsCountedDayIsCountedWeekIsCountedMonth
DEF4/6/2004100
DEF4/7/2004011

Because the update of “yesterday’s” row is only necessary the first time new rows are added for “today,” you can improve performance by taking this into account if the fact table is updated more than once a day.

None of this deals with the multi-calendar, multi-location issues noted earlier?these are left to the Analysis Services portion of the solution.

The code to maintain this table as shown will not be simple. But as shown, it makes the SQL for counting Tvol both simple and immune to double-counting. See Sidebar 1 for useful variations on this technique.

Some MDX Examples
With a star schema created as described, the next step is to use Analysis Services technology to take advantage of it. Assume that you’re translating the star schema into a cube using standard techniques, with the IsCounted facts becoming measures in the cube with “Sum” as their Aggregate Property. After doing that, you can create calculated members (CMs) to complete the solution. Assume the Time dimension is named [Date]. These MDX examples were created to illustrate the ideas, so consider them as templates rather production-ready code.

The first requirement is to create a CM to sum the proper “IsCounted” measure based on the current Date context. Calling it TranCount, the MDX would look like:

   iif([Date].CurrentMember.Level.Ordinal = 0,      NULL, -- NULL if [Date] dimension is at the "All" level   iif([Date].CurrentMember.Level.Name="Day",       [Measures].[IsCountedDay],   iif([Date].CurrentMember.Level.Name="Month",      [Measures].[IsCountedMonth],   iif([Date].CurrentMember.Level.Name="Quarter",      [Measures].[IsCountedQuarter],      [Measures].[IsCountedYear]   ))))

TranCount provides Tvol for whatever the current Date constraint is, such as MMYYYY in the earlier example.

The baseline CM is defined in terms of two other CMs: BaselineNumerator / BaselineDenominator.

The BaselineNumerator needs to sum the previously defined TranCount over the one year trailing period relative to the current period. This suggests the MDX functions ParallelPeriod and Sum. A template for BaselineNumerator looks like:

   Sum({ParallelPeriod([Date].[Year],1,[Date].CurrentMember):      [Date].CurrentMember.PrevMember},[Measures].[TranCount])

Note that what this deceptively simple expression actually does is sum the set of Date members between the member prior to the current Date member and one year back from that prior member. For example, if the current member is Q1 2004, the result sums Q1 2003 : Q4 2003. Also note that you don’t need to add any explicit code regarding when a Location went live, because the code adding “actuals.” For a given Location, the expression will add either history-to-date (if less than a year) or a trailing year’s worth, depending on the value of [Date].CurrentMember. The ParallelPeriod function also automatically does the right thing regardless of the current level in the [Date] hierarchy.

A complete version of BaselineNumerator would require logic to handle hitting the “edge” of the cube (i.e. when there is no PrevMember) to avoid generating an error.

Next, BaselineDenominator must derive the correct divisor for the baseline. This calculation must take into account the business rules regarding Location-based calendars, when the Location went live relative to the Date constraint of the query, and the level within the Date hierarchy of the query. Because this CM is the most complex part of the solution and is very application-specific, I’ll sketch only high-level ideas here.

First, you need a way to get the number of business days in the prior year given a Day and a Location. Within the MDX environment, the most practical way to do this is using the MDX LookUp function against another cube (here named BLDays). This other cube and the star schema feeding it would need to be specially designed for this purpose, and would have the following characteristics:

  • Its dimensions are the [Date] and [Location] dimensions of the main cube, which therefore must be shared dimensions in Analysis Services terms.
  • It must be actively maintained so as to contain data for every combination of Date and Location?thus it is not a static object but must be updated as Time goes forward and as new Locations go live.
  • The star schema has one integer fact, BusinessDaysInLastYear. This value is the number of business days in the preceding year as a function of Day and Location, taking into account the calendar for the Location and when the Location went live. BusinessDaysInLastYear is zero for Days prior to the date the Location went live.
  • In effect, the complicated business rules are pre-calculated in this schema. This not only improves query-time performance but is probably the only practical way to keep query complexity manageable.

The lookup code, when Date is at Day level, looks like:

   LookupCube("BLDays", "(" +          MemberToStr([Date].[ByMonth].CurrentMember) + "," +         MemberToStr([Location].CurrentMember)+       ",[Measures].[BusinessDaysLastYear])")

The preceding query gives the required number of business days in the prior year.

Unfortunately it’s still not complete. Recall that at higher levels of the [Date] hierarchy, the proper divisor is not the number of business days in the prior year but the number of time periods at the same level in the prior year (taking Location live date into account). The following expression gives an idea on how to achieve this:

   LookupCube("BLDays", "Sum({ParallelPeriod([Date].[Year],1," +       MemberToStr([Date].CurrentMember) + "):" +       MemberToStr([Date].CurrentMember.PrevMember) + "},iif((" +       MemberToStr([Location].CurrentMember) +       ",[Measures].[BusinessDaysInLastYear])>0,1,0))"))

The (simplified) sub-expression from this is:

   iif(([Location].CurrentMember,[Measures].      [BusinessDaysInLastYear])>0,1,0)

When evaluated in the context of the BLDays cube, the preceding expression will yield “1” if [Location].CurrentMember was live within the current Date period, and “0” if not. This will be evaluated for each Date member generated by the ParallelPeriod function (as described for BaselineNumerator). Sum then adds the results. For example, if the current Date level is Month and the current Location were live only during the two months prior to the current month, the result would be “2,” the desired divisor. Note that this logic considers a Location live for an entire Date period if it was live in any part of that period.

A complete version of BaselineDenominator must knit together the two lookup expressions above to use the appropriate one based on the current level in the [Date] dimension.

As you can see, calculating realistic baselines can be far more complex than simply taking a simple average. If you neglect to take the appropriate business rules into account, your baseline data won’t reflect reality, and thus will also throw off the accuracy of the results.

devxblackblue

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.

About Our Journalist