RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Build a Generic Histogram Generator for SQL Server : Page 6

Histograms help people analyze large amounts of data, whether you display them as tables or as charts. This article shows you how to do both.


Smooth Sailing
In the sections on temporal histograms I did not mention smoothing, but it was implicit in the discussion. Recall that if you specify a @DateTemplate of, say, yyyy-mm for a temporal histogram, the first bucket spans the entire month of the lower boundary date, irrespective of whether the lower boundary date specifies the first of that month. The analogous situation applies at the trailing end of the final bucket. If your date template specifies only years, then buckets are "smoothed" to full years automatically.

With numerical histograms, on the other hand, smoothing is explicit and under your control via the @Smooth parameter. You saw in the earlier example (Table 7 vs. Table 8) how boundary smoothing can be quite useful.

The boundary smoothing algorithm is a table-valued function used by the histogram stored procedure. But it is designed as a public function, complete with API documentation. If you're wondering why values get smoothed to certain boundaries, you can experiment with the BoundSmoothing function directly. The output of the BoundSmoothing function in this case is a single row containing nine values. In practice you need only the final two values; the others are present to allow for learning or tinkering, and to allow a discussion of smoothing many different input values in this next example. This query takes a series of representative samples and reports them in a single result set; it is essentially a unit test for the function.

   select * from dbo.BoundSmoothing(3, 21)
   union select * from dbo.BoundSmoothing(71, 76)
   union select * from dbo.BoundSmoothing(19, 76)
   union select * from dbo.BoundSmoothing(119, 176)
   union select * from dbo.BoundSmoothing(119, 276)
   union select * from dbo.BoundSmoothing(119, 976)
   union select * from dbo.BoundSmoothing(119, 676)
   union select * from dbo.BoundSmoothing(49, 976)
   union select * from dbo.BoundSmoothing(199, 899)
   union select * from dbo.BoundSmoothing(199, 901)
   union select * from dbo.BoundSmoothing(199, 976)
   union select * from dbo.BoundSmoothing(201, 976)
   union select * from dbo.BoundSmoothing(119, 1176)
   union select * from dbo.BoundSmoothing(119, 1499)
   union select * from dbo.BoundSmoothing(119, 1501)
   union select * from dbo.BoundSmoothing(319, 1576)
   union select * from dbo.BoundSmoothing(2319, 11576)
   union select * from dbo.BoundSmoothing(12383, 12389)
   union select * from dbo.BoundSmoothing(12319, 12389)
   union select * from dbo.BoundSmoothing(12319, 84111)
   union select * from dbo.BoundSmoothing(12319, 90111)
   order by lownum, highnum

Figure 5 shows the output of these 21 test cases. Row 1 for example, takes the lower bound of 3 and upper bound of 21 and returns the numbers unchanged. Why? The numbers are close together (the value in the closeness column is small). You can look for other values where closeness is small—rows 4 and 21 qualify. Each of these cases returns the outputs unchanged. For example, it would not make much sense to take the range 12383-12389 and smooth it to 12000-13000.

Figure 5. Bound Smoothing Scenarios: Each row in the result set represents a representative sample for adjusting upper and lower boundaries.

As promised earlier, here's a quantitative description of the general smoothing algorithm, using the rows from Figure 5 as examples.

  • When the upper and lower bounds are close together (less than 25 units apart), they are returned unchanged. See rows 1, 4, and 21 as previously mentioned.
  • If the upper bound is within 25 percent of itself rounded to its most significant digit increased by one with remaining digits zero then use that value; otherwise, round to the penultimate significant digit increased by one with remaining digits zero. This is a mouthful that is easier to digest by example, illustrated by rows 10 and 11. In both cases (taking either 1499 or 1501, the most significant digit is 1, increased by 1 is 2, and filled out with zeroes yields 2000 as the tentative smoothed value. The value 1501 is within 25 percent of 2000 so it accepts 2000 but 1499 is just below the 25 percent threshold so it rejects it in favor of looking at the next digit, the 4. Bumping 14 up by 1 yields 15, and filling with zeroes produces the 1500 value that you see for row 10.
  • If the upper bound is within 10 percent of itself rounded to its most significant digit increased by one with remaining digits zero, and the lower bound relative to the smoothed upper bound is within 20 percent of zero, then use zero as the lower bound. Compare rows 19 and 20, for example. In both rows the distance between the lower bound and zero is constant. But in row 20 the upper bound is sufficiently larger than the lower bound to warrant smoothing the lower bound to zero. Rows 14 and 15 show the same thing with the input lower bound immediately above and below the 20 percent threshold.

Aggregate Operators
Histograms by their very nature involve aggregation of some one or more aspects of a set of data. The histogram stored procedure provides a simple mechanism for explicitly factoring in SQL aggregate operators as well using the @Aggregate parameter. For example, suppose you wanted to explore the question of how many orders each customer has. The following query returns the number of orders per customer. Each row of the result set summarizes one customer.

   SELECT CustomerID, count(SalesOrderID) AS OrderCount
   FROM Sales.SalesOrderHeader
   GROUP BY CustomerID

For AdventureWorks, that yields almost 20,000 rows—quite a lot of data. You could, of course, sort that result in one direction to see which customers ordered the most, or the opposite direction to see which ordered the least, etc. But for a good overview, funnel the request through the histogram stored procedure. This next query condenses the result to a single row (a summary histogram) by performing the same aggregation as above—counting orders per customer—then generating a standard histogram on that intermediate result (see Table 12):

   exec histogram
Table 12. Explicit Aggregation: A histogram with explicit aggregation using the SQL count() function.
0-2 3-5 6-8 9-11 12-14 15-17 18-20 21-23 24-26 27-30+
17122 1660 210 28 64 21 0 0 1 13

Author's Note: The aggregate operation is the only operation discussed in this article that does not work in SQL Server 2000. Everything else works in both SQL Server 2000 and 2005.

Summary of Histogram Usage
For your convenience, Table 13 presents a summary of parameter usage for all three histogram types. Again, refer to the formal Histogram API for more details.

Table 13. Parameter Usage: Here are the parameters for the three histogram types, defined by their @Mode values: DateRange, DateMonths, and NumberRange.
Parameter DateRange DateMonths NumberRange
@SchemaName optional optional optional
@TableName required required required
@BucketExpr required required required
@BucketFilter optional optional optional
@IdExpr optional optional optional (NA if @Aggregate used)

optional (and only when @IdExpr is also used)


optional (and only when @IdExpr is also used)

@Aggregate NA NA optional
@DateTemplate required

required if: @IdExpr is a date field

@Mode required if:bounds are omitted and@BucketExpr is not a date field required required if:bounds are omitted and@BucketExpr is not an integer or money field
@LowerBoundDate optional optional NA
@UpperBoundDate optional optional NA
@Buckets optional NA NA
@LowerBoundInt NA NA optional
@UpperBoundInt NA NA optional
@Smooth NA NA optional
@Verbose optional optional optional
@Process optional optional optional

Close Icon
Thanks for your registration, follow us on our social networks to keep up-to-date