WEBINAR:
OnDemand
Full Text Search: The Key to Better Natural Language Queries for NoSQL in Node.js
Smooth Sailing
In the sections on temporal histograms I did not mention smoothing, but it was implicit in the discussion. Recall that if you specify a @DateTemplate of, say, yyyymm for a temporal histogram, the first bucket spans the entire month of the lower boundary date, irrespective of whether the lower boundary date specifies the first of that month. The analogous situation applies at the trailing end of the final bucket. If your date template specifies only years, then buckets are "smoothed" to full years automatically.
With numerical histograms, on the other hand, smoothing is explicit and under your control via the @Smooth parameter. You saw in the earlier example (Table 7 vs. Table 8) how boundary smoothing can be quite useful.
The boundary smoothing algorithm is a tablevalued function used by the histogram stored procedure. But it is designed as a public function, complete with API documentation. If you're wondering why values get smoothed to certain boundaries, you can experiment with the BoundSmoothing function directly. The output of the BoundSmoothing function in this case is a single row containing nine values. In practice you need only the final two values; the others are present to allow for learning or tinkering, and to allow a discussion of smoothing many different input values in this next example. This query takes a series of representative samples and reports them in a single result set; it is essentially a unit test for the function.
select * from dbo.BoundSmoothing(3, 21)
union select * from dbo.BoundSmoothing(71, 76)
union select * from dbo.BoundSmoothing(19, 76)
union select * from dbo.BoundSmoothing(119, 176)
union select * from dbo.BoundSmoothing(119, 276)
union select * from dbo.BoundSmoothing(119, 976)
union select * from dbo.BoundSmoothing(119, 676)
union select * from dbo.BoundSmoothing(49, 976)
union select * from dbo.BoundSmoothing(199, 899)
union select * from dbo.BoundSmoothing(199, 901)
union select * from dbo.BoundSmoothing(199, 976)
union select * from dbo.BoundSmoothing(201, 976)
union select * from dbo.BoundSmoothing(119, 1176)
union select * from dbo.BoundSmoothing(119, 1499)
union select * from dbo.BoundSmoothing(119, 1501)
union select * from dbo.BoundSmoothing(319, 1576)
union select * from dbo.BoundSmoothing(2319, 11576)
union select * from dbo.BoundSmoothing(12383, 12389)
union select * from dbo.BoundSmoothing(12319, 12389)
union select * from dbo.BoundSmoothing(12319, 84111)
union select * from dbo.BoundSmoothing(12319, 90111)
order by lownum, highnum
Figure 5 shows the output of these 21 test cases. Row 1 for example, takes the lower bound of 3 and upper bound of 21 and returns the numbers unchanged. Why? The numbers are close together (the value in the closeness column is small). You can look for other values where closeness is small—rows 4 and 21 qualify. Each of these cases returns the outputs unchanged. For example, it would not make much sense to take the range 1238312389 and smooth it to 1200013000.


Figure 5. Bound Smoothing Scenarios: Each row in the result set represents a representative sample for adjusting upper and lower boundaries. 
As promised earlier, here's a quantitative description of the general smoothing algorithm, using the rows from Figure 5 as examples.
 When the upper and lower bounds are close together (less than 25 units apart), they are returned unchanged. See rows 1, 4, and 21 as previously mentioned.
 If the upper bound is within 25 percent of itself rounded to its most significant digit increased by one with remaining digits zero then use that value; otherwise, round to the penultimate significant digit increased by one with remaining digits zero. This is a mouthful that is easier to digest by example, illustrated by rows 10 and 11. In both cases (taking either 1499 or 1501, the most significant digit is 1, increased by 1 is 2, and filled out with zeroes yields 2000 as the tentative smoothed value. The value 1501 is within 25 percent of 2000 so it accepts 2000 but 1499 is just below the 25 percent threshold so it rejects it in favor of looking at the next digit, the 4. Bumping 14 up by 1 yields 15, and filling with zeroes produces the 1500 value that you see for row 10.
 If the upper bound is within 10 percent of itself rounded to its most significant digit increased by one with remaining digits zero, and the lower bound relative to the smoothed upper bound is within 20 percent of zero, then use zero as the lower bound. Compare rows 19 and 20, for example. In both rows the distance between the lower bound and zero is constant. But in row 20 the upper bound is sufficiently larger than the lower bound to warrant smoothing the lower bound to zero. Rows 14 and 15 show the same thing with the input lower bound immediately above and below the 20 percent threshold.
Aggregate Operators
Histograms by their very nature involve aggregation of some one or more aspects of a set of data. The histogram stored procedure provides a simple mechanism for explicitly factoring in SQL aggregate operators as well using the @Aggregate parameter. For example, suppose you wanted to explore the question of how many orders each customer has. The following query returns the number of orders per customer. Each row of the result set summarizes one customer.
SELECT CustomerID, count(SalesOrderID) AS OrderCount
FROM Sales.SalesOrderHeader
GROUP BY CustomerID
For AdventureWorks, that yields almost 20,000 rows—quite a lot of data. You could, of course, sort that result in one direction to see which customers ordered the most, or the opposite direction to see which ordered the least, etc. But for a good overview, funnel the request through the histogram stored procedure. This next query condenses the result to a single row (a summary histogram) by performing the same aggregation as above—counting orders per customer—then generating a standard histogram on that intermediate result (see Table 12):
exec histogram
@SchemaName='Sales',
@TableName='SalesOrderHeader',
@BucketExpr='count(SalesOrderID)',
@Aggregate='CustomerID'
Table 12. Explicit Aggregation: A histogram with explicit aggregation using the SQL count() function.
02 
35 
68 
911 
1214 
1517 
1820 
2123 
2426 
2730+ 
17122 
1660 
210 
28 
64 
21 
0 
0 
1 
13 
Author's Note: The aggregate operation is the only operation discussed in this article that does not work in SQL Server 2000. Everything else works in both SQL Server 2000 and 2005. 
Summary of Histogram Usage
For your convenience, Table 13 presents a summary of parameter usage for all three histogram types. Again, refer to the formal Histogram API for more details.
Table 13. Parameter Usage: Here are the parameters for the three histogram types, defined by their @Mode values: DateRange, DateMonths, and NumberRange.
Parameter 
DateRange 
DateMonths 
NumberRange 
@SchemaName 
optional 
optional 
optional 
@TableName 
required 
required 
required 
@BucketExpr 
required 
required 
required 
@BucketFilter 
optional 
optional 
optional 
@IdExpr 
optional 
optional 
optional (NA if @Aggregate used) 
@IdFilter 
optional (and only when @IdExpr is also used)

@IdAlias 
optional (and only when @IdExpr is also used)

@Aggregate 
NA 
NA 
optional 
@DateTemplate 
required 
required if: @IdExpr is a date field

@Mode 
required if:bounds are omitted and@BucketExpr is not a date field 
required 
required if:bounds are omitted and@BucketExpr is not an integer or money field 
@LowerBoundDate 
optional 
optional 
NA 
@UpperBoundDate 
optional 
optional 
NA 
@Buckets 
optional 
NA 
NA 
@LowerBoundInt 
NA 
NA 
optional 
@UpperBoundInt 
NA 
NA 
optional 
@Smooth 
NA 
NA 
optional 
@Verbose 
optional 
optional 
optional 
@Process 
optional 
optional 
optional 