Login | Register   
LinkedIn
Google+
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Build a Generic Histogram Generator for SQL Server : Page 5

Histograms help people analyze large amounts of data, whether you display them as tables or as charts. This article shows you how to do both.


advertisement

Numerical Histograms
The final type of histogram to consider is a numerical histogram. This type—specified by the @Mode value of NumberRange—tabulates buckets of integers rather than dates. Still using the Purchasing.PurchaseOrderHeader table, this next example (see Table 7) addresses the question: What is the distribution of orders by sales amount?

   exec histogram
       @SchemaName='Purchasing',
       @TableName='PurchaseOrderHeader',
       @BucketExpr='SubTotal'
Table 7. Numerical Histogram: A numerical histogram with boundary smoothing yields buckets without "sharp edges".
0-
99999
100000-
199999
200000-
299999
300000-
399999
400000-
499999
500000-
599999
600000-
699999
700000-
799999
800000-
899999
900000-
1000000+
4001 8 0 1 0 1 0 0 0 1


This type of histogram also automatically determines boundaries if you do not supply them. The smallest value in Table 7 is 37 and the largest value is 997,680. Having those values as the bucket boundaries, though, would be atypical at best, and possibly misleading. You usually want boundary values to be more regular: in this case, starting the first bucket with zero and ending the final bucket with one million is more palatable. I call this process "smoothing."

To convince you that smoothing is generally a good thing, Table 8 shows the same query with smoothing turned off. Smoothing is on by default, so to turn it off, you merely add the @Smooth parameter and set it to zero. Smoothing involves adjusting each boundary value so that it is divisible by an appropriate power of ten and giving the lower boundary a tendency towards zero when it is nearby. The key qualitative terms in that statement, shown italicized, are described more quantitatively a bit later.



   exec histogram
       @SchemaName='Purchasing',
       @TableName='PurchaseOrderHeader',
       @BucketExpr='SubTotal',
       @Smooth=0
Table 8. Numerical Histogram without Smoothing: A numerical histogram without boundary smoothing yields buckets that are much less intuitive.
37-
99801
99802-
199566
199567-
299331
299332-
399096
399097-
498861
498862-
598626
598627-
698391
698392-
798156
798157-
897921
897922-
997680-
4001 8 0 1 0 1 0 0 0 1


Reprise: How Big is a Bucket?
With temporal histograms, buckets span a day, a month, or a year, as you have seen. With numerical histograms the details are more intricate.

The plus sign following the upper bound on the final bucket in Table 7 is an indicator that that bucket is larger than all the other buckets (though not by much): the first nine buckets all span 100,000 values while the final bucket spans 100,001 values. That is a minor artifact of a design decision on my part, to have a tendency towards starting with zero and concluding with a power of ten.

The notation in Table 8 makes it a little harder to spot, but look at the column name of the final bucket: "897922-997680-". That final minus sign is not a stray character. This minus sign indicates that the final bucket is smaller than all the others. Here, the first nine buckets span 99,765 values while the final bucket spans 99,759 values. In this case, the bucket sizes were influenced not by a design decision, but by the data. While you can make a theoretical case that buckets should always be the same size, the data is not always so accommodating. The reason for the discrepancy in the final bucket is simply due to division in the integer domain. The boundaries of each bucket are constrained to be integers and there are an integral number of buckets—but division operations typically yield values outside the integer domain, introducing the difference.

Bucket Mathematics
You do not need to fully appreciate the mathematics but here are some brief details: Consider that we have n buckets with l as the lower bound and u as the upper bound. The span of the buckets is determined as shown in Figure 4. Read the first equation as "the span of buckets 1 through (n-1) is…" and the second equation as "the span of the final bucket (bucket n) is…". The brackets indicate a mathematical ceiling—the closest integer equal to or greater than the real value. The span of the final bucket essentially gets the leftovers due to rounding to integer values. It may sometimes be the same as other buckets, but it will frequently be different, though always within n units of the other buckets.

 
Figure 4. Bucket Size Determination: The final bucket span will frequently be close, but not equal to, the span of the other buckets.

Numerical Histogram: Summary vs. Detail
Just as your temporal histograms can be either summary or detail histograms, numerical histograms use exactly the same mechanism, through the @IdExpr parameter. This allows you to expand a summary histogram and relate another variable to its buckets. To make it more interesting, the second variable here is a date. This query itemizes the order totals by day:

   exec histogram
       @SchemaName='Purchasing',
       @TableName='PurchaseOrderHeader',
       @BucketExpr='SubTotal',
       @Buckets=5,
       <b>@IdExpr='OrderDate',</b>
       @IdFilter='OrderDate < ''4/1/2002'''
   

Table 9 shows the result set, where each OrderDate is formatted in its default state of date and time down to milliseconds. But this does not yield any additional information. Each individual OrderDate value will be enumerated, likely guaranteeing that each one will be tabulated in the first column. From Table 9, it appears that OrderDate is really just a date and does not include a time (because all the values show midnight as the time component). So this histogram is grouping all orders on a given day into a single row. If the field you picked happened to also have times in it, then you would likely get just one order per row without any grouping. Note that the presence of the @IdFilter is optional here; it's used only to limit the number of rows in the result set.

Table 9. Numerical Detail Histogram: This numerical detail histogram uses a raw datetime value.
OrderDate 0-
199999
200000-
399999
400000-
599999
600000-
799999
800000-
1000000+
2001-05-17 00:00:00.000 4 0 0 0 0
2001-05-31 00:00:00.000 4 0 0 0 0
2002-01-14 00:00:00.000 8 0 0 0 0
2002-01-15 00:00:00.000 12 0 0 0 0
2002-02-08 00:00:00.000 4 0 0 0 0
2002-02-16 00:00:00.000 16 0 0 0 0
2002-02-20 00:00:00.000 4 0 0 0 0
2002-02-24 00:00:00.000 16 0 0 0 0
2002-02-25 00:00:00.000 4 0 0 0 0
2002-03-12 00:00:00.000 16 0 0 0 0
2002-03-26 00:00:00.000 4 0 0 0 0
2002-03-30 00:00:00.000 4 0 0 0 0


Wouldn't it be nice, though, if you could group dates by day, month, or year just as you could with temporal histograms? You can, using the @IdExpr parameter. It is an ID expression rather than just an ID field name because it may contain an arbitrary expression; using a field name is simply one possible case of a generalized expression. The next example leverages this expressive power.

   exec histogram
       @SchemaName='Purchasing',
       @TableName='PurchaseOrderHeader',
       @BucketExpr='SubTotal',
       @Buckets=5,
       @IdExpr='cast(year(OrderDate) as varchar) + ''/'' +
                cast(month(OrderDate) as varchar)',
       @IdAlias='Date',
       @IdFilter='OrderDate < ''1/1/2004'''
Author's Note: In the procedure call above, the value of @IdExpr is a single string, so it must all be on one line; I have split it across two lines here just for clarity.


The @IdExpr is a derived value, a concatenation of the year of the OrderDate, then a virgule, then the month. Since @IdExpr is a string parameter, the month and year values must be cast to varchar within the expression. Also note that to embed a string literal within a string, SQL Server uses the convention of doubling the quote character; this is used in both the @IdExpr value and the @IdFilter value above. The @IdAlias, on the other hand, is recommended whenever your @IdExpr is not a simple field name. Without it, SQL Server Management Studio has to come up with its own name for the anonymous column (it labels it (No column name)). The @IdAlias parameter gives you the opportunity to supply a more suitable name—see Table 10.

Table 10. Using Derived Values: This numerical detail histogram uses a derived value for the row identifier. Note that the sort order is suboptimal.
Date 0-
199999
200000-
399999
400000-
599999
600000-
799999
800000-
1000000+
2001/5 8 0 0 0 0
2002/1 20 0 0 0 0
2002/10 20 0 0 0 0
2002/11 24 0 0 0 0
2002/2 44 0 0 0 0
2002/3 24 0 0 0 0
2002/4 40 0 0 0 0
2002/5 16 0 0 0 0
2002/6 4 0 0 0 0
2002/7 44 0 0 0 0
2002/8 8 0 0 0 0
2002/9 28 0 0 0 0
2003/10 240 0 0 0 0
2003/11 92 0 0 0 0
2003/12 231 0 0 0 0
2003/3 16 0 0 0 0
2003/5 40 0 0 0 0
2003/6 56 0 0 0 0
2003/7 12 0 0 0 0
2003/8 12 0 0 0 0
2003/9 336 0 0 0 0


Table 10 achieves grouping, but suffers from one glaring deficiency: the sort order is suboptimal. The output has the date values sorted lexically, as strings, rather than semantically, as dates. You could modify the date expression to yield a better sortable token, but rather than go through the machinations for that, here is a better way:

   exec histogram
       @SchemaName='Purchasing',
       @TableName='PurchaseOrderHeader',
       @BucketExpr='SubTotal',
       @Buckets=5,
       <b>@IdExpr='OrderDate',
       @DateTemplate='yyyy/mm',</b>
       @IdAlias='Date',
       @IdFilter='OrderDate < ''1/1/2004'''

This procedure call replaces the complicated @IdExpr value with the simple field name again—a field that is of type datetime, which is significant. Then, the extra magic comes in by adding the familiar @DateTemplate parameter to specify how to format the @IdExpr. Why is @IdAlias still present? I just stated above that it is only needed if @IdExpr is not a simple field name, but that was not completely accurate. You also need @IdAlias when that simple field is a date and you include a @DateTemplate. A glance at the generated query shows you why:

   SELECT
       RIGHT<b>('0000'+cast(YEAR(OrderDate) as varchar),4) + '/' +
       RIGHT('00'+cast(MONTH(OrderDate) as varchar),2)</b> as [Date],
    COUNT(CASE WHEN SubTotal >= 0 AND SubTotal <= 199999 THEN 1 END) as '0-199999',
    COUNT(CASE WHEN SubTotal >= 200000 AND SubTotal <= 399999 THEN 1 END)
           as '200000-399999',
    COUNT(CASE WHEN SubTotal >= 400000 AND SubTotal <= 599999 THEN 1 END)
           as '400000-599999',
    COUNT(CASE WHEN SubTotal >= 600000 AND SubTotal <= 799999 THEN 1 END)
           as '600000-799999',
    COUNT(CASE WHEN SubTotal >= 800000 AND SubTotal <= 1000000 THEN 1 END)
           as '800000-1000000+'
FROM [Purchasing].[PurchaseOrderHeader] (nolock) WHERE OrderDate < '1/1/2004'
GROUP BY
       RIGHT('0000'+cast(YEAR(OrderDate) as varchar),4) + '/' +
       RIGHT('00'+cast(MONTH(OrderDate) as varchar),2)
ORDER BY
       RIGHT('0000'+cast(YEAR(OrderDate) as varchar),4) + '/' +
       RIGHT('00'+cast(MONTH(OrderDate) as varchar),2)

As you can see by the highlighted code, what actually goes into that column is a derived expression; without the alias, it again becomes an anonymous column. Table 11 shows the result set with the Date column now sorted properly. Note that you could even set @IdAlias to OrderDate to make it appear as if the Date column were actually the database field.

Table 11. Better Date Sorting: The numerical detail histogram uses an implicit derived value for the row identifier that formats the date values to sort as expected.
Date 0-
199999
200000-
399999
400000-
599999
600000-
799999
800000-
1000000+
2001/05 8 0 0 0 0
2002/01 20 0 0 0 0
2002/02 44 0 0 0 0
2002/03 24 0 0 0 0
2002/04 40 0 0 0 0
2002/05 16 0 0 0 0
2002/06 4 0 0 0 0
2002/07 44 0 0 0 0
2002/08 8 0 0 0 0
2002/09 28 0 0 0 0
2002/10 20 0 0 0 0
2002/11 24 0 0 0 0
2003/03 16 0 0 0 0
2003/05 40 0 0 0 0
2003/06 56 0 0 0 0
2003/07 12 0 0 0 0
2003/08 12 0 0 0 0
2003/09 336 0 0 0 0
2003/10 240 0 0 0 0
2003/11 92 0 0 0 0
2003/12 231 0 0 0 0



Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap