Login | Register   
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Build a Generic Histogram Generator for SQL Server : Page 3

Histograms help people analyze large amounts of data, whether you display them as tables or as charts. This article shows you how to do both.


advertisement

How Big is a Bucket?
Be aware that when buckets are month-sized, each bucket contains data for a complete month, including the starting and final months—even if your bounds are not on the first and final days of the month, respectively. In other words, even when given a lower bound of 6/5/2002, the June 2002 bucket contains data for the entire month of June. It would have exactly the same count with a lower bound date of 6/22/2002 or even 6/30/2002 because the bucket label reads "June," so viewers would assume that it contains the data for all of June. For the same reason, year-sized buckets contain data for a full year, irrespective of specific boundary dates.

To show a concrete example, here is the procedure call with year-sized buckets and its corresponding query. Notice that the boundary dates are not on year boundaries but the buckets in the query are.

   exec histogram
       @SchemaName='Purchasing',
       @TableName='PurchaseOrderHeader',
       @BucketExpr='OrderDate',
       @LowerBoundDate='6/5/2002',
       @UpperBoundDate='8/15/2003',
       @Mode='DateRange',
       @DateTemplate='yyyy'
   
   SELECT
count(CASE WHEN OrderDate >= '01/01/2002' AND 
   OrderDate < '01/01/2003' THEN 1 END) as '2002',
count(CASE WHEN OrderDate >= '01/01/2003' AND 
   OrderDate < '01/01/2004' THEN 1 END) as '2003'
FROM [Purchasing].[PurchaseOrderHeader] (nolock)

Compare this to the earlier example that resolved to months; observe that with the same upper and lower bounds, which are not on precise month boundaries, the SQL query components are on month boundaries. Only when resolving down to day-size buckets is no adjustment on the bounds necessary, as in this next query. Here the upper and lower bounds specify precise match the first and last components of the query:



   exec histogram
       @SchemaName='Purchasing',
       @TableName='PurchaseOrderHeader',
       @BucketExpr='OrderDate',
       @LowerBoundDate='11/1/2002',
       @UpperBoundDate='11/5/2002',
       @Mode='DateRange',
       @DateTemplate='yyyy-mm-dd'
   
   SELECT
count(CASE WHEN OrderDate >= '11/01/2002' AND 
   OrderDate < '11/02/2002' THEN 1 END) as '2002-11-01',
count(CASE WHEN OrderDate >= '11/02/2002' AND 
   OrderDate < '11/03/2002' THEN 1 END) as '2002-11-02',
count(CASE WHEN OrderDate >= '11/03/2002' AND 
   OrderDate < '11/04/2002' THEN 1 END) as '2002-11-03',
count(CASE WHEN OrderDate >= '11/04/2002' AND 
   OrderDate < '11/05/2002' THEN 1 END) as '2002-11-04',
count(CASE WHEN OrderDate >= '11/05/2002' AND 
   OrderDate < '11/06/2002' THEN 1 END) as '2002-11-05'
FROM [Purchasing].[PurchaseOrderHeader] (nolock)

A Note on Range Completeness
The queries shown above completely blanket the data being tabulated; there are no date gaps due to the query construction. If you are less than fluent with T-SQL here's a brief explanation. From the first case in the preceding query, the predicate OrderDate >= '11/01/2002' AND OrderDate < '11/02/2002' includes all records where the OrderDate is at or later than 11/01/2002 12:00:00AM (due to the ">=" operator) and where the OrderDate is earlier than 11/02/2002 12:00:00AM (due to the "<" operator). In other words every record from November 2nd, no matter what fraction of a second you choose, is captured by this predicate. By induction, one can show that this holds true for any day, and therefore for all days included in the query. The logic for month-size or year-size buckets is similar.

More on Temporal Histograms
So far, you've seen how to create temporal histograms that cover an arbitrary date range, resolving this to years, months, or days. The histogram tool also lets you take a cross-section of your data, examining months without regard to years. For example, suppose you want to count all orders between March and August across multiple years. The next query does just that. The key is specifying the special @Mode value of DateMonths instead of DateRange that you have been using so far. Here the @LowerBoundDate and @UpperBoundDate boundaries specify the months for which the query should generate columns as well as the years to include in each bucket.

   exec histogram
       @SchemaName='Purchasing',
       @TableName='PurchaseOrderHeader',
       @BucketExpr='OrderDate',
       @LowerBoundDate='3/5/2001',
       @UpperBoundDate='8/15/2004',
       @Mode='DateMonths'

Table 4 shows the output.

Table 4. Temporal Histogram By Month Across Years: Here's temporal histogram where each bucket represents a month across all specified years.
March April May June July August
353 333 417 405 438 452


The April bucket in Table 4 is a tabulation of orders from the month of May from 2001-2004. The other buckets in the result set contain similar tabulations for their respective months.

You can prove to yourself that the boundary year values are really being used by changing the boundary dates, restricting the query to the years 2002-2003:

   exec histogram
       @SchemaName='Purchasing',
       @TableName='PurchaseOrderHeader',
       @BucketExpr='OrderDate',
       @LowerBoundDate='3/5/2002',
       @UpperBoundDate='8/15/2003',
       @Mode='DateMonths'

As Table 5 shows, the values change to reflect the query restriction.

Table 5. Restricted Years: The same query that resulted in Table 4 yields different counts when you change the year date boundaries.
March April May June July August
40 40 56 60 56 20


If you omit the @LowerBoundDate, the first bucket will be for January, and the first year included will be the earliest year in the Purchasing.PurchaseOrderHeader table. If you omit the @UpperBoundDate, the final bucket will be for December and the last year included will be the latest year in the target table. Omitting both bounds, then, will give you a 12-bucket histogram tabulating all records from the target table.



Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap