he histogram is a mathematical tool that is invaluable when analyzing large quantities of data. A histogram may be represented in a table of numbers or in a chart. Because databases tend to hold large quantities of data, it follows that a database interface that leverages histograms would be a useful tool to have in your analysis toolbox. That’s exactly what you’ll see here?how to generate complex histograms with a single line of T-SQL code. The T-SQL procedure provides tabular output within the confines of a DB query tool (e.g., SQL Server Management Studio or Query Analyzer). As just one example of a way to convert the tabular data to a graphical representation, you’ll also see how to hook up the output of the stored procedure to the input of a chart within Excel.
Wikipedia’s histogram overview defines a histogram as “a mapping. . . that counts the number of observations that fall into various disjoint categories.” The Wikipedia article provides both a tabular and a graphical representation of a simple example: measuring the commute time for US workers. Table 1 shows a subset of the Wikipedia table that is the most relevant portion for this discussion. The first column is the list of disjoint categories, also referred to as bins or buckets. The second column shows the count of the workers (in thousands) corresponding to each bucket. From census data, the article author determined that 4.18 million people had commute times of 5 minutes or less; 13.687 million had commute times between 5 and 10 minutes; 18.618 million had commute times between 10 and 15 minutes; and so forth.
|Minutes||Workers (in thousands)|
For this article, I want to produce a histogram to show the same data in a different way. Table 2 shows exactly the same data but with the X values transposed with the Y values, so each column defines a bucket instead of each row defining a bucket. This format has several advantages:
- Column-wise buckets visually map to a graphical representation more directly. Typically a histogram represented as a graph shows each bucket as a vertical bar in a bar chart. So each column in Table 2 corresponds directly to each bar in its corresponding graph.
- Column-wise buckets map to a single-row result set from a database query. This provides the flexibility to show either a summary histogram?which I define as a single-row result set?or a detail histogram containing more than one row. A detail histogram for this same data set might, for example, enumerate age groups of workers. Perhaps the first row would represent workers under age 20, the next row workers over 20 and under 30, etc.
- The header row shows ranges not single values. Each column header contains not just a single value but explicitly states the range contained in its bucket eliminating any ambiguity. With single values, does “10” mean “5-10” or “10-15” or even “7.5-12.5” ?
|Commute Time (minutes)||0-4||5-9||10-14||15-19||20-24||25-29||30-34||35-39||40-45|
|Number of Workers||4180||13687||18618||19634||17981||7190||16369||3212||4122|
|What You Need|
Installing the Sample Code
To follow along, first download the zipped sample code that accompanies this article, and unzip the file. Next, run histogram.sql to install the histogram stored procedure and then run both BoundSmoothing.sql and GetDataTypeName.sql to install the necessary support functions. You will, of course, need sufficient database privileges to do this; talk to your DBA if you get an error when you try to install a procedure or function.
|Author’s Note: This stored procedure makes use of dynamic SQL. Dynamic SQL, while sometimes necessary, and frequently useful, always presents a risk of abuse to your system. See The Curse and Blessings of Dynamic SQL by a noted SQL Server MVP, Erland Sommarskog.|
You also need the standard AdventureWorks database to replicate the specific examples discussed. If you don’t have it, you can download it here, and install it before continuing.
Finally, to generate connections between Excel and your SQL Server database you will need to install Microsoft Query from your Excel installation disk, because it is not installed by default with Excel. To determine whether you have it installed or not, simply follow the steps in the first Excel exercise and the program will tell you if Query is not loaded.
The Histogram API
Generating a histogram in T-SQL requires just one call to a stored procedure, the database equivalent of a subroutine.
The signature for the histogram stored procedure shows all its parameters. Though there are quite a few, not all parameters are used at one time?and many are optional.
Histogram @SchemaName, @TableName, @BucketExpr, @BucketFilter, @IdExpr, @IdFilter, @IdAlias, @DateTemplate, @Mode, @LowerBoundDate, @UpperBoundDate, @Buckets, @LowerBoundInt, @UpperBoundInt, @Smooth, @Verbose, @Process
Here’s an overview of the parameters:
- Specify a table (@TableName) and a field to tabulate (@BucketExpr) for any histogram.
- To generate a cross tabulation (or detail histogram) rather than a summary histogram, supply a field to enumerate values (@IdExpr) along with an optional alias (@IdAlias) if you provide an expression rather than just a field name for @IdExpr.
- Focus your data (@IdFilter) by specifying what goes into each bucket (@BucketFilter) or which values to enumerate from your @IdExpr.
- The bounds of your histogram will be determined automatically from the data unless you choose to explicitly define them. Use @LowerBoundInt and @UpperBoundInt for a numerical histogram or @LowerBoundDate and @UpperBoundDate for a temporal one.
- Set the number of buckets explicitly for a numerical histogram (@BucketSize) or implicitly for a temporal histogram (@DateTemplate).
- The @DateTemplate parameter actually specifies two things: how to format the column headers (the bucket labels) in the result set, and (for a temporal histogram) what size buckets to use. You may specify years (yyyy), months (mm), or days (dd) in any arrangement, separated by virgules (/) or hyphens (–); the smallest unit present determines the bucket size.
From the API, here is a more complete, formal list of parameters:
@SchemaName ::= schema or owner name @TableName ::= table name @BucketExpr ::= field name or expression @BucketFilter ::= filter predicate for @BucketExpr @IdExpr ::= field name or expression @IdFilter ::= filter predicate for @IdExpr @IdAlias ::= column name alias for @IdExpr @Aggregate ::= aggregate expression (count, min, max) @DateTemplate ::= template calling out year, month, and/or day specifiers @Mode ::= histogram type: 'DateRange', 'DateMonths', or 'NumberRange' @LowerBoundDate ::= smallest date to include for date histograms @UpperBoundDate ::= largest date to include for date histograms @Buckets ::= number of columns for NumberRange mode @LowerBoundInt ::= smallest value to include for numerical histograms @UpperBoundInt ::= largest value to include for numerical histograms @Smooth ::= for NumberRange mode, adjust bounds to palatable values @Verbose ::= "0"-"3" for no diagnostic output or progressively more @Process ::= "1" to execute; "0" to only show generated statements
|Author’s Note: For a full description of all the parameters see the official Histogram API documentation.|
The easiest way to understand how to use this stored procedure is by working through the examples sprinkled throughout this article. Most of the examples use the Purchasing.PurchaseOrderHeader table from the AdventureWorks database.
This first example addresses the question: How many orders per month were placed between 6/5/2002 and 8/15/2003? To answer this question, you want to count orders based on the OrderDate field of the table and obtain a result that shows a summary (one-row), temporal (date-buckets) histogram. If you want to cover a specific date range, supply the upper and lower bounds; otherwise, the stored procedure will automatically use the lowest and highest dates that exist in the table for the boundaries (see Table 3). Note that you could also omit the @Mode parameter in this instance, because the procedure can automatically determine the mode from the data type of the OrderDate field and from the presence of the upper and lower bounds given as dates. In this example, the buckets are month-sized. The smallest unit present in the @DateTemplate is mm; the months from June 2002 through March 2003 are represented.
exec histogram @SchemaName='Purchasing', @TableName='PurchaseOrderHeader', @BucketExpr='OrderDate', @LowerBoundDate='6/5/2002', @UpperBoundDate='8/15/2003', @Mode='DateRange', @DateTemplate='yyyy-mm'
The Underlying Query Exposed
As mentioned earlier, the Histogram stored procedure uses dynamic SQL to do its magic. That is, from the parameters you supply in the procedure call, it dynamically generates a full query and then executes it, returning the result. You can see the actual query sent to your database by setting the @Verbose parameter to 2 or higher. The @Verbose parameter specifies a diagnostic output level, where 0 (the default) is defined as none. If you set @Verbose to 1, you get some derived values of interest; to 2, you also get the completed query; and to 3, you also get any subqueries (if used) that determine implicit upper and lower bounds.
For the preceding example, here’s the query generated from the procedure call:
SELECT count(CASE WHEN OrderDate >= '06/01/2002' AND OrderDate < '07/01/2002' THEN 1 END) as '2002-06', count(CASE WHEN OrderDate >= '07/01/2002' AND OrderDate < '08/01/2002' THEN 1 END) as '2002-07', count(CASE WHEN OrderDate >= '08/01/2002' AND OrderDate < '09/01/2002' THEN 1 END) as '2002-08', count(CASE WHEN OrderDate >= '09/01/2002' AND OrderDate < '10/01/2002' THEN 1 END) as '2002-09', count(CASE WHEN OrderDate >= '10/01/2002' AND OrderDate < '11/01/2002' THEN 1 END) /* This query continues in a similar fashion for each month */ ... FROM [Purchasing].[PurchaseOrderHeader] (nolock)
|Figure 1. Date Template Variations: Frames (1) and (2) show the same data in month-sized buckets, differing only in column headings. Frame (3) shows year-size buckets. Frame (4), (5), and (6) show day-sized buckets.|
As you can see, the query code is rather repetitive, boring, and simple?all qualities that make it a good candidate for programmatic generation. The power and advantage of the stored procedure lies not in any one example, but rather in the flexibility to generate a wide variety of different queries by merely adjusting parameters.
Figure 1 shows the variety of different histograms that you can generate just by changing the @DateTemplate. Note that (1) and (2) have the same data and differ only in column headings, the first showing the year and the month, the second showing just the month. Recall that the smallest unit in the @DateTemplate determines the bucket size. Both yyyy-mm and mm have mm as the smallest unit, hence the buckets are both month-sized. The @DateTemplate, however, also formats the column names. Similarly, (4), (5), and (6) all show the same data, because their @DateTemplate parameters contain the day selector (dd); they all resolve down to a single day. (I changed the date range to keep the number of columns small.) The year and month selectors for (4), (5), and (6) are optional, and serve only to provide context if you need it.
How Big is a Bucket?
Be aware that when buckets are month-sized, each bucket contains data for a complete month, including the starting and final months?even if your bounds are not on the first and final days of the month, respectively. In other words, even when given a lower bound of 6/5/2002, the June 2002 bucket contains data for the entire month of June. It would have exactly the same count with a lower bound date of 6/22/2002 or even 6/30/2002 because the bucket label reads "June," so viewers would assume that it contains the data for all of June. For the same reason, year-sized buckets contain data for a full year, irrespective of specific boundary dates.
To show a concrete example, here is the procedure call with year-sized buckets and its corresponding query. Notice that the boundary dates are not on year boundaries but the buckets in the query are.
exec histogram @SchemaName='Purchasing', @TableName='PurchaseOrderHeader', @BucketExpr='OrderDate', @LowerBoundDate='6/5/2002', @UpperBoundDate='8/15/2003', @Mode='DateRange', @DateTemplate='yyyy' SELECTcount(CASE WHEN OrderDate >= '01/01/2002' AND OrderDate < '01/01/2003' THEN 1 END) as '2002',count(CASE WHEN OrderDate >= '01/01/2003' AND OrderDate < '01/01/2004' THEN 1 END) as '2003'FROM [Purchasing].[PurchaseOrderHeader] (nolock)
Compare this to the earlier example that resolved to months; observe that with the same upper and lower bounds, which are not on precise month boundaries, the SQL query components are on month boundaries. Only when resolving down to day-size buckets is no adjustment on the bounds necessary, as in this next query. Here the upper and lower bounds specify precise match the first and last components of the query:
exec histogram @SchemaName='Purchasing', @TableName='PurchaseOrderHeader', @BucketExpr='OrderDate', @LowerBoundDate='11/1/2002', @UpperBoundDate='11/5/2002', @Mode='DateRange', @DateTemplate='yyyy-mm-dd' SELECTcount(CASE WHEN OrderDate >= '11/01/2002' AND OrderDate < '11/02/2002' THEN 1 END) as '2002-11-01',count(CASE WHEN OrderDate >= '11/02/2002' AND OrderDate < '11/03/2002' THEN 1 END) as '2002-11-02',count(CASE WHEN OrderDate >= '11/03/2002' AND OrderDate < '11/04/2002' THEN 1 END) as '2002-11-03',count(CASE WHEN OrderDate >= '11/04/2002' AND OrderDate < '11/05/2002' THEN 1 END) as '2002-11-04',count(CASE WHEN OrderDate >= '11/05/2002' AND OrderDate < '11/06/2002' THEN 1 END) as '2002-11-05'FROM [Purchasing].[PurchaseOrderHeader] (nolock)
A Note on Range Completeness
The queries shown above completely blanket the data being tabulated; there are no date gaps due to the query construction. If you are less than fluent with T-SQL here's a brief explanation. From the first case in the preceding query, the predicate OrderDate >= '11/01/2002' AND OrderDate < '11/02/2002' includes all records where the OrderDate is at or later than 11/01/2002 12:00:00AM (due to the ">=" operator) and where the OrderDate is earlier than 11/02/2002 12:00:00AM (due to the "<" operator). In other words every record from November 2nd, no matter what fraction of a second you choose, is captured by this predicate. By induction, one can show that this holds true for any day, and therefore for all days included in the query. The logic for month-size or year-size buckets is similar. More on Temporal Histograms
So far, you've seen how to create temporal histograms that cover an arbitrary date range, resolving this to years, months, or days. The histogram tool also lets you take a cross-section of your data, examining months without regard to years. For example, suppose you want to count all orders between March and August across multiple years. The next query does just that. The key is specifying the special @Mode value of DateMonths instead of DateRange that you have been using so far. Here the @LowerBoundDate and @UpperBoundDate boundaries specify the months for which the query should generate columns as well as the years to include in each bucket.
exec histogram @SchemaName='Purchasing', @TableName='PurchaseOrderHeader', @BucketExpr='OrderDate', @LowerBoundDate='3/5/2001', @UpperBoundDate='8/15/2004', @Mode='DateMonths'
Table 4 shows the output.
The April bucket in Table 4 is a tabulation of orders from the month of May from 2001-2004. The other buckets in the result set contain similar tabulations for their respective months.
You can prove to yourself that the boundary year values are really being used by changing the boundary dates, restricting the query to the years 2002-2003:
exec histogram @SchemaName='Purchasing', @TableName='PurchaseOrderHeader', @BucketExpr='OrderDate', @LowerBoundDate='3/5/2002', @UpperBoundDate='8/15/2003', @Mode='DateMonths'
As Table 5 shows, the values change to reflect the query restriction.
If you omit the @LowerBoundDate, the first bucket will be for January, and the first year included will be the earliest year in the Purchasing.PurchaseOrderHeader table. If you omit the @UpperBoundDate, the final bucket will be for December and the last year included will be the latest year in the target table. Omitting both bounds, then, will give you a 12-bucket histogram tabulating all records from the target table.
The Summary vs. Detail Histogram
The single-row result set typically provides a summary histogram?an analysis of an entire table or entire criterion. A detail histogram, on the other hand, relates two variables?one enumerated in columns, the other enumerated in rows. Each row provides a separate mini-histogram for a single enumerated value of the second variable. This type of histogram is also called a cross tabulation (or crosstab for short).
You may generate a detail histogram for any of the types of histograms specifiable with @Mode. To extend the previous example you need add only one more parameter?@IdExpr?to convert the summary histogram into a detail histogram.
exec histogram @SchemaName='Purchasing', @TableName='PurchaseOrderHeader', @BucketExpr='OrderDate', @LowerBoundDate='3/5/2001', @UpperBoundDate='8/15/2004', @Mode='DateMonths', @IdExpr='EmployeeID'
This query addresses the question: Between March and August in the 4-year span 2001 to 2004, how many orders per month did each employee enter? Table 6 shows the result set.
If you compare Table 6 to Table 4, you can see that each column in Table 6 adds up to the single value in the same column in Table 4. So Table 6 shows a breakdown by employee of the histogram in Table 4. In other words, each row of Table 6 is a summary histogram focused on a single employee.
Filtering in Two Dimensions
|Figure 2. Filtering Variations: Both column (@BucketFilter) and row (@IdFilter) filters may be applied to a detail histogram. The unfiltered query (1) sheds columns (2) when you filter on a particular status value; sheds rows (3) when filtered by Employee IDs, and sheds both rows and columns (4) when both filters are applied.|
The final type of histogram to consider is a numerical histogram. This type?specified by the @Mode value of NumberRange?tabulates buckets of integers rather than dates. Still using the Purchasing.PurchaseOrderHeader table, this next example (see Table 7) addresses the question: What is the distribution of orders by sales amount?
exec histogram @SchemaName='Purchasing', @TableName='PurchaseOrderHeader', @BucketExpr='SubTotal'
This type of histogram also automatically determines boundaries if you do not supply them. The smallest value in Table 7 is 37 and the largest value is 997,680. Having those values as the bucket boundaries, though, would be atypical at best, and possibly misleading. You usually want boundary values to be more regular: in this case, starting the first bucket with zero and ending the final bucket with one million is more palatable. I call this process "smoothing."
To convince you that smoothing is generally a good thing, Table 8 shows the same query with smoothing turned off. Smoothing is on by default, so to turn it off, you merely add the @Smooth parameter and set it to zero. Smoothing involves adjusting each boundary value so that it is divisible by an appropriate power of ten and giving the lower boundary a tendency towards zero when it is nearby. The key qualitative terms in that statement, shown italicized, are described more quantitatively a bit later.
exec histogram @SchemaName='Purchasing', @TableName='PurchaseOrderHeader', @BucketExpr='SubTotal', @Smooth=0
Reprise: How Big is a Bucket?
With temporal histograms, buckets span a day, a month, or a year, as you have seen. With numerical histograms the details are more intricate.
The plus sign following the upper bound on the final bucket in Table 7 is an indicator that that bucket is larger than all the other buckets (though not by much): the first nine buckets all span 100,000 values while the final bucket spans 100,001 values. That is a minor artifact of a design decision on my part, to have a tendency towards starting with zero and concluding with a power of ten.
The notation in Table 8 makes it a little harder to spot, but look at the column name of the final bucket: "897922-997680-". That final minus sign is not a stray character. This minus sign indicates that the final bucket is smaller than all the others. Here, the first nine buckets span 99,765 values while the final bucket spans 99,759 values. In this case, the bucket sizes were influenced not by a design decision, but by the data. While you can make a theoretical case that buckets should always be the same size, the data is not always so accommodating. The reason for the discrepancy in the final bucket is simply due to division in the integer domain. The boundaries of each bucket are constrained to be integers and there are an integral number of buckets?but division operations typically yield values outside the integer domain, introducing the difference.
You do not need to fully appreciate the mathematics but here are some brief details: Consider that we have n buckets with l as the lower bound and u as the upper bound. The span of the buckets is determined as shown in Figure 4. Read the first equation as "the span of buckets 1 through (n-1) is?" and the second equation as "the span of the final bucket (bucket n) is?". The brackets indicate a mathematical ceiling?the closest integer equal to or greater than the real value. The span of the final bucket essentially gets the leftovers due to rounding to integer values. It may sometimes be the same as other buckets, but it will frequently be different, though always within n units of the other buckets.
|Figure 4. Bucket Size Determination: The final bucket span will frequently be close, but not equal to, the span of the other buckets.|
Numerical Histogram: Summary vs. Detail
Just as your temporal histograms can be either summary or detail histograms, numerical histograms use exactly the same mechanism, through the @IdExpr parameter. This allows you to expand a summary histogram and relate another variable to its buckets. To make it more interesting, the second variable here is a date. This query itemizes the order totals by day:
exec histogram @SchemaName='Purchasing', @TableName='PurchaseOrderHeader', @BucketExpr='SubTotal', @Buckets=5, @IdExpr='OrderDate', @IdFilter='OrderDate < ''4/1/2002'''
Table 9 shows the result set, where each OrderDate is formatted in its default state of date and time down to milliseconds. But this does not yield any additional information. Each individual OrderDate value will be enumerated, likely guaranteeing that each one will be tabulated in the first column. From Table 9, it appears that OrderDate is really just a date and does not include a time (because all the values show midnight as the time component). So this histogram is grouping all orders on a given day into a single row. If the field you picked happened to also have times in it, then you would likely get just one order per row without any grouping. Note that the presence of the @IdFilter is optional here; it's used only to limit the number of rows in the result set.
Wouldn't it be nice, though, if you could group dates by day, month, or year just as you could with temporal histograms? You can, using the @IdExpr parameter. It is an ID expression rather than just an ID field name because it may contain an arbitrary expression; using a field name is simply one possible case of a generalized expression. The next example leverages this expressive power.
exec histogram @SchemaName='Purchasing', @TableName='PurchaseOrderHeader', @BucketExpr='SubTotal', @Buckets=5, @IdExpr='cast(year(OrderDate) as varchar) + ''/'' + cast(month(OrderDate) as varchar)', @IdAlias='Date', @IdFilter='OrderDate < ''1/1/2004'''
|Author's Note: In the procedure call above, the value of @IdExpr is a single string, so it must all be on one line; I have split it across two lines here just for clarity.|
The @IdExpr is a derived value, a concatenation of the year of the OrderDate, then a virgule, then the month. Since @IdExpr is a string parameter, the month and year values must be cast to varchar within the expression. Also note that to embed a string literal within a string, SQL Server uses the convention of doubling the quote character; this is used in both the @IdExpr value and the @IdFilter value above. The @IdAlias, on the other hand, is recommended whenever your @IdExpr is not a simple field name. Without it, SQL Server Management Studio has to come up with its own name for the anonymous column (it labels it (No column name)). The @IdAlias parameter gives you the opportunity to supply a more suitable name?see Table 10.
Table 10 achieves grouping, but suffers from one glaring deficiency: the sort order is suboptimal. The output has the date values sorted lexically, as strings, rather than semantically, as dates. You could modify the date expression to yield a better sortable token, but rather than go through the machinations for that, here is a better way:
exec histogram @SchemaName='Purchasing', @TableName='PurchaseOrderHeader', @BucketExpr='SubTotal', @Buckets=5, @IdExpr='OrderDate', @DateTemplate='yyyy/mm', @IdAlias='Date', @IdFilter='OrderDate < ''1/1/2004'''
This procedure call replaces the complicated @IdExpr value with the simple field name again?a field that is of type datetime, which is significant. Then, the extra magic comes in by adding the familiar @DateTemplate parameter to specify how to format the @IdExpr. Why is @IdAlias still present? I just stated above that it is only needed if @IdExpr is not a simple field name, but that was not completely accurate. You also need @IdAlias when that simple field is a date and you include a @DateTemplate. A glance at the generated query shows you why:
SELECT RIGHT('0000'+cast(YEAR(OrderDate) as varchar),4) + '/' + RIGHT('00'+cast(MONTH(OrderDate) as varchar),2) as [Date], COUNT(CASE WHEN SubTotal >= 0 AND SubTotal <= 199999 THEN 1 END) as '0-199999', COUNT(CASE WHEN SubTotal >= 200000 AND SubTotal <= 399999 THEN 1 END) as '200000-399999', COUNT(CASE WHEN SubTotal >= 400000 AND SubTotal <= 599999 THEN 1 END) as '400000-599999', COUNT(CASE WHEN SubTotal >= 600000 AND SubTotal <= 799999 THEN 1 END) as '600000-799999', COUNT(CASE WHEN SubTotal >= 800000 AND SubTotal <= 1000000 THEN 1 END) as '800000-1000000+'FROM [Purchasing].[PurchaseOrderHeader] (nolock) WHERE OrderDate < '1/1/2004'GROUP BY RIGHT('0000'+cast(YEAR(OrderDate) as varchar),4) + '/' + RIGHT('00'+cast(MONTH(OrderDate) as varchar),2)ORDER BY RIGHT('0000'+cast(YEAR(OrderDate) as varchar),4) + '/' + RIGHT('00'+cast(MONTH(OrderDate) as varchar),2)
As you can see by the highlighted code, what actually goes into that column is a derived expression; without the alias, it again becomes an anonymous column. Table 11 shows the result set with the Date column now sorted properly. Note that you could even set @IdAlias to OrderDate to make it appear as if the Date column were actually the database field.
In the sections on temporal histograms I did not mention smoothing, but it was implicit in the discussion. Recall that if you specify a @DateTemplate of, say, yyyy-mm for a temporal histogram, the first bucket spans the entire month of the lower boundary date, irrespective of whether the lower boundary date specifies the first of that month. The analogous situation applies at the trailing end of the final bucket. If your date template specifies only years, then buckets are "smoothed" to full years automatically.
With numerical histograms, on the other hand, smoothing is explicit and under your control via the @Smooth parameter. You saw in the earlier example (Table 7 vs. Table 8) how boundary smoothing can be quite useful.
The boundary smoothing algorithm is a table-valued function used by the histogram stored procedure. But it is designed as a public function, complete with API documentation. If you're wondering why values get smoothed to certain boundaries, you can experiment with the BoundSmoothing function directly. The output of the BoundSmoothing function in this case is a single row containing nine values. In practice you need only the final two values; the others are present to allow for learning or tinkering, and to allow a discussion of smoothing many different input values in this next example. This query takes a series of representative samples and reports them in a single result set; it is essentially a unit test for the function.
select * from dbo.BoundSmoothing(3, 21) union select * from dbo.BoundSmoothing(71, 76) union select * from dbo.BoundSmoothing(19, 76) union select * from dbo.BoundSmoothing(119, 176) union select * from dbo.BoundSmoothing(119, 276) union select * from dbo.BoundSmoothing(119, 976) union select * from dbo.BoundSmoothing(119, 676) union select * from dbo.BoundSmoothing(49, 976) union select * from dbo.BoundSmoothing(199, 899) union select * from dbo.BoundSmoothing(199, 901) union select * from dbo.BoundSmoothing(199, 976) union select * from dbo.BoundSmoothing(201, 976) union select * from dbo.BoundSmoothing(119, 1176) union select * from dbo.BoundSmoothing(119, 1499) union select * from dbo.BoundSmoothing(119, 1501) union select * from dbo.BoundSmoothing(319, 1576) union select * from dbo.BoundSmoothing(2319, 11576) union select * from dbo.BoundSmoothing(12383, 12389) union select * from dbo.BoundSmoothing(12319, 12389) union select * from dbo.BoundSmoothing(12319, 84111) union select * from dbo.BoundSmoothing(12319, 90111) order by lownum, highnum
- When the upper and lower bounds are close together (less than 25 units apart), they are returned unchanged. See rows 1, 4, and 21 as previously mentioned.
- If the upper bound is within 25 percent of itself rounded to its most significant digit increased by one with remaining digits zero then use that value; otherwise, round to the penultimate significant digit increased by one with remaining digits zero. This is a mouthful that is easier to digest by example, illustrated by rows 10 and 11. In both cases (taking either 1499 or 1501, the most significant digit is 1, increased by 1 is 2, and filled out with zeroes yields 2000 as the tentative smoothed value. The value 1501 is within 25 percent of 2000 so it accepts 2000 but 1499 is just below the 25 percent threshold so it rejects it in favor of looking at the next digit, the 4. Bumping 14 up by 1 yields 15, and filling with zeroes produces the 1500 value that you see for row 10.
- If the upper bound is within 10 percent of itself rounded to its most significant digit increased by one with remaining digits zero, and the lower bound relative to the smoothed upper bound is within 20 percent of zero, then use zero as the lower bound. Compare rows 19 and 20, for example. In both rows the distance between the lower bound and zero is constant. But in row 20 the upper bound is sufficiently larger than the lower bound to warrant smoothing the lower bound to zero. Rows 14 and 15 show the same thing with the input lower bound immediately above and below the 20 percent threshold.
Histograms by their very nature involve aggregation of some one or more aspects of a set of data. The histogram stored procedure provides a simple mechanism for explicitly factoring in SQL aggregate operators as well using the @Aggregate parameter. For example, suppose you wanted to explore the question of how many orders each customer has. The following query returns the number of orders per customer. Each row of the result set summarizes one customer.
SELECT CustomerID, count(SalesOrderID) AS OrderCount FROM Sales.SalesOrderHeader GROUP BY CustomerID
For AdventureWorks, that yields almost 20,000 rows?quite a lot of data. You could, of course, sort that result in one direction to see which customers ordered the most, or the opposite direction to see which ordered the least, etc. But for a good overview, funnel the request through the histogram stored procedure. This next query condenses the result to a single row (a summary histogram) by performing the same aggregation as above?counting orders per customer?then generating a standard histogram on that intermediate result (see Table 12):
exec histogram @SchemaName='Sales', @TableName='SalesOrderHeader', @BucketExpr='count(SalesOrderID)', @Aggregate='CustomerID'
|Author's Note: The aggregate operation is the only operation discussed in this article that does not work in SQL Server 2000. Everything else works in both SQL Server 2000 and 2005.|
Summary of Histogram Usage
For your convenience, Table 13 presents a summary of parameter usage for all three histogram types. Again, refer to the formal Histogram API for more details.
|@IdExpr||optional||optional||optional (NA if @Aggregate used)|
optional (and only when @IdExpr is also used)
optional (and only when @IdExpr is also used)
required if: @IdExpr is a date field
|@Mode||required if:bounds are omitted and@BucketExpr is not a date field||required||required if:bounds are omitted and@BucketExpr is not an integer or money field|
Expanding the Box: From T-SQL to Excel Charts
Working in a querying tool such as SQL Server Management Express is often adequate, particularly for data-savvy individuals such as developers. But in many cases, it's better to show graphical charts. This section describes a procedure for converting histograms to Excel charts. If you glance at the figures in this section, you will see that I cannot claim that you can "convert to a chart in just 3 easy steps!" or even 5 or 10 or? well, there are a lot of little dialogs you need to go through, but after you have done it a couple times it is not really that complicated.
|Figure 6. Defining a Data Connection: Follow the sequence of screens to define a data connection to Excel.|
Step 1: Create a Connection to Your Data Source
Before you can connect to a database in Excel, you need to define the connection in Windows to point to your database. In Windows XP, open the ODBC Data Source administrator (on the Administrative menu which you may need to enable). Figure 6 walks through configuring and testing an existing connection with it. If you need to create a new one, select Add instead of Configure.
Step 2: Connect a Database to Excel
With your database connection in place, create a new worksheet in Excel 2003. On a blank sheet, instantiate a new database query (see Figure 7). The wizard asks you first to select your connection, then to define the query, and finally where to place the results of that query. In the middle step you actually leave Excel and go into Microsoft's eponymous query program, and finally return to Excel. This venture into Microsoft Query is detailed in Figure 8.
Step 3A: Generate a Chart for a Summary Histogram
|Figure 9. Dynamic Table and Chart: When your query is connected to Excel, you have a dynamic connection. When the data table is active, you have access to the refresh button on the External Data toolbar. Pressing refresh updates both the table and the chart.|
After completing step 2 you'll have a data table in Excel (see Figure 9). To convert this into a chart you invoke Excel's Chart Wizard from the Insert menu. Using the wizard is very simple so I will not detail it here. Customizing your chart to be just the way you want it, however, may take some effort. When your chart is in place you'll have both a dynamic table and a dynamic chart. Notice in Figure 9 that one cell in the table is the active cell. Because you are in the data table, the External Data toolbar appears. You need simply press the refresh button in the middle of that toolbar (the button with the exclamation point icon) to refresh both the table and its linked chart.
Step 3B: Generate a Chart for a Detail Histogram
The query in Figure 9 returns a summary histogram. For this next example, assume you have a detail histogram that you want to link to an Excel table and chart. Figure 10 shows just the key pieces of this process; refer to Figures 7 and 8 for steps not shown here. After you return your result to Excel you'll have a standard two-dimensional Excel table. There is one Excel secret you need to know at this point: delete the column header on the first column (see frame 4 in Figure 10) before you invoke the chart wizard. If you do this, the chart wizard will naturally yield a chart where each row is rendered as a very thin histogram. If you don't, the chart wizard will mislead, befuddle, or frustrate you!
|Figure 10. Charting a Detail Histogram: (1) Define your query. (2) From Microsoft Query return your results to Excel. (3) Delete the column header on the first column to allow the chart wizard to properly guide you. (4) Invoke the chart wizard. (5) Use the default column chart type. You could select Finish in step (5); (6) just lets you confirm that the series are in columns. (7) Your linked chart is rendered.|
Step 4: Modify a Query
You've seen how to link dynamic data to your Excel chart and how to refresh the chart on demand. What if you actually need to change the query?does that mean you need to throw it out and start over? Not necessarily. As shown in Figure 11, you can start with an existing dynamic chart, and then return to Microsoft Query to edit the query, which results in a different number of rows. When you return to Excel everything gets updated automatically.
|Figure 11. Modifying the Query: Changes that affect the number of records returned are automatically propagated to Excel. (1) Select the first button on the External Data toolbar to edit the configuration. (2) Press OK. (3) From the initial result set (4) edit the query and (5) change the number of rows in the result. (6) Return the data to Excel and observe the table and chart have rendered the changes.|
Note that this holds true when you change the number of rows because the chart series are in columns. If instead of changing the number of rows, you change the number of columns by adjusting the upper or lower boundary date, you will get an undefined reference or a missing data error in the chart upon returning to Excel.
This article uses Excel as a channel for converting data tables to charts because it's a convenient, standard application where you can produce the charts through a smart UI rather than by programming, but there are, of course, many other possibilities. For example, you might want to feed the output of the stored procedure to your web or Windows application to create a dynamic chart. Whatever you choose, the intent of this article was to sufficiently explain the Histogram API to allow you to leverage it in the application or vehicle of your choice.