TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
 Specialized Dev Zones Research Center eBook Library .NET Java C++ Web Dev Architecture Database Security Open Source Enterprise Mobile Special Reports 10-Minute Solutions DevXtra Blogs Slideshow

Build a Generic Histogram Generator for SQL Server

Histograms help people analyze large amounts of data, whether you display them as tables or as charts. This article shows you how to do both.

 by Michael Sorens
 Jul 3, 2008
 Page 1 of 7

he histogram is a mathematical tool that is invaluable when analyzing large quantities of data. A histogram may be represented in a table of numbers or in a chart. Because databases tend to hold large quantities of data, it follows that a database interface that leverages histograms would be a useful tool to have in your analysis toolbox. That's exactly what you'll see here—how to generate complex histograms with a single line of T-SQL code. The T-SQL procedure provides tabular output within the confines of a DB query tool (e.g., SQL Server Management Studio or Query Analyzer). As just one example of a way to convert the tabular data to a graphical representation, you'll also see how to hook up the output of the stored procedure to the input of a chart within Excel.

Wikipedia's histogram overview defines a histogram as "a mapping. . . that counts the number of observations that fall into various disjoint categories." The Wikipedia article provides both a tabular and a graphical representation of a simple example: measuring the commute time for US workers. Table 1 shows a subset of the Wikipedia table that is the most relevant portion for this discussion. The first column is the list of disjoint categories, also referred to as bins or buckets. The second column shows the count of the workers (in thousands) corresponding to each bucket. From census data, the article author determined that 4.18 million people had commute times of 5 minutes or less; 13.687 million had commute times between 5 and 10 minutes; 18.618 million had commute times between 10 and 15 minutes; and so forth.

Table 1. Sample Histogram: A histogram showing the distribution of commute times for US workers in 5-minute buckets.
 Minutes Workers (in thousands) 0 4180 5 13687 10 18618 15 19634 20 17981 25 7190 30 16369 35 3212 40 4122

For this article, I want to produce a histogram to show the same data in a different way. Table 2 shows exactly the same data but with the X values transposed with the Y values, so each column defines a bucket instead of each row defining a bucket. This format has several advantages:

• Column-wise buckets visually map to a graphical representation more directly. Typically a histogram represented as a graph shows each bucket as a vertical bar in a bar chart. So each column in Table 2 corresponds directly to each bar in its corresponding graph.
• Column-wise buckets map to a single-row result set from a database query. This provides the flexibility to show either a summary histogram—which I define as a single-row result set—or a detail histogram containing more than one row. A detail histogram for this same data set might, for example, enumerate age groups of workers. Perhaps the first row would represent workers under age 20, the next row workers over 20 and under 30, etc.
• The header row shows ranges not single values. Each column header contains not just a single value but explicitly states the range contained in its bucket eliminating any ambiguity. With single values, does "10" mean "5-10" or "10-15" or even "7.5-12.5" ?
Table 2. Transposed Histogram: This table shows a transposition of the histogram in Table 1, showing the same data in a form more conducive for the techniques discussed.
 Commute Time (minutes) 0-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 40-45 Number of Workers 4180 13687 18618 19634 17981 7190 16369 3212 4122

 What You Need SQL Server 2000 or later SQL Server Query Analyzer, SQL Server Management Studio, or a similar tool for querying a database and receiving a result set Microsoft Excel 2003 for creating charts from result sets

Installing the Sample Code
To follow along, first download the zipped sample code that accompanies this article, and unzip the file. Next, run histogram.sql to install the histogram stored procedure and then run both BoundSmoothing.sql and GetDataTypeName.sql to install the necessary support functions. You will, of course, need sufficient database privileges to do this; talk to your DBA if you get an error when you try to install a procedure or function.

 Author's Note: This stored procedure makes use of dynamic SQL. Dynamic SQL, while sometimes necessary, and frequently useful, always presents a risk of abuse to your system. See The Curse and Blessings of Dynamic SQL by a noted SQL Server MVP, Erland Sommarskog.

You also need the standard AdventureWorks database to replicate the specific examples discussed. If you don't have it, you can download it here, and install it before continuing.

Finally, to generate connections between Excel and your SQL Server database you will need to install Microsoft Query from your Excel installation disk, because it is not installed by default with Excel. To determine whether you have it installed or not, simply follow the steps in the first Excel exercise and the program will tell you if Query is not loaded.