Mastering Top-N and OLAP Queries in DB2

ith the help of today’s Web integration and GUI tools, it’s now much easier to build reports without any programming. SQL provides GROUP BY and ORDER BY clauses, to collate and sort otherwise unordered data. But sometimes we want more flexibility. Consider the following report requests:

  • List my top two clients, according to sales totals
  • List the next three clients
  • Rank all clients by sales volume, but list them alphabetically
  • Rank all clients by sales volume on a per-month basis
  • Show each client’s year-to-date and all-time cumulative totals
  • A rolling three-month average, from one before to one after, and the same from the previous year
Traditionally the extra processing for these kinds of queries was done in application code. Using newer SQL constructs, we can move the processing closer to the data itself, both for efficient processing and simpler data access.



How can I rank my results in DB2 and retrieve only the data I want?



Use DB2’s OLAP functions to refine the data; Use FETCH FIRST to minimize the results.FETCH FIRST for Top-N
DB2’s method of performing a Top-N query is the FETCH FIRST clause. You can append these variations to a regular SELECT query:

  • FETCH FIRST ROW ONLY
  • FETCH FIRST 1 ROW ONLY
  • FETCH FIRST integer ROWS ONLY
Interestingly, you can also use

  • FETCH FIRST 1 ROWS ONLY
  • FETCH FIRST 5 ROW ONLY
They aren’t as nice grammatically, but they make it easier to generate queries automatically?you don’t have to worry about whether to say ROW or ROWS.

For demonstration, I’ll create a table called MYSALES, which can be generated by running Listing 1.

Now, we can ask for a single record as follows:

Listing 2. Return a Single Row

   SELECT * FROM MYSALES FETCH FIRST ROW ONLY      CLIENT     MONTHEND   SALEVOL    DEVX       03/31/1998     100
We have retrieved one row, but there’s no way to know ahead of time which row it will be.(HINT: This does give us a handy way to remind ourselves what fields are in a table, with a row of sample data as a bonus!)

To find our top two clients, use the code in Listing 3.

Listing 3: Show Top Two Clients

   SELECT CLIENT, SUM(SALEVOL) AS TOTALVOL   FROM MYSALES    GROUP BY CLIENT    ORDER BY SUM(SALEVOL) DESC    FETCH FIRST 2 ROWS ONLY      CLIENT     TOTALVOL      DEVX           5785   EGGHEAD        5341
Ranking and Numbering
DB2 supports ranking and numbering in much the same way that Oracle does (see “Performing Top-N Queries in Oracle“).

The available functions are:

  • ROW_NUMBER(), which simply numbers the returned rows sequentially
  • RANK(), which ranks the results, but, in the case of a tie, gives the same number to each and leaves a gap to compensate
  • DENSE_RANK() operates the same as RANK() but doesn’t leave any gaps.
The difference in these three methods can be seen in Listing 4.

Listing 4. Ranking and Numbering Results

   select      client, MONTHEND, SALEVOL,      ROW_NUMBER() over (order by SALEVOL desc) AS RN,      RANK() over (order by SALEVOL desc) AS RANK,      DENSE_RANK() over (order by SALEVOL desc) AS DENSE   from mysales    where MONTHEND=DATE(‘1997-11-30’)   order by RN      CLIENT     MONTHEND   SALEVOL   RN  RANK DENSE   CYRIX      11/30/1997    120     1     1     1   BIG BLUE   11/30/1997    106     2     2     2   EGGHEAD    11/30/1997    106     3     2     2   DEVX       11/30/1997     80     4     4     3   FIGTREE    11/30/1997     62     5     5     4   ACME       11/30/1997     20     6     6     5
The results need not be returned in rank order. We may wish to show each client’s rank while listing them alphabetically

Listing 5. Return Ranks in Any Order

   select     CLIENT, SALEVOL,     RANK() over (order by SALEVOL desc) AS RANK   from mysales    where MONTHEND=DATE(1997-11-30′)    order by CLIENT      CLIENT     SALEVOL RANK   ACME            20    6   BIG BLUE       106    2   CYRIX          120    1   DEVX            80    4   EGGHEAD        106    2   FIGTREE         62    5
One application of ROW_NUMBER is to select a numbered range of rows from the middle of your results. For example, you could retrieve the next three clients following the top two as shown in Listing 6.

Listing 6. Selecting Rows by Number

   WITH ALLSALES AS       (SELECT CLIENT, SUM(SALEVOL) AS TOTALVOL,      ROW_NUMBER() OVER         (ORDER BY SUM(SALEVOL) DESC, CLIENT)      AS RN      FROM MYSALES       GROUP BY CLIENT)    SELECT       CLIENT, TOTALVOL, RN   FROM ALLSALES   WHERE RN BETWEEN 3 AND 5   ORDER BY RN      CLIENT     TOTALVOL  RN   BIG BLUE       4781   3   FIGTREE        3986   4   ACME           3044   5
Partitioning?Ranking within Groups
We have seen ROW_NUMBER and RANK operating over an ORDER clause with one or more columns. The ORDER BY must be specified for RANK() and DENSE_RANK(), but is optional for ROW_NUMBER(). That is, selecting

   ROW_NUMBER() OVER()
will number the rows in an arbitrary fashion.

The other clause that may be used with RANK, DENSE_RANK, and ROW_NUMBER is a partitioning clause. This allows you to perform the function over subsets of data.

The query in Listing 7 will obtain the ranking of clients for each month in the past year.

Listing 7. Ranking in partitions

   SELECT      CLIENT, MONTHEND, SALEVOL,      RANK() OVER (PARTITION BY MONTHEND       ORDER BY SALEVOL DESC) AS MONTHRANK   FROM MYSALES   WHERE MONTHEND > CURRENT DATE – 1 YEAR   ORDER BY MONTHEND
As with the ORDER clause, it is possible to specify multiple PARTITION columns, separated by commas.A Wider RANGE of Functions
Now we come to a more powerful feature?the ability to perform column functions on a moving range of rows. You can use any column function, and it will be applied to the records:

  • Grouped in the PARTITION BY clause, if present
  • Limited to a subset of rows using a ROWS or RANGE clause, if specified, where the range operates in the ORDER given.
The easiest way to see this is by example. Listing 8 shows the SUM, AVG, and MAX functions used to fill a range of different roles. If you run the query on our sample data you’ll see some patterns emerge among the different clients.

Listing 8. A Wider Range of Functions

   SELECT      CLIENT, MONTHEND, SALEVOL,         — Total sales ever      SUM(SALEVOL) OVER ()      AS TOTALSALES,         –Year-to-date cumulative total      SUM(SALEVOL) OVER(         PARTITION BY CLIENT, YEAR(MONTHEND)         ORDER BY MONTHEND         ROWS UNBOUNDED PRECEDING)         AS YTD,         –All-time cumulative total across all       –clients prior to this month      SUM(SALEVOL) OVER(         ORDER BY MONTHEND         ROWS UNBOUNDED PRECEDING)         AS ALLYTD,         –Moving 3-month average      AVG(SALEVOL) OVER(         PARTITION BY CLIENT          ORDER BY MONTHEND          ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)         AS MOVAVGDIF,           –Difference to last year’s moving avg      SALEVOL – AVG(SALEVOL) OVER(         PARTITION BY CLIENT          ORDER BY MONTHEND          ROWS BETWEEN 13 PRECEDING AND 11 PRECEDING)         AS MOVPREVDIF,         –Greatest volume by any client for the       –current month      MAX(SALEVOL) OVER(         PARTITION BY MONTHEND) AS MONTHMAX,         –Percentage of total for month      SMALLINT(SALEVOL * 100 /          SUM(SALEVOL) OVER (PARTITION BY MONTHEND))         AS THISPCT,         –Number of months where this client’s sales      — are within +/- 5 units of this month      COUNT(*) OVER(         PARTITION BY CLIENT         ORDER BY SALEVOL         RANGE BETWEEN 5 PRECEDING AND 5 FOLLOWING)         AS NUMSAME      FROM MYSALES   ORDER BY CLIENT, MONTHEND
You can use RANGE in place of ROWS, except that RANGE operates only on a single numeric column, and compares the values in the column rather than the number of rows. (See the NUMSAME calculation near the end of Listing 8 above).

That’s only the beginning of the possibilities. It’s important to remember that OLAP queries can place a heavy load on a database, and sometimes the whole data set must be processed even to return a small number of results. That said, you don’t have to transfer large amounts of data to other programs, or worse, across a network, to get the results.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Overview

Recent Articles: