Tame Beastly Data with Summary Tables

Tame Beastly Data with Summary Tables

anagers loved Eddie; he always had the answers on-hand.

Eddie was a production supervisor in the pre-computer era. He received comprehensive hourly production reports on paper. He also received urgent queries from managers, who needed up-to-date summaries spanning the entire year’s production. He always had a ready answer.

Eddie had learned that in his job, it paid to keep running totals. He could provide a range of summaries without shuffling reams of paper, and point straight to the answers that he kept pinned up on the wall.

Do you run summary queries against your database? Are they slowing down due to the sheer bulk of the raw data? Eddie may have your solution: keep summary tables. In this article, we will examine how DB2 can do it for you and have your answers waiting before you even ask.

How do you quickly obtain a summary from a massive store of live data?

Use summary tables to hold collated information and have them updated as required. Summary tables can be used explicitly or automatically to improve the speed of certain queries.

Summary tables can be:

  • Do-it-yourself: Maintained by application logic or triggers
  • Automatic: Maintained synchronously by DB2
  • Semi-automatic: Maintained asynchronously by DB2 (updates upon request)
  • Externally hosted: Take the data elsewhere for analysis and reporting. Where you have a lot of activity on your source data, it may be worth replicating to another machine. In fact, this is recommended practice for online analytical processing (OLAP), but outside the scope of this article.
Grouping the Data
The first step is to decide how to group the data, so let’s get some specifics on Eddie’s production line. He supervises production of “goop.” Goop is made in various colors (say, four), and sold by weight, in bags. Due to variations, each bag of goop must be weighed, and stamped with a serial number. A bag of goop is filled each minute, from each of four machines. The batch number and date must also be recorded for each bag. Listing 1 creates a table to record goop production.

Listing 1: The GOOP table.– Don’t forget to connect to a database first,CONNECT TO SAMPLE @CREATE TABLE TEST.GOOP (  Machine SMALLINT NOT NULL,  SerialNo CHAR(10) NOT NULL,  Color CHAR(10) NOT NULL,  Weight SMALLINT NOT NULL,  DateMade DATE NOT NULL,  PRIMARY KEY (SerialNo)) @– Note: I am using “@” to terminate statements.– You can set this in the Command Center’s — “Tools – Settings” dialog, in the “general” tab– Otherwise, enter commands individually

Now let’s take three queries Eddie must answer:

A. What’s the average bag weight for each color from each machine overall?
B. How much goop of each color have we produced in the past seven days excluding today, in how many bags and with what average bag weight?
C. How much goop of each color have we produced so far today, and in how many bags? (Must be up-to-the-minute)

As the database stands, each of these queries could be answered with some simple SQL (see Listing 2), though the database would do a “table scan” for each query. This is much like Eddie looking through every production report he’s ever received.

Listing 2: Ordinary queries.– Query ASELECT Machine,   SUM(Weight)/COUNT(*) as AveWeightFROM TEST.GOOPGROUP by Machine @– Query BSELECT Color,   SUM(Weight) as SumWeight,  COUNT(*) as Bags,   SUM(Weight)/COUNT(*) as AveWeightFROM TEST.GOOP WHERE DateMade BETWEEN (current date – 7 days)   AND (current date – 1 day)GROUP BY Color @– Query CSELECT Color,   COUNT(*) as Bags,  SUM(Weight) as SumWeight FROM TEST.GOOPWHERE DateMade = current dateGROUP BY Color @– Indexing can substantially improve performancecreate index TEST.GOOPMCIX ON  TEST.GOOP (Machine, Color) @create index TEST.GOOPDMCIX ON  TEST.GOOP (DateMade, Machine, Color) @– Use RUNSTATS when the table is loaded,– so the database can decide when to use an indexRUNSTATS ON TABLE TEST.GOOP WITH DISTRIBUTION AND DETAILED INDEXES ALL @

Adding an index or two (also in Listing 2, above) will help with queries B and C, like Eddie filing his papers by date, but it’s still a lot of reading.

Our raw data comprises more than 5,000 records per day of production (4 machines * 60 minutes * 24 hours = 5,760). We can group our data by machine, date, and color, which allows us to reduce our data set to no more than 16 records for each day. In the course of a year, that’s 2 million source records summarized in less than 6,000!

Now we need an implementation; let’s look at the options.

The Hard Way: Do-it-yourself
The simplest way to achieve our summary table objective, in a simple environment, is to have the application maintain the summary. It sounds straightforward, but complexity grows exponentially, even if the logic is placed in triggers. It’s also unnecessary, as DB2 can do the work for us.Immediate Results
With DB2, the functionality we need to reduce the workload of our reporting is built right into the database, by utilizing Automatic Summary Tables (ASTs). An AST is a “materialized view” ?a view that is stored on disc for direct access.

To use Automatic Summary Tables we simply:

  1. Decide how to group the data, and whether we want it updated immediately (“refresh immediate”) or on-request (“refresh deferred”)
  2. Create a summary table for the summary and optionally index it for even greater performance (see Listing 3).
Listing 3: Create a summary table.CREATE SUMMARY TABLE TEST.GOOPST AS (  SELECT Machine, Color, DateMade,   SUM(Weight) AS SumWeight,   COUNT(Weight) AS CountWeight,  COUNT(*) AS Bags  FROM TEST.GOOP  GROUP BY CUBE(Machine, Color, DateMade)) DATA INITIALLY DEFERRED REFRESH IMMEDIATE @refresh table TEST.GOOPST @create index TEST.GOOPSTMCIX ON  TEST.GOOPST (Machine, Color) @create index TEST.GOOPSTDMCIX ON  TEST.GOOPST (DateMade, Machine, Color) @– Do a RUNSTATS, but not until you’ve populated the table!RUNSTATS ON TABLE TEST.GOOP WITH DISTRIBUTION AND DETAILED INDEXES ALL @RUNSTATS ON TABLE TEST.GOOPST WITH DISTRIBUTION AND DETAILED INDEXES ALL @– DB2s query optimizer will make better-informed decisions– if you perform a RUNSTATS on both the source table– and the summary table while they are both populated

Select Your Refreshment
You could now change your reports to query the summary table or you can continue using the original queries. First you need to decide whether to use refresh immediate or refresh deferred ASTs.

Refresh immediate summary tables:

  • are defined by

    CREATE SUMMARY TABLE summaryname[(columname, columname,..)]AS (fullselect)DATA INITIALLY DEFERRED REFRESH IMMEDIATE

  • operate synchronously, that is, they are updated as part of any transaction on their source tables. This incurs some overhead, but the results are always available.
  • Can be used by DB2 to automatically optimize both static and dynamic SQL queries
  • Must contain a GROUP BY clause and a COUNT(*) or COUNT_BIG(*)
  • May use GROUPING SETS, CUBE, and ROLLUP (See Listing 4, below)
  • May contain a SUM(columname) column function, but requires a COUNT(columname) on the same column
  • May be based on an inner-join between tables or views, but cannot use the INNER JOIN keywords.

In Listing 4, we use a CUBE summary table with our revised queries to give us optimal performance. Query A can now be answered from a single row:

Listing 4: A CUBE summary table with revised queries.CREATE SUMMARY TABLE TEST.GOOPCUBE AS (  SELECT Machine, Color, DateMade,   SUM(Weight) AS SumWeight,   COUNT(Weight) AS CountWeight,  COUNT(*) AS Bags  FROM TEST.GOOP  GROUP BY CUBE(Machine, Color, DateMade)) DATA INITIALLY DEFERRED REFRESH IMMEDIATE @refresh table TEST.GOOPCUBE@create index TEST.GOOPCDMCIX ON  TEST.GOOPCUBE (DateMade, Machine, Color) @– We’ll only use one index this time, as– machine/color totals can be found where DateMade is nullRUNSTATS ON TABLE TEST.GOOPCUBE WITH DISTRIBUTION AND DETAILED INDEXES ALL @– You can DROP TABLE TEST.GOOPST @– as GOOPCUBE does the same plus more.– Now we can use queries on the original table, — and DB2 should choose the AST, — or we can explicitly query the cube AST as follows:– Query ASELECT Machine,   SumWeight/Bags as AveWeightFROM TEST.GOOPcubeWHERE Machine IS NOT NULL  AND DateMade IS NULL  AND Color IS NULL@– Query BSELECT Color,   SUM(SumWeight) as SumWeight,  SUM(Bags) as Bags,   SUM(SumWeight)/sum(Bags) as AveWeightFROM TEST.GOOPcube WHERE DateMade BETWEEN (current date – 7 days)   AND (current date – 1 day)  AND Machine IS NOT NULL  AND Color IS NOT NULLGROUP BY Color@– Query C — Also includes an all-Colors totalSELECT Color,   Bags, SumWeight FROM TEST.GOOPcubeWHERE DateMade = CURRENT DATE  AND Machine IS NULLGROUP BY Color, Bags, SumWeight@

Refresh immediate summary tables are good when:

  • Having a summary table will avoid the need to scan part or all of its source table
  • You can afford the transaction overhead
  • You need a completely up-to-date summary
  • Using a regular query or view is too slow, or fails due to other constraints.
Update on Request
Examining our queries more carefully, we might:
  • Decide that query A is acceptable without the current day’s data
  • Recognize that query B doesn’t need the current day’s data
  • Establish that query C runs fast enough on the raw data.

Hence, we could skip the overhead of a refresh immediate summary, and rebuild the summary each day. A refresh deferred summary table is a neat way to do this. It allows more flexible queries, but has its own limitations.

Refresh deferred summary tables are similar to refresh immediate, except they:
  • Operate asynchronously; they are refreshed only when you explicitly refresh them
  • May use scalar functions
  • May use other column functions, such as MIN and MAX
  • Do not require a COUNT(*) or COUNT_BIG(*)
  • May contain a HAVING clause
  • Cannot be used to automatically optimize static SQL queries
  • Can be used to automatically optimize dynamic SQL queries under certain circumstances (see sidebar “A Faster Disaster”).

Listing 5 shows a refresh deferred definition. Remember, you can increase flexibility and reduce overhead by deferring the update.

Listing 5: Create a Deferred Summary TableCREATE SUMMARY TABLE TEST.GOOP_DEF AS (  SELECT Machine, DateMade,   SUM(Weight) AS SumWeight,   COUNT(Weight) AS CountWeight,  COUNT(*) AS Bags,  MIN(SerialNo) as MinSerial,  MAX(SerialNo) as MaxSerial  FROM TEST.GOOP  GROUP BY Machine, DateMade) DATA INITIALLY DEFERRED REFRESH DEFERRED @create index TEST.GOOPDEFMDIX ON  TEST.GOOP_DEF (Machine, DateMade) @

Note that the contents of goop_def are only updated upon the command

refresh table TEST.GOOP_DEF

meaning that the contents of goop_def are not guaranteed to be current.

Refresh deferred summary tables are good when:

  • You can refresh a summary table once and query it many times, due to the timing of the transactions and queries?for example, if you run queries on historical daily summaries or the source data arrives in bursts.
  • You need the extra functions that aren’t available with refresh immediate.

So it’s still a balancing act, and there’s no single answer to all situations, but now you have two tools to cut your query times when collating large data sets. Decide whether your summary needs to be live, or simply refreshable, and let DB2 work smart?like Eddie.


Share the Post: