anagers loved Eddie; he always had the answers on-hand.
Eddie was a production supervisor in the pre-computer era. He received comprehensive hourly production reports on paper. He also received urgent queries from managers, who needed up-to-date summaries spanning the entire year's production. He always had a ready answer.
Eddie had learned that in his job, it paid to keep running totals. He could provide a range of summaries without shuffling reams of paper, and point straight to the answers that he kept pinned up on the wall.
Do you run summary queries against your database? Are they slowing down due to the sheer bulk of the raw data? Eddie may have your solution: keep summary tables. In this article, we will examine how DB2 can do it for you and have your answers waiting before you even ask.
![]()
Use summary tables to hold collated information and have them updated as required. Summary tables can be used explicitly or automatically to improve the speed of certain queries.
Summary tables can be:
Grouping the Data
The first step is to decide how to group the data, so let's get some specifics on Eddie's production line. He supervises production of "goop." Goop is made in various colors (say, four), and sold by weight, in bags. Due to variations, each bag of goop must be weighed, and stamped with a serial number. A bag of goop is filled each minute, from each of four machines. The batch number and date must also be recorded for each bag. Listing 1 creates a table to record goop production.
Listing 1: The GOOP table.
-- Don't forget to connect to a database first,
CONNECT TO SAMPLE @
CREATE TABLE TEST.GOOP (
Machine SMALLINT NOT NULL,
SerialNo CHAR(10) NOT NULL,
Color CHAR(10) NOT NULL,
Weight SMALLINT NOT NULL,
DateMade DATE NOT NULL,
PRIMARY KEY (SerialNo)) @
-- Note: I am using "@" to terminate statements.
-- You can set this in the Command Center's
-- "Tools - Settings" dialog, in the "general" tab
-- Otherwise, enter commands individually
Now let's take three queries Eddie must answer:
A. What's the average bag weight for each color from each machine overall?
B. How much goop of each color have we produced in the past seven days excluding today, in how many bags and with what average bag weight?
C. How much goop of each color have we produced so far today, and in how many bags? (Must be up-to-the-minute)
As the database stands, each of these queries could be answered with some simple SQL (see Listing 2), though the database would do a "table scan" for each query. This is much like Eddie looking through every production report he's ever received.
Listing 2: Ordinary queries.
-- Query A
SELECT Machine,
SUM(Weight)/COUNT(*) as AveWeight
FROM TEST.GOOP
GROUP by Machine @
-- Query B
SELECT Color,
SUM(Weight) as SumWeight,
COUNT(*) as Bags,
SUM(Weight)/COUNT(*) as AveWeight
FROM TEST.GOOP
WHERE DateMade BETWEEN (current date - 7 days)
AND (current date - 1 day)
GROUP BY Color @
-- Query C
SELECT Color,
COUNT(*) as Bags,
SUM(Weight) as SumWeight
FROM TEST.GOOP
WHERE DateMade = current date
GROUP BY Color @
-- Indexing can substantially improve performance
create index TEST.GOOPMCIX ON TEST.GOOP
(Machine, Color) @
create index TEST.GOOPDMCIX ON TEST.GOOP
(DateMade, Machine, Color) @
-- Use RUNSTATS when the table is loaded,
-- so the database can decide when to use an index
RUNSTATS ON TABLE TEST.GOOP WITH DISTRIBUTION AND
DETAILED INDEXES ALL @
Adding an index or two (also in Listing 2, above) will help with queries B and C, like Eddie filing his papers by date, but it's still a lot of reading.
Our raw data comprises more than 5,000 records per day of production (4 machines * 60 minutes * 24 hours = 5,760). We can group our data by machine, date, and color, which allows us to reduce our data set to no more than 16 records for each day. In the course of a year, that's 2 million source records summarized in less than 6,000!
Now we need an implementation; let's look at the options.
The Hard Way: Do-it-yourself
Immediate Results
With DB2, the functionality we need to reduce the workload of our reporting is built right into the database, by utilizing Automatic Summary Tables (ASTs). An AST is a "materialized view" a view that is stored on disc for direct access.
To use Automatic Summary Tables we simply:
Listing 3: Create a summary table.
CREATE SUMMARY TABLE TEST.GOOPST AS (
SELECT Machine, Color, DateMade,
SUM(Weight) AS SumWeight,
COUNT(Weight) AS CountWeight,
COUNT(*) AS Bags
FROM TEST.GOOP
GROUP BY CUBE(Machine, Color, DateMade)
) DATA INITIALLY DEFERRED REFRESH IMMEDIATE @
refresh table TEST.GOOPST @
create index TEST.GOOPSTMCIX ON TEST.GOOPST
(Machine, Color) @
create index TEST.GOOPSTDMCIX ON TEST.GOOPST
(DateMade, Machine, Color) @
-- Do a RUNSTATS, but not until you've populated the table!
RUNSTATS ON TABLE TEST.GOOP WITH DISTRIBUTION
AND DETAILED INDEXES ALL @
RUNSTATS ON TABLE TEST.GOOPST WITH DISTRIBUTION
AND DETAILED INDEXES ALL @
-- DB2s query optimizer will make better-informed decisions
-- if you perform a RUNSTATS on both the source table
-- and the summary table while they are both populated
Refresh immediate summary tables:
CREATE SUMMARY TABLE summaryname
[(columname, columname,..)]
AS (fullselect)
DATA INITIALLY DEFERRED REFRESH IMMEDIATE
In Listing 4, we use a CUBE summary table with our revised queries to give us optimal performance. Query A can now be answered from a single row:
Listing 4: A CUBE summary table with revised queries.
CREATE SUMMARY TABLE TEST.GOOPCUBE AS (
SELECT Machine, Color, DateMade,
SUM(Weight) AS SumWeight,
COUNT(Weight) AS CountWeight,
COUNT(*) AS Bags
FROM TEST.GOOP
GROUP BY CUBE(Machine, Color, DateMade)
) DATA INITIALLY DEFERRED REFRESH IMMEDIATE @
refresh table TEST.GOOPCUBE@
create index TEST.GOOPCDMCIX ON TEST.GOOPCUBE (
DateMade, Machine, Color) @
-- We'll only use one index this time, as
-- machine/color totals can be found where DateMade is null
RUNSTATS ON TABLE TEST.GOOPCUBE WITH DISTRIBUTION AND
DETAILED INDEXES ALL @
-- You can DROP TABLE TEST.GOOPST @
-- as GOOPCUBE does the same plus more.
-- Now we can use queries on the original table,
-- and DB2 should choose the AST,
-- or we can explicitly query the cube AST as follows:
-- Query A
SELECT Machine,
SumWeight/Bags as AveWeight
FROM TEST.GOOPcube
WHERE Machine IS NOT NULL
AND DateMade IS NULL
AND Color IS NULL@
-- Query B
SELECT Color,
SUM(SumWeight) as SumWeight,
SUM(Bags) as Bags,
SUM(SumWeight)/sum(Bags) as AveWeight
FROM TEST.GOOPcube
WHERE DateMade BETWEEN (current date - 7 days)
AND (current date - 1 day)
AND Machine IS NOT NULL
AND Color IS NOT NULL
GROUP BY Color@
-- Query C
-- Also includes an all-Colors total
SELECT Color,
Bags, SumWeight
FROM TEST.GOOPcube
WHERE DateMade = CURRENT DATE
AND Machine IS NULL
GROUP BY Color, Bags, SumWeight@
Refresh immediate summary tables are good when:
Update on Request
Examining our queries more carefully, we might:
Hence, we could skip the overhead of a refresh immediate summary, and rebuild the summary each day. A refresh deferred summary table is a neat way to do this. It allows more flexible queries, but has its own limitations.
Refresh deferred summary tables are similar to refresh immediate, except they:Listing 5 shows a refresh deferred definition. Remember, you can increase flexibility and reduce overhead by deferring the update.
Listing 5: Create a Deferred Summary Table
CREATE SUMMARY TABLE TEST.GOOP_DEF AS (
SELECT Machine, DateMade,
SUM(Weight) AS SumWeight,
COUNT(Weight) AS CountWeight,
COUNT(*) AS Bags,
MIN(SerialNo) as MinSerial,
MAX(SerialNo) as MaxSerial
FROM TEST.GOOP
GROUP BY Machine, DateMade
) DATA INITIALLY DEFERRED REFRESH DEFERRED @
create index TEST.GOOPDEFMDIX ON TEST.GOOP_DEF
(Machine, DateMade) @
Note that the contents of goop_def are only updated upon the command
refresh table TEST.GOOP_DEF
meaning that the contents of goop_def are not guaranteed to be current.
Refresh deferred summary tables are good when:
So it's still a balancing act, and there's no single answer to all situations, but now you have two tools to cut your query times when collating large data sets. Decide whether your summary needs to be live, or simply refreshable, and let DB2 work smartlike Eddie.
| DevX is a division of Internet.com. © Copyright 2010 Internet.com. All Rights Reserved. Legal Notices |