Browse DevX
Sign up for e-mail newsletters from DevX


Tame Beastly Data with Summary Tables-2 : Page 2




Building the Right Environment to Support AI, Machine Learning and Deep Learning

Grouping the Data
The first step is to decide how to group the data, so let's get some specifics on Eddie's production line. He supervises production of "goop." Goop is made in various colors (say, four), and sold by weight, in bags. Due to variations, each bag of goop must be weighed, and stamped with a serial number. A bag of goop is filled each minute, from each of four machines. The batch number and date must also be recorded for each bag. Listing 1 creates a table to record goop production.

Listing 1: The GOOP table. -- Don't forget to connect to a database first, CONNECT TO SAMPLE @ CREATE TABLE TEST.GOOP ( Machine SMALLINT NOT NULL, SerialNo CHAR(10) NOT NULL, Color CHAR(10) NOT NULL, Weight SMALLINT NOT NULL, DateMade DATE NOT NULL, PRIMARY KEY (SerialNo)) @ -- Note: I am using "@" to terminate statements. -- You can set this in the Command Center's -- "Tools - Settings" dialog, in the "general" tab -- Otherwise, enter commands individually

Now let's take three queries Eddie must answer:

A. What's the average bag weight for each color from each machine overall?
B. How much goop of each color have we produced in the past seven days excluding today, in how many bags and with what average bag weight?
C. How much goop of each color have we produced so far today, and in how many bags? (Must be up-to-the-minute)

As the database stands, each of these queries could be answered with some simple SQL (see Listing 2), though the database would do a "table scan" for each query. This is much like Eddie looking through every production report he's ever received.

Listing 2: Ordinary queries. -- Query A SELECT Machine, SUM(Weight)/COUNT(*) as AveWeight FROM TEST.GOOP GROUP by Machine @ -- Query B SELECT Color, SUM(Weight) as SumWeight, COUNT(*) as Bags, SUM(Weight)/COUNT(*) as AveWeight FROM TEST.GOOP WHERE DateMade BETWEEN (current date - 7 days) AND (current date - 1 day) GROUP BY Color @ -- Query C SELECT Color, COUNT(*) as Bags, SUM(Weight) as SumWeight FROM TEST.GOOP WHERE DateMade = current date GROUP BY Color @ -- Indexing can substantially improve performance create index TEST.GOOPMCIX ON TEST.GOOP (Machine, Color) @ create index TEST.GOOPDMCIX ON TEST.GOOP (DateMade, Machine, Color) @ -- Use RUNSTATS when the table is loaded, -- so the database can decide when to use an index RUNSTATS ON TABLE TEST.GOOP WITH DISTRIBUTION AND DETAILED INDEXES ALL @

Adding an index or two (also in Listing 2, above) will help with queries B and C, like Eddie filing his papers by date, but it's still a lot of reading.

Our raw data comprises more than 5,000 records per day of production (4 machines * 60 minutes * 24 hours = 5,760). We can group our data by machine, date, and color, which allows us to reduce our data set to no more than 16 records for each day. In the course of a year, that's 2 million source records summarized in less than 6,000!

Now we need an implementation; let's look at the options.

The Hard Way: Do-it-yourself
The simplest way to achieve our summary table objective, in a simple environment, is to have the application maintain the summary. It sounds straightforward, but complexity grows exponentially, even if the logic is placed in triggers. It's also unnecessary, as DB2 can do the work for us.

Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



Thanks for your registration, follow us on our social networks to keep up-to-date