Tax Problem Puzzle

Question:
[Joe Celko’s Tax Problem Puzzle]

I got this one from Richard S. Romley, via CompuServe. We were engaged in a debate about Sybase’s old T-SQL versus ANSI/ISO SQL-92. I assume the reader can figure out which side I was on.

This is a simplified version of a problem relating to tax calculations. I will define a tax area as being made up of multiple tax authorities. For example, a tax area might be a city, and the tax authorities for that city might be the city, the county that the city is in, and the state that the county and city are in. When you pay tax on a purchase in the city, the tax rate you pay is made up of the city tax, the county tax, and the state tax. Each of these taxing authorities changes their tax rate independent of the others.

We have the following two tables:

    CREATE TABLE         (taxauthority CHAR(10) NOT NULL,        taxarea CHAR(10) NOT NULL,        PRIMARY KEY (taxauthority, taxarea));
This is a many to many relationship. Each tax area is made up of multiple tax authorities and each tax authority may appear in multiple tax areas.
    TaxAreaAuthority    taxauthority taxarea    ======================    ‘city1’      ‘city1’    ‘city2’      ‘city2’    ‘city3’      ‘city3’    ‘county1’    ‘city1’    ‘county1’    ‘city2’    ‘county2’    ‘city3’    ‘state1’     ‘city1’    ‘state1’     ‘city2’    ‘state1’     ‘city3’
This means that city1 and city2 are in county1 of state1; city3 is in county2 of state 1 and so forth. The other table we need is the tax rates.
    CREATE TABLE TaxRates        (taxauthority CHAR(10) NOT NULL,        dteffective DATE NOT NULL,        authtaxrate DECIMAL (8,2) NOT NULL,        PRIMARY KEY (taxauthority, dteffective));
Populate this table as follows:
    TaxRates    taxauthority dteffective authtaxrate    ======================================    ‘city1’      ‘1993-01-01’     1.0    ‘city1’      ‘1994-01-01’     1.5    ‘city2’      ‘1993-09-01’     1.5    ‘city2’      ‘1994-01-01’     2.0    ‘city2’      ‘1995-01-01’     2.5    ‘city3’      ‘1993-01-01’     1.7    ‘city3’      ‘1993-07-01’     1.9    ‘county1’    ‘1993-01-01’     2.3    ‘county1’    ‘1994-Oct01’     2.5    ‘county1’    ‘1995-01-01’     2.7    ‘county2’    ‘1993-01-01’     2.4    ‘county2’    ‘1994-01-01’     2.7    ‘county2’    ‘1995-01-01’     2.8    ‘state1’     ‘1993-01-01’     0.5    ‘state1’     ‘1994-01-01’     0.8    ‘state1’     ‘1994-07-01’     0.9    ‘state1’     ‘1994-10-01’     1.1
The problem: What is the total tax rate for City2 on November 1 1994? Richard’s solution with a single SELECT, used no views or subqueries. In T-SQL, it looked like this:
    SELECT totaltaxrate = SUM(MAX(R1.AuthTaxRate))         FROM TaxAreaAuthority A1, TaxRates R1, TaxRates R2        WHERE A1.taxarea = ‘city2’            AND A1.taxauthority = R1.taxauthority            AND A1.taxauthority = R2.taxauthority            AND R1.dteffective <= '1994-11-01'            AND R2.dteffective <= '1994-11-01'        GROUP BY R1.taxauth, R1.dteffective        HAVING R1.dteffective = MAX(R2.dteffective);
The answer for City2 on November 1, 1994 would be
    City2   taxrate = 2.0    County1 taxrate = 2.5    State1  taxrate = 1.1    ———————    TotalTaxRate    = 5.6
The nested aggregate functions will not work in any standard SQL. The SELECT list does not have the GROUP BY columns and the HAVING clause references columns not in the SELECT list, too. In short, this is not standard SQL.

Ignoring the requirement that you do not use subqueries, can you write a single query that will answer this question in SQL-92?

Answer:
The nested aggregate functions are much harder to understand than subqueries. Since aggregate functions apply to the groups, the only way SUM(MAX(R1.AuthTaxRate)) could be interpreted is that it is the sum of the largest value of R1.AuthTaxRate in each group, which is simply the largest value of R1.AuthTaxRate in each group, period. In order to read this as T-SQL intends, you have to force a hierarchy of data, instead of sets of data. In short, you are throwing away the relational model.

In English the question is: “For City2 on 1994 November 1, what was the total tax rate, based on the latest rates in effect at that time from each of the authorities which has control over City2?”

Let’s solve this problem in pieces. First we want to find out who the taxing authorities for the city are, so we write a subquery

    (SELECT taxauthority         FROM TaxAreaAuthority AS A1        WHERE A1.taxarea = ‘city2’)
which will give us the set (‘city2’, ‘county1’, ‘state1’).

Next we want to find what the tax rates were on 1994 November 1, so we write another subquery:

    (SELECT taxauthority, authtaxrate        FROM TaxRates AS R1        WHERE R1.dteffective = (SELECT MAX (R2.dteffective)            FROM TaxRates AS R2            WHERE R2.dteffective <= '1994-11-01'))
Now, let’s combine the two subqueries and put our constants in the SELECT list to make the final answer readable. Actually, I would make these constants into parameters, so the routine would generalize, but for now, let’s stick to the original problem:
    SELECT ‘city2’ AS city,             ‘1994-11-01’ AS effective_date,             R1.taxauthority,             authtaxrate        FROM TaxRates AS R1        WHERE R1.dteffective = (SELECT MAX (R2.dteffective)                FROM TaxRates AS R2                WHERE R2.dteffective <= '1994-11-01'                    AND R1.taxauthority = R2.taxauthority)            AND taxauthority IN (SELECT taxauthority                 FROM TaxAreaAuthority AS A1                WHERE A1.taxarea = 'city2');
now let’s do the totaling:
    SELECT ‘city2’ AS city,             ‘1994-11-01’ AS effective_date,             SUM (authtaxrate) AS total_taxes        FROM TaxRates AS R1        WHERE R1.dteffective = (SELECT MAX (R2.dteffective)                FROM TaxRates AS R2                WHERE R2.dteffective <= '1994-11-01'                    AND R1.taxauthority = R2.taxauthority)            AND taxauthority IN (SELECT taxauthority                 FROM TaxAreaAuthority AS A1                WHERE A1.taxarea = 'city2')        GROUP BY city, effective_date;
But wait! We can do more consolidation and get rid of the AND predicate in favor of more nesting, thus:
    SELECT ‘city2’ AS city,            ‘1994-11-01’ AS effect_date,            SUM (authtaxrate) AS total_taxes        FROM TaxRates AS R1        WHERE R1.dteffective = (SELECT MAX (R2.dteffective)            FROM TaxRates AS R2            WHERE R2.dteffective <= '1994-11-01'                AND R1.taxauthority = R2.taxauthority                AND R2.taxauthority                    IN (SELECT taxauthority                        FROM TaxAreaAuthority AS A1                        WHERE A1.taxarea = 'city2'))        GROUP BY city, effect_date;
Since the innermost subquery is a non-correlated, constant list, performance should be pretty good. And sure enough, when I look at the execution plan in Watcom SQL where I tested this, I find that the R1 and R2 tables were sequentially scanned, but the A1 table used the primary key index. If I put indexes on the TaxRates table, then I can get a faster execution plan.

Puzzle provided courtesy of:
Joe Celko
[email protected]

devxblackblue

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.

About Our Journalist