Re: Datamation's Proposed Extention to SQL Language (#1)

From: <jfr_at_trwlasd.com>
Date: Sun, 17 Jul 1994 06:04:20 GMT
Message-ID: <Ct2M79.CMA_at_trwlasd.com>


Michael Sallwasser writes
> In the June 1, 1994 issue of Datamation, the cover story proposes 3
> extensions to the SQL language. In an article, "SQL is our language.
> Fix it now.", the editors suggest three extensions that appear
> syntactically compatible with SQL as I know it. Here is the first
> of the three. I have made my comments. I would be interested in
> reading yours.
>
> [If you have not read the article, what follows may not be very clear]
>
> #1 ALTERNATE clause:
>
> This feature would allow two group functions to be used in the same
> select statement with different selection criteria for each group
> function. The ALTERNATE constraint replaces all constraints on the same
> table in the surrounding query.
>
> The example they give is for comparing this year's sales with last
> year's sales by productname. The syntax might be:
>
> select item.name, sum(sales.amt)/
> sum(alternate sales.amt where sales.year = 1993)
> from sales,
> item
> where sales.item_id = item.id
> and sales.year = 1994
> group by item.name;
>
> The intention is that the WHERE clause within the alternate clause would
> replace that all parts of the main WHERE clause other than join
> conditions. I believe a clearer syntax might be as follows:
>
> select item.name,
> sum(sales.amt where sales.year=1994)/
> sum(sales.amt where sales.year=1993)
> from sales,
> item
> where sales.item_id = item.id
> group by item.name;

>

> In this case, there is no replacement, the WHERE clause within the group
> function provides constraints in addition to those contained with the
> main WHERE clause.

As they say, been there, done that :-) Seriously, this is a good idea, but ANSI SQL 92 already provides a mechanism (albeit a syntactically uglier mechanism to do this:

        select item.name,
           sum(case when sales.year=1994 then sales.amt otherwise NULL)/
           sum(case when sales.year=1993 then sales.amt otherwise NULL)
        etc., etc., etc.

The CASE construct is similar to a function and is used essentially the same way as an IF() function in a Lotus 1-2-3 spreadsheet in that it returns a single value based on one or more alternatives.

While I was Manager of Core Technology at Red Brick Systems, we implemented in the Red Brick Warehouse database product (RBW) an even more powerful feature (also a part of ANSI SQL 92): subqueries in the SELECT list. Using this feature, you could also write the above query as a set of correlated subqueries in the SELECT list:

        select item.name,
           (select sum(sales) from item
             where item.id = sales.id and year = 1994)/
           (select sum(sales) from item
             where item.id = sales.id and year = 1993)
        from sales;

Note that there is no join or group by required in this query because the outer query effectively groups the results by traversing its table only once. However, the inner queries are correlated (in this example, but they don't have to be) and so the query optimizer and executor must come up with relatively good ways of handling this. The nice thing is you can do even more interesting queries with this facility:

        select item.name,
            (select sum(sales) from item
             where item.id = sales.id and year = 1994)/
            (select sum(sales) from item where year = 1994)
        from sales;

This query gets you a comparison of sales for each product sold in 1994 against the total sales for the year. The second subquery is NOT correlated in this query, and a good optimizer would probably only calculate this value once.

Again, none of this is really new, it is already defined in ANSI SQL 92 and frankly the article in Datamation strikes me as typical of people who ask for things without being fully informed about what has been defined already (or they don't understand the impact of various features of the ANSI definition like the CASE construct and subqueries in the SELECT list).  

> Either way, this feature would be of benefit to the customers that I
> support. If the SQL could be generated by the various third party GUI
> tools we use that would be even better.

Unfortunately, GUI tools generate SQL for thelowest common denominator database systems. At Red Brick, where we had most of the functions detailed below, we could almost never get third-party tool vendors to take advantage of these functions. This was also true when I worked at Teradata as Manager of Coordinated Products (where I was charged with getting third party vendors to build cooperative links to Teradata's DBC parallel processing database system). It requires extreme forethought to build a product which can (1) work with a wide variety of databases and (2) use all of the differing features of each optimally. Most products are NOT built with that level of forethought.

> #2 sequential functions:
>
> The editors propose three sequential functions:
>
> number(),
> numberof(),
> totalof(expression)
> rank(expression)
> tertile(expression)
> quartile(expression)
> movingavg(expression,num)
> movingsum(expression,num)
> cumulative(expression)
>
> I would propose a generic version of tertile (if that a word?) and
> quartile, ntile(expression,n) (e.g. quartile(expression) ==
> ntile(expression,4)

Red Brick Warehouse had many of these operators (I would guess that at least one or more of the authors of this article had at least a fleeting acquaintance with RBW or else with FOCUS). I am not convinced that all of these functions make sense in a relational database query language. The problem is timing. The only sensible way to process these functions is AFTER the ORDER BY clause is processed (prior to that, there is no "normal" order to the query and the GROUP BY doesn't define one either, just take some time to think about it and you will see why). RBW required a LOT of strange manipulations to make these work and the range of deterministic behavior was rather limited. If you want to apply predicate behavior to these numbers ("give me the products in the second quartile" or "give me the worst fifteen sales people in terms of sales") you need ANOTHER predicate clause (you can't really put these predicates in the WHERE or HAVING clauses unless you want to heavily overload the meaning of these clauses). Its ugly stuff, to be sure.

I personally, have always wanted user-defined functions and aggregation functions. These provide a lot more return for the effort invested. User-defined scalar functions is relatively trivial (you just need a language to define them in and a way of extending the SQL function list dynamically). Aggregations are trickier because they are in effect iterator functions which require three separate phases: initialization, repetitive execution for each included value, and summary. Initialization has to be performed each time a new grouping is identified, and repetitive execution is used to accumulate whichever values are required. Summary must be performed at the end of all processing (once you know the entire set of data exists) and calculates the actual value that is returned.

--
Jon Rosen
=============================================================
So remember when you're feeling very small and insecure,
How amazingly unlikely is your birth,
And pray that there's intelligent life somewhere up in space,
Because there's bugger all down here on Earth. - Monty Python
-- 
Jon Rosen
=============================================================
So remember when you're feeling very small and insecure,
How amazingly unlikely is your birth,
Received on Sun Jul 17 1994 - 08:04:20 CEST

Original text of this message