Re: SQL competition: the shortest interval coalesce/pack query

From: Hugo Kornelis <hugo_at_pe_NO_rFact.in_SPAM_fo>
Date: Fri, 03 Dec 2004 09:45:34 +0100
Message-ID: <0a90r0tudtmkmm0edh1p9u8h5qurr7qbi5_at_4ax.com>


On Thu, 2 Dec 2004 17:28:40 -0800, Mikito Harakiri wrote:

>Write the shortest sql query that would return a minimal cover of a set of
>intervals. For example, given
>
>Intervals = {(x=1,y=3), (x=2,y=5), (x=4, y=11), (x=10,y=12), (x=20,y=21)}
>
>it should return
>
>cover(Intervals) = {(x=1,y=12), (x=20,y=21)}
>
>"Developing Time-Oriented..." book by R.Snordgrass demonstrates several
>rather lenthy solutions (pp.160+). The book is freely downloadable.
>
>To be able to compare results, lets agree that each of the clauses (select,
>from, where, group by, having) starts a new line. (Each subquery/inner view
>must obey this rule as well.) Also, no more than 2 column expressions are
>allowed to fit on a single line in the select clause. Likewise, no more than
>2 view/tables are allowed on a single line within from clause, no more than
>2 predicates in the each line of the where clause, etc. If anybody knows
>alternative (maybe even standard) sql formatting rules, you are welcome to
>suggest those.
>

Hi Mikito,

I didn't check out the book you mention, so I don't know if my solutions are shorter. They are tested on SQL Server 2000. I tried not to use any proprietary code, so it should be easily ported to other platforms. I used my own formatting preferences, reformat as you see fit for either your personal style or for comparisons.

Here's a query that uses a self-join and three nested correlated subqueries:

SELECT      a.x, MAX(b.y) AS y
FROM        Intervals AS a
INNER JOIN  Intervals AS b
      ON    b.y > a.x

WHERE NOT EXISTS
 (SELECT *
  FROM       Intervals AS c
  WHERE     a.x - 1 BETWEEN c.x AND c.y)
AND NOT EXISTS
 (SELECT *
  FROM      Intervals AS d
  WHERE     d.y > a.x
  AND       d.y < b.y

  AND NOT EXISTS
   (SELECT *
    FROM Intervals AS e
    WHERE d.y + 1 BETWEEN e.x AND e.y)) GROUP BY a.x

And this is essentially the same format, but converted to use left anti semi joins instead of subqueries. I don't think it's shorter, but it might execute better on some platforms and some people prefer this format over subqueries. Your call.

SELECT      a.x, MAX(b.y) AS y
FROM        Intervals AS a
INNER JOIN  Intervals AS b
      ON    b.y > a.x
LEFT  JOIN  Intervals AS c
      ON    a.x - 1 BETWEEN c.x AND c.y

LEFT JOIN (Intervals AS d
LEFT JOIN Intervals AS e
      ON    d.y + 1 BETWEEN e.x AND e.y)
      ON    d.y > a.x
      AND   d.y < b.y
      AND   e.x IS NULL
WHERE       c.x IS NULL
AND         d.x IS NULL
GROUP BY    a.x


If the table is large, the correlated subqueries (version 1) or the quintuple self-join (version 2) will probably make it slow. But you asked for a short query, not for a quick one :-)

It might be worthwile to investigate if a numbers table can be used to construct a shorter or faster solution, but I decided to make my solution as vanilla as possible, using only the Intervals table and nothing else.

BTW, this is the SQL I used to create table and test data:

CREATE TABLE Intervals

        (x int NOT NULL
        ,y int NOT NULL
        ,PRIMARY KEY (x, y)
        )

go

INSERT INTO Intervals (x, y)
SELECT 1, 3 UNION ALL
SELECT 2, 5 UNION ALL
SELECT 4, 11 UNION ALL

SELECT 10, 12 UNION ALL
SELECT 20, 21
go

Best, Hugo

-- 

(Remove _NO_ and _SPAM_ to get my e-mail address)
Received on Fri Dec 03 2004 - 09:45:34 CET

Original text of this message