Re: Newbie Query Question

From: Douglas Hawthorne <douglashawthorne_at_yahoo.com.au>
Date: Mon, 12 Apr 2004 06:38:55 GMT
Message-ID: <3Qqec.5251$ED.1237@news-server.bigpond.net.au>

"JES" <nausadge_at_yahoo.com> wrote in message news:xu1ec.14590$mP1.6039_at_newssvr22.news.prodigy.com...
> I have a table that is "long" and I must transform the table into a "wide"
> table. For example, the long table is DATE, DATAVALUE, DATANAME (where
> there is one record for each measured DATAVALUE and where DATANAME comes
> from pre-defined list of names) and I need to transform this table into
> DATE, DATANAME1, DATANAME2, DATANAME3, ETC, where I get one record for
each
> unique DATE and where DATAVALUE from the original table is assigned to the
> corresponding DATANAME column in the "wide" table.
>
> I have done this two ways, and one way is much slower in Oracle than the
> other. Why? Any suggestions for another method? Both methods seem to
run
> quickly in Access...
>
> Method 1. Something like the following...
>
> SELECT DATE AS DATE,
> (SELECT DATAVALUE
> FROM tbl1_tmp
> WHERE tbl1.DATE = tbl1_tmp.DATE AND
> WHERE DATANAME = 'DATANAME1') AS DATANAME1,
> (SELECT DATAVALUE
> FROM tbl1_tmp
> WHERE tbl1.DATE = tbl1_tmp.DATE AND
> DATANAME = 'DATANAME2') AS DATANAME2,
> (SELECT DATAVALUE
> FROM tbl1_tmp
> WHERE tbl1.DATE = tbl1_tmp.DATE AND
> WHERE DATANAME = 'DATANAME3') AS DATANAME3
>
> FROM tbl1
>
> Method 2. And something like the following
>
> SELECT DATE AS DATE,
> SELECT tbl1_tmp1.DATAVALUE AS DATANAME1
> SELECT tbl1_tmp2.DATAVALUE AS DATANAME2
> SELECT tbl1_tmp3.DATAVALUE AS DATANAME3
>
> FROM
> (SELECT DATAVALUE
> FROM tbl1_tmp
> WHERE tbl1.DATE = tbl1_tmp.DATE AND
> WHERE DATANAME = 'DATANAME1') tbl1_tmp1,
>
> (SELECT DATAVALUE
> FROM tbl1_tmp
> WHERE tbl1.DATE = tbl1_tmp.DATE AND
> DATANAME = 'DATANAME2') tbl1_tmp2,
>
> (SELECT DATAVALUE
> FROM tbl1_tmp
> WHERE tbl1.DATE = tbl1_tmp.DATE AND
> WHERE DATANAME = 'DATANAME3') tbl1_tmp3,
>
> tbl1
>
>
> Thanks,
>
> JES
>
>
>

JES, Method #1 has a couple of syntax errors which were easily fixed. However, method #2 is not valid SQL because you are mixing a correlated subquery with a Cartesian product.

In summary, the biggest improvement for method #1 came from putting a primary key constraint on the TBL1_TMP table with a reduction in elapsed time from 349.26 seconds to 2.46 seconds (a reduction of 99.55%). The best time I could get with the test data was 0.82 seconds using a pivot query on a heap table.

The technique you need to master is that of pivot query. For an explanation, see "Pivot Query" on pp.576 to 582 of "Expert One-on-One Oracle" by Thomas Kyte (A-Press:2003). You could also search http://asktom.oracle.com for "pivot query".

These tests were run under 10.1.0.2 on WinXP Pro.

(1) Creation of Test Data

Let's create some test data:
CREATE TABLE tbl1_tmp AS

SELECT

         TRUNC( SYSDATE ) + FLOOR( rownum / 3 ) AS datadate,
         DECODE( MOD( rownum, 3),
                 0, 'DATANAME1',
                 1, 'DATANAME2',
                 2, 'DATANAME3'
            ) AS dataname,
         rownum AS datavalue
      FROM
         all_objects

;

I could not call the first column, 'DATE', because it is a data type. The table structure is:
SQL> desc tbl1_tmp

 Name                Null?    Type
 ------------------- -------- ------------------
 DATADATE                     DATE
 DATANAME                     VARCHAR2(9)
 DATAVALUE                    NUMBER

The date table (TBL1) was created as follows: CREATE TABLE tbl1 AS SELECT DISTINCT datadate FROM tbl1_tmp;

I collected statistics for the tables:
SQL> exec dbms_stats.gather_table_stats( USER, 'TBL1' ) SQL> exec dbms_stats.gather_table_stats( USER, 'TBL1_TMP' )

To get the time taken for a query,
SQL> SET TIMING ON I turned SQL tracing on with:
ALTER SESSION SET timed_statistics=true; ALTER SESSION SET sql_trace=true;

(2) Original Query

I then corrected some errors in your first query to produce: SELECT datadate,
(SELECT DATAVALUE
FROM tbl1_tmp
WHERE tbl1.datadate = tbl1_tmp.datadate AND DATANAME = 'DATANAME1') AS DATANAME1,
(SELECT DATAVALUE
FROM tbl1_tmp
WHERE tbl1.datadate = tbl1_tmp.datadate AND DATANAME = 'DATANAME2') AS DATANAME2,
(SELECT DATAVALUE
FROM tbl1_tmp
WHERE tbl1.datadate = tbl1_tmp.datadate AND DATANAME = 'DATANAME3') AS DATANAME3
FROM tbl1
;

The results were as follows:

13035 rows selected.
Elapsed: 00:05:49.26

The TKPROF output shows:
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ------

Parse 1 0.31 0.32 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 870 347.29 347.95 0 6062169 0 13035
------- ------ -------- ---------- ---------- ---------- ---------- ------

total 872 347.60 348.28 0 6062169 0 13035

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 73

Rows Row Source Operation

-------  ---------------------------------------------------
  13034  TABLE ACCESS FULL TBL1_TMP (cr=2020425 pr=0 pw=0 time=115302886 us)

13034 TABLE ACCESS FULL TBL1_TMP (cr=2020425 pr=0 pw=0 time=115816431 us) 13034 TABLE ACCESS FULL TBL1_TMP (cr=2020425 pr=0 pw=0 time=115808387 us) 13035 TABLE ACCESS FULL TBL1 (cr=894 pr=0 pw=0 time=91381 us)

The main point to note is that we are doing in excess of 6 million logical I/O's to process this query because each correlated subquery requires a table scan.

(3) Original Query with PK

I then added a primary key constraint (TBL1_TMP_PK) for DATADATE and DATANAME because the nature of the first query assumes that the correlated subqueries return a single value:
ALTER TABLE tbl1_tmp

ADD CONSTRAINT tbl1_tmp_pk

PRIMARY KEY (datadate, dataname)
;

I collected statistics for the index:
SQL> exec dbms_stats.gather_index_stats( USER, 'TBL1_TMP_PK' )

The query was rerun after statistics were collected. The results were as follows:

13035 rows selected.
Elapsed: 00:00:02.46

The TKPROF output showed:
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ------

Parse 1 0.21 0.24 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 870 1.00 1.24 177 81711 0 13035
------- ------ -------- ---------- ---------- ---------- ---------- ------

total 872 1.21 1.49 177 81711 0 13035

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 73

Rows Row Source Operation

-------  ---------------------------------------------------

  13034 TABLE ACCESS BY INDEX ROWID TBL1_TMP (cr=26939 pr=52 pw=0 time=322452 us)
  13034 INDEX UNIQUE SCAN TBL1_TMP_PK (cr=13905 pr=52 pw=0 time=177941 us)(object id 54039)
  13034 TABLE ACCESS BY INDEX ROWID TBL1_TMP (cr=26939 pr=50 pw=0 time=290687 us)
  13034 INDEX UNIQUE SCAN TBL1_TMP_PK (cr=13905 pr=50 pw=0 time=159581 us)(object id 54039)
  13034 TABLE ACCESS BY INDEX ROWID TBL1_TMP (cr=26939 pr=51 pw=0 time=283721 us)
  13034 INDEX UNIQUE SCAN TBL1_TMP_PK (cr=13905 pr=51 pw=0 time=153145 us)(object id 54039)
  13035 TABLE ACCESS FULL TBL1 (cr=894 pr=24 pw=0 time=70478 us)

Now the logical I/O's have dropped from 6 million down to 81,711 and achieved a reduction in elapsed time of 99.6%.

(4) Original Query with IOT

In an attempt to improve things even further, I created an index organised table (IOT):
CREATE TABLE tbl1_tmp_iot

(

      datadate   DATE,
      dataname   VARCHAR2(9),
      datavalue  NUMBER,
      CONSTRAINT tbl1_tmp_iot_pk
         PRIMARY KEY ( datadate, dataname )

)
ORGANIZATION INDEX
;

I populated the table as follows:
INSERT INTO tbl1_tmp_iot SELECT * FROM tbl1_tmp; COMMIT; I collected the statistics:
SQL> exec dbms_stats.gather_index_stats( USER, 'TBL1_TMP_IOT_PK' )

SELECT datadate,
(SELECT DATAVALUE
FROM tbl1_tmp_iot
WHERE tbl1.datadate = tbl1_tmp_iot.datadate AND DATANAME = 'DATANAME1') AS DATANAME1,
(SELECT DATAVALUE
FROM tbl1_tmp_iot
WHERE tbl1.datadate = tbl1_tmp_iot.datadate AND DATANAME = 'DATANAME2') AS DATANAME2,
(SELECT DATAVALUE
FROM tbl1_tmp_iot
WHERE tbl1.datadate = tbl1_tmp_iot.datadate AND DATANAME = 'DATANAME3') AS DATANAME3
FROM tbl1
;

The elapsed time is 00:00:01.29 seconds. The TKPROF output is: call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ------

Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 870 0.70 0.66 0 42609 0 13035
------- ------ -------- ---------- ---------- ---------- ---------- ------

total 872 0.70 0.67 0 42609 0 13035

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 73

Rows Row Source Operation

-------  ---------------------------------------------------

  13034 INDEX UNIQUE SCAN TBL1_TMP_IOT_PK (cr=13905 pr=0 pw=0 time=167930 us)(object id 54041)
  13034 INDEX UNIQUE SCAN TBL1_TMP_IOT_PK (cr=13905 pr=0 pw=0 time=130540 us)(object id 54041)
  13034 INDEX UNIQUE SCAN TBL1_TMP_IOT_PK (cr=13905 pr=0 pw=0 time=125185 us)(object id 54041)
  13035 TABLE ACCESS FULL TBL1 (cr=894 pr=0 pw=0 time=52274 us)

Now we have almost halfed the number of logical I/O's by eliminating the table lookup for the value of the DATAVALUE column.

(5) Pivot Query with PK

I took a modified version of Thomas Kyte's query from p.577 to produce the following query:
SELECT

      datadate,
      MAX(
         DECODE(
            dataname,
            'DATANAME1', datavalue
         )
      ) AS dataname1,
      MAX(
         DECODE(
            dataname,
            'DATANAME2', datavalue
         )
      ) AS dataname2,
      MAX(
         DECODE(
            dataname,
            'DATANAME3', datavalue
         )
      ) AS dataname3
   FROM
      tbl1_tmp
   GROUP BY
      datadate

;

The elapsed time is 00:00:00.82 seconds and the TKPROF output shows: call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ------

Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 870 0.12 0.33 110 155 2 13035
------- ------ -------- ---------- ---------- ---------- ---------- ------

total 872 0.12 0.34 110 155 2 13035

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 73

Rows Row Source Operation

-------  ---------------------------------------------------

13035 SORT GROUP BY (cr=155 pr=110 pw=110 time=278318 us) 39102 TABLE ACCESS FULL TBL1_TMP (cr=155 pr=0 pw=0 time=78327 us)

(6) Pivot Query with IOT

I used the above pivot query on the IOT: SELECT

      datadate,
      MAX(
         DECODE(
            dataname,
            'DATANAME1', datavalue
         )
      ) AS dataname1,
      MAX(
         DECODE(
            dataname,
            'DATANAME2', datavalue
         )
      ) AS dataname2,
      MAX(
         DECODE(
            dataname,
            'DATANAME3', datavalue
         )
      ) AS dataname3
   FROM
      tbl1_tmp_iot
   GROUP BY
      datadate

;

The elapsed time is 00:00:01.62 seconds, and the TKPROF output shows: call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ------

Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 870 0.21 0.17 0 1013 0 13035
------- ------ -------- ---------- ---------- ---------- ---------- ------

total 872 0.21 0.17 0 1013 0 13035

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 73

Rows Row Source Operation

-------  ---------------------------------------------------

13035 SORT GROUP BY NOSORT (cr=1013 pr=0 pw=0 time=183040 us) 39102 INDEX FULL SCAN TBL1_TMP_IOT_PK (cr=1013 pr=0 pw=0 time=117988 us)(object id 54041)

(7) Overview of Pivot Query

The key to pivot queries is the use of the comibination of the MAX and DECODE functions with the GROUP BY clause.

To see how the process of pivoting works, lets look at the intermediate step (without the MAX function and GROUP BY clause): SELECT

      datadate,
      DECODE(
         dataname,
         'DATANAME1', datavalue
      ) AS dataname1,
      DECODE(
         dataname,
         'DATANAME2', datavalue
      ) AS dataname2,
      DECODE(
         dataname,
         'DATANAME3', datavalue
      ) AS dataname3
   FROM
      tbl1_tmp
   WHERE
      rownum <= 5

DATADATE DATANAME1 DATANAME2 DATANAME3

--------- ---------- ---------- ----------
11-APR-04                     1
11-APR-04                                2
12-APR-04          3
12-APR-04                     4
12-APR-04                                5

You will note that rows #3, #4, and #5 have placed the value of the DATAVALUE column into different columns depending upon the value of the DATANAME column. The blank entries are really NULLs.

To condense these groups of three (3) rows into a single row requires the use of the MAX function in conjunction with the GROUP BY clause.

Douglas Hawthorne Received on Mon Apr 12 2004 - 01:38:55 CDT