Home » SQL & PL/SQL » SQL & PL/SQL » How To question for analytics (Oracle 10gR2, Solaris)
How To question for analytics [message #601603] Fri, 22 November 2013 14:30 Go to next message
joebegenwald@gmail.com
Messages: 7
Registered: October 2013
Junior Member
I need to rewrite the following query. The table involved has about 5 to 6 million rows on it, so joining it to itself like this is not good.
select d.col1
     , d.col2
     , d.col3
     , d.col4
     , s.row_id
from tablex d,
  ( select col1,
           col2,
           min(row_id) row_id
    from tablex
    group by col1, col2 ) s
where d.row_id = s.row_id;

The goal is to group the table by columns COL1 and COL2, but them only return column COL3 and COL4 values from the FIRST row in each group.

I have been trying variations of things like this:

select col1
, col2
, first_value(col3) over (partition by col1, col2 order by row_id) mcol3
, first_value(col4) over (partition by col1, col2 order by row_id) mcol4
, first_value(row_id) over (partition by col1, col2 order by row_id) mrow_id
from tablex
group by col1, col2;

but of course, SQL doesn't like this.

How close or off base am I with this??




*BlackSwan added {code} tags. Please do so yourself in the future; as shown below
http://www.orafaq.com/forum/t/174502/

[Updated on: Fri, 22 November 2013 16:09] by Moderator

Report message to a moderator

Re: How To question for analytics [message #601612 is a reply to message #601603] Fri, 22 November 2013 23:12 Go to previous messageGo to next message
manubatham20
Messages: 443
Registered: September 2010
Location: Champaign, IL
Senior Member

Hi,

Welcome to the forum!!

What problem you are facing exactly?

Why you think first option is not a good option, any problem you are facing in execution, or any problem you are able to see which will occur in future.

Why SQL doesn't like the second option, we don't have your table structure so we can't execute and verify your queries.
Please read once link posted by BlackSwan.

And what you trying to achieve by determining first row on basis of row_id value?

Regards,
Manu

[Updated on: Fri, 22 November 2013 23:13]

Report message to a moderator

Re: How To question for analytics [message #601619 is a reply to message #601603] Sat, 23 November 2013 00:18 Go to previous messageGo to next message
flyboy
Messages: 1769
Registered: November 2006
Senior Member
It seems that the main confusion comes from the fact you are talking about analytics, but in fact you are using aggregates.

And aggregates seems to be the correct approach here as the original query is filtering only some row(s) for each (COL1, COL2) combination.

But, If there are duplicities in (COL1, COL2, ROW_ID), you are out of luck as the task is not aggregation, but filter on conditions in the posted query.

Otherwise you may use FIRST/LAST function in its aggregate form, which is described e.g. here: http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions056.htm#i1000901
It could be something like:
select deptno,
  max(empno) keep ( dense_rank first order by hiredate ),
  max(sal) keep ( dense_rank first order by hiredate )
from scott.emp
group by deptno
order by deptno;

(as you did not post a test case, I used sample tample EMP from SCOTT schema)
Re: How To question for analytics [message #601684 is a reply to message #601619] Sun, 24 November 2013 08:08 Go to previous messageGo to next message
joebegenwald@gmail.com
Messages: 7
Registered: October 2013
Junior Member
Thanks. This seems to be what I needed.

To clarify: the KEEP clause in your example will ensure that the MAX(EMPNO) and MAX(SAL) values returned come from the same row, correct? That is what I am trying to achieve. Using DENSE_RANK FIRST is reducing the sub-result set to a singe row, right? And using that and the same ORDER BY clause in each select element ensures that they are the same row.

Thanks again for your assistance here.

Joe
Re: How To question for analytics [message #601685 is a reply to message #601612] Sun, 24 November 2013 08:15 Go to previous messageGo to next message
joebegenwald@gmail.com
Messages: 7
Registered: October 2013
Junior Member
I am trying to group a large number of rows(5-6 million) and return specific values from just the first row in each group.

Using the self join works, but the CBO is using a double full table scan. With 5 million rows in the table, well, I think you can see the problem.

Thanks
Re: How To question for analytics [message #601687 is a reply to message #601685] Sun, 24 November 2013 08:48 Go to previous messageGo to next message
BlackSwan
Messages: 22537
Registered: January 2009
Senior Member
SQL> set autot on explain
SQL> select deptno,
  max(empno) keep ( dense_rank first order by hiredate ),
  max(sal) keep ( dense_rank first order by hiredate )
from scott.emp
group by deptno
order by deptno;  2    3    4    5    6  

    DEPTNO MAX(EMPNO)KEEP(DENSE_RANKFIRSTORDERBYHIREDATE)
---------- ----------------------------------------------
MAX(SAL)KEEP(DENSE_RANKFIRSTORDERBYHIREDATE)
--------------------------------------------
        10                                           7782
                                        2450

        20                                           7369
                                         800

        30                                           7499
                                        1600



Execution Plan
----------------------------------------------------------
Plan hash value: 15469362

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     3 |    57 |     4  (25)| 00:00:01 |
|   1 |  SORT GROUP BY     |      |     3 |    57 |     4  (25)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| EMP  |    14 |   266 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------

SQL> 


Re: How To question for analytics [message #601693 is a reply to message #601684] Sun, 24 November 2013 09:58 Go to previous messageGo to next message
flyboy
Messages: 1769
Registered: November 2006
Senior Member
joebegenwald@gmail.com wrote on Sun, 24 November 2013 15:08
To clarify: the KEEP clause in your example will ensure that the MAX(EMPNO) and MAX(SAL) values returned come from the same row, correct? That is what I am trying to achieve. Using DENSE_RANK FIRST is reducing the sub-result set to a singe row, right? And using that and the same ORDER BY clause in each select element ensures that they are the same row.

Maybe, if you opened the link I posted, I would not have to copy/paste its content here:
Quote:
The aggregate_function is any one of the MIN, MAX, SUM, AVG, COUNT, VARIANCE, or STDDEV functions. It operates on values from the rows that rank either FIRST or LAST. If only one row ranks as FIRST or LAST, the aggregate operates on a singleton (nonaggregate) set.

Also I think I already stated it clearly when requiring uniqueness on (COL1, COL2, ROW_ID).

Just for completeness, you may achieve the same result with analytics, something like:
select * from (
  select emp.*, min(hiredate) over (partition by deptno) x1
  from scott.emp
) where x1 = hiredate
order by deptno;

(note the necessity of a subquery as analytics is computed after filter)
Re: How To question for analytics [message #601699 is a reply to message #601687] Sun, 24 November 2013 11:23 Go to previous message
Solomon Yakobson
Messages: 1966
Registered: January 2010
Senior Member
BlackSwan wrote on Sun, 24 November 2013 09:48

  max(empno) keep ( dense_rank first order by hiredate ),
  max(sal) keep ( dense_rank first order by hiredate )


Data magic. KEEP in non-deterministic. max(empno) will return maximum empno if there will be more than one employee with same first hiredate. max(sal) will return maximum sal if there will be more than one employee with same first hiredate. As a result, max(empno) keep ( dense_rank first order by hiredate ) might return empno of employee A, while max(sal) keep ( dense_rank first order by hiredate ) might return sal of employee B. For example:

SQL> select deptno,empno,sal,hiredate from emp order by deptno,hiredate;

    DEPTNO      EMPNO        SAL HIREDATE
---------- ---------- ---------- -------------------
        10       7782       2450 06/09/1981 00:00:00
        10       7839       5000 11/17/1981 00:00:00
        10       7934       1300 01/23/1982 00:00:00
        20       7369        800 12/17/1980 00:00:00
        20       7566       2975 04/02/1981 00:00:00
        20       7902       3000 12/03/1981 00:00:00
        20       7788       3000 04/19/1987 00:00:00
        20       7876       1100 05/23/1987 00:00:00
        30       7499       1600 02/20/1981 00:00:00
        30       7521       1250 02/22/1981 00:00:00
        30       7698       2850 05/01/1981 00:00:00

    DEPTNO      EMPNO        SAL HIREDATE
---------- ---------- ---------- -------------------
        30       7844       1500 09/08/1981 00:00:00
        30       7654       1250 09/28/1981 00:00:00
        30       7900        950 12/03/1981 00:00:00

14 rows selected.

SQL> insert into emp(deptno,empno,sal,hiredate) values(10,7781,2451,date '1981-06-09');

1 row created.

SQL> select deptno,empno,sal,hiredate from emp order by deptno,hiredate;

    DEPTNO      EMPNO        SAL HIREDATE
---------- ---------- ---------- -------------------
        10       7781       2451 06/09/1981 00:00:00
        10       7782       2450 06/09/1981 00:00:00
        10       7839       5000 11/17/1981 00:00:00
        10       7934       1300 01/23/1982 00:00:00
        20       7369        800 12/17/1980 00:00:00
        20       7566       2975 04/02/1981 00:00:00
        20       7902       3000 12/03/1981 00:00:00
        20       7788       3000 04/19/1987 00:00:00
        20       7876       1100 05/23/1987 00:00:00
        30       7499       1600 02/20/1981 00:00:00
        30       7521       1250 02/22/1981 00:00:00

    DEPTNO      EMPNO        SAL HIREDATE
---------- ---------- ---------- -------------------
        30       7698       2850 05/01/1981 00:00:00
        30       7844       1500 09/08/1981 00:00:00
        30       7654       1250 09/28/1981 00:00:00
        30       7900        950 12/03/1981 00:00:00

15 rows selected.

SQL> select deptno,
  2  max(empno) keep ( dense_rank first order by hiredate ),
  3    max(sal) keep ( dense_rank first order by hiredate )
  4  from scott.emp
  5  group by deptno
  6  order by deptno;

    DEPTNO MAX(EMPNO)KEEP(DENSE_RANKFIRSTORDERBYHIREDATE) MAX(SAL)KEEP(DENSE_RANKFIRSTORDERBYHIREDATE)
---------- ---------------------------------------------- --------------------------------------------
        10                                           7782                                         2451
        20                                           7369                                          800
        30                                           7499                                         1600

SQL>


As you can see we got empno 7782 but sal of empno 7781. We need unique order by when using multiple KEEPs and want them in synch (this will make sort deterministic). In this case all we need is adding ROWID to order by:

SQL> select deptno,
  2  max(empno) keep ( dense_rank first order by hiredate,rowid),
  3    max(sal) keep ( dense_rank first order by hiredate,rowid)
  4  from scott.emp
  5  group by deptno
  6  order by deptno;

    DEPTNO MAX(EMPNO)KEEP(DENSE_RANKFIRSTORDERBYHIREDATE,ROWID) MAX(SAL)KEEP(DENSE_RANKFIRSTORDERBYHIREDATE,ROWID)
---------- ---------------------------------------------------- --------------------------------------------------
        10                                                 7781                                               2451
        20                                                 7369                                                800
        30                                                 7499                                               1600

SQL>


SY.

[Updated on: Sun, 24 November 2013 11:25]

Report message to a moderator

Previous Topic: how to make a column primary key which contains duplicate data
Next Topic: get data between rows
Goto Forum:
  


Current Time: Wed Jul 30 15:36:37 CDT 2014

Total time taken to generate the page: 0.87144 seconds