Home » RDBMS Server » Performance Tuning » Self join removal (Oracle, 9.2.0.4, Windows XP)
Self join removal [message #318471] |
Tue, 06 May 2008 23:47 |
srihari.gurram
Messages: 13 Registered: May 2008
|
Junior Member |
|
|
The following query is taking more time to execute due to the presense of self join and max function to get the previous max date. The self joined table has almost 40 million records. I have created unique indexes on the columns of the join. Please let me know of any solutions to avoid this self join.
SELECT max(tpm2.prc_dt),
tpm2.asset_id ,
tpm2.pricing_pt_id ,
tpm1.prc_dt
FROM t_prc_master tpm1,
t_prc_master tpm2,
t_pricing_pt_grp_assign ptg,
t_pricing_pt pt
WHERE tpm1.prc_dt = report_date_
AND tpm1.asset_id = tpm2.asset_id
AND tpm1.pricing_pt_id = tpm2.pricing_pt_id
AND tpm2.prc_dt < tpm1.prc_dt
AND tpm2.accept_flg = 'Y'
AND tpm1.accept_flg = 'Y'
AND tpm1.pricing_pt_id = ptg.pricing_pt_id
AND ptg.pricing_pt_id = pt.pricing_pt_id
AND ptg.pricing_pt_grp_id = pricing_pt_grp_id_
AND ( pricing_pt_id_ is null OR
pricing_pt_id_ = -1 OR
tpm1.pricing_pt_id = pricing_pt_id_ )
GROUP BY tpm2.asset_id, tpm2.pricing_pt_id, tpm1.prc_dt;
[Moderator: formatted code]
[Updated on: Wed, 07 May 2008 03:22] by Moderator Report message to a moderator
|
|
|
|
|
Re: Self join removal [message #318569 is a reply to message #318566] |
Wed, 07 May 2008 03:27 |
rleishman
Messages: 3728 Registered: October 2005 Location: Melbourne, Australia
|
Senior Member |
|
|
Also, post the results of this SQL:
SELECT count(*), max(cnt), avg(cnt), stddev(cnt)
FROM (
SELECT asset_id, pricing_pt_id, COUNT(*) AS cnt
FROM t_prc_master
GROUP BY asset_id, pricing_pt_id
)
Ross Leishman
[Updated on: Wed, 07 May 2008 03:28] Report message to a moderator
|
|
|
Re: Self join removal [message #318632 is a reply to message #318569] |
Wed, 07 May 2008 06:28 |
srihari.gurram
Messages: 13 Registered: May 2008
|
Junior Member |
|
|
Result for the query are:
SELECT count(*), max(cnt), avg(cnt), stddev(cnt)
FROM (
SELECT asset_id, pricing_pt_id, COUNT(*) AS cnt
FROM t_prc_master
GROUP BY asset_id, pricing_pt_id
)
COUNT(*)
2514016
MAX(CNT)
197
AVG(CNT)
21.6669110300014001501979303234346957219
STDDEV(CNT)
30.26544175678646740435447457146461754385
|
|
|
|
Re: Self join removal [message #318776 is a reply to message #318634] |
Wed, 07 May 2008 23:29 |
rleishman
Messages: 3728 Registered: October 2005 Location: Melbourne, Australia
|
Senior Member |
|
|
You have a logic problem. On average there are 21 rows per asset/pricing_pt and a large standard deviation - meaning there are plenty of combos with a lot more than 21.
Consider the worst case of 197 rows on the following:
FROM t_prc_master tpm1,
t_prc_master tpm2,
t_pricing_pt_grp_assign ptg,
t_pricing_pt pt
WHERE tpm1.prc_dt = report_date_
AND tpm1.asset_id = tpm2.asset_id
AND tpm1.pricing_pt_id = tpm2.pricing_pt_id
AND tpm2.prc_dt < tpm1.prc_dt
This means that every row will join to EVERY other row with a lesser prc_dt. Those 197 rows will blow out to 197+196+...+2+1 = 19306 rows.
It's like a mini-cartesian join.
Surely you don't want to join to EVERY row with a lesser date. Since you have MAX(tpm2.prc_dt), I assume you only want the LATEST row earlier than the selected tpm1 row.
I understand that the table is large, but are you dealing with <1% of the rows after applying the report_date_? WHERE tpm1.prc_dt = report_date_
If so, this will help:
FROM t_prc_master tpm1,
t_prc_master tpm2,
t_pricing_pt_grp_assign ptg,
t_pricing_pt pt
WHERE tpm1.prc_dt = report_date_
AND (tpm2.asset_id, tpm2.pricing_pt_id, tpm2.prc_dt) = (
SELECT asset_id, pricing_pt_id, MAX(prc_dt)
FROM t_prc_master tpm3
WHERE tpm3.asset_id = tpm1.asset_id
AND tpm3.pricing_pt_id = tpm1.pricing_pt_id
AND tpm3.prc_dt < tpm1.prc_dt
AND tpm3.accept_flg = 'Y'
GROUP BY asset_id, pricing_pt_id
)
And this bit is vitally important, otherwise IT WONT WORK:
Add an index on (asset_id, pricing_pt_id, accept_flg, prc_dt) and gather stats on both the table and index. If you do not have that exact index or you do not gather statistics, the Oracle will choose the wrong plan.
When you run Explain Plan, it MUST contain a step that includes the following:
INDEX RANGE SCAN (MIN/MAX)
If you do not see the MIN/MAX then it hasn't worked - don't bother running it.
Since this SQL will not pick up multiple tpm2 rows, you should be able to eliminate the GROUP BY from the query.
If your report_date_ returns more than 10% of the table then this technique may well be slower. You may need to try an analytic function LAG().
Ross Leishman
|
|
|
Re: Self join removal [message #319504 is a reply to message #318776] |
Mon, 12 May 2008 00:36 |
srihari.gurram
Messages: 13 Registered: May 2008
|
Junior Member |
|
|
I have tried to use the mentioned approach, but didnt solve my problem... It takes more time than the original. I have created the indexes and also verified the explain plan.
The data we are dealing with the following condition is less than 10% of the data only (about 1% to 2%), but still it is not working out.
tpm1.prc_dt = report_date_
Can u give me an example on how to use the LAG function for this?
The main thing here (I believe) is to avoid to go for the MAX function.
|
|
|
|
Re: Self join removal [message #319558 is a reply to message #319542] |
Mon, 12 May 2008 03:39 |
srihari.gurram
Messages: 13 Registered: May 2008
|
Junior Member |
|
|
Please find below the SQL used and the Explain plan:
SELECT tpm2.prc_dt,
tpm2.asset_id ,
tpm2.pricing_pt_id ,
tpm1.prc_dt
FROM t_prc_master tpm1,
t_prc_master tpm2,
t_pricing_pt_grp_assign ptg,
t_pricing_pt pt
WHERE tpm1.prc_dt = report_date_
AND (tpm2.asset_id, tpm2.pricing_pt_id, tpm2.prc_dt) = (
SELECT asset_id, pricing_pt_id, MAX(prc_dt)
FROM t_prc_master tpm3
WHERE tpm3.asset_id = tpm1.asset_id
AND tpm3.pricing_pt_id = tpm1.pricing_pt_id
AND tpm3.prc_dt < tpm1.prc_dt
AND tpm3.accept_flg = 'Y'
GROUP BY asset_id, pricing_pt_id
)
AND tpm1.accept_flg = 'Y'
AND tpm1.pricing_pt_id = ptg.pricing_pt_id
AND ptg.pricing_pt_id = pt.pricing_pt_id
AND ptg.pricing_pt_grp_id = pricing_pt_grp_id_
AND ( pricing_pt_id_ is null OR
pricing_pt_id_ = -1 OR
tpm1.pricing_pt_id = pricing_pt_id_ );
Explain plan:
in attachment
|
|
|
Re: Self join removal [message #319743 is a reply to message #318776] |
Mon, 12 May 2008 22:05 |
rleishman
Messages: 3728 Registered: October 2005 Location: Melbourne, Australia
|
Senior Member |
|
|
rleishman wrote on Thu, 08 May 2008 14:29 | And this bit is vitally important, otherwise IT WONT WORK:
...
When you run Explain Plan, it MUST contain a step that includes the following:
INDEX RANGE SCAN (MIN/MAX)
If you do not see the MIN/MAX then it hasn't worked - don't bother running it.
|
srihari.gurram wrote on Mon, 12 May 2008 15:36 | I have created the indexes and also verified the explain plan.
|
But the plan did not contain the MIN/MAX step. So why did you lie?
srihari.gurram wrote on Mon, 12 May 2008 18:39 | Please find below the SQL used and the Explain plan:
...
|
The SQL you posted contains 4 tables in the FROM clause, but the plan contains only 2. So why did you lie?
It's almost as though you don't want help because you are deliberately making this harder for us.
But let's press on regardless. Since that SQL did not get the desired plan, we'll try an alternative. Please try the following and post the explain plan. Again, we are looking for a MIN/MAX step in the plan, and you must make sure that you gather statistics on both the table and the indexes.
Rather than posting a dump of the PLAN_TABLE, could you please format the Explain Plan as described here and enclose it in CODE tags.
SELECT tpm2.prc_dt,
tpm2.asset_id ,
tpm2.pricing_pt_id ,
tpm1.prc_dt
FROM t_prc_master tpm1,
t_prc_master tpm2,
t_pricing_pt_grp_assign ptg,
t_pricing_pt pt
WHERE tpm1.prc_dt = report_date_
AND tpm2.asset_id = tpm1.asset_id
AND tpm2.pricing_pt_id = tpm1.pricing_pt_id
AND tpm2.accept_flg = 'Y'
AND tpm2.prc_dt = (
SELECT MAX(prc_dt)
FROM t_prc_master tpm3
WHERE tpm3.asset_id = tpm1.asset_id
AND tpm3.pricing_pt_id = tpm1.pricing_pt_id
AND tpm3.prc_dt < tpm1.prc_dt
AND tpm3.accept_flg = 'Y'
)
AND tpm1.accept_flg = 'Y'
AND tpm1.pricing_pt_id = ptg.pricing_pt_id
AND ptg.pricing_pt_id = pt.pricing_pt_id
AND ptg.pricing_pt_grp_id = pricing_pt_grp_id_
AND ( pricing_pt_id_ is null OR
pricing_pt_id_ = -1 OR
tpm1.pricing_pt_id = pricing_pt_id_ );
Ross Leishman
|
|
|
Re: Self join removal [message #320383 is a reply to message #319743] |
Wed, 14 May 2008 23:07 |
srihari.gurram
Messages: 13 Registered: May 2008
|
Junior Member |
|
|
We have changed a couple of things in the query according to the requirement. The final form of original and modified queries are below:
Original:
SELECT max(tpm2.prc_dt),
tpm2.asset_id ,
tpm2.pricing_pt_id ,
tpm1.prc_dt
FROM t_prc_master tpm1,
t_prc_master tpm2
WHERE tpm1.prc_dt = report_date_
AND tpm2.asset_id = tpm1.asset_id
AND tpm2.pricing_pt_id = tpm1.pricing_pt_id
AND tpm2.accept_flg = 'Y'
AND tpm2.prc_dt = (
SELECT MAX(prc_dt)
FROM t_prc_master tpm3
WHERE tpm3.asset_id = tpm1.asset_id
AND tpm3.pricing_pt_id = tpm1.pricing_pt_id
AND tpm3.prc_dt < tpm1.prc_dt
AND tpm3.accept_flg = 'Y'
)
AND tpm1.accept_flg = 'Y'
AND exists (select 1 from t_temp_prcmov
where pca_flg = 'P'
and prc_pt_cntry_atyp = tpm1.pricing_pt_id)
group by tpm2.asset_id, tpm2.pricing_pt_id, tpm1.prc_dt;
Modified:
SELECT tpm2.prc_dt,
tpm2.asset_id ,
tpm2.pricing_pt_id ,
tpm1.prc_dt
FROM t_prc_master tpm1,
t_prc_master tpm2
WHERE tpm1.prc_dt = report_date_
AND tpm2.asset_id = tpm1.asset_id
AND tpm2.pricing_pt_id = tpm1.pricing_pt_id
AND tpm2.accept_flg = 'Y'
AND tpm2.prc_dt = (
SELECT MAX(prc_dt)
FROM t_prc_master tpm3
WHERE tpm3.asset_id = tpm1.asset_id
AND tpm3.pricing_pt_id = tpm1.pricing_pt_id
AND tpm3.prc_dt < tpm1.prc_dt
AND tpm3.accept_flg = 'Y'
)
AND tpm1.accept_flg = 'Y'
and exists (select 1 from t_temp_prcmov
where pca_flg = 'P'
and prc_pt_cntry_atyp = tpm1.pricing_pt_id);
This was the reason for having only 2 tables in the explain plan. The two tables which were removed had less than 100 rows. I have tried the explain plan for the above one, WITHOUT the EXISTS condition and had the same explain plan as given in the last reply.
(I didn't get the RANGE SCAN MIN/MAX step in the plan).
The performance got reduced by using the above modified query.
Let me know the usage of LAG function in this Query.
|
|
|
Re: Self join removal [message #320392 is a reply to message #320383] |
Wed, 14 May 2008 23:47 |
rleishman
Messages: 3728 Registered: October 2005 Location: Melbourne, Australia
|
Senior Member |
|
|
LAG probably won't help given your volumes. The method I have described is your best chance. If it is not working, then something has not been done correctly. Either:- Index is not on (asset_id, pricing_pt_id, accept_flg, prc_dt) in that order.
- Index or table is not analyzed with BMS_STATS.GATHER TABLE STATS()
Since you don't seem to want to sort it out by finding what the problem is, I'll respect your wishes and drop it.
You could of course read the documentation to work out how to use the LAG function, but since you've been so patient, here it is.
SELECT *
FROM (
SELECT asset_id ,
pricing_pt_id ,
prc_dt,
LAG(prc_dt) OVER (PARTITION BY asset_id, pricing_pt_id ORDER BY prc_dt) prev_prc_dt
FROM t_prc_master
WHERE accept_flg = 'Y'
)
WHERE prc_dt = report_date_
It may be an improvement on what you currently have, but it will still be sub-optimal.
Another syntax you could try would be:
SELECT asset_id ,
pricing_pt_id ,
prc_dt,
(SELECT MAX(prc_dt)
FROM t_prc_master tpm3
WHERE tpm3.asset_id = tpm1.asset_id
AND tpm3.pricing_pt_id = tpm1.pricing_pt_id
AND tpm3.prc_dt < tpm1.prc_dt
AND tpm3.accept_flg = 'Y'
)
FROM t_prc_master tpm1
WHERE accept_flg = 'Y'
AND prc_dt = report_date_
Once again though, this would require the index I mentioned, which would need to be analysed, and the Explain Plan would have to include the MIN/MAX range scan.
Ross Leishman
|
|
|
|
|
Re: Self join removal [message #328049 is a reply to message #318471] |
Wed, 18 June 2008 12:59 |
coleing
Messages: 213 Registered: February 2008
|
Senior Member |
|
|
Hi,
It looks to me like you are reading quite a lot of the t_prc_master table.
Perhaps something like this might run quicker? Its dependent on a full scan of the table being quicker than multiple index lookups against the same row.
The logic might need slight amendment, as I might have lost something in transaltion.
select *
from
(
SELECT tpm2.prc_dt,
tpm2.asset_id ,
tpm2.pricing_pt_id ,
tpm1.prc_dt
max( case
when tpm1.accept_flag = 'Y' then
tpm1.prc_dt
else
to_date('00010101','YYYYMMDD')
end)
over (partition by assit_id, pricing_pt_id) as max_prc_dt
FROM (select * from t_prc_master where tpm1.accept_flg = 'Y') tpm1
WHERE tpm1.prc_dt = report_date_
and exists (select 1 from t_temp_prcmov
where pca_flg = 'P'
and prc_pt_cntry_atyp = tpm1.pricing_pt_id)
)
where prc_dt = max_prc_dt;
|
|
|
Re: Self join removal [message #328088 is a reply to message #320392] |
Wed, 18 June 2008 22:37 |
rleishman
Messages: 3728 Registered: October 2005 Location: Melbourne, Australia
|
Senior Member |
|
|
There's something very fishy going on here. The SQL accesses T_PRC_MASTER 3 times, but the execution plan shows only 2 steps accessing that table: Steps 3 and 5. The plan step for the sub-query is MISSING!
Rows Row Source Operation
------- ---------------------------------------------------
780544 HASH JOIN SEMI (cr=6894392 r=408805 w=7230 time=90095967 us)
780544 NESTED LOOPS (cr=6894388 r=401573 w=0 time=83121308 us)
783459 TABLE ACCESS BY INDEX ROWID T_PRC_MASTER (cr=593790 r=24953 w=0 time=6516069 us)
784161 INDEX RANGE SCAN PRC_DT_ASSET_ID (cr=7493 r=75 w=0 time=378990 us)(object id 450059)
780544 INDEX RANGE SCAN ASSET_DT_ACCEPT_FLG (cr=2341687 r=14 w=0 time=3129196 us)(object id 450055)
55 TABLE ACCESS FULL TT_TEMP_PRCMOV (cr=4 r=2 w=0 time=310 us)
When we look at the re-parsed plan immediately below, we see the sub-query present as a SORT and INDEX access.
This is odd.
This can still run faster. Note that it is NOT performing the INDEX RANGE SCAN (MIN/MAX) step that I suggested it required. Quite simply, until you see this step in the plan you will not get optimal performance. I don't know how I can make that any clearer.
Perhaps you have not created the index correctly. Prove to me that you have an index on (asset_id, pricing_pt_id, accept_flg, prc_dt) IN THAT ORDER, and that statistics have been gathered for both the table and index.
Ross Leishman
|
|
|
Goto Forum:
Current Time: Sat Nov 09 16:44:01 CST 2024
|