RE: Alternative to V$PQ_TQSTAT for troubleshooting unbalanced HJs?
Date: Sun, 12 Aug 2012 12:15:09 -0500
Much thanks to Greg for pointing out _sqlmon_max_planlines and its default cutoff of not showing any query with more than 300 lines in their plan. Our plan was about 1500 lines... bumping up the limit did do the trick... on the next execution of our query it immediately show up and I began with delight seeing a nice correlation with V$SESSION_LONGOPS and the plan statistics in SQL Monitor.
HOWEVER... it was a swift transition from the thrill of victory to the agony of defeat...
After 90 minutes of watching our query in SQL Monitor... it DISAPPEARED utterly and suddenly, and of course right before the query reached one of the imbalances I'm trying to get some insight on.
Is there *another* hidden parameter that controls this (how long a query is allowed to be monitored)?
From: Greg Rahn [mailto:greg_at_structureddata.org]
Sent: Saturday, August 11, 2012 9:11 AM
To: Tornblad, John
Subject: Re: Alternative to V$PQ_TQSTAT for troubleshooting unbalanced HJs?
I think your best bet is SQL Monitor. If it does not show up in OEM, can you find it in v$sql_monitor? If not, is the plan longer than 300 lines? If so, you would need to set _sqlmon_max_planlines to a number greater than the number of plan lines for that query. Once you get the sql_id and it shows up in v$sql monitor, you could just run a loop and save some sql monitor reports to a directory and look at them later using the SQL on this page:
Your second best bet is to experiment with parts of the plan that you think may be problematic and do a "select key, count group by key order by count desc limit 10" query to see if the data has a very popular key. Things to be aware of: window functions whose output produces just a few number of values or joins (with a hash hash distribution) whose input has a key that is significantly more popular than the rest of the data or data with a Zipf distribution.
With parallel skew there are two types to be aware of:
- due to data skew (a very popular key), a worker does significantly more work than the other workers
- due to the way the data arrives to the operator, only one worker (or a small number) is busy at any one point in time, but at the end, they all process nearly the same amount of data (temporal skew)
On Fri, Aug 10, 2012 at 11:12 AM, Tornblad, John <JTornblad_at_emdeon.com> wrote:
I have a large INSERT ... SELECT parallel query involving dozens of hash joins, hundreds of millions of rows and strongly suspect we are getting bitten by an imbalance of workload distributed among the parallel servers for certain HJs, but not all. The suspicion has been sparked by watching our query in OEM's Activity window on the query... there are "good" periods of time where clearly the parallel slaves are working jointly and equally on a join... and other "bad" periods where a single parallel slave is chewing away by itself for an hour.
To top it off there appears to be a bug in 220.127.116.11 / OEM 12c SQL Monitoring such that something about our query makes it not show up at all in OEM SQL Monitor (or in V$SQL_MONITOR) so I can't easily look when the slow operation is running and just "see" what operation it's on and try to work that back to what specific join might be causing an imbalance of workload.
</pre>This message is confidential, intended only for the named
recipient(s) and may contain information that is privileged or
exempt from disclosure under applicable law. If you are not
the intended recipient(s), you are notified that the
dissemination, distribution, or copying of this message is
strictly prohibited. If you receive this message in error or
are not the named recipient(s), please notify the sender by
return email and delete this message. Thank you.
http://www.freelists.org/webpage/oracle-l Received on Sun Aug 12 2012 - 12:15:09 CDT