Re: Alternative to V$PQ_TQSTAT for troubleshooting unbalanced HJs?

From: Greg Rahn <greg_at_structureddata.org>
Date: Sat, 11 Aug 2012 09:11:08 -0700
Message-ID: <CAGXkmitfxOuVaE4zAbvPwPu8buM2S89mC8KSSCTbR8B-mbfN6w_at_mail.gmail.com>



I think your best bet is SQL Monitor. If it does not show up in OEM, can you find it in v$sql_monitor? If not, is the plan longer than 300 lines?  If so, you would need to set _sqlmon_max_planlines to a number greater than the number of plan lines for that query. Once you get the sql_id and it shows up in v$sql monitor, you could just run a loop and save some sql monitor reports to a directory and look at them later using the SQL on this page:
http://structureddata.org/2008/01/06/oracle-11g-real-time-sql-monitoring-using-dbms_sqltunereport_sql_monitor/ Your second best bet is to experiment with parts of the plan that you think may be problematic and do a "select key, count group by key order by count desc limit 10" query to see if the data has a very popular key. Things to be aware of: window functions whose output produces just a few number of values or joins (with a hash hash distribution) whose input has a key that is significantly more popular than the rest of the data or data with a Zipf distribution.

With parallel skew there are two types to be aware of: 1) due to data skew (a very popular key), a worker does significantly more work than the other workers
2) due to the way the data arrives to the operator, only one worker (or a small number) is busy at any one point in time, but at the end, they all process nearly the same amount of data (temporal skew)

On Fri, Aug 10, 2012 at 11:12 AM, Tornblad, John <JTornblad_at_emdeon.com>wrote:

> I have a large INSERT ... SELECT parallel query involving dozens of hash
> joins, hundreds of millions of rows and strongly suspect we are getting
> bitten by an imbalance of workload distributed among the parallel
> servers for certain HJs, but not all. The suspicion has been sparked by
> watching our query in OEM's Activity window on the query... there are
> "good" periods of time where clearly the parallel slaves are working
> jointly and equally on a join... and other "bad" periods where a single
> parallel slave is chewing away by itself for an hour.
>
> To top it off there appears to be a bug in 11.2.0.3 / OEM 12c SQL
> Monitoring such that something about our query makes it not show up at
> all in OEM SQL Monitor (or in V$SQL_MONITOR) so I can't easily look when
> the slow operation is running and just "see" what operation it's on and
> try to work that back to what specific join might be causing an
> imbalance of workload.
>
>

-- 
Regards,
Greg Rahn  |  blog <http://bit.ly/u9N0i8>  |  twitter <http://bit.ly/v733dJ>  |
 linkedin <http://linkd.in/gregrahn>


--
http://www.freelists.org/webpage/oracle-l
Received on Sat Aug 11 2012 - 11:11:08 CDT

Original text of this message