Alternative to V$PQ_TQSTAT for troubleshooting unbalanced HJs?

From: Tornblad, John <JTornblad_at_emdeon.com>
Date: Fri, 10 Aug 2012 13:12:43 -0500
Message-ID: <0EC8AF65BC8E9A4E9C4EB86F1882C96E046C1B31_at_ZBNAAEEX052.na.webmd.net>



I have a large INSERT ... SELECT parallel query involving dozens of hash joins, hundreds of millions of rows and strongly suspect we are getting bitten by an imbalance of workload distributed among the parallel servers for certain HJs, but not all. The suspicion has been sparked by watching our query in OEM's Activity window on the query... there are
"good" periods of time where clearly the parallel slaves are working
jointly and equally on a join... and other "bad" periods where a single parallel slave is chewing away by itself for an hour.  

Looking at the outline data on the query, I can see dozens of these three distribution modes  

PQ_DISTRIBUTE(... BROADCAST NONE) PQ_DISTRIBUTE(... NONE BROADCAST) PQ_DISTRIBUTE(... HASH HASH)   Although I suspect the CBO is making good decisions for most of these (we have skewed data and good stats including histograms for skewed columns) I am wondering if on a few of those joins we'd be better served by a BROADCAST vs. a HASH... perhaps dramatically so.  

What I don't know is a lot...  

First, I am not assuming that all parallel slaves should necessarily be just as busy as each other all the time... our query includes WINDOW SORTs and plenty of other goodies and expressions that I have no idea if they (individually or when composited into our query) can be parallelized or not in 11.2.0.3 or how they might appear in what I am watching in OEM or in various V$ views on the running query.  

Second, I read with excitement these posts...  

http://oracledoug.com/px6.html

http://jonathanlewis.wordpress.com/2007/03/14/how-parallel/  

which turned me on to V$PQ_TQSTAT but was deflated by Jonathan's caveat
"unfortunately, v$pq_tqstat tends to go wrong at exactly the moment when
the queries get tough". Confirmed. 18 hours later (an interminable wait!) a look into V$PQ_TQSTAT from the same session directly after a run of the query completed showed it's busted... one row (expected this to be hundreds). As I write this up, I am wondering if I needed to look at V$PQ_TQSTAT before I COMMITed the results but I'm not sure if I can survive the capricious let down of a potential heisenbug again.  

I've looked at _PX_TRACE but from what I've scanned in this lightly-documented area there doesn't appear to be a variation of options that gives equivalent output to V$PQ_TQSTAT showing the rows delivered to each parallel server process. Our should I say... an easy-to-read equivalent. There is plenty of output... but no docs on how to interpret it or where to look for what.  

To top it off there appears to be a bug in 11.2.0.3 / OEM 12c SQL Monitoring such that something about our query makes it not show up at all in OEM SQL Monitor (or in V$SQL_MONITOR) so I can't easily look when the slow operation is running and just "see" what operation it's on and try to work that back to what specific join might be causing an imbalance of workload.    

I'm not out of options... I think the most promising at this point is to try a trivial join test case with some version of _PX_TRACE on, and try to interpret the trace output and see where the numbers regarding TQ row counts might be.  

I'd appreciate anyone's comments, advice or even some commiseration at this point.  

-john  

</pre>This message is confidential, intended only for the named recipient(s) and may contain information that is privileged or exempt from disclosure under applicable law. If you are not the intended recipient(s), you are notified that the dissemination, distribution, or copying of this message is strictly prohibited. If you receive this message in error or are not the named recipient(s), please notify the sender by return email and delete this message. Thank you.

--
http://www.freelists.org/webpage/oracle-l
Received on Fri Aug 10 2012 - 13:12:43 CDT

Original text of this message