Re: Alternative to V$PQ_TQSTAT for troubleshooting unbalanced HJs?

From: Mark Burgess <mark_at_burgess-consulting.com.au>
Date: Sat, 11 Aug 2012 07:01:17 +1000
Message-Id: <2907F32D-C8CC-4690-B59D-0FE71ABE50DB_at_burgess-consulting.com.au>



Hi John,

we recently worked through a similar exercise. The way that we identified the candidate operations for the PQ_DISTRIBUTE hint was to view the output from v$session_longops when the query was running and tie the operation that was taking the most time back to the line in the execution plan and ultimately the join. Correcting some of these imbalances resulted in a significant improvement in elapsed time and machine utilisation for the target queries.

Regards,

Mark Burgess

Burgess Systems Consulting
w: http://www.burgess-consulting.com.au
m: +61-405-484-455

On 11/08/2012, at 4:12 AM, "Tornblad, John" <JTornblad_at_emdeon.com> wrote:

> I have a large INSERT ... SELECT parallel query involving dozens of hash
> joins, hundreds of millions of rows and strongly suspect we are getting
> bitten by an imbalance of workload distributed among the parallel
> servers for certain HJs, but not all. The suspicion has been sparked by
> watching our query in OEM's Activity window on the query... there are
> "good" periods of time where clearly the parallel slaves are working
> jointly and equally on a join... and other "bad" periods where a single
> parallel slave is chewing away by itself for an hour.
>
>
> Looking at the outline data on the query, I can see dozens of these
> three distribution modes
>
>
>
> PQ_DISTRIBUTE(... BROADCAST NONE)
>
> PQ_DISTRIBUTE(... NONE BROADCAST)
>
> PQ_DISTRIBUTE(... HASH HASH)
>
>
>
> Although I suspect the CBO is making good decisions for most of these
> (we have skewed data and good stats including histograms for skewed
> columns) I am wondering if on a few of those joins we'd be better served
> by a BROADCAST vs. a HASH... perhaps dramatically so.
>
>
>
> What I don't know is a lot...
>
>
>
> First, I am not assuming that all parallel slaves should necessarily be
> just as busy as each other all the time... our query includes WINDOW
> SORTs and plenty of other goodies and expressions that I have no idea if
> they (individually or when composited into our query) can be
> parallelized or not in 11.2.0.3 or how they might appear in what I am
> watching in OEM or in various V$ views on the running query.
>
>
>
> Second, I read with excitement these posts...
>
>
>
> http://oracledoug.com/px6.html
>
> http://jonathanlewis.wordpress.com/2007/03/14/how-parallel/
>
>
>
> which turned me on to V$PQ_TQSTAT but was deflated by Jonathan's caveat
> "unfortunately, v$pq_tqstat tends to go wrong at exactly the moment when
> the queries get tough". Confirmed. 18 hours later (an interminable
> wait!) a look into V$PQ_TQSTAT from the same session directly after a
> run of the query completed showed it's busted... one row (expected this
> to be hundreds). As I write this up, I am wondering if I needed to look
> at V$PQ_TQSTAT before I COMMITed the results but I'm not sure if I can
> survive the capricious let down of a potential heisenbug again.
>
>
>
> I've looked at _PX_TRACE but from what I've scanned in this
> lightly-documented area there doesn't appear to be a variation of
> options that gives equivalent output to V$PQ_TQSTAT showing the rows
> delivered to each parallel server process. Our should I say... an
> easy-to-read equivalent. There is plenty of output... but no docs on
> how to interpret it or where to look for what.
>
>
>
> To top it off there appears to be a bug in 11.2.0.3 / OEM 12c SQL
> Monitoring such that something about our query makes it not show up at
> all in OEM SQL Monitor (or in V$SQL_MONITOR) so I can't easily look when
> the slow operation is running and just "see" what operation it's on and
> try to work that back to what specific join might be causing an
> imbalance of workload.
>
>
>
>
>
> I'm not out of options... I think the most promising at this point is to
> try a trivial join test case with some version of _PX_TRACE on, and try
> to interpret the trace output and see where the numbers regarding TQ row
> counts might be.
>
>
>
> I'd appreciate anyone's comments, advice or even some commiseration at
> this point.
>
>
>
> -john
>
>
>
>
>
> </pre>This message is confidential, intended only for the named
> recipient(s) and may contain information that is privileged or
> exempt from disclosure under applicable law. If you are not
> the intended recipient(s), you are notified that the
> dissemination, distribution, or copying of this message is
> strictly prohibited. If you receive this message in error or
> are not the named recipient(s), please notify the sender by
> return email and delete this message. Thank you.
> --
> http://www.freelists.org/webpage/oracle-l
>
>

--
http://www.freelists.org/webpage/oracle-l
Received on Fri Aug 10 2012 - 16:01:17 CDT

Original text of this message