Inability to reliably monitor long-running parallel queries

From: Tornblad, John <JTornblad_at_emdeon.com>
Date: Fri, 2 Nov 2012 11:36:43 -0500
Message-ID: <0EC8AF65BC8E9A4E9C4EB86F1882C96E052505D7_at_ZBNAAEEX052.na.webmd.net>



I opened an SR about 5 months ago with support on an intermittent problem with monitoring (OEM SQL Monitor) long-running parallel queries. Inexplicably, after "some" amount of time (varied from 30-90 minutes or more) the monitoring in OEM and v$sql_monitor would terminate and the query would be marked as "DONE (ERROR)"... even though the query was still running. Very frustrating for this great feature!

We spent a lot of time at the beginning reviewing and fiddling with the parameters _sqlmon_* but none of these are ultimately the culprit (although they can cause other issues for instance if you have a really big plan then _sqlmon_max_planlines needs to be at least as large as the number of lines in the big plan).

It turns out that the issue for a parallel query is MMON is watching *all* of the parallel servers / slaves and if *any* are idle for more than an internally hard-coded 30 minutes... the finger of death is applied and the monitoring of the query is terminated... even though other parallel servers are active on the same query. I am not sure if the query coordinator is included, it seems very likely the QC is idle for big queries for more than 30 minutes (easily).

Just wondering if anyone else has run into this issue, I cannot be the only customer who has noticed this. Despite "not a bug" declarations it seems like this is a bug / misfeature / unwelcome aggressiveness from MMON. Copied in the latest feedback from support below with their responses to my questions on this issue and some "bug" references (that are "not bugs").

-john

Hello John,

  1. when you say "the update from the bug" are you talking about bug 14117966 ? That bug is in the status "32 - Not a Bug. To Filer".

==> Yes, the bug 14117966 is at status 32,however I still reopen the bug
and forward your updates to the BDE engineer of the bug.

2) Is bug 14117966 an open bug (implying it will be fixed)?

==> Right now the bug is at status 32.
==> Similar matching bugs are closed as not a bug.
eg: Bug 13523091 - INCORRECT STATUS IN V$SQL_MONITOR WITH PARALLEL QUERY

The above bug 13523091 is also closed as not a bug with the following comments.

'This is expected behavior.

As documented in the Reference manual, the entries in v$sql_monitor are reclaimed by MMON process so that new statements can be monitored.

http://st-doc.us.oracle.com/11/112/server.112/e25513/dynviews_3052.htm?t erm=v\$sql\_monitor#REFRN30479
[john: corrected link]
http://docs.oracle.com/cd/E11882_01/server.112/e25513/dynviews_3052.htm? term=v\$sql\_monitor#REFRN30479

The reclaiming process occurrs every 60 seconds in the MMON process. When MMON finds out that an entry that has active plan is idle for more than 30 minutes, it marks the entry as possible for reclamation with potential error (ERROR_keswxPlanMonState). It is marked as error because there could have been an error in this plan that MMON was unable to detect, considering that it is spending too long time (30 minutes) as "idle".

This "30 minutes" is hard coded as macro KESWX_MAX_ACTIVE_IDLE so it cannot be changed.

In real life, with some bad plan or very big data, it is possible that some process of PQ may stay idle and wait for other slaves to finish their processing for more than 30 minutes. '

==> The bug 14117966 filed from this SR has referred to Bug 13523091.

3) Are you saying that any parallel query that at least one slave which is idle for more than 30 minutes (even though other slaves are active) will be called an ERROR even though it is actually STILL RUNNING?

Yes.

</pre>This message is confidential, intended only for the named recipient(s) and may contain information that is privileged or exempt from disclosure under applicable law. If you are not the intended recipient(s), you are notified that the dissemination, distribution, or copying of this message is strictly prohibited. If you receive this message in error or are not the named recipient(s), please notify the sender by return email and delete this message. Thank you.

--
http://www.freelists.org/webpage/oracle-l
Received on Fri Nov 02 2012 - 17:36:43 CET

Original text of this message