MMON slaves spinning

From: De DBA <dedba_at_tpg.com.au>
Date: Fri, 29 Nov 2013 12:09:00 +1000
Message-ID: <5297F73C.4010402_at_tpg.com.au>



G'day

Oracle 11.2.0.3 with Oct13 CPU, on RHEL 6

We have 2 servers: one runs just one instance - Production - with Memory_Target = 4G and Memory_Max_Target = 25G. The other server has 7 instances: an Active DG standby database (Memory_target = 2G, no Memory_Max_Target set) and 6 staging databases, each with Memory_Target = 4GB and Memory_Max_Target = 8GB. Both servers have 64GB of physical RAM installed, and 30G of swap available. Top shows that no swap is in use on either.

All databases are identical - the staging databases are regularly re-created from the standby database using RMAN DUPLICATE. Each database is accessed by the same set of web applications (excluding the standby database, obviously). Modifications made in the staging applications are migrated to the production environment after testing and approval. TDE is used in every database.

There is also a matching (and underused) set of development databases, which are created identically to the staging databases, via an intermediary where sensitive data is masked.

Four of the staging databases all suffer from time to time from a spinning MMON slave (usually M000), which may or may not block the Library Cache Mutex. When it does, no more sessions can log on and the database for all intends and purposes is down. The 2 staging databases that do not suffer this problem are recreated (and therefore restarted) daily. No other database suffers from this problem, even though all databases are identical, both in contents as well as configuration.

As the production database is part of a critical 24/7 environment the fear is that this eventually will also hit the production environment and cause large losses...

The spinning slave process is still alive and a system state dump of the spinning situation shows nothing out of the ordinary (except long lists of blocked processes when the mutex is locked). We tried the following:

  • flush the Shared Pool - this provided only temporary relief (a few minutes)
  • kill MMON and its slaves on the OS level (pkill -9 ora_mmon_stgx; <etc>) Immediately after a new mmon process was started, a slave spawned and started spinning
  • bounce the instance - this provides some relief, hours and even sometimes days.

The SQL area of a staging database with a spinning MMon slave does not show large amounts of child cursors, in fact the production database (which never suffers this problem, and is never bounced) has 10 times the amount of cursors and child cursors. One thing I noticed in the staging alert logs is that sometimes, but not always, a spinning situation is preceded by an emergency ASH flush. This also never happens in the production instance. Symptoms of a spinning process (TNS timeout errors, other background processes failing to start, PMON failing to acquire a latch) always start appearing in the alert log shortly after the daily maintenance window is closed. This is the standard Oracle defined maintenance window and associated plans.

The MMON process logs absolutely no errors of any kind, so none of the scenarios that I can find using Google or MetaLink apply (they all seem to be associated with ORA-600 errors).

Any suggestion welcome :)

Cheers,
Tony

--
http://www.freelists.org/webpage/oracle-l
Received on Fri Nov 29 2013 - 03:09:00 CET

Original text of this message