RE: 11g fault diagnosability infrastructure and poor documentation

From: Robert Freeman <robertgfreeman_at_yahoo.com>
Date: Tue, 2 Oct 2007 20:40:44 -0600
Message-ID: <KEEDIPJOJLCHPPAIDPDOIEIBEGAA.robertgfreeman@yahoo.com>

The health checks are perhaps better documented, and you can get some insights into them by using OEM which provides a window into them (the "checkers" as they are called). They feed some of the new features like the Data Repair Advisor and so on.

The incident packages are probably easier to understand if you go through OEM too and follow some of the workflow they have there. I noticed that when I created a database with DBCA that there were two corrupted segments in the data dictionary just waiting for me to package. Did anyone else notice that?

As for the rest..... yeah, there is some frustration there. I don't like that I have to setup the config manager to use the automated SR packaging to it's fullest extent (I just want to put in my Metalink ID and have it go from there).

As for automatic hang detection.... well, if I could simulate a hang reliably... ;-)

There are a lot of new mysteries in 11g that can potentially slide up and bite you.... SQL Plan Management is a big one IMHO. Beware. It's a good idea, but can really cause you grief is you are tuning and don't realize what is going on in the background.

I also think this is the tip of the iceburg ... 11gR2 and beyond will likely build on these architectures. It's going to be important to understand these new architectural features (particularly things like Automated SQL tuning and SQL Plan Management)... perhaps more than ever since they have more potential to jump up and kick us upside the head.

Of course, you can just turn them off too... ;-)

Much of this is covered in my 11g New Features book.... Coming soon!! Very soon!!

Robert G. Freeman
Oracle Consultant/DBA/Author
Principal Engineer/Team Manager
The Church of Jesus Christ of Latter-Day Saints Father of Five, Husband of One,
Author of various geeky computer titles
from Osborne/McGraw Hill (Oracle Press)
Oracle Database 11g New Features Now Available for Pre-sales on Amazon.com! BLOG: http://robertgfreeman.blogspot.com/ Sig V1.2

-----Original Message-----
From: oracle-l-bounce_at_freelists.org
[mailto:oracle-l-bounce_at_freelists.org]On Behalf Of Jeremiah Wilton Sent: Tuesday, October 02, 2007 7:30 PM
To: ORACLE-L
Subject: 11g fault diagnosability infratructure and poor documentation

Am I the only one who has been unable to do much with this feature due to the woefully absent documentation? Three components of "fault diagnosability" in particular seem very interesting:

automatic hang detection
automatic reactive "health checks"
incident packages as a replacement for RDA

Hang detection seems like a great idea, but there is no information on precisely what constitutes a "hang" according to DIAG and DIA0. These processes seem never to wake up, even in the most dire of hanging situations. I did find that by default in single-instance databases, the _hang_resolution, _hm_analysis_output_disk and _hm_log_incidents parameters are set to FALSE, which I take to mean the feature is turned off. Even turned on, long hangs involving chains of waiters visible in hanganalyze output do not trigger any actions that I can discern. This is slightly complicated by the fact that two components of "fault diagnosability" share the initials HM, and packages, parameters and views use HM interchangeably to mean "hang manager" and "heath monitor".

As for Health Checks, there is no documentation indicating what kinds of events or incidents might result in a "reactive" health check. The existence of reactive health checks is repeatedly asserted in the documentation, and there is even a parameter called _diag_hm_rc_enabled with the description "Parameter to enable/disable Diag HM Reactive Checks". Set to FALSE by default, this parameter does nothing in the event of a badly degraded and hanging system either. We are left to wonder what "reactive" health checks react to!

Finally, the incident packaging service works well enough, but is predicated completely upon the notion that any and all problems will be associated with a fatal error of some kind. Anything that does not dump ORA-600 or another fatal error will not result in an "incident" and thus there is nothing to package. There is apparently no provision for problems that do not dump on an error. So, an on-demand incident package apparently cannot be created. Thus, despite the incident payloads having many of the same contents as the horrid RDA of yore, you cannot generate one on demand in a supported way. You can shoot a server process with a SIGSEGV, but I cannot imagine that is how Oracle intends us to get diagnostic data for opening an SR.

You can probably detect that I am frustrated but I have been playing with this feature set for weeks and it is a frustrating morass of nonworking undocumented wastes of server memory. Remember, we are all now running two extra background processes, DIAG and DIA0, just for this feature. They are up and running and using memory on all of our 11g systems even if they do nothing and are turned off at the parameter level by default.

I am ranting here in hopes that someone else has gotten further than I have or knows someone on the inside who can shed some light on these concerns.

Thanks,

Jeremiah Wilton
ORA-600 Consulting

--
http://www.freelists.org/webpage/oracle-l


--
http://www.freelists.org/webpage/oracle-l

Received on Tue Oct 02 2007 - 21:40:44 CDT