Re: Measure database availability beyond 99.9%
Date: Fri, 29 Aug 2008 22:24:43 +0200
For many databases we do only database hosting, applications are responsibility of the customer. The SLAs for these do not contain hard numbers except availability and service times, no performance data.
So, the only thing we really need to measure here is the uptime of the databases as such. For accessibilty from the client side we would need some sort of monitoring installed on them which we cannot always do. Besides, we "know" the network is stable. If we can reach the databases, so can the customers.
For High Availability we use a simple Windows cluster with Oracle Failsafe. Automatic failover takes about 53 seconds and has occurred twice this year (we've been lucky). The customers know that SPOF for this solution is the SAN storage, but are not willing to pay for something more reliable.
So we would like to give them numbers like "As long as there is no storage failure, you get 99.99% availability. If there is - bad luck." Maybe we can talk them into Data Guard.
Business would also like to use these numbers in upcoming discussions of service levels and prices as well as bragging.
A real monitoring system for the whole company (not only databases) is being built, but will take time. There are several unsolved problems in the proposed solution.
Niall Litchfield wrote:
> Aaaarrrrgh! I'm sure there's a purpose that isn't lying to justify > expensive investments. I just cannot see it. Real HA must do service > level monitoring (aka can the users work) what you seem to propose > has no clear benefit, please tell me I'm wrong. > > On 28/08/2008, Ingrid Voigt <GiantPanda_at_gmx.net> wrote:
>> we are looking for a tool to measure and report the availability of our
>> databases in the HA range, i.e. with high precision. At this time we are
>> only interested in the database state, not whether the customers can work.
>> The database versions involved are 9.2 - 10.2, 11 coming next year. All
>> editions: SE1, SE and EE.
>> So far, we have been using EM Grid Control, but beyond 99,9% this is not
>> precise enough. Too many failures of the agent/the Grid Control system
>> rather than the database and too much time between "database back up"
>> and "agent notices that database is back up". A switch in the failsafe
>> clusters takes less than a minute and should be reported to the second,
>> if possible.
>> We can get startup time easily from a database trigger or the alertlog,
>> but have not good way to measure shutdown time so far. Is there
>> something good available (free would be nice) or do we have to build it
>> Thanks for your help.
>> Ingrid Voigt
> Received on Fri Aug 29 2008 - 15:24:43 CDT