Single point of failures, how to identify them?

From: Guillermo Alan Bort <cicciuxdba_at_gmail.com>
Date: Wed, 6 Jul 2011 16:19:04 -0300
Message-ID: <CAJ2dSGQ5BmfVM6G7O_EQVeRVH_qzdE-T7H8k9ytoDdXvw8kwxA_at_mail.gmail.com>



A few days ago another business was hit with a bug and they got some corruption on ASM. I'm not very familiar with what happened, what bug or anything like that but they ended up having to restore a bunch of databases. This got me thinking... we don't normally like to admit it but we do have single point of failures and identifying them could help us be prepared to deal with any issue impacting one of them (or find an alternative to minimize downtime).

There are several things to consider when talking about points of failure, and I might even start a blog series about this topic, but I will try to describe what I consider to be a point of failure.

A point of failure is any part of a system (in our case it would be a computer system) that by not performing its task as designed could cause problems in the end result expected of the system. This brings us to try and define what a system is, and for the purposes of making life easy for us I will choose to define a system as a set of tools and processes that transform something into something else (in our case information). Systems include hardware, software and human components.

When looking for points of failure in a system one must consider the full extent of the system and then take a close look at each and every component of that system and ask: If this here stops working, what will happen with the entire system?

What if the answer to that question is "the entire system will stop working" well, that's a point of failure...

What can we do to prevent the "entire system" to "stop working"? Usually it comes down to redundancy... we take the "piece" that is likely to cause the system to not work and we throw a couple of replacement that will pick up its task should it fail to do it. Some will even do it at the same time, increasing efficiency while everything runs smoothly (or giving a headache to the system administrator).

Now, back to real life... we are DBAs, well, some of us are... and we manage databases... so... what are the points of failure in a database and how do you work your way around them? Have you ever found anything that cannot be solved by redundnacy? (usually data corruption falls into this category) what do you do then?

Well, the RDBMS itself is a point of failure, if there is a bug hitting a particular patchset, no matter how maney RAC nodes you have, they will all hit it (unless it's an intermittent bug!!!) asm behave pretty much the same way.

You can have multiple homes and have listeners of different versions ready should you run into problems with any particular one.

User error... well, users are already redundant enough, let's not make them more redundant :-P

cheers
Alan.-

--
http://www.freelists.org/webpage/oracle-l
Received on Wed Jul 06 2011 - 14:19:04 CDT

Original text of this message