Re: Single point of failures, how to identify them?

From: Martin Berger <martin.a.berger_at_gmail.com>
Date: Thu, 7 Jul 2011 08:45:16 +0200
Message-ID: <CALH8A90p4w+E1Ai6NzuNKu1VdeQ4uGh5-u+4Y2p7qQz_1ANk0Q_at_mail.gmail.com>



Alan,

Times when I had to do formal proof of a sentence passed long ago, and I was never good at it. But please try to follow this short idea:

  1. If you find a SPOF, you can a) create a 2nd instance of this functionality , b) in best case with different technology to avoid redundant errors and have to initiate c) a ruling instance to decide which is the 'right one' in case of an error.
  2. as 1c) is a SPOF, you have to create 1a) for it

If you can not avoid 1c), you will be in an endless loop, increasing complexity with every iteration.
So please, show me 1c) is wrong, I really would like to build my systems without 1c)!!!

The only one thing I am sure about is: complexity is one of the biggest danger for availability.

To introduce a stop condition I'd like to add the probability to fail. Now let's enhance the algorithm:

  1. define the desired availability
  2. calculate the current availability of your system
  3. if B < A then
  4. find the SPOF with the worst probability to fail
  5. replace it according to 1)
  6. calculate the new probability to fail - only if it's better, continue, otherwise revert the task
  7. continue with B)

I have left out one last aspect: Money.
Every component in your System costs money. If you add components, the system will get more expensive.
So in my point Cc) we should also calculate the additional effort, and set it into correlation to the expected (availability) gain.

I'm not sure if this is really the best method, but within some iterations you should be able to show if the expected availability can be reached within the given budget.

I hope that makes any sense for you,
 Martin

On Wed, Jul 6, 2011 at 21:19, Guillermo Alan Bort <cicciuxdba_at_gmail.com> wrote:
> A few days ago another business was hit with a bug and they got some
> corruption on ASM. I'm not very familiar with what happened, what bug or
> anything like that but they ended up having to restore a bunch of databases.
> This got me thinking... we don't normally like to admit it but we do have
> single point of failures and identifying them could help us be prepared to
> deal with any issue impacting one of them (or find an alternative to
> minimize downtime).
>
> There are several things to consider when talking about points of failure,
> and I might even start a blog series about this topic, but I will try to
> describe what I consider to be a point of failure.
>
> A point of failure is any part of a system (in our case it would be a
> computer system) that by not performing its task as designed could cause
> problems in the end result expected of the system. This brings us to try and
> define what a system is, and for the purposes of making life easy for us I
> will choose to define a system as a set of tools and processes that
> transform something into something else (in our case information). Systems
> include hardware, software and human components.
>
> When looking for points of failure in a system one must consider the full
> extent of the system and then take a close look at each and every component
> of that system and ask: If this here stops working, what will happen with
> the entire system?
>
> What if the answer to that question is "the entire system will stop working"
> well, that's a point of failure...
>
> What can we do to prevent the "entire system" to "stop working"? Usually it
> comes down to redundancy... we take the "piece" that is likely to cause the
> system to not work and we throw a couple of replacement that will pick up
> its task should it fail to do it. Some will even do it at the same time,
> increasing efficiency while everything runs smoothly (or giving a headache
> to the system administrator).
>
> Now, back to real life... we are DBAs, well, some of us are... and we manage
> databases... so... what are the points of failure in a database and how do
> you work your way around them? Have you ever found anything that cannot be
> solved by redundnacy? (usually data corruption falls into this category)
> what do you do then?
>
> Well, the RDBMS itself is a point of failure, if there is a bug hitting a
> particular patchset, no matter how maney RAC nodes you have, they will all
> hit it (unless it's an intermittent bug!!!) asm behave pretty much the same
> way.
>
> You can have multiple homes and have listeners of different versions ready
> should you run into problems with any particular one.
>
> User error... well, users are already redundant enough, let's not make them
> more redundant :-P
>
> cheers
> Alan.-

--
http://www.freelists.org/webpage/oracle-l
Received on Thu Jul 07 2011 - 01:45:16 CDT

Original text of this message