RE: 24 x 7 on NT?

From: Mohan, Ross <MohanR_at_STARS-SMI.com>
Date: Wed, 27 Jun 2001 08:54:48 -0700
Message-ID: <F001.0033A0DA.20010627080838@fatcity.com>

*VERY* Interesting! Thanks, Paul.....

(where is Eric?)

-----Original Message-----
Sent: Wednesday, June 27, 2001 5:21 AM
To: Multiple recipients of list ORACLE-L

dgoulet_at_vicr.com wrote:
>
> Well, I guess so if that was the only occurrence. I'll never know and I
doubt
> that they will fess-up.
>
> At any rate, If one wants to use NT or any other OS for that matter in a
24x7
> guaranteed manner then one should look into making as much as possible
> redundant. Back in my Blue Suit days we did a lot of cause and effect
analysis,
> particularly on Nuclear stuff, to insure that if one component failed
there was
> a redundant part to take over the tasks of the failed unit. We also did
> analysis to determine what the likelihood of the failure was and what the
> cost/benefit of having the redundant part was. Basically, if you can
expect say
> 1 failure every 8544 hours and it will take less than 1 hour to correct
the
> failure, is it worth the expense to have redundant hardware for that
failure?
> It's one of those things that needs to be evaluated on a case by case
basis. In
> the case of NT, you'd need a separate server and be running OPS. What is
the
> cost, what is the expected frequency, and is the loss ?= the cost??
>
> Good questions, but only you can provide the answers. In the case we have
here,
> out HP's fail once every 4 years on average over the 10+ years of history
we
> have with HP. And each failure takes about 2 hours to fix. Now at $1000
per
> minute of lost revenue that comes to $120,000. A dual server and OPS
> architecture would cost $190,000 just to acquire the hardware and
software.
> Definitely not worth the expense since all of the failures we've had have
been
> soft ones anyway.
>
> Dick Goulet
>

this email lacks organization, as it is just notes from memory and searching.

My experience in this are is somewhat dated - almost 20 years ago.

One professor at CMU (G.J. Powers) covered failure mode analysis in the design of Chemical Process plants in an intro to ChemE course. Basically, the event was a human fataility, and the rule of 1 death in 20,000 man*years was the threshold. (circa 1983). He is a co-author of the Lapp-Powers algorithm. I was truly impressed by his use of heuristics as a general problem solving method.

Here, the event may be the inability of a user connection via the internet to not connect withing 10 seconds, or the ability to provide business continuity via a disaster recovery site.

http://www.drj.com is a good start for disaster recovery stuff - but that is off the topic.

A google search on "fault tree analysis" or "failure mode analysis" turned up some interesting links. Look for the term "Hazop" as a term used for operability analysis in the Chemical Processing Industry for models.

much of this type of research was accomplished during the US space program - Apollo missions in particular. handling LOX and having enriched oxygen atmospheres tends to make people pay attention to safety. also - a great deal of research in this area was accomplished in the nuclear power industry.

links:
NASA is usually a good one - http://www.sti.nasa.gov/new/fta34.html Sandia National Labs
http://reliability.sandia.gov/Reliability/Fault_Tree_Analysis/fault_tree_ana lysis.html
here's one commercial one - http://www.fault-tree.com/ http://www.high-availability.com/docs/index.htm

basically, you want to perform a failure mode analysis and prepare a fault tree.
interdependencies are especially important to cover, as an instability in one system can then cause rippling effects in other systems that are coupled (e.g. a DNS Server).
events are classified as minor (+1) (loss of hard disk) and major (+10) (backhoe severs fiber-optic backbone of half of US).

you'll need some sample figures for various components.

 MTTF (mean time to failure)
 MTBF (mean time between failure) 
 MTTR (mean time to repair)

couple that with component prices, and you should be able to produce a decent model for how to incrementally decrease chance of failure vs. additional cost of redundancy. The administrative costs of staff level of expertise, additional training, testing and documentation are more difficult to estimate. Don't under-estimate human factors as being the primary cause of various failures.
Human Reliability Analysis is a good buzzword to describe this area. http://reliability.sandia.gov/Human_Factor_Engineering/Human_Reliability_Ana lysis/human_reliability_analysis.html

Throw in an analysis as to what spares to have on hand also - vs. carrying tighter turn-around times from vendors in support agreements.

This is something that I've been meaning to do for awhile.

I'd bet that many a thesis has already been prepared in this area.

an interesting link to a Fault-Tree analysis of Intrusion Detection: http://citeseer.nj.nec.com/395103.html

I've been told that much of the Oracle high-avail is in korn shell - and downloadable.

bibliography of texts in this area:
http://www.enre.umd.edu/srel/edures/rbooks.htm

sleep.

Paul

-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Paul Drake
  INET: paled_at_home.com

Fat City Network Services    -- (858) 538-5051  FAX: (858) 538-5051
San Diego, California        -- Public Internet access / Mailing Lists
--------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).
-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Mohan, Ross
  INET: MohanR_at_STARS-SMI.com

Fat City Network Services    -- (858) 538-5051  FAX: (858) 538-5051
San Diego, California        -- Public Internet access / Mailing Lists
--------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).

Received on Wed Jun 27 2001 - 10:54:48 CDT