Re: 24 x 7 on NT?

From: Peter McLarty <peter.mclarty_at_incts.com>
Date: Wed, 27 Jun 2001 04:58:53 -0700
Message-ID: <F001.00339B7B.20010627045033@fatcity.com>

Interesting rambling,

I had never seen a fault tree before but had in some way used one at times without understanding how to really use what i was doing, I will have to study these links some more

Peter

At 01:21 AM 27/06/2001 -0800, you wrote:
>dgoulet_at_vicr.com wrote:
> >
> > Well, I guess so if that was the only occurrence. I'll never know and
> I doubt
> > that they will fess-up.
> >
> > At any rate, If one wants to use NT or any other OS for that matter in
> a 24x7
> > guaranteed manner then one should look into making as much as possible
> > redundant. Back in my Blue Suit days we did a lot of cause and effect
> analysis,
> > particularly on Nuclear stuff, to insure that if one component failed
> there was
> > a redundant part to take over the tasks of the failed unit. We also did
> > analysis to determine what the likelihood of the failure was and what the
> > cost/benefit of having the redundant part was. Basically, if you can
> expect say
> > 1 failure every 8544 hours and it will take less than 1 hour to correct the
> > failure, is it worth the expense to have redundant hardware for that
> failure?
> > It's one of those things that needs to be evaluated on a case by case
> basis. In
> > the case of NT, you'd need a separate server and be running OPS. What
> is the
> > cost, what is the expected frequency, and is the loss ?= the cost??
> >
> > Good questions, but only you can provide the answers. In the case we
> have here,
> > out HP's fail once every 4 years on average over the 10+ years of
> history we
> > have with HP. And each failure takes about 2 hours to fix. Now at
> $1000 per
> > minute of lost revenue that comes to $120,000. A dual server and OPS
> > architecture would cost $190,000 just to acquire the hardware and
> software.
> > Definitely not worth the expense since all of the failures we've had
> have been
> > soft ones anyway.
> >
> > Dick Goulet
> >
>
>this email lacks organization, as it is just notes from memory and
>searching.
>
>My experience in this are is somewhat dated - almost 20 years ago.
>
>One professor at CMU (G.J. Powers) covered failure mode analysis in the
>design of Chemical Process plants in an intro to ChemE course.
>Basically, the event was a human fataility, and the rule of 1 death in
>20,000 man*years was the threshold. (circa 1983). He is a co-author of
>the Lapp-Powers algorithm. I was truly impressed by his use of
>heuristics as a general problem solving method.
>
>Here, the event may be the inability of a user connection via the
>internet to not connect withing 10 seconds, or the ability to provide
>business continuity via a disaster recovery site.
>
>http://www.drj.com is a good start for disaster recovery stuff - but
>that is off the topic.
>
>A google search on "fault tree analysis" or "failure mode analysis"
>turned up some interesting links. Look for the term "Hazop" as a term
>used for operability analysis in the Chemical Processing Industry for
>models.
>
>much of this type of research was accomplished during the US space
>program - Apollo missions in particular. handling LOX and having
>enriched oxygen atmospheres tends to make people pay attention to
>safety. also - a great deal of research in this area was accomplished in
>the nuclear power industry.
>
>links:
>NASA is usually a good one - http://www.sti.nasa.gov/new/fta34.html
>Sandia National Labs
>http://reliability.sandia.gov/Reliability/Fault_Tree_Analysis/fault_tree_analysis.html
>here's one commercial one - http://www.fault-tree.com/
>http://www.high-availability.com/docs/index.htm
>
>basically, you want to perform a failure mode analysis and prepare a
>fault tree.
>interdependencies are especially important to cover, as an instability
>in one system can then cause rippling effects in other systems that are
>coupled (e.g. a DNS Server).
>events are classified as minor (+1) (loss of hard disk) and major (+10)
>(backhoe severs fiber-optic backbone of half of US).
>
>
>you'll need some sample figures for various components.
> MTTF (mean time to failure)
> MTBF (mean time between failure)
> MTTR (mean time to repair)
>
>couple that with component prices, and you should be able to produce a
>decent model for how to incrementally decrease chance of failure vs.
>additional cost of redundancy. The administrative costs of staff level
>of expertise, additional training, testing and documentation are more
>difficult to estimate. Don't under-estimate human factors as being the
>primary cause of various failures.
>Human Reliability Analysis is a good buzzword to describe this area.
>http://reliability.sandia.gov/Human_Factor_Engineering/Human_Reliability_Analysis/human_reliability_analysis.html
>
>Throw in an analysis as to what spares to have on hand also - vs.
>carrying tighter turn-around times from vendors in support agreements.
>
>This is something that I've been meaning to do for awhile.
>
>I'd bet that many a thesis has already been prepared in this area.
>
>an interesting link to a Fault-Tree analysis of Intrusion Detection:
>http://citeseer.nj.nec.com/395103.html
>
>I've been told that much of the Oracle high-avail is in korn shell - and
>downloadable.
>
>bibliography of texts in this area:
>http://www.enre.umd.edu/srel/edures/rbooks.htm
>
>sleep.
>
>Paul
>--
>Please see the official ORACLE-L FAQ: http://www.orafaq.com
>--
>Author: Paul Drake
> INET: paled_at_home.com
>
>Fat City Network Services -- (858) 538-5051 FAX: (858) 538-5051
>San Diego, California -- Public Internet access / Mailing Lists
>--------------------------------------------------------------------
>To REMOVE yourself from this mailing list, send an E-Mail message
>to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
>the message BODY, include a line containing: UNSUB ORACLE-L
>(or the name of mailing list you want to be removed from). You may
>also send the HELP command for other information (like subscribing).

-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Peter McLarty
  INET: peter.mclarty_at_incts.com

Fat City Network Services    -- (858) 538-5051  FAX: (858) 538-5051
San Diego, California        -- Public Internet access / Mailing Lists
--------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).

Received on Wed Jun 27 2001 - 06:58:53 CDT