Re: Failure Modes and Ranges of Operation

From: Marshall <marshall.spight_at_gmail.com>
Date: 3 Feb 2007 11:26:24 -0800
Message-ID: <1170530784.938933.208330_at_k78g2000cwa.googlegroups.com>


On Feb 3, 7:26 am, Bob Badour <bbad..._at_pei.sympatico.ca> wrote:
> Neo's recent troll started me down a meandering path leading to these
> two separate but somewhat related (engineering) concerns.
>
> An operating range describes the conditions under which one of our
> devices will operate without failure. Failure modes are what happens
> when one tries to operate a device beyond that range.
>
> The sci.logic sort of folks grapple with the problems one encounters
> when one tries to have an infinite range of operation. And yet one can
> never fully escape the failure mode problem because division by zero
> generally fails.
>
> Failure mode analysis is very important in engineering. I wonder whether
> it has any supporting theory? Certainly, one can think of general
> principles. Likewise, beyond statistical analysis and empirical
> measurement, does any theory exist regarding ranges of operation?
>
> Any thoughts?

One sound theoretical aspect of the issue you're mentioning is the totality/partiality of functions. "Total functions" are functions whose
"operating range" is 100% of their inputs. Partial functions are those for which there is some sort of issue or another with some sets of inputs. Current terminology doesn't distinguish between two important different kinds of failures: undefined results and results whose computational cost is infinite. This may sound ridiculous on the surface, but I believe if nothing else it is important to consider
the two cases separately, because they behave differently in current hardware. Consider that these broken functions are broken in very different ways:

  f(x) = x/0

  f(x) = f(f(x))

(I'll go out on a limb here and say that we may someday be able to do something with the second kind, the kind that require infinite resources to compute. Consider that wild speculation though.)

In general, we find that "forward" computations, those that are built out of successive lower level computations, do not cause us problems. Given the ordered naturals, we can speak of the successor function; given successor, we can construct addition, which gives rise to multiplication, which gives rise to exponentiation, which given rise to tetration, ad infinitum. Each of these functions is total. Each successive one has an inverse, and it seems the inverses are a significant source of problems.

The predecessor function is defined on every natural number except 0. Oops! Subtraction is only defined on every pair of natural numbers where the minuend is >= the subtrahend. The integers solve this issue, and make subtraction total. Multiplication is total even when we extend it to the integers, but most pairs of integers divided don't make an integer. So we get the rationals, but we still have a problem at zero. At the next level, the forward function, exponentiation, is fine, but at this level we have *two* inverse functions, because exponentiation is not commutative. We get logs and roots, and neither one is closed under the rationals.

But it is entirely possible to restrict ourselves to functions that we know are total, or functions restricted to defined inputs. A simple example:

In this code, I don't know for sure that it will work until runtime, because I don't know that y != 0:

  int x, y;
  return x/y;

However, this code will always work, independent of the values of x and y:

  int x, y;
  if (y != 0)
    return x/y;
  else
    return 0;

Most compilers aren't smart enough to know that the first return cannot fault, but there are a few research compilers that can figure this out, and I think it's just a matter of time until this capability becomes more commonplace.

There is also resource usage to consider. Given a finite machine and a given program, how can we be sure it is free of stack overflows, or out of memory errors? It turns out to be possible to know this with an effects system. The principle is simple: for every effect one wishes to consider (maximum memory usage, nontermination, etc.) one needs to have an algebra for every kind of node in the language's abstract syntax that can propagate the state of this effect based on its child nodes.

This *necessarily* comes at the cost of reduced expressiveness. There are perfectly correct programs that cannot be proven to be correct under any system. However the expressiveness of provably correct programs is very high, and for the right application the cost in expressiveness and proof obligations is worth paying. Medical devices and avionics are often mentioned at this point.

For example, recursion makes many of these calculations harder, so there are systems in which it is simply not allowed. You pay a cost in expressiveness, and you get a benefit: without recursion, termination analysis is much easier. If you eliminate recursion and general looping, termination is guaranteed; no analysis is necessary.

Guaranteeing that resource usage falls within limits can be done by an effect that establishes an upper bound on resource usage per abstract syntax node.

The vehicle within the compiler that does this work is the type system. In a very real sense, when the compiler issues a type error, it means that it was unable to construct a requested correctness proof. This is not a metaphor but a literal correspondence.

http://en.wikipedia.org/wiki/Curry_Howard

The type systems of most modern languages are on the lame side. You can't prove much with the Java type system past the absence of type errors. But with JML, you can annotate Java code with nontrivial property requirements. And there are other systems such as Epigram, where the type system is the Calculus of Inductive Constructions. Coq is a theorem prover also using CIC; Epigram is a general purpose programming language. Those are both research projects, but SPARK/Ada is a commercial system that has similar goals:

http://www.praxis-his.com/sparkada/

There are hard limits to these sorts of approaches. We can prove our software conforms to our specs, but we cannot prove that our specs are correct for our application. We can prove that our software will work within the confines of our idealized machine, but we can't prove the machine won't lose power, or have a chip failure, or whatever. Nonetheless it seems the worst problems in software are the results of exactly the kinds of things that these proof techniques can fix: buffer overruns, security model failures, crashing on novel inputs, etc.

There is also an entire can of worms relating to communication and concurrency. We face a host of failures there: deadlock, race conditions, etc. There exist formal calculi for proving the absence of these failures as well: Pi calculus, flow calculus, CSP, etc.

As an aside, I'll also mention Erlang, which, while not on the same formal basis as the other stuff I've mentioned, nonetheless has an excellent reliability record on the basis of an unusual error handling philosophy called "Let it Crash." The basic idea being that the consequence of any kind of failure, divide by zero, abrupt socket closure, etc. is dealt with by the death of the thread in which the fault occurred. No try/catch here! Instead there is a clean system of supervisor processes which monitor and detect thread faults and respond according to policy. In practice this works exceedingly well; Erlang software has some of the highest uptime in the industry.

Apologies for the rambling nature of this post; hope it was interesting anyway.

Marshall Received on Sat Feb 03 2007 - 20:26:24 CET

Original text of this message