12 MTBF (Mean Time Between Flareups, er, Failures)

Description

This article is from the storage FAQ part2, by Rodney D. Van Meter with numerous contributions by others.
12 MTBF (Mean Time Between Flareups, er, Failures)

    There is a short FAQ-like document available from IBM at
http://www.storage.ibm.com/storage/oem/tech/mtbf.htm. No math for the
statistically inclined, but explains in clear prose what IBM at least
means when they say MTBF.

I will also note that, for a complex but reparable system such as an
autochanger, each subsystem may have a separate MTBF and a different
lifetime, which may be combined to give one figure for the unit as a
whole.

Here is a reasonably understandable, but somewhat long, description of
MTBF. Thanks to Kevin Daly (president of Odetics, kdaly@odetics.com)
wrote in 10/95 for this FAQ. After some waffling, I've included the
whole thing, despite its length.

===============================================================

M T B F

In order to understand MTBF (Mean Time Between Failures) it is best to
start with something else -- something for which it is easier to
develop an intuitive feel.  This other concept is failure rate which
is, not surprisingly, the average (mean) rate at which things fail.  A
"thing" could be a component, an assembly, or a whole system.  Some
things -- rocks, for example -- are accepted to have very low failure
rates while others -- British sports cars, for example -- are (or
should be) expected to have relatively high failure rates.

It is generally accepted among reliability specialists (and you,
therefore, must not question it) that a thing's failure rate isn't
constant, but generally goes through three phases over a thing's
lifetime.  In the first phase the failure rate is relatively high, but
decreases over time -- this is called the "infant mortality" phase
(sensitive guys these reliability specialists).  In the second phase
the failure rate is low and essentially constant -- this is
(imaginatively) called the "constant failure rate" phase.  In the
third phase the failure rate begins increasing again, often quite
rapidly, -- this is called the "wearout" phase.  The reliability
specialists noticed that when plotted as a function of time the
failure rate resembled a familiar bathroom appliance -- but they
called it a "bathtub" curve anyway.  The units of failure rate are
failures per unit of "thing-time"; e.g. failures per machine-hour or
failures per system-year.

What, you may ask, does all this have to do with MTBF?  MTBF is the
inverse of the failure rate in the constant failure rate phase.
Nothing more and nothing less.  The units of MTBF are (or, should be)
units of "thing-time" pre failure; e.g. machine-hours per failure or
system-years per failure but the "thing" part and the "per failure"
part are almost always omitted to enhance the mystique and confusion
and to make MTBF appear to have the units of "time" which it doesn't.
We will bow to the convention of speaking of MTBF in hours or years --
but we all know what we really mean.

What does MTBF have to do with lifetime?  Nothing at all!  It is not
at all unusual for things to have MTBF's which significantly exceed
their lifetime as defined by wearout -- in fact, you know many such
things.  A "thirty-something" American (well within his constant
failure rate phase) has a failure (death) rate of about 1.1 deaths per
1000 person-years and, therefore, has an MTBF of 900 years (of course
its really 900 person-years per death).  Even the best ones, however,
wear out long before that.

This example points out one other important characteristic of MTBF --
it is an ensemble characteristic which applies to populations (i.e.
"lots") of things; not a sample characteristic which applies to one
specific thing.  In the good old days when failure rates were
relatively high (and, therefore, MTBF relatively low) this
characteristic of MTBF was a curiosity which created lively (?) debate
at conventions of reliability specialists (them) but otherwise didn't
unduly bother right-thinking people (us).  Things, however, have
changed.  For many systems of interest today the required failure
rates are so low that the MTBF substantially exceeds the lifetime
(obviously nature had this right a long time ago).  In these cases
MTBF's are not only "not necessarily" sample characteristics, but are
"necessarily not" sample characteristics.  In the terms of the
reliability cognoscenti, failure processes are not ergodic (i.e. you
can't blithely trade population statistics for time statistics).  The
key implication of this essential characteristic of MTBF is that it
can only be determined from populations and it should only be applied
to populations.

MTBF is, therefore an excellent characteristic for determining how
many spare hard drives are needed to support 1000 PC's, but a poor
characteristic for guiding you on when you should change your hard
drive to avoid a crash.

MTBF's are best determined from large populations.  How large?  From
every point of view (theoretical, practical, statistical) but cost,
the answer is "the larger, the better".  There are, however, well
established techniques for planning and conducting test programs to
develop specified levels of confidence in a thing's MTBF.
Establishing an MTBF at the 80% confidence level, for example, is
clearly better, but much more difficult and expensive, than doing it
at a 60% confidence level.  As an example, a test designed to
demonstrate a thing's MTBF at the 80% confidence level, requires a
total thing-time of 160% of the MTBF if it can be conducted with no
failures.  You don't want to know how much thing-time is required to
achieve reasonable confidence levels if any failures occur during the
test.

What, by the way, is "thing-time"?  An important subtlety is that
"thing-time" isn't "clock time" (unless, of course, your thing is a
clock).  The question of how to compute "thing-time" is a critical one
in reliability engineering.  For some things (e.g.  living thing) time
always counts but for others the passage of "thing-time" may be highly
dependent upon the state of the thing.  Various ad hoc time
corrections (such as "power on hours" (POH)) have been used, primarily
in the electronics area.  There is significant evidence that, in the
mechanical area "thing-time" is much more related to activity rate
than it is to clock time.  Measures such as "Mean Cycles Between
Failures (MCBF)" are becoming accepted as more accurate ways to assess
the "duty cycle effect".  Well-founded, if heuristic, techniques have
been developed for combining MCBF and MTBF effects for systems in
which the average activity rate is known.

MTBF need not, then be "Mysterious Time Between Failures" or
"Misleading Time Between Failures", but an important system
characteristic which can help to quantify the suitability of a system
for a potential application.  While rising demands on system integrity
may make this characteristic seem "unnatural", remember you live in a
country of 250 million 9- million-hour MTBF people!

===================================================================
Kevin C. Daly
President
ATL Products
kdaly@odetics.com
(714) 774-6900
Continue to:
Headaches Begone! A Systemic Approach To Healing Your Headaches
Don't Let Your Bike Seat Ruin Your Sex Life Book
12 MTBF (Mean Time Between Flareups, er, Failures)

Description

12 MTBF (Mean Time Between Flareups, er, Failures)

Search

My Books

Discover