Test Stories - references split out



System Agreement Failures

In the history of fault tolerance, there have been many examples of
system agreement failures that have caused serious impairments or
complete loss of fielded systems. Many of these failures are due to
Byzantine faults [1]. On this page, several
examples of faults that led to system disagreement are given. Particular
attention should be given to the proximate cause of the disagreement
(typically a poorly designed fault detection, isolation, and recovery
mechanism) rather than the phenomenological (physics based) event that
starts the chain of events leading to the disagreement.

Space Transport System

NASA's space shuttle has experienced several examples of agreement
failures due to incorrect handling of Byzantine faults between its
MDM units and its
GPC. These faults fall within the class
that the shuttle developers called "non-universal I/O error". The
MDMs act as remote I/O concentrators
for the GPCs. Data from the
MDMs are transferred to the
GPCs over data busses that are similar
to MIL STD 1553. The GPCs execute
redundancy-management algorithms that include
FDIR functions having specific
handling for the "non-universal I/O error" class of failure. However,
these FDIR algorithms were not
correctly designed to handle Byzantine faults. Given that there were
four GPCs, the shuttle had sufficient
redundancy to tolerate a Byzantine fault, if these FDIR algorithms had
been designed correctly.

In one of the earliest examples (some 25 years ago), this failure was
triggered by a technician putting incorrect terminating resistors on the
end of a data bus. Because of the impedance mismatch between the
characteristic impedance of the data bus and resistance of the
terminating resistors, signals on the data bus were reflected off of the
resistors. These reflections caused a standing wave on the data bus. Two
of the four GPCs happened to be
connected to the data bus at nodes of the standing wave and the other
two GPCs were connected to the data bus
at anti-nodes of the standing wave (see figure 1).
Because of this, two of the GPCs
disagreed with the other two GPCs. It
was lucky that this irreconcilable 2:2 disagreement occurred in the lab.

Standing wave caused by incorrect terminating resistors

Figure 1: Standing wave caused by incorrect terminating resistors

A more recent example of this problem came closer to causing a disaster.
At 12:12 GMT 13 May 2008, a space shuttle was loading its hypergolic
fuel for mission STS-124 when a 3-1
disagreement occurred among its GPCs
(GPC 4 disagreed with the other
GPCs). Three seconds later, the split
became 2-1-1 (GPC 2 now disagreed with
GPC 4 and the other two
GPCs). This required that the launch
countdown be stopped. During the subsequent troubleshooting, the
remaining two GPCs disagreed (1-1-1-1
split). See the reports given
in [2] and [3]. This
was a complete system disagreement. However, none of the
GPCs were faulty. The fault was in the
FA 2 MDM. This fault was a crack in a
diode. The photomicrographs in figure 2 show two
views of this diode, rotated 90 degrees. The dark wavy line pointed to
by the red arrows is the crack. The current flow through diode is
normally left to right through the material shown in these pictures.
This means that the crack was perpendicular to the normal current flow
and completely through the current path. As a crack opened up, it
changed the diode into another type of component ... a capacitor. This
transformation is illustrated in figure 3.

The diode crack \
Figure 2: The diode crack

Figure with three elements: the left element shows a diode symbol, the
center element shows a diode symbol with a gap through the center of it,
the ride element shows capacitor. \
Figure 3: The transformation of a diode into a capacitor

Normal bus signals \
Figure 4: Normal bus signals

Bad bus signals caused by diode failure \
Figure 5: Bad bus signals caused by diode failure

The normal signals that should appear on the data bus between the
MDM and the
GPC are shown in
figure 4. The signals that were produced due to
the diode failure are shown in figure 5. Because
some of the bits in the signal are smaller than they should have been,
some of the GPC receivers could not see
these bits. The ability to see these bits depends on the sensitivity of
the receiver, which is a function of manufacturing variances,
temperature, and its power supply voltage. From the symptoms, it is
apparent that the receiver in GPC 4 was
the least sensitive and saw the errors before the other three
GPC. This
causedGPC 4 to disagree with the other
three. Then, as the crack in the diode widened, the bits became shorter
to the point where GPC 2 could no
longer see these bits; which caused it to disagree with the other
GPC. At this point, the set of messages
that was received correctly by GPC 4
was different from the set of messages that was correctly received by
GPC 2 which was different again from
the set of messages that was correctly received by
GPC 1 and
GPC 3. This process continued until
GPC 1 and
GPC 3 also disagreed with all the other
GPC.

TTP/C

A databus known as TTP/C was developed
for the needs of the emerging automotive "by-wire"
industry [4]. Its goals were to provide
replica determinism while living within the cost constraints of the
automotive marketplace. Because of its low cost, it has also found
applications within the aerospace market.
TTP/C is a
TDMA-based serial communications
protocol that provides synchronization in deterministic message
communication over dual redundant physical media.
TTP/C also provides a membership
service at the protocol level. The function of the membership service is
to provide global consensus on message distribution and system state.
Addressing the consensus problem at the protocol level can greatly
reduce system software complexity. However, placing a requirement for
protocol level consensus leaves the protocol itself vulnerable to a
Byzantine failure. The FIT project
confirmed the possibility of this vulnerability by observing actual
occurrences of such failures [].

As part of the FIT project, a first generation time-triggered
communication controller (TTP-C1) was radiated with heavy
ions. [6]. The errors caused by this
experiment were not controlled; they were the result of random
radioactive decay. The reported fault manifestations were bit-flips in
register and RAM locations within the TTP-C1
IC. ICs
with improved design are now available for
TTP/C.

During the many thousands of fault injection runs, several system
failures due to Byzantine faults were
recorded [6]. The dominant Byzantine
failure mode observed was due to marginal transmission timing.
Corruptions in the time-base registers, within the integrated
IC that had been irradiated, led it to
transmit messages at periods that were slightly-off-specification (SOS),
i.e. slightly too early or too late relative to the globally agreed upon
time base. A message transmitted slightly too early was accepted only by
the ICs of the system having slightly
fast clocks; ICs with slightly slower
clocks rejected the message. Even though such a timing failure would
have been tolerated by TTP/C's
Byzantine-tolerant clock synchronization
algorithm [7], the dependency of this
service on TTP/C's membership service
prevented it from succeeding. After a Byzantine erroneous transmission,
the membership consensus logic of
TTP/C prevented
ICs that had different perceptions of
this transmission's validity from communicating with each other.
Therefore, following such a faulty transmission, the system is
partitioned into two sets or cliques --- one clique containing the
ICs that accepted the erroneous
transmission, the other clique containing the
ICs that rejected the transmission.

TTP/C incorporates a mechanism to deal
with these unexpected faults - as long as the errors are transient. The
clique avoidance algorithm is executed on every
IC prior to its next scheduled message.
ICs that find themselves in a minority
clique (i.e. unable to receive messages from the majority of active
ICs) are expected to cease operation
before transmitting. However, if the faulty
IC is in the majority clique or is
programmed to re-integrate after a failure, then a permanent SOS fault
can cause repeated failures. This behavior was observed during the FIT
fault injections. In several fault injection tests, the faulty
IC did not cease transmission and the
SOS fault persisted. The persistence of this fault prevented the clique
avoidance mechanism from successfully recovering. In several instances,
the faulty IC continued to divide the
membership of the remaining cliques, which resulted in eventual system
failure. In later analysis of the faulty behavior, these effects were
repeated with software simulated fault injection. The original faults
were traced to upsets in either the C1 controller time-base registers or
the micro-code instruction RAM [8].
Subsequent generations of the TTP/C
controller have incorporated parity and other mechanisms to reduce the
influence of random upsets (e.g. ROM based microcode execution). SOS
faults in TTP/C can be mitigated with
a central guardian. This guardian assumes fail-detectable behavior and
does not violate end-to-end CRC arguments, which has to be shown by
exhaustive testing.

Mid-Value Select

Mid-Value select is a well-known method for masking the propagation of
failures. It has properties that are similar to an M-out-of-N
voter [9], but does not require any of its
inputs to be bit-for-bit identical. Many mid-value select
implementations in actual fielded systems are merged with other fault
tolerance mechanisms such as reasonableness checks or other fault
detection mechanisms that are then used to block some inputs to the
mid-value selection if they are known to be bad via these other fault
detection mechanisms. This blocking of inputs makes these types of
mid-value selection mechanisms a form of hybrid
nMR. Hybrid
nMR systems can change the "M" and/or
"N" in the M-out-of-N calculations by using reconfiguration. The
objective is to be able to tolerate more faults than can be tolerated by
using nMR alone (details described in
example below). To make things even more complex, these additional fault
detection mechanisms sometimes use a previous output of the mid-value
select in comparison with its inputs to determine if they are faulty
and/or use a previous output of the mid-value select as the replacement
value for a faulty input.

One of the most common ways of blocking inputs to a mid-value selector
is to override the value of a faulty input with the value that is the
midpoint of the reasonable range of input values. This is illustrated on
the left side of figure 6, which was taken from a paper
by Stephen Osder [10] and modified
slightly. The "voter" in his figure is actually a mid-value selector,
which due to its similarity with a bit-for-bit M-out-of-N voter can be
called a voter. While Osder uses this example to show how this widely
used mid-value selection design can fail to meet its design purpose for
a non-replicated mid-value selector, we can use the same design to show
how disagreement can arrive in replicated mid-value selectors.

Mid-value select failure example \
Figure 6: Mid-value select failure example

In this example, zero input volts (ground) is assumed to be the middle
of the valid range. The switches in this figure are driven by the fault
detection logic and force faulty inputs to ground. The fault detection
mechanism used in this example is one that is commonly used. It compares
the previous output of the mid-value selection with each of the inputs.
If an input value is more than a certain epsilon away from this last
value, it gets switched to zero. The rationale for this design is that
the mid-value selector could only tolerate one failure (worst case) if
the switches were not included. With the switches (and assuming good
enough failure detection circuitry that controls the switches), this
design hopes to tolerate an additional failure using the following
argument - When the first failure is detected, it is clamped to zero. If
a second failure occurs that is further away from zero than the good
value, the good value is selected by the mid-value selector. If the bad
value is between the good value and ground, the worst value that the
mid-value selection can output is ground, which is the midpoint of the
reasonable input range. Sometimes this is sufficient. When it is not, a
common variation of this idea is to use the previous output of the
mid-value selection instead of the midpoint of the reasonable input
range (ground in this example).

Mid-value select is used most advantageously where bit-for-bit
M-out-of-N voters cannot be used due to the system's inability to ensure
that the inputs are bit-for-bit identical. Most often this is due to the
inputs being asynchronous with respect to each other. However, it is
this very asynchrony coupled with these hybrid
nMR additions to mid-value select that
can still lead to system disagreement.

Going back to the Osder example, but using replicated mid-value selectors, the
following scenario, as depicted in the simple plot on the right side of
figure 6, is possible - Two of the three signals are on either
side of the signal that is currently being selected as the mid-value and are
nearly epsilon away from this mid-value. Given that this middle input value is
sampled asynchronously and is varying to some degree, one of the mid-value
selectors could sample it when it was closer to the more positive of the other
two inputs (i.e., X2 is sampled near point A) and another mid-value
selector could sample it when it was closer to the more negative of the two
other signals (i.e., X2 is sampled near point B). The former
mid-value selector will then block the more negative of the other two inputs
and the latter mid-value select will block the more positive of the other two
inputs. We now have a disagreement between these two mid-value selectors. Even
more perversely, a third mid-value selector could sample the middle input when
it's exactly between the two other imports and not see either one of them as
being an epsilon away. Thus, we get a three-way split: one blocking the
positive input, one blocking the negative input, and one not blocking any
inputs. The conditions for sustaining a three-way split are highly
unlikely. However, a two-way split would be persistent. In this persistent
state, the replicated mid-value selectors may initially select the same input
(e.g. X2 from the right side of figure 6) but
eventually select different inputs. For example, if X2 continues on
its downward slope and eventually becomes lower than X3, any
replicated mid-value selector that has blocked X~1~ will select X3
as the mid-value; while, any replicated mid-value selector that has blocked
X3 will select (the grounded) X2 as the mid-value. Thus,
this condition creates a system disagreement.

Test Stories - references split out (old revision)

Edited Nov 14, 2012 by kevin-driscoll







System Agreement Failures



In the history of fault tolerance, there have been many examples of system
agreement failures that have caused serious impairments or complete loss of
fielded systems. Many of these failures are due to Byzantine faults [1]. On this page, several examples of faults that
led to system disagreement are given. Particular attention should be given to
the proximate cause of the disagreement (typically a poorly designed fault
detection, isolation, and recovery mechanism) rather than the phenomenological
(physics based) event that starts the chain of events leading to the
disagreement.


Space Transport System



NASA's space shuttle has experienced several examples of agreement failures due
to incorrect handling of Byzantine faults between its MDM
units and its GPC. These faults fall within the class that
the shuttle developers called "non-universal I/O error". The MDMs act as remote I/O concentrators for the GPCs. Data from the MDMs are transferred
to the GPCs over data busses that are similar to MIL STD
1553. The GPCs execute redundancy-management algorithms
that include FDIR functions having specific handling for
the "non-universal I/O error" class of failure. However, these FDIR algorithms were not correctly designed to handle
Byzantine faults. Given that there were four GPCs, the
shuttle had sufficient redundancy to tolerate a Byzantine fault, if these FDIR algorithms had been designed correctly.


In one of the earliest examples (some 25 years ago), this failure was triggered
by a technician putting incorrect terminating resistors on the end of a data
bus. Because of the impedance mismatch between the characteristic impedance of
the data bus and resistance of the terminating resistors, signals on the data
bus were reflected off of the resistors. These reflections caused a standing
wave on the data bus. Two of the four GPCs happened to be
connected to the data bus at nodes of the standing wave and the other two GPCs were connected to the data bus at anti-nodes of the
standing wave (see figure 1). Because of
this, two of the GPCs disagreed with the other two GPCs. It was lucky that this irreconcilable 2:2 disagreement
occurred in the lab.



Standing wave caused by incorrect terminating resistors

Figure 1: Standing wave caused by incorrect terminating resistors



A more recent example of this problem came closer to causing a disaster. At
12:12 GMT 13 May 2008, a space shuttle was loading its hypergolic fuel for
mission STS-124 when a 3-1 disagreement occurred among its
GPCs (GPC 4 disagreed with the other GPCs). Three seconds later, the split became 2-1-1 (GPC 2 now disagreed with GPC 4 and the other
two GPCs). This required that the launch countdown be
stopped. During the subsequent troubleshooting, the remaining two GPCs disagreed (1-1-1-1 split). See the reports given
in [2] and [3]. This was a complete system
disagreement. However, none of the GPCs were faulty. The
fault was in the FA 2 MDM. This fault was a crack in a
diode. The photomicrographs in figure 2
show two views of this diode, rotated 90 degrees. The dark wavy line pointed
to by the red arrows is the crack. The current flow through diode is normally
left to right through the material shown in these pictures. This means that
the crack was perpendicular to the normal current flow and completely through
the current path. As a crack opened up, it changed the diode into another
type of component ... a capacitor. This transformation is illustrated in
figure 3.



The diode crack

Figure 2: The diode crack



Figure with three elements: the left element shows a diode symbol, the
center element shows a diode symbol with a gap through the center of it, the
ride element shows capacitor.

Figure 3: The transformation of a diode into a capacitor



Normal bus signals

Figure 4: Normal bus signals



Bad bus signals caused by diode failure

Figure 5: Bad bus signals caused by diode failure


The normal signals that should appear on the data bus between the MDM and the GPC are shown in figure 4. The signals that were produced due to the
diode failure are shown in figure 5.
Because some of the bits in the signal are smaller than they should have been,
some of the GPC receivers could not see these bits. The
ability to see these bits depends on the sensitivity of the receiver, which is
a function of manufacturing variances, temperature, and its power supply
voltage. From the symptoms, it is apparent that the receiver in GPC 4 was the least sensitive and saw the errors before the
other three GPC. This causedGPC 4 to
disagree with the other three. Then, as the crack in the diode widened, the
bits became shorter to the point where GPC 2 could no
longer see these bits; which caused it to disagree with the other GPC. At this point, the set of messages that was received
correctly by GPC 4 was different from the set of messages
that was correctly received by GPC 2 which was different
again from the set of messages that was correctly received by GPC 1 and GPC 3. This process continued
until GPC 1 and GPC 3 also disagreed
with all the other GPC.


TTP/C



A databus known as TTP/C was developed for the needs of
the emerging automotive "by-wire" industry [4]. Its goals were to provide replica determinism while living within
the cost constraints of the automotive marketplace. Because of its low cost,
it has also found applications within the aerospace market. TTP/C is a TDMA-based serial
communications protocol that provides synchronization in deterministic message
communication over dual redundant physical media. TTP/C
also provides a membership service at the protocol level. The function of the
membership service is to provide global consensus on message distribution and
system state. Addressing the consensus problem at the protocol level can
greatly reduce system software complexity. However, placing a requirement for
protocol level consensus leaves the protocol itself vulnerable to a Byzantine
failure. The FIT project confirmed the possibility of this
vulnerability by observing actual occurrences of such failures [].


As part of the FIT project, a first generation time-triggered communication
controller (TTP-C1) was radiated with heavy ions. [6]. The errors caused by
this experiment were not controlled; they were the result of random radioactive
decay. The reported fault manifestations were bit-flips in register and RAM
locations within the TTP-C1 IC. ICs
with improved design are now available for TTP/C.


During the many thousands of fault injection runs, several system failures due
to Byzantine faults were recorded [6].
The dominant Byzantine failure mode observed was due to marginal transmission
timing. Corruptions in the time-base registers, within the integrated IC that had been irradiated, led it to transmit messages at
periods that were slightly-off-specification (SOS), i.e. slightly too early or
too late relative to the globally agreed upon time base. A message transmitted
slightly too early was accepted only by the ICs of the
system having slightly fast clocks; ICs with slightly
slower clocks rejected the message. Even though such a timing failure would
have been tolerated by TTP/C's Byzantine-tolerant clock
synchronization algorithm [7], the dependency of this service on TTP/C's membership service prevented it from succeeding. After
a Byzantine erroneous transmission, the membership consensus logic of TTP/C prevented ICs that had different
perceptions of this transmission's validity from communicating with each other.
Therefore, following such a faulty transmission, the system is partitioned into
two sets or cliques --- one clique containing the ICs that
accepted the erroneous transmission, the other clique containing the ICs that rejected the transmission.


TTP/C incorporates a mechanism to deal with these
unexpected faults - as long as the errors are transient. The clique avoidance
algorithm is executed on every IC prior to its next
scheduled message. ICs that find themselves in a minority
clique (i.e. unable to receive messages from the majority of active ICs) are expected to cease operation before transmitting.
However, if the faulty IC is in the majority clique or is
programmed to re-integrate after a failure, then a permanent SOS fault can
cause repeated failures. This behavior was observed during the FIT fault
injections. In several fault injection tests, the faulty IC
did not cease transmission and the SOS fault persisted. The persistence of
this fault prevented the clique avoidance mechanism from successfully
recovering. In several instances, the faulty IC continued
to divide the membership of the remaining cliques, which resulted in eventual
system failure. In later analysis of the faulty behavior, these effects were
repeated with software simulated fault injection. The original faults were
traced to upsets in either the C1 controller time-base registers or the
micro-code instruction RAM [8]. Subsequent generations of the TTP/C controller have incorporated parity and other mechanisms to
reduce the influence of random upsets (e.g. ROM based microcode execution).
SOS faults in TTP/C can be mitigated with a central
guardian. This guardian assumes fail-detectable behavior and does not violate
end-to-end CRC arguments, which has to be shown by exhaustive testing.


Mid-Value Select



Mid-Value select is a well-known method for masking the propagation of
failures. It has properties that are similar to an M-out-of-N voter [9], but does not require any
of its inputs to be bit-for-bit identical. Many mid-value select
implementations in actual fielded systems are merged with other fault tolerance
mechanisms such as reasonableness checks or other fault detection mechanisms
that are then used to block some inputs to the mid-value selection if they are
known to be bad via these other fault detection mechanisms. This blocking of
inputs makes these types of mid-value selection mechanisms a form of hybrid
nMR. Hybrid nMR systems can change
the "M" and/or "N" in the M-out-of-N calculations by using reconfiguration.
The objective is to be able to tolerate more faults than can be tolerated by
using nMR alone (details described in example below). To
make things even more complex, these additional fault detection mechanisms
sometimes use a previous output of the mid-value select in comparison with its
inputs to determine if they are faulty and/or use a previous output of the
mid-value select as the replacement value for a faulty input.


One of the most common ways of blocking inputs to a mid-value selector is to
override the value of a faulty input with the value that is the midpoint of the
reasonable range of input values. This is illustrated on the left side of
figure 6, which was taken from a paper by Stephen
Osder [10] and modified slightly. The "voter" in his figure is
actually a mid-value selector, which due to its similarity with a bit-for-bit
M-out-of-N voter can be called a voter. While Osder uses this example to show
how this widely used mid-value selection design can fail to meet its design
purpose for a non-replicated mid-value selector, we can use the same design to
show how disagreement can arrive in replicated mid-value selectors.



Mid-value select failure example

Figure 6: Mid-value select failure example


In this example, zero input volts (ground) is assumed to be the middle of the
valid range. The switches in this figure are driven by the fault detection
logic and force faulty inputs to ground. The fault detection mechanism used in
this example is one that is commonly used. It compares the previous output of
the mid-value selection with each of the inputs. If an input value is more
than a certain epsilon away from this last value, it gets switched to zero.
The rationale for this design is that the mid-value selector could only
tolerate one failure (worst case) if the switches were not included. With the
switches (and assuming good enough failure detection circuitry that controls
the switches), this design hopes to tolerate an additional failure using the
following argument - When the first failure is detected, it is clamped to
zero. If a second failure occurs that is further away from zero than the good
value, the good value is selected by the mid-value selector. If the bad
value is between the good value and ground, the worst value that the mid-value
selection can output is ground, which is the midpoint of the reasonable input
range. Sometimes this is sufficient. When it is not, a common variation of
this idea is to use the previous output of the mid-value selection instead of
the midpoint of the reasonable input range (ground in this example).


Mid-value select is used most advantageously where bit-for-bit M-out-of-N
voters cannot be used due to the system's inability to ensure that the inputs
are bit-for-bit identical. Most often this is due to the inputs being
asynchronous with respect to each other. However, it is this very asynchrony
coupled with these hybrid nMR additions to mid-value select
that can still lead to system disagreement.


Going back to the Osder example, but using replicated mid-value selectors, the
following scenario, as depicted in the simple plot on the right side of
figure 6, is possible - Two of the three
signals are on either side of the signal that is currently being selected as
the mid-value and are nearly epsilon away from this mid-value. Given that this
middle input value is sampled asynchronously and is varying to some degree, one
of the mid-value selectors could sample it when it was closer to the more
positive of the other two inputs (i.e., X2 is sampled near point A)
and another mid-value selector could sample it when it was closer to the more
negative of the two other signals (i.e., X2 is sampled near point
B). The former mid-value selector will then block the more negative of the
other two inputs and the latter mid-value select will block the more positive
of the other two inputs. We now have a disagreement between these two
mid-value selectors. Even more perversely, a third mid-value selector could
sample the middle input when it's exactly between the two other imports and not
see either one of them as being an epsilon away. Thus, we get a three-way
split: one blocking the positive input, one blocking the negative input, and
one not blocking any inputs. The conditions for sustaining a three-way split
are highly unlikely. However, a two-way split would be persistent. In this
persistent state, the replicated mid-value selectors may initially select the
same input (e.g. X2 from the right side of figure 6) but eventually select different inputs. For
example, if X2 continues on its downward slope and eventually
becomes lower than X3, any replicated mid-value selector that has
blocked X1 will select X3 as the mid-value; while, any
replicated mid-value selector that has blocked X3 will select (the
grounded) X2 as the mid-value. Thus, this condition creates a
system disagreement.