Test Stories - one file, except figures






1  System Agreement Failures

In the history of fault tolerance, there have been many examples of system agreement failures that have caused serious impairments or complete loss of fielded systems. Many of these failures are due to Byzantine faults [1]. On this page, several examples of faults that led to system disagreement are given. Particular attention should be given to the proximate cause of the disagreement (typically a poorly designed fault detection, isolation, and recovery mechanism) rather than the phenomenological (physics based) event that starts the chain of events leading to the disagreement.

1.1  Space Transport System

NASA's space shuttle has experienced several examples of agreement failures due to incorrect handling of Byzantine faults between its MDM units and its GPC. These faults fall within the class that the shuttle developers called "non-universal I/O error". The MDMs act as remote I/O concentrators for the GPCs. Data from the MDMs are transferred to the GPCs over data busses that are similar to MIL STD 1553. The GPCs execute redundancy-management algorithms that include FDIR functions having specific handling for the "non-universal I/O error" class of failure. However, these FDIR algorithms were not correctly designed to handle Byzantine faults. Given that there were four GPCs, the shuttle had sufficient redundancy to tolerate a Byzantine fault, if these FDIR algorithms had been designed correctly.

In one of the earliest examples (some 25 years ago), this failure was triggered by a technician putting incorrect terminating resistors on the end of a data bus. Because of the impedance mismatch between the characteristic impedance of the data bus and resistance of the terminating resistors, signals on the data bus were reflected off of the resistors. These reflections caused a standing wave on the data bus. Two of the four GPCs happened to be connected to the data bus at nodes of the standing wave and the other two GPCs were connected to the data bus at anti-nodes of the standing wave (see figure 1). Because of this, two of the GPCs disagreed with the other two GPCs. It was lucky that this irreconcilable 2:2 disagreement occurred in the lab.

Standing wave caused by incorrect terminating resistors

Figure 1: Standing wave caused by incorrect terminating resistors

A more recent example of this problem came closer to causing a disaster. At 12:12 GMT 13 May 2008, a space shuttle was loading its hypergolic fuel for mission STS-124 when a 3-1 disagreement occurred among its GPCs (GPC 4 disagreed with the other GPCs). Three seconds later, the split became 2-1-1 (GPC 2 now disagreed with GPC 4 and the other two GPCs). This required that the launch countdown be stopped. During the subsequent troubleshooting, the remaining two GPCs disagreed (1-1-1-1 split). See the reports given in [2] and [3]. This was a complete system disagreement. However, none of the GPCs were faulty. The fault was in the FA 2 MDM. This fault was a crack in a diode. The photomicrographs in figure 2 show two views of this diode, rotated 90 degrees. The dark wavy line pointed to by the red arrows is the crack. The current flow through diode is normally left to right through the material shown in these pictures. This means that the crack was perpendicular to the normal current flow and completely through the current path. As a crack opened up, it changed the diode into another type of component ... a capacitor. This transformation is illustrated in figure 3.

The diode crack
Figure 2: The diode crack

Figure with three elements: the left element shows a diode symbol, the
center element shows a diode symbol with a gap through the center of it, the
ride element shows capacitor.
Figure 3: The transformation of a diode into a capacitor

Normal bus signals
Figure 4: Normal bus signals

Bad bus signals caused by diode failure
Figure 5: Bad bus signals caused by diode failure

The normal signals that should appear on the data bus between the MDM and the GPC are shown in figure 4. The signals that were produced due to the diode failure are shown in figure 5. Because some of the bits in the signal are smaller than they should have been, some of the GPC receivers could not see these bits. The ability to see these bits depends on the sensitivity of the receiver, which is a function of manufacturing variances, temperature, and its power supply voltage. From the symptoms, it is apparent that the receiver in GPC 4 was the least sensitive and saw the errors before the other three GPC. This causedGPC 4 to disagree with the other three. Then, as the crack in the diode widened, the bits became shorter to the point where GPC 2 could no longer see these bits; which caused it to disagree with the other GPC. At this point, the set of messages that was received correctly by GPC 4 was different from the set of messages that was correctly received by GPC 2 which was different again from the set of messages that was correctly received by GPC 1 and GPC 3. This process continued until GPC 1 and GPC 3 also disagreed with all the other GPC.

1.2  TTP/C

A databus known as TTP/C was developed for the needs of the emerging automotive "by-wire" industry [4]. Its goals were to provide replica determinism while living within the cost constraints of the automotive marketplace. Because of its low cost, it has also found applications within the aerospace market. TTP/C is a TDMA-based serial communications protocol that provides synchronization in deterministic message communication over dual redundant physical media. TTP/C also provides a membership service at the protocol level. The function of the membership service is to provide global consensus on message distribution and system state. Addressing the consensus problem at the protocol level can greatly reduce system software complexity. However, placing a requirement for protocol level consensus leaves the protocol itself vulnerable to a Byzantine failure. The FIT project confirmed the possibility of this vulnerability by observing actual occurrences of such failures [5].

As part of the FIT project, a first generation time-triggered communication controller (TTP-C1) was radiated with heavy ions. [6]. The errors caused by this experiment were not controlled; they were the result of random radioactive decay. The reported fault manifestations were bit-flips in register and RAM locations within the TTP-C1 IC. ICs with improved design are now available for TTP/C.

During the many thousands of fault injection runs, several system failures due to Byzantine faults were recorded [6]. The dominant Byzantine failure mode observed was due to marginal transmission timing. Corruptions in the time-base registers, within the integrated IC that had been irradiated, led it to transmit messages at periods that were slightly-off-specification (SOS), i.e. slightly too early or too late relative to the globally agreed upon time base. A message transmitted slightly too early was accepted only by the ICs of the system having slightly fast clocks; ICs with slightly slower clocks rejected the message. Even though such a timing failure would have been tolerated by TTP/C's Byzantine-tolerant clock synchronization algorithm [7], the dependency of this service on TTP/C's membership service prevented it from succeeding. After a Byzantine erroneous transmission, the membership consensus logic of TTP/C prevented ICs that had different perceptions of this transmission's validity from communicating with each other. Therefore, following such a faulty transmission, the system is partitioned into two sets or cliques --- one clique containing the ICs that accepted the erroneous transmission, the other clique containing the ICs that rejected the transmission.

TTP/C incorporates a mechanism to deal with these unexpected faults - as long as the errors are transient. The clique avoidance algorithm is executed on every IC prior to its next scheduled message. ICs that find themselves in a minority clique (i.e. unable to receive messages from the majority of active ICs) are expected to cease operation before transmitting. However, if the faulty IC is in the majority clique or is programmed to re-integrate after a failure, then a permanent SOS fault can cause repeated failures. This behavior was observed during the FIT fault injections. In several fault injection tests, the faulty IC did not cease transmission and the SOS fault persisted. The persistence of this fault prevented the clique avoidance mechanism from successfully recovering. In several instances, the faulty IC continued to divide the membership of the remaining cliques, which resulted in eventual system failure. In later analysis of the faulty behavior, these effects were repeated with software simulated fault injection. The original faults were traced to upsets in either the C1 controller time-base registers or the micro-code instruction RAM [8]. Subsequent generations of the TTP/C controller have incorporated parity and other mechanisms to reduce the influence of random upsets (e.g. ROM based microcode execution). SOS faults in TTP/C can be mitigated with a central guardian. This guardian assumes fail-detectable behavior and does not violate end-to-end CRC arguments, which has to be shown by exhaustive testing.

1.3  Mid-Value Select

Mid-Value select is a well-known method for masking the propagation of failures. It has properties that are similar to an M-out-of-N voter [9], but does not require any of its inputs to be bit-for-bit identical. Many mid-value select implementations in actual fielded systems are merged with other fault tolerance mechanisms such as reasonableness checks or other fault detection mechanisms that are then used to block some inputs to the mid-value selection if they are known to be bad via these other fault detection mechanisms. This blocking of inputs makes these types of mid-value selection mechanisms a form of hybrid nMR. Hybrid nMR systems can change the "M" and/or "N" in the M-out-of-N calculations by using reconfiguration. The objective is to be able to tolerate more faults than can be tolerated by using nMR alone (details described in example below). To make things even more complex, these additional fault detection mechanisms sometimes use a previous output of the mid-value select in comparison with its inputs to determine if they are faulty and/or use a previous output of the mid-value select as the replacement value for a faulty input.

One of the most common ways of blocking inputs to a mid-value selector is to override the value of a faulty input with the value that is the midpoint of the reasonable range of input values. This is illustrated on the left side of figure 6, which was taken from a paper by Stephen Osder [10] and modified slightly. The "voter" in his figure is actually a mid-value selector, which due to its similarity with a bit-for-bit M-out-of-N voter can be called a voter. While Osder uses this example to show how this widely used mid-value selection design can fail to meet its design purpose for a non-replicated mid-value selector, we can use the same design to show how disagreement can arrive in replicated mid-value selectors.

Mid-value select failure example
Figure 6: Mid-value select failure example

In this example, zero input volts (ground) is assumed to be the middle of the valid range. The switches in this figure are driven by the fault detection logic and force faulty inputs to ground. The fault detection mechanism used in this example is one that is commonly used. It compares the previous output of the mid-value selection with each of the inputs. If an input value is more than a certain epsilon away from this last value, it gets switched to zero. The rationale for this design is that the mid-value selector could only tolerate one failure (worst case) if the switches were not included. With the switches (and assuming good enough failure detection circuitry that controls the switches), this design hopes to tolerate an additional failure using the following argument - When the first failure is detected, it is clamped to zero. If a second failure occurs that is further away from zero than the good value, the good value is selected by the mid-value selector. If the bad value is between the good value and ground, the worst value that the mid-value selection can output is ground, which is the midpoint of the reasonable input range. Sometimes this is sufficient. When it is not, a common variation of this idea is to use the previous output of the mid-value selection instead of the midpoint of the reasonable input range (ground in this example).

Mid-value select is used most advantageously where bit-for-bit M-out-of-N voters cannot be used due to the system's inability to ensure that the inputs are bit-for-bit identical. Most often this is due to the inputs being asynchronous with respect to each other. However, it is this very asynchrony coupled with these hybrid nMR additions to mid-value select that can still lead to system disagreement.

Going back to the Osder example, but using replicated mid-value selectors, the following scenario, as depicted in the simple plot on the right side of figure 6, is possible - Two of the three signals are on either side of the signal that is currently being selected as the mid-value and are nearly epsilon away from this mid-value. Given that this middle input value is sampled asynchronously and is varying to some degree, one of the mid-value selectors could sample it when it was closer to the more positive of the other two inputs (i.e., X2 is sampled near point A) and another mid-value selector could sample it when it was closer to the more negative of the two other signals (i.e., X2 is sampled near point B). The former mid-value selector will then block the more negative of the other two inputs and the latter mid-value select will block the more positive of the other two inputs. We now have a disagreement between these two mid-value selectors. Even more perversely, a third mid-value selector could sample the middle input when it's exactly between the two other imports and not see either one of them as being an epsilon away. Thus, we get a three-way split: one blocking the positive input, one blocking the negative input, and one not blocking any inputs. The conditions for sustaining a three-way split are highly unlikely. However, a two-way split would be persistent. In this persistent state, the replicated mid-value selectors may initially select the same input (e.g. X2 from the right side of figure 6) but eventually select different inputs. For example, if X2 continues on its downward slope and eventually becomes lower than X3, any replicated mid-value selector that has blocked X1 will select X3 as the mid-value; while, any replicated mid-value selector that has blocked X3 will select (the grounded) X2 as the mid-value. Thus, this condition creates a system disagreement.

Acronyms and Initialisms

AcronymDefinition
AADLArchitecture Analysis & Design Language
ACEAnalog Control Electronics
ACSRAlgebra of Communicating Shared Resources
ALDERISAnalysis Language for Distributed, Embedded, and Real-time Systems
ANSIAmerican National Standards Institute
ASICApplication-specific Integrated Circuit
BAGBandwidth Allocation Gap
BCETBest Case Execution Time
BDDBinary Decision Diagram
BEBest Effort
BFTByzantine Fault Tolerant
BIUBus Interface Unit
BRAINBraided Ring Availability Integrity Network
COMCommand
COM/MONCommand/Monitor
BTLBackplane Transceiver Logic
CPUCentral Processing Unit
CRCCyclic Redundancy Check
CTLComputation Tree Logic
DESDiscrete Event Simulation
DEDiscrete Event
DREAMDistributed Real-time Embedded Analysis Method
DREDistributed Real-time Embedded
DSMLDomain-specific Modeling Language
ESEnd System
EDFEarliest Deadline First
EDICTError Detection Isolation Containment Types
ETBEvidential Tool Bus
FADECFull Authority Digital Engine Control
FCSFlight Critical Systems
FDIRFault Detection, Isolation, and Recovery
FIFOFirst In First Out
FITFault Injection Techniques in the Time-Triggered Architecture
FLPFischer, Lynch, and Paterson
FMEAFailure Mode and Effects Analysis
FSMFinite State Machine
GPCGeneral Purpose Computer
GSPNGeneralized Stochastic Petri Net
GUIGraphical User Interface
HPCHigh Performance Computing
ICIntegrated Circuit
IMAIntegrated Modular Avionics
ITARInternational Traffic in Arms Regulations
ITUInternational Telecommunication Union
LANLocal Area Network
LLFLeast Laxity First
LRMLine Replaceable Module
MACMedium Access Control
MARTEModeling and Analysis of Real-Time and Embedded Systems
MDMMultiplexer Demultiplexer
MoCModel of Computation
MONMonitor
MVSMid-Value Select
NASANational Aeronautics and Space Administration
NICNetwork Interface Controller
nMRn-Modular Redundant
OSATEOpen Source AADL Tool Environment
PCMPulse Code Modulation
PEProcessing Element
RCRate Constrained
RMURedundancy Management Unit
ROBUSReliable Optical Bus
SAESociety of Automotive Engineers
SALSymbolic Analysis Laboratory
SCPSelf-checking Pair
SoSSlightly out of Specification
SPIDERScalable Processor-Independent Design for Enhanced Reliability
STSSpace Transportation System
TDMATime Division Multiple Access
TMRTriple Modular Redundancy
TTP/CTime-triggered Protocol
TTtime triggered
UMLUnified Modeling Language
VLVirtual Link
WCETWorst Case Execution Time
XMLExtensible Markup Language

References

[1] Driscoll, K.; Hall, B.; Paulitsch, M.; Zumsteg, P.; and Sivencrona, H.: The Real Byzantine Generals.Proceedings of the Digital Avionics Systems Conference , 2004, pp. 61-71.
[2] Bergin, C.: Faulty MDM removed. May 18 2008. URL http://www.nasaspaceflight.com/2008/05/sts-124-frr-debate-outstanding-issues-faulty-mdm-removed.
[3] Bergin, C.: STS-126: Super smooth endeavor easing through the countdown. NASA Spaceflight.com, November 13 2008. URL http://www.nasaspaceflight.com/2008/11/sts-126-endeavour-easing-through-countdown.
[4] Kopetz, H.; and Grünsteidl, G.: TTP-A time triggered protocol for automotive applications, Inst. für Technische Informatik, Technische Universit, 1992.
[5] Gruenbacher, H.: Fault Injection for TTA. Deliverable 5.1-5.5 Combined Report IST 1999 10748, Carinthia Tech Institute, 2002. URL http://www3.cti.ac.at/fit/.
[6] Sivencrona, H.; Johannessen, P.; Persson, M.; and Torin, J.: Heavy-ion Fault Injection in the Time-triggered Communication Protocol. Proc. 1st Latin American Symposium on Dependable Computing LNCS 2847 , 2003, pp. pp. 69-80.
[7] Pfeifer, H.; Schwier, D.; and von Henke, F. W.: Formal Verification for Time Triggered Clock Synchronization. Proc. 7th IFIP International Working Conference on Dependable Computing for Critical Applications , 1999.
[8] Ademaj, A.: Slightly-Off-Specification Failures in the Time Triggered Architecture. 7th IEEE Int. Workshop on High Level Design Validation and Test , 2002.
[9] Osder, S.: Chronological overview of past avionic flight control system reliability in military and commercial operations. AGARD-AG-224 , P. R. Kurzhals, ed., NATO Research and Technology Organisation, vol. 224, Jan 1977, pp. 2-1-2-17. Available from NTIS HC A16/MF A01.
[10] Osder, S.: Generic Faults and Architecture Deisgn Considerations in Flight Critical Systems. AIAA Journal Of Guidance, vol. 6, no. 2, March-April 1983, pp. 65-71.


Test Stories - Atomic (old revision)

Edited Nov 14, 2012 by kevin-driscoll

1 System Agreement Failures

In the history of fault tolerance, there have been many examples of
system agreement failures that have caused serious impairments or
complete loss of fielded systems. Many of these failures are due to
Byzantine faults [1]. On this page, several examples
of faults that led to system disagreement are given. Particular
attention should be given to the proximate cause of the disagreement
(typically a poorly designed fault detection, isolation, and recovery
mechanism) rather than the phenomenological (physics based) event that
starts the chain of events leading to the disagreement.

1.1  Space Transport System

NASA's space shuttle has experienced several examples of agreement
failures due to incorrect handling of Byzantine faults between its
MDM units and its GPC. These faults fall within the
class that the shuttle developers called "non-universal I/O error". The
MDMs act as remote I/O concentrators for the GPCs. Data
from the MDMs are transferred to the GPCs over data
busses that are similar to MIL STD 1553. The GPCs execute
redundancy-management algorithms that include FDIR functions
having specific handling for the "non-universal I/O error" class of
failure. However, these FDIR algorithms were not correctly
designed to handle Byzantine faults. Given that there were four
GPCs, the shuttle had sufficient redundancy to tolerate a
Byzantine fault, if these FDIR algorithms had been designed
correctly.

In one of the earliest examples (some 25 years ago), this failure was
triggered by a technician putting incorrect terminating resistors on the
end of a data bus. Because of the impedance mismatch between the
characteristic impedance of the data bus and resistance of the
terminating resistors, signals on the data bus were reflected off of the
resistors. These reflections caused a standing wave on the data bus. Two
of the four GPCs happened to be connected to the data bus at
nodes of the standing wave and the other two GPCs were connected
to the data bus at anti-nodes of the standing wave (see
figure 1). Because of this, two of the GPCs
disagreed with the other two GPCs. It was lucky that this
irreconcilable 2:2 disagreement occurred in the lab.

Standing wave caused by incorrect terminating
resistors

Figure 1: Standing wave caused by incorrect terminating resistors

A more recent example of this problem came closer to causing a disaster.
At 12:12 GMT 13 May 2008, a space shuttle was loading its hypergolic
fuel for mission STS-124 when a 3-1 disagreement occurred among
its GPCs (GPC 4 disagreed with the other GPCs).
Three seconds later, the split became 2-1-1 (GPC 2 now disagreed
with GPC 4 and the other two GPCs). This required that
the launch countdown be stopped. During the subsequent troubleshooting,
the remaining two GPCs disagreed (1-1-1-1 split). See the
reports given in [2] and [3]. This was a
complete system disagreement. However, none of the GPCs were
faulty. The fault was in the FA 2 MDM. This fault was a crack in
a diode. The photomicrographs in figure 2 show two
views of this diode, rotated 90 degrees. The dark wavy line pointed to
by the red arrows is the crack. The current flow through diode is
normally left to right through the material shown in these pictures.
This means that the crack was perpendicular to the normal current flow
and completely through the current path. As a crack opened up, it
changed the diode into another type of component ... a capacitor. This
transformation is illustrated in figure 3.

The diode crack \
Figure 2: The diode crack

Figure with three elements: the left element shows a diode symbol, the
center element shows a diode symbol with a gap through the center of it,
the ride element shows capacitor. \
Figure 3: The transformation of a diode into a capacitor

Normal bus signals \
Figure 4: Normal bus signals

Bad bus signals caused by diode failure \
Figure 5: Bad bus signals caused by diode failure

The normal signals that should appear on the data bus between the
MDM and the GPC are shown in
figure 4. The signals that were produced due to
the diode failure are shown in figure 5. Because
some of the bits in the signal are smaller than they should have been,
some of the GPC receivers could not see these bits. The ability
to see these bits depends on the sensitivity of the receiver, which is a
function of manufacturing variances, temperature, and its power supply
voltage. From the symptoms, it is apparent that the receiver in
GPC 4 was the least sensitive and saw the errors before the
other three GPC. This causedGPC 4 to disagree with the
other three. Then, as the crack in the diode widened, the bits became
shorter to the point where GPC 2 could no longer see these bits;
which caused it to disagree with the other GPC. At this point,
the set of messages that was received correctly by GPC 4 was
different from the set of messages that was correctly received by
GPC 2 which was different again from the set of messages that
was correctly received by GPC 1 and GPC 3. This process
continued until GPC 1 and GPC 3 also disagreed with all
the other GPC.

1.2  TTP/C

A databus known as TTP/C was developed for the needs of the
emerging automotive "by-wire" industry [4]. Its goals
were to provide replica determinism while living within the cost
constraints of the automotive marketplace. Because of its low cost, it
has also found applications within the aerospace market. TTP/C
is a TDMA-based serial communications protocol that provides
synchronization in deterministic message communication over dual
redundant physical media. TTP/C also provides a membership
service at the protocol level. The function of the membership service is
to provide global consensus on message distribution and system state.
Addressing the consensus problem at the protocol level can greatly
reduce system software complexity. However, placing a requirement for
protocol level consensus leaves the protocol itself vulnerable to a
Byzantine failure. The FIT project confirmed the possibility of
this vulnerability by observing actual occurrences of such
failures [5].

As part of the FIT project, a first generation time-triggered
communication controller (TTP-C1) was radiated with heavy
ions. [6]. The errors caused by this experiment were
not controlled; they were the result of random radioactive decay. The
reported fault manifestations were bit-flips in register and RAM
locations within the TTP-C1 IC. ICs with improved design
are now available for TTP/C.

During the many thousands of fault injection runs, several system
failures due to Byzantine faults were recorded [6].
The dominant Byzantine failure mode observed was due to marginal
transmission timing. Corruptions in the time-base registers, within the
integrated IC that had been irradiated, led it to transmit
messages at periods that were slightly-off-specification (SOS), i.e.
slightly too early or too late relative to the globally agreed upon time
base. A message transmitted slightly too early was accepted only by the
ICs of the system having slightly fast clocks; ICs with
slightly slower clocks rejected the message. Even though such a timing
failure would have been tolerated by TTP/C's Byzantine-tolerant
clock synchronization algorithm [7], the dependency of
this service on TTP/C's membership service prevented it from
succeeding. After a Byzantine erroneous transmission, the membership
consensus logic of TTP/C prevented ICs that had
different perceptions of this transmission's validity from communicating
with each other. Therefore, following such a faulty transmission, the
system is partitioned into two sets or cliques --- one clique containing
the ICs that accepted the erroneous transmission, the other
clique containing the ICs that rejected the transmission.

TTP/C incorporates a mechanism to deal with these unexpected
faults - as long as the errors are transient. The clique avoidance
algorithm is executed on every IC prior to its next scheduled
message. ICs that find themselves in a minority clique (i.e.
unable to receive messages from the majority of active ICs) are
expected to cease operation before transmitting. However, if the faulty
IC is in the majority clique or is programmed to re-integrate
after a failure, then a permanent SOS fault can cause repeated failures.
This behavior was observed during the FIT fault injections. In several
fault injection tests, the faulty IC did not cease transmission
and the SOS fault persisted. The persistence of this fault prevented the
clique avoidance mechanism from successfully recovering. In several
instances, the faulty IC continued to divide the membership of
the remaining cliques, which resulted in eventual system failure. In
later analysis of the faulty behavior, these effects were repeated with
software simulated fault injection. The original faults were traced to
upsets in either the C1 controller time-base registers or the micro-code
instruction RAM [8]. Subsequent generations of the
TTP/C controller have incorporated parity and other mechanisms
to reduce the influence of random upsets (e.g. ROM based microcode
execution). SOS faults in TTP/C can be mitigated with a central
guardian. This guardian assumes fail-detectable behavior and does not
violate end-to-end CRC arguments, which has to be shown by exhaustive
testing.

1.3  Mid-Value Select

Mid-Value select is a well-known method for masking the propagation of
failures. It has properties that are similar to an M-out-of-N
voter [9], but does not require any of its inputs to
be bit-for-bit identical. Many mid-value select implementations in
actual fielded systems are merged with other fault tolerance mechanisms
such as reasonableness checks or other fault detection mechanisms that
are then used to block some inputs to the mid-value selection if they
are known to be bad via these other fault detection mechanisms. This
blocking of inputs makes these types of mid-value selection mechanisms a
form of hybrid nMR. Hybrid nMR systems can change the
"M" and/or "N" in the M-out-of-N calculations by using reconfiguration.
The objective is to be able to tolerate more faults than can be
tolerated by using nMR alone (details described in example
below). To make things even more complex, these additional fault
detection mechanisms sometimes use a previous output of the mid-value
select in comparison with its inputs to determine if they are faulty
and/or use a previous output of the mid-value select as the replacement
value for a faulty input.

One of the most common ways of blocking inputs to a mid-value selector
is to override the value of a faulty input with the value that is the
midpoint of the reasonable range of input values. This is illustrated on
the left side of figure 6, which was taken from a paper
by Stephen Osder [10] and modified slightly. The "voter" in
his figure is actually a mid-value selector, which due to its similarity
with a bit-for-bit M-out-of-N voter can be called a voter. While Osder
uses this example to show how this widely used mid-value selection
design can fail to meet its design purpose for a non-replicated
mid-value selector, we can use the same design to show how disagreement
can arrive in replicated mid-value selectors.

Mid-value select failure example \
Figure 6: Mid-value select failure example

In this example, zero input volts (ground) is assumed to be the middle
of the valid range. The switches in this figure are driven by the fault
detection logic and force faulty inputs to ground. The fault detection
mechanism used in this example is one that is commonly used. It compares
the previous output of the mid-value selection with each of the inputs.
If an input value is more than a certain epsilon away from this last
value, it gets switched to zero. The rationale for this design is that
the mid-value selector could only tolerate one failure (worst case) if
the switches were not included. With the switches (and assuming good
enough failure detection circuitry that controls the switches), this
design hopes to tolerate an additional failure using the following
argument - When the first failure is detected, it is clamped to zero. If
a second failure occurs that is further away from zero than the good
value, the good value is selected by the mid-value selector. If the bad
value is between the good value and ground, the worst value that the
mid-value selection can output is ground, which is the midpoint of the
reasonable input range. Sometimes this is sufficient. When it is not, a
common variation of this idea is to use the previous output of the
mid-value selection instead of the midpoint of the reasonable input
range (ground in this example).

Mid-value select is used most advantageously where bit-for-bit
M-out-of-N voters cannot be used due to the system's inability to ensure
that the inputs are bit-for-bit identical. Most often this is due to the
inputs being asynchronous with respect to each other. However, it is
this very asynchrony coupled with these hybrid nMR additions to
mid-value select that can still lead to system disagreement.

Going back to the Osder example, but using replicated mid-value
selectors, the following scenario, as depicted in the simple plot on the
right side of figure 6, is possible - Two of the three
signals are on either side of the signal that is currently being
selected as the mid-value and are nearly epsilon away from this
mid-value. Given that this middle input value is sampled asynchronously
and is varying to some degree, one of the mid-value selectors could
sample it when it was closer to the more positive of the other two
inputs (i.e., X~2~ is sampled near point A) and another mid-value
selector could sample it when it was closer to the more negative of the
two other signals (i.e., X~2~ is sampled near point B). The former
mid-value selector will then block the more negative of the other two
inputs and the latter mid-value select will block the more positive of
the other two inputs. We now have a disagreement between these two
mid-value selectors. Even more perversely, a third mid-value selector
could sample the middle input when it's exactly between the two other
imports and not see either one of them as being an epsilon away. Thus,
we get a three-way split: one blocking the positive input, one blocking
the negative input, and one not blocking any inputs. The conditions for
sustaining a three-way split are highly unlikely. However, a two-way
split would be persistent. In this persistent state, the replicated
mid-value selectors may initially select the same input (e.g. X~2~ from
the right side of figure 6) but eventually select
different inputs. For example, if X~2~ continues on its downward slope
and eventually becomes lower than X~3~, any replicated mid-value
selector that has blocked X~1~ will select X~3~ as the mid-value; while,
any replicated mid-value selector that has blocked X~3~ will select (the
grounded) X~2~ as the mid-value. Thus, this condition creates a system
disagreement.

Acronyms and Initialisms















































































AcronymDefinition
AADLArchitecture Analysis & Design Language
ACEAnalog Control Electronics
ACSRAlgebra of Communicating Shared Resources
ALDERISAnalysis Language for Distributed, Embedded, and Real-time Systems
ANSIAmerican National Standards Institute
ASICApplication-specific Integrated Circuit
BAGBandwidth Allocation Gap
BCETBest Case Execution Time
BDDBinary Decision Diagram
BEBest Effort
BFTByzantine Fault Tolerant
BIUBus Interface Unit
BRAINBraided Ring Availability Integrity Network
COMCommand
COM/MONCommand/Monitor
BTLBackplane Transceiver Logic
CPUCentral Processing Unit
CRCCyclic Redundancy Check
CTLComputation Tree Logic
DESDiscrete Event Simulation
DEDiscrete Event
DREAMDistributed Real-time Embedded Analysis Method
DREDistributed Real-time Embedded
DSMLDomain-specific Modeling Language
ESEnd System
EDFEarliest Deadline First
EDICTError Detection Isolation Containment Types
ETBEvidential Tool Bus
FADECFull Authority Digital Engine Control
FCSFlight Critical Systems
FDIRFault Detection, Isolation, and Recovery
FIFOFirst In First Out
FITFault Injection Techniques in the Time-Triggered Architecture
FLPFischer, Lynch, and Paterson
FMEAFailure Mode and Effects Analysis
FSMFinite State Machine
GPCGeneral Purpose Computer
GSPNGeneralized Stochastic Petri Net
GUIGraphical User Interface
HPCHigh Performance Computing
ICIntegrated Circuit
IMAIntegrated Modular Avionics
ITARInternational Traffic in Arms Regulations
ITUInternational Telecommunication Union
LANLocal Area Network
LLFLeast Laxity First
LRMLine Replaceable Module
MACMedium Access Control
MARTEModeling and Analysis of Real-Time and Embedded Systems
MDMMultiplexer Demultiplexer
MoCModel of Computation
MONMonitor
MVSMid-Value Select
NASANational Aeronautics and Space Administration
NICNetwork Interface Controller
nMRn-Modular Redundant
OSATEOpen Source AADL Tool Environment
PCMPulse Code Modulation
PEProcessing Element
RCRate Constrained
RMURedundancy Management Unit
ROBUSReliable Optical Bus
SAESociety of Automotive Engineers
SALSymbolic Analysis Laboratory
SCPSelf-checking Pair
SoSSlightly out of Specification
SPIDERScalable Processor-Independent Design for Enhanced Reliability
STSSpace Transportation System
TDMATime Division Multiple Access
TMRTriple Modular Redundancy
TTP/CTime-triggered Protocol
TTtime triggered
UMLUnified Modeling Language
VLVirtual Link
WCETWorst Case Execution Time
XMLExtensible Markup Language

References


[1] Driscoll, K.; Hall, B.; Paulitsch, M.; Zumsteg, P.; and Sivencrona, H.:
The Real Byzantine Generals.Proceedings of the Digital Avionics Systems
Conference
, 2004, pp. 61-71.


[2] Bergin, C.: Faulty MDM removed. May 18 2008. URL
<http://www.nasaspaceflight.com/2008/05/sts-124-frr-debate-outstanding-issues-faulty-mdm-removed>.


[3] Bergin, C.: STS-126: Super smooth endeavor easing through the countdown.
NASA Spaceflight.com, November 13 2008. URL
<http://www.nasaspaceflight.com/2008/11/sts-126-endeavour-easing-through-countdown>.


[4] Kopetz, H.; and Grünsteidl, G.: TTP-A time triggered protocol for
automotive applications, Inst. für Technische Informatik, Technische
Universit, 1992.


[5] Gruenbacher, H.: Fault Injection for TTA. Deliverable 5.1-5.5 Combined
Report IST 1999 10748, Carinthia Tech Institute, 2002. URL
<http://www3.cti.ac.at/fit/>.


[6] Sivencrona, H.; Johannessen, P.; Persson, M.; and Torin, J.: Heavy-ion
Fault Injection in the Time-triggered Communication Protocol. Proc. 1st
Latin American Symposium on Dependable Computing LNCS 2847
, 2003, pp.
pp. 69-80.


[7] Pfeifer, H.; Schwier, D.; and von Henke, F. W.: Formal Verification for
Time Triggered Clock Synchronization. Proc. 7th IFIP International
Working Conference on Dependable Computing for Critical Applications
,
1999.


[8] Ademaj, A.: Slightly-Off-Specification Failures in the Time Triggered
Architecture. 7th IEEE Int. Workshop on High Level Design Validation
and Test
, 2002.


[9] Osder, S.: Chronological overview of past avionic flight control system
reliability in military and commercial operations. AGARD-AG-224, P. R.
Kurzhals, ed., NATO Research and Technology Organisation, vol. 224, Jan
1977, pp. 2-1-2-17. Available from NTIS HC A16/MF A01.


[10] Osder, S.: Generic Faults and Architecture Deisgn Considerations in
Flight Critical Systems. AIAA Journal Of Guidance, vol. 6, no. 2,
March-April 1983, pp. 65-71.