Robust Design: Fault Tolerance

Monday, March 29th, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on the Embedded Master

Designing a system for fault tolerance is a robust design principle for building systems that will continue to operate correctly or in an acceptable degraded fashion. This approach is appropriate for systems that are expected to experience failures or operate in environments that are too complex to completely account and control all of the potential forces acting on the system.

Implementing fault tolerance usually involves some level of redundancy in the system, which increases the system’s overall cost and complexity, so the decision to adopt a fault tolerant approach is usually based on the cost to implement the tolerance, the consequences of allowing the failure to occur, and the probability of the failure occurring.

The basic tenet of a fault tolerant design is that no single failure, of any nature, can cause the system to completely stop operating. To deliver a fault tolerant system, such designs usually employ some type of redundancy in the subsystems so that if one of the subsystems fails, for whatever reason, the redundant subsystem can keep the overall system operating correctly. A low-tech example of this concept is large trucks that rely on many more than four tires to carry their load. Even when a single tire fails, the truck can continue to carry on its primary task because the load the failed tire was carrying is spread out among the remaining tires.

Common fault tolerant electronics subsystems (we will explore each type of fault tolerant approach in more detail in future posts) that employ different strategies for fault tolerance include computer memories, disk-based data storage, electronic communications, and “padded-cell” operating system virtualization. However, merely relying on redundant subsystems is insufficient for a robust design. In a recent article claiming Toyota’s Acceleration Issue Due to Electronics, several failure experts share that

“… Automakers claim that no danger is posed because they build in “redundant systems”—but that’s not foolproof unless they are truly independent, according to the engineers. EMI can affect both systems the same way. Anderson showed that the two systems lie physically next to each other. It would seem that interference affecting one would affect the other.

A safety override must be a totally independent system, said Armstrong. Safety cannot be achieved by relying only on complex electronic systems. To reduce risk to an acceptable level, independent “fail safes” or backup systems are required. “But the auto industry continues to ignore standard safety engineering principles … even though a modern vehicle is actually a computer controlled machine,” writes Armstrong.”

A fault tolerant design must account for the entire environment that the system resides within. As a result, it is impossible to build a fault tolerant design solely with software because the processor core itself represents a single point of failure. There must be at least some physical component and complete independence between redundant components within the design to accomplish a fault tolerant design.

The EMI statement in the excerpt is potentially incomplete, and this is reflected in the inference of identical similarities between the two subsystems. While the two subsystems addressed may lie next to each other, the statement does not qualify whether the two systems are identical instances of the same subsystem. Even if they are logically (software) identical, each box might use different types of shielding and data interfaces to be immune and susceptible differently to different types of EMI. The two addressed systems might use different processor cores and memories, as well as different software, such that they have different susceptibilities to EMI.

An important point is that it is not immediately obvious what is sufficient to build a fault tolerant design. In the next post about fault tolerance, I will explore fault analysis and fault injection as a technique to increase confidence in the approach of the fault tolerant design.

Tags:

Leave a Reply