Operational Single Points of Failure

Monday, June 21st, 2010 by Robert Cravotta

A key tenet of fault tolerant designs is to eliminate all single points of failure from the system. A single point of failure is a component or subsystem within a system such that if it suffers a failure, it can cause the rest of the system to fail. When I was first exposed to the single point of failure concept, we used it to refer to sub systems in electronic control systems. A classic example of a single point failure is a function that is implemented completely in software, even if you use multiple functions or algorithms to check each other, because the processor core itself represents a single point of failure in a system with only one processor.

As my experience grew, and I was exposed to more of the issues facing internal program management as well as design level trade-off complexities, I appreciated that single points of failure are not limited to just what is inside your box. It is a system level concept and while it definitely applies to the obvious engineering candidates, such as hardware, software, and mechanical considerations (and thermal and EMI and … and …), it also applies to team processes, procedures, staffing, and even third-party services.

Identifying single points of failure in team processes and procedures can be subtle, but the consequences of allowing them to stay within your system design can be as bad as a normal engineering single point of failure. As an example, processes that only a single person executes are possible sources of failures because there is no cross check or measurement to ensure the person is performing the process correctly and this might allow certain failure conditions to go undetected. In contrast, you can eliminate such a failure point if the process involves more than a single person and the tasks performed by both people support some level of cross-correlation.  

Staffing policies can introduce dangerous single points of failure into your team or company, especially if there is no mechanism for the team to detect and correct when a given skill set is not duplicated across multiple people on the team or in the company. You never know when or why that person with the unique skills or knowledge will become unavailable. While you might be able to contact them if they leave the company or win the lottery, you would have a hard time being able to tap them if they died.

There was a cartoon I displayed in my office for a while many years ago that showed a widow and her child in the rain standing over a grave, and there is an engineer standing next to them asking if the husband ever mentioned anything about source code. The message is powerful and terrifying for anyone that is responsible for maintaining systems. The answer is to plan for redundancy in your staff’s skills and knowledge. When you identify that you have a single point of failure in your staff’s skills or knowledge, commit to fixing that problem as soon as possible. Note that this is a “when condition” and not an “if condition” because it will happen from time to time for reasons completely out of your control.

The thing to remember is that single points of failure can exist anywhere in your system and are not limited to just the components in your products. As systems include more outside or third-party services or partners, the scope of the system grows accordingly and the impact of non-technical single points of failure can grow also.

If you would like to be an information source for this series or provide a guest post, please contact me at Embedded Insights.

[Editor's Note: This was originally posted at the Embedded Master]



Leave a Reply