I am accelerating my plans to start a series on robust design principles because of the timely interest in the safety recall by Toyota for a sticking accelerator pedal. Many people are weighing in on the issue, but Charles J. Murray’s article “Toyota’s Problem Was Unforeseeable” and Michael Barr’s posting “Is Toyota’s Accelerator Problem Caused by Embedded Software Bugs?” make me think there is significant value in discussing robust design approaches right away.
A quick answer to the questions posed by the first article is no. The failure was not unforeseeable if a robust system level failure analysis effort is part of the specification, design, build, test, and deploy process. The subheading for Charles article hits the nail on the head:
“As systems grow in complexity, experts say designing for failure may be the best course of action for managing it.”
To put things in perspective, my own engineering experience with robust designs is strongly based on specifying, designing, building, and testing autonomous systems in an aerospace environment. Some of these systems were man-rated, triple-fault-tolerant designs – meaning the system had to operate with no degradation in spite of any three failures. The vast majority of the designs I worked on were at least single-fault-tolerant designs. Much of my design bias is shaped by those projects. In the next post in this series, I will explore fault-tolerant philosophies for robust design.
A quick answer to the questions posed by the second article is – it depends. Just because a software change can fix a problem does not make it a software bug – despite the fact that so many people like to imply the root cause of the problem is the software. Only software that does not correctly implement the explicit specification and design are truly software bugs. Otherwise, it is a system level problem that a software change might be more economically or technically feasible to use to solve the problem – but it requires first changing the system level specifications and design. This is more than just a semantic nit – it is an essential perspective to root cause analysis and resolution, and I hope in my next post to clearly explain why.
I would like to initially propose four robust design categories (fault tolerant, sandbox, patch-it, and disposable); if you know of another category, please share it here. I plan to follow up with separate posts focusing on each of these categories. I would also like to solicit for guest posts from anyone that has experience in any of these different types of robust design.
Fault tolerant design focuses on keeping the system running or safe in spite of failures. These techniques are commonly applied in high value designs where people’s lives are at stake (like airplanes, space ships, and automobiles), but there are techniques that can be applied at even lesser impact consumer level designs (think exploding batteries – which I’ll expand on in the next post).
Sandbox design focuses on controlling the environment so that failures cannot occur. Ever wonder why Apple’s new iPad does not support third party multitasking?
Patch-it design focuses on fixing problems after the system is in the field. This is a common approach for a lot of software products where the consequences of failures in the end system are not catastrophic and where implementing a correction is low cost.
Disposable design focuses on short life span issues. This affects robust design decisions in a meaningfully different way than the other three types of designs.
The categories I’ve proposed are system level in nature, but I think the concepts we can uncover in a discussion would apply to all of the disciplines required to design each component and subsystem in contemporary projects.
[Editor's Note: This was originally post on EDN ]