Robust Design

Thursday, February 4th, 2010 by Robert Cravotta

I am accelerating my plans to start a series on robust design principles because of the timely interest in the safety recall by Toyota for a sticking accelerator pedal. Many people are weighing in on the issue, but Charles J. Murray’s article “Toyota’s Problem Was Unforeseeable” and Michael Barr’s posting “Is Toyota’s Accelerator Problem Caused by Embedded Software Bugs?” make me think there is significant value in discussing robust design approaches right away.

A quick answer to the questions posed by the first article is no. The failure was not unforeseeable if a robust system level failure analysis effort is part of the specification, design, build, test, and deploy process. The subheading for Charles article hits the nail on the head:

“As systems grow in complexity, experts say designing for failure may be the best course of action for managing it.”

To put things in perspective, my own engineering experience with robust designs is strongly based on specifying, designing, building, and testing autonomous systems in an aerospace environment. Some of these systems were man-rated, triple-fault-tolerant designs – meaning the system had to operate with no degradation in spite of any three failures. The vast majority of the designs I worked on were at least single-fault-tolerant designs. Much of my design bias is shaped by those projects. In the next post in this series, I will explore fault-tolerant philosophies for robust design.

A quick answer to the questions posed by the second article is – it depends. Just because a software change can fix a problem does not make it a software bug – despite the fact that so many people like to imply the root cause of the problem is the software. Only software that does not correctly implement the explicit specification and design are truly software bugs. Otherwise, it is a system level problem that a software change might be more economically or technically feasible to use to solve the problem – but it requires first changing the system level specifications and design. This is more than just a semantic nit – it is an essential perspective to root cause analysis and resolution, and I hope in my next post to clearly explain why.

I would like to initially propose four robust design categories (fault tolerant, sandbox, patch-it, and disposable); if you know of another category, please share it here. I plan to follow up with separate posts focusing on each of these categories. I would also like to solicit for guest posts from anyone that has experience in any of these different types of robust design.

Fault tolerant design focuses on keeping the system running or safe in spite of failures. These techniques are commonly applied in high value designs where people’s lives are at stake (like airplanes, space ships, and automobiles), but there are techniques that can be applied at even lesser impact consumer level designs (think exploding batteries – which I’ll expand on in the next post).

Sandbox design focuses on controlling the environment so that failures cannot occur. Ever wonder why Apple’s new iPad does not support third party multitasking?

Patch-it design focuses on fixing problems after the system is in the field. This is a common approach for a lot of software products where the consequences of failures in the end system are not catastrophic and where implementing a correction is low cost.

Disposable design focuses on short life span issues. This affects robust design decisions in a meaningfully different way than the other three types of designs.

The categories I’ve proposed are system level in nature, but I think the concepts we can uncover in a discussion would apply to all of the disciplines required to design each component and subsystem in contemporary projects.

[Editor's Note: This was originally post on EDN ]

Tags: , , ,

No Responses to “Robust Design”

  1. A. @EP says:

    “The fancier you make the plumping, the easier it is to stop up the drain”

    Lt Cmdr Montgomery Scott
    USS Enterprise

    Every time a feature is added, it takes 2 revs to correct.

  2. D.W. @EP says:

    The big problem that everyone seems to want to avoid acknowledging is that software is different from hardware. It is so because of how the software culture developed. From the beginning, software has been treated as an academic exercise, with results on a best effort basis. We accept things from Microsoft, et al that would be actionable in a hardware product such as an appliance.

    This differs from hardware where, over time, we had to acknowledge that poorly designed hardware (bridges, elevators, airplanes, machinery in factories, etc.) kills and maims people. Criminal negligence applies.

    Now, software is starting to kill people. Yes, that is what this is about. If it is a software issue and people have died as a result, the bad software has killed those people. As this grows worse with more complex software systems being used to control machinery (cars, medical machines …), this will not be allowed to stand. The companies that built the machines that hurt people will be sued, and the programmers will be pulled into the chaos. This is happening as we watch.

    Eventually, the companies (and their executives) will start to protect themselves by creating methods of genuinely preventing these kinds of problems. We went through this in the industrial revolution. Fast forward to machine design and safety guards, etc. OSHA will be coming to software as well, and intrusively.

    Toyota, like many others, may have innocently stumbled into this fact, being originally a hardware (automobile) company. They are not the first, just the latest. for example, how did that Airbus airliner recently crash in the middle of the ocean? The sensor hardware (iced up air sensors) was blamed, but the software is – necessarily – what drove the plane into the ocean.

    Stay tuned. If software cannot be trusted, it will be restricted until it can.

  3. R. @EP says:

    Seems to me this is heading to a very intersting insights..

    If possible ( i would be intrested) if you could put emphasis on test to fail rather that test to pass, whihc means deeper understanding and understanding the space of robust designs.

    regds
    R.

  4. B. @EP says:

    I can design a cheap, affordable system and I can design a triple redundant system but I can’t do both at the same time.

  5. E. K. @EP says:

    It proves a thermodynamics principle that no human system can maintain an efficiency rate of 100%
    It teaches me and the rest of the developing world that concern for quality is important no matter the cost.

Leave a Reply