Robust Design: Best Guesses

Monday, March 15th, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on the Embedded Master

An important realization about building robust systems is that the design decisions and trade-offs we make are based on our best guesses. As designers, we must rely on best guesses because it is impossible to describe a “perfect and complete” specification for all but the most simple, constrained, and isolated systems. A “perfect and complete” specification is a mythical notion that assumes that it is possible to not only specify all of the requirements to describe what a system must do, but that it is also possible to explicitly and unambiguously describe everything a system must never do under all possible operating conditions.

The second part of this assumption, explicitly describing everything the system must never do, is not feasible because the complete list of operating conditions and their forbidden behaviors is infinite. A concession to practicality is that system specifications address the anticipated operating conditions and those operating conditions with the most severe consequences – such as injury or death.

As the systems we design and build continue to grow in complexity, so too does the difficulty in explicitly identifying all of the relevant use cases that might cause a forbidden behavior. The short cut of specifying that the system may never act a certain way under any circumstance is too ambiguous. Where do you draw the line between reasonable use-cases and unreasonable ones? For example, did Toyota pursue profits at the expense of safety by knowingly ignoring the potential for unwanted acceleration? But what is the threshold between when you can safely ignore or must react to a potential problem? Maybe sharing my experience (from twenty years ago) with a highly safe and reliable automobile can stimulate some ideas on defining such a threshold.

After a few months of ownership, my car would randomly stall while at full freeway speeds. I brought the car into the dealership three separate times. The first two times, they could not duplicate the problem, nor could they find anything that they could adjust in the car. The third time I brought the car in, I started working with a troubleshooter that was flown in from the national office. Fortunately, I was able to duplicate the problem once for the troubleshooter, so they knew this was not just a potential problem, but a real event. It took two more weeks of full time access to the car for the troubleshooter to return the car to me with a fix.

I spoke with the technician and he shared the following insights with me. I was one of about half a dozen people in the entire country that was experiencing this problem. The conditions required to manifest this failure were specific. First, it only happened on very hot (approximately 100 degrees) and dry days. Second, the car had to be hot from sitting out in the direct sun for some time. Third, the air conditioning unit needed to be set to the highest setting while the car was turned on. Fourth, the driver of the car had to have a specific driving style (the stalls never happened to my wife who has a heavier foot on the accelerator than I do).

It turns out the control software for managing the fuel had two phases of operation. The first phase ran for the first few minutes after the car was started, and it characterized the driving style of the driver to set the parameters for managing the fuel delivered to the engine. After a few minutes of operating the car, the second phase of operation, which never modified the parameter settings, took over until the vehicle was turned off. My driving style when combined with those other conditions caused the fuel management parameters to deliver too little fuel to the engine under a specific driving condition which I routinely performed while on the freeway.

So it was a software problem right? Well, not exactly, there was one more condition that was necessary to create this problem. The Freon for the air conditioning unit had to be at least slightly overcharged. Once the technician set the Freon charge level to no more than full charge, the problem went away and I never experienced the problem again over 150k miles of driving. I always made sure that we never overcharged the Freon when recharging the system.

I imagine there could have been a software fix that used a modified algorithm that also measured and correlated the Freon charge level, but I do not know if that automobile manufacturer followed that course or not for future vehicles.

So how do you specify such an esoteric use-case before experiencing it?

The tragedy of these types of situations is that the political, legal, and regulatory realities prevent the manufacturer of the vehicle in question from freely sharing what information they have, and possibly being able to more quickly pinpoint the unique set of conditions required to make the event occur, without severely risking their own survival.

Have you experienced something that can help distinguish when and how to address potential from probable from actually occurring unintended behaviors? I do not believe any long term operating company puts out any product in volume with the intention of ignoring reasonable safety hazards. If a problem persists, I believe it is more likely because their best guesses have not yet been able to uncover which of the infinite possible conditions are contributing to the event.

My next post in this series will touch on ambiguity and uncertainty.

Tags: , , ,

No Responses to “Robust Design: Best Guesses”

  1. D.R. @EM says:

    Here’s one on the erratic car behavior that was tough to find…
    One of my neighbor’s cars would just shut off when on the freeway. She would pull over to the side, and it would not crank…no response from turning the key for about 10 minutes and then it would start right up. This happened time and again. The dealership could NOT dupe the problem. This went on for months. As it turns out, she had one of those “speed pass” gadgets on her keyring. If you are not familiar with a speed pass, it is an RFID chip in a little circular plastic piece, and the RFID number is tied to your credit card, so all you have to do is swipe the speed pass on the reader at the gas station that issues the speed passes (instead of running your credit card through the CC reader). The car she was driving also had an RFID “smart key” ignition. So, when she was driving, the speed pass would come into range of the RFID smart-key reader in the steering column, get bad data, and shut the engine down. The designers of the smart-key system in the car didn’t consider the speed passes in their testing. This example is the inverse of the Toyota problems…

  2. D.W. @EM says:

    This is the class of the ugly error: to infrequent to get useful debug data yet two frequent and/or expensive/dangerous to ignore. This kind of problem is made worse when software is in the loop. Software is invisible. If it fails, you cannot know hoe unless the software gave you data to record and you recorded it. The only solution for this is “flight recorders”, also known as log files in computer systems. They record data before and after a failure. This recorded data allows the failure to be analyzed. They typically record all the sensor inputs and control outputs in a mechanical system, like an airplane. But that is not enough. The software has to put out additional data for analysis of program operation, so you can find the module(s) and conditions of failure. Without this kind of data, you are doomed.

  3. A. @EM says:

    What is the cost of the “flight recorder”?
    Are you ready to pay it for each car?

  4. D.R. @EM says:

    When the car makers decided to put 64+ microprocessors in the critical drive-by-wire systems in cars, they took the responsibility of creating an audit trail of the behavior of those systems and subsystems….so they can find and fix the failures. But they did it cheap and recklessly ignored the downstream problems, which shows that the auto makers are the blooming idiots of the embedded world. Toyota is getting what it deserves right now, for not being very concerned about their embedded HW and SW behaviors. Maybe the legal system will motivate them to properly design their uP systems and SW in the future.

  5. A.T. @EM says:

    Just throw this useful information up from the flight recorder onto the Nav screen:

  6. A. @EM says:

    Robert said: “As designers, we must rely on best guesses because it is impossible to describe a “perfect and complete” specification for all but the most simple, constrained, and isolated systems.”

    I think D.W.’s idea of a “flight recorder” is a great example of this concept. In order for the device to have the desired usefulness, we would have to make a series of best guesses as to what data we would need to record as well as what sensors and controls to build into the system in order to extract said data. We would also have to guess as to the interaction of the many variables in the system and let’s not forget about guessing what potential errors could occur or deviation tolerances to allow for in the design.

    Just to add a bit of a twist… anyone care to make a best guess as to how often the “flight recorder” might be a part of the problem? And would you have guessed that you should consider that in your design?

  7. P. @EM says:

    Another thing bothers me: What I see at polytechnical schools and universities is that the students ‘forget’ to learn about basic analog design and RF/EMC behaviour of those circuits. I am in business with my small company for 10 years now: Having this knowledge brought me a little fortune. We did this through smooth design trajects with 1st class affordable electronic parts from good companies (National, Fairchild, Linear…. TI and the least favourite is Maxim due to stock problems all the time ;-) Customers simply like this and come back time and time again with new things to solve.
    And the solution most companies nowadays offer (Linear and now National) is to put ’solutions in a box’ like small SMPS circuits, etc. You never completely know the behaviour, so always take very good care. These things are the basis of design failures, because the basic knowledge ’seems’ not important anymore.

    (C. M.)

  8. A. @EM says:

    Crash Data Recorders

    The recorders are already in place on a lot of vehicles

  9. H.U. @EM says:

    Interview Local,application coffee case kitchen far bank question state traditional receive our information fish pair key bright far because off description action package other whom beautiful arise reading clearly its expert intention stick recall determine less day shop creation entry turn surface balance laugh over actually supply fresh determine path gain violence overall implication certain able beneath amount improve hurt into why threat action historical organise same full environmental corporate loss liberal around share anyone meet person smile strike prepare journey all special often construction month word

Leave a Reply