[Editor's Note: This was originally posted on the Embedded Master]
An important realization about building robust systems is that the design decisions and trade-offs we make are based on our best guesses. As designers, we must rely on best guesses because it is impossible to describe a “perfect and complete” specification for all but the most simple, constrained, and isolated systems. A “perfect and complete” specification is a mythical notion that assumes that it is possible to not only specify all of the requirements to describe what a system must do, but that it is also possible to explicitly and unambiguously describe everything a system must never do under all possible operating conditions.
The second part of this assumption, explicitly describing everything the system must never do, is not feasible because the complete list of operating conditions and their forbidden behaviors is infinite. A concession to practicality is that system specifications address the anticipated operating conditions and those operating conditions with the most severe consequences – such as injury or death.
As the systems we design and build continue to grow in complexity, so too does the difficulty in explicitly identifying all of the relevant use cases that might cause a forbidden behavior. The short cut of specifying that the system may never act a certain way under any circumstance is too ambiguous. Where do you draw the line between reasonable use-cases and unreasonable ones? For example, did Toyota pursue profits at the expense of safety by knowingly ignoring the potential for unwanted acceleration? But what is the threshold between when you can safely ignore or must react to a potential problem? Maybe sharing my experience (from twenty years ago) with a highly safe and reliable automobile can stimulate some ideas on defining such a threshold.
After a few months of ownership, my car would randomly stall while at full freeway speeds. I brought the car into the dealership three separate times. The first two times, they could not duplicate the problem, nor could they find anything that they could adjust in the car. The third time I brought the car in, I started working with a troubleshooter that was flown in from the national office. Fortunately, I was able to duplicate the problem once for the troubleshooter, so they knew this was not just a potential problem, but a real event. It took two more weeks of full time access to the car for the troubleshooter to return the car to me with a fix.
I spoke with the technician and he shared the following insights with me. I was one of about half a dozen people in the entire country that was experiencing this problem. The conditions required to manifest this failure were specific. First, it only happened on very hot (approximately 100 degrees) and dry days. Second, the car had to be hot from sitting out in the direct sun for some time. Third, the air conditioning unit needed to be set to the highest setting while the car was turned on. Fourth, the driver of the car had to have a specific driving style (the stalls never happened to my wife who has a heavier foot on the accelerator than I do).
It turns out the control software for managing the fuel had two phases of operation. The first phase ran for the first few minutes after the car was started, and it characterized the driving style of the driver to set the parameters for managing the fuel delivered to the engine. After a few minutes of operating the car, the second phase of operation, which never modified the parameter settings, took over until the vehicle was turned off. My driving style when combined with those other conditions caused the fuel management parameters to deliver too little fuel to the engine under a specific driving condition which I routinely performed while on the freeway.
So it was a software problem right? Well, not exactly, there was one more condition that was necessary to create this problem. The Freon for the air conditioning unit had to be at least slightly overcharged. Once the technician set the Freon charge level to no more than full charge, the problem went away and I never experienced the problem again over 150k miles of driving. I always made sure that we never overcharged the Freon when recharging the system.
I imagine there could have been a software fix that used a modified algorithm that also measured and correlated the Freon charge level, but I do not know if that automobile manufacturer followed that course or not for future vehicles.
So how do you specify such an esoteric use-case before experiencing it?
The tragedy of these types of situations is that the political, legal, and regulatory realities prevent the manufacturer of the vehicle in question from freely sharing what information they have, and possibly being able to more quickly pinpoint the unique set of conditions required to make the event occur, without severely risking their own survival.
Have you experienced something that can help distinguish when and how to address potential from probable from actually occurring unintended behaviors? I do not believe any long term operating company puts out any product in volume with the intention of ignoring reasonable safety hazards. If a problem persists, I believe it is more likely because their best guesses have not yet been able to uncover which of the infinite possible conditions are contributing to the event.
My next post in this series will touch on ambiguity and uncertainty.