Robust Design: Ambiguity and Uncertainty

Monday, March 22nd, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on the Embedded Master

Undetected ambiguity is the bane of designers. Unfortunately, the opportunities for ambiguity to manifest in our specifications and designs are numerous, and they are easy to miss. Worse, when an ambiguity is discovered because two or more groups on a design team interpreted some information differently, the last person or team that touched the system often gets the blame – and that almost always is the software team.

For example, in the Best Guesses comments, DaveW points out that

“… This kind of problem is made worse when software is in the loop. Software is invisible. If it fails, you cannot know [how] unless the software gave you data to record and you recorded it.”

A problem with this common sentiment is unambiguously determining what constitutes a software failure. I shared in the lead-in Robust Design post that

“… Just because a software change can fix a problem does not make it a software bug – despite the fact that so many people like to imply the root cause of the problem is the software. Only software that does not correctly implement the explicit specification and design are truly software bugs. Otherwise, it is a system level problem that a software change might be more economically or technically feasible to use to solve the problem – but it requires first changing the system level specifications and design. This is more than just a semantic nit”

Charles Mingus offers a system perspective that hints at this type of problem:

“… And the solution most companies nowadays offer (Linear and now National) is to put ‘solutions in a box’ like small SMPS circuits, etc. You never completely know the behaviour, so always take very good care. These things are the basis of design failures, because the basic knowledge ‘seems’ not important anymore.”

Pat Ford, in the “Prius software bug?” LinkedIn discussion observes that

“…this isn’t just a software bug, this is a systems design bug, where multiple subsystems are improperly implemented.”

So how do these subsystems get improperly implemented? I contend that improperly implemented systems are largely the result of ambiguity in the system specifications, design assumptions, and user instructions. A classic visual example of ambiguity involves an image that contains a vase or two human faces looking at each other. Another classic visual example involves an image that you can interpret as a young or old woman. If you are not familiar with these images, please take some time to see both sets of images in both examples.

These two images are not so much optical illusions as they are examples of interpreting the same data in two different equally valid ways. I believe one reason why these images have at least two equally valid interpretations is that they are based on symbolic representations of the things that you can interpret them to represent. Symbols are imprecise and simplified abstractions of objects, concepts, and ideas. If you were dealing with the actual objects, the different interpretations might not be equally valid anymore.

Now consider how engineers and designers create systems. They often use a symbolic language in a natural language in a free or structured format to describe the system. It is one thing to describe all the things the system is, but it is a much different problem to explicitly describe all the things that the system is not.

To illustrate the weakness of a purely natural language way to describe something, consider how you teach someone to do a new task they have never done before. Do you explain everything in words and then leave them to their own devices to accomplish the task? Do you show them how to do it the first time?

This is the same type of problem development tool providers have to address each time they release a new development kit, and they are increasingly adopting video or animated walkthroughs to improve the success adoption rate of their systems. And this problem does not apply just to designers – it affects all types of end systems as well.

In the best guesses post, I talked about how a set of conditions had to coincide with the Freon in the air conditioning unit had to be overcharged. How would you have written the instructions for properly charging the Freon in such a system? Would the instructions specify what defined a full charge? To what precision would you have specified a minimum and maximum tolerable charge – or would you have? When using language to describe something, there is a chance that certain types of information are well understood by everyone and that you do not explicitly describe them over and over. This is fine until someone from outside that circle applies a different set of assumptions because they came from a different environment, and that environment made different arbitrary decisions that were appropriate for that operating environment.

I was recently reminded of this concept with the iRobot Roomba vacuum that I own. I went through a larger learning curve than I expected with regards to cleaning all of the brushes because some of the places you need to clear out are not immediately obvious until you understand how the vacuum works. But the real kick in the head came when I tried to use the brush cleaning tool. I read the instructions in the manual about the tool, and it says

“Use the included cleaning tool to easily remove hair from Roomba’s bristle brush by pulling it over the brush.”

Are these instructions simple enough that there is no room for ambiguity and misinterpretation? Well, I found the wrong way to use the tool, and looking at customer comments about the cleaning tool, so have other people. Mind you, this is with a tool that has a very limited possible number of ways of being used, but until you understand how it works, it is possible to use it incorrectly. I realized that the symbolic graphic on the side of the tool could be interpreted in at least two different equally valid ways because of the positioning and use of a triangle symbol which could represent the tool, the direction the brush should be used, or pointing to the place where the brush should enter the tool. Now that I understand how the tool works, the instructions and symbols make sense, but until I actually saw the tool work, it was not clear.

So not only is the specification for a system – that has never existed before – often written in a symbolic language, but so is the software that implements that system, as well as the user/maintenance manual for that system. Add to this that design teams consist of an ever larger number of people that do not necessarily work in the same company, industry, or even country. The opportunity for local, regional, and global culture differences amplifies the chances that equally valid but incompatible interpretations of the data can arise.

Consider the fate of the 1998 Mars Climate Orbiter that failed in its mission because of a mismatch between Imperial and Metric units. The opportunity to inject the mismatch into the system occurred when the units were changed between different instances of the flight software, and because there was inadequate integration testing.

I saw a similarly painful failure on a project when the control system for a spacecraft when the team decided to replace the 100 Hz inertial measurement unit with a 400 Hz unit. The failure was spectacular and completely avoidable.

The challenge then is how do we as designers increase our chances of spotting when these ambiguities exist in our specifications and design choices – especially evolving systems that experience changes in the people working on them? Is there a way to properly capture the tribal knowledge that is taken for granted? Are there tools that help you avoid shipping your end-products with undiscovered time-bombs?

I proposed four different robust design principles in the lead-in post for this series. My next post in this series will explore the fault-tolerance principle for improving the success of our best guesses and minimizing the consequences of ambiguities and uncertainty.

Tags: , , ,

2 Responses to “Robust Design: Ambiguity and Uncertainty”

  1. D.W. @EM says:

    Ambiguity is the possibility of misinterpretation, proportional to lack of direct experience. Once you have “done it” or “used it,” you assume that is the way to interpret anything that asks you to do it or use it. No ambiguity. For example, there are two types of maps. The first is a map for someone who has never been “there” before, a first-time map. The second is a reminder map for someone who has been “there” before. The second type of map is useless to the first time visitor. It assumes you recognize the landmarks. But the second kind is usually the one given. It is very tough to create the first map, because it is hard to remember what your thinking was like before you knew the landmarks.
    .
    And detailed explanation is not the absolute remedy. You explain new things in terms of things already known. And as McCarthy of AI fame said, “There must be some knowledge. You can explain nothing to a stone.” You need to identify what you know versus what the installer, user or maintainer knows and to use explanation to bridge the gap. But you cannot consciously identify all you know that is relevant to the design, and you do not know what the user knows – and does not know – without asking. Making assumptions in these cases leads to disasters.
    .
    Reality bites, and you have to ship something. Do the best manual you can, realizing it will have holes. Look for things that are not in the experience of the user, and expect problems with those things, realizing that you will not catch all of them. Then, listen to the field.
    .
    Software is one of these arcane things. Designs today put a lot of functionality and associated complexity in the software. And software is a black box in a system, when viewed from the outside. In mechanical systems, you can often take off the cover and figure out how it works. Not so with software. Sometimes, not even with the source code!
    .
    The software box becomes a major abstraction of how the thing works. Joel Spolsky’s “Law of Leaky Abstractions” applies. When it works, all is fine. But when it breaks, you have to know how it works at the next level down to fix it. If you car engine does not start one morning, you need to know the next level down: you need a charged battery and gas in the tank to start the car, etc.
    .
    This brings us to measurements. If the car does not start, headlights and a gas gauge will tell us if the battery has charge and there is gas in the tank. In an embedded system, this means that there needs to be some indicators – flashing LEDs, etc. – that tell you if things are OK at the next level down. In a software driven system, this usually means that you have to add software to generate these readouts. And while you are at it, you should probably run a log file that records the previous few seconds before a crash, so that you can figure out what happened.

  2. Mavi Jeans says:

    Good share, great article, very usefull for us…thanks.

Leave a Reply