Extreme Processing Thresholds: Low Power #1

Friday, April 2nd, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on the Embedded Master

In the previous Extreme Processing post about low cost processing options, I touched on what techniques processor vendors are using to drive down the price of their value line devices. However, the focus of these companies is not just on low price, but on delivering the best parts to match the performance, power, and price demands across the entire processing spectrum. Semir Haddad, Marketing Manager of the 32-bit ARM microcontrollers at STMicroelectronics, shares “Our goal ultimately is to have one [processor part] for each use case in the embedded world, from the lowest-cost to the highest-end.”

In addition to extreme low cost parts, there is increasing demand for processors that support longer battery life. Similar to low cost processor announcements, there is a bit of marketing specmanship when releasing a device that drives down the leading edge of the lowest energy usage by a microcontroller. The ARM Cortex-M3 based EFM32 Gecko microcontrollers from EnergyMicro claims a 180 μA/MHz active mode power consumption. Texas Instruments’ 16-bit ultra-low power line of MSP430 microcontrollers claims a 165 μA/MIPs active mode power consumption. Microchip’s new 8-bit PIC1xF182x microcontrollers claim a less than 50 μA/MHz active current consumption.

There are many ways to explore and compare low power measurements, and there have been a number of exchanges between the companies including white papers and YouTube videos. We can explore some of these claims over the next few posts and discussions, but for this post, I would like to focus on whether the use of μA/MHz benchmark is appropriate or if there is a better way for low power processor vendors to communicate their power consumption to you. In the case of the Texas Instruments part, 1 MHz = 1 MIPS when there is no CPU clock divider.

If the μA/MHz benchmark for active operation is appropriate for you, is there any additional information you need disclosed with the benchmark so that you can make an educated judgment and comparison between similar and competing parts? The goal here is to help suppliers communicate the information you to more quickly make decisions. I have a list of characteristics I think you might need along with the benchmark value, and I will share it with you in the next post after you have a chance to discuss it here.

If the μA/MHz benchmark is not appropriate for you, what would be a better way to communicate a device’s relevant power consumption scenarios? I suspect the μA/MHz benchmark is popular in the same way that MIPS benchmarks are popular – because they are a single, simple number that is easy to measure and compare. The goal here is to highlight how to get the information you most need more quickly, easily, and consistently. I have some charts and tables to share with you in the follow-on post.

Question of the Week: When does it make sense to use an RTOS or operating system?

Wednesday, March 31st, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on the Embedded Master

I posted last week’s question about dynamic memory allocation here and in a group discussion at LinkedIn. There were many thoughtful comments at both locations, with the majority of responses made in the LinkedIn discussion. I’m trying to figure out a way to enable you to see the responses from both groups in one place. If you have any ideas, please let me know.

This week’s question expands on one of the underlying premises behind last week’s question – that the size of the system, real time requirements, and reliability affect your answer to the question. For this week, under what conditions do you use or avoid using an RTOS or operating system in your embedded designs?

Depending on where you draw the line on what constitutes a system, I have built embedded systems with none, one, or even multiple operating systems. In general, the smaller systems relied on an infinite loop construct or a manually built scheduler; these systems often had hard real-time and high reliability requirements. In some of the systems, we used an RTOS most significantly to leverage the communication stack support it offered when the system was in a near-real time operating mode. When the system switched to a hard real-time mode, the control software no longer made any RTOS calls.

The only context we used a full blown operating system was for ground and test support equipment that provided a near-real time, data rich user interface for the human operator and for non-real time post-operational analysis. In these cases, the operating system reduced the complexity for driving application-level processing such as receiving inputs from the operator, displaying messages and data to the operator, and communicating with other remotely located components of the systems.

Based on these projects, the team’s decision to use or avoid using an RTOS correlated strongly with the real time and reliability constraints of the system as well as how intimately the software was coupled to controlling and monitoring the hardware components.

Contemporary processors and RTOS offer coupled resources, such as memory protection units and padded cell virtualization support, which may blur the criteria for when it is appropriate to use or avoid using an RTOS. I’m trying to tease out under what conditions, such as resource limits, reliability requirements, project scheduling, and real-time operating constraints that the trade-offs favor adopting or dropping RTOS support. So when answering this question, please try to identify the key criteria for your application space that pushed your decision one way or the other.

Robust Design: Fault Tolerance

Monday, March 29th, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on the Embedded Master

Designing a system for fault tolerance is a robust design principle for building systems that will continue to operate correctly or in an acceptable degraded fashion. This approach is appropriate for systems that are expected to experience failures or operate in environments that are too complex to completely account and control all of the potential forces acting on the system.

Implementing fault tolerance usually involves some level of redundancy in the system, which increases the system’s overall cost and complexity, so the decision to adopt a fault tolerant approach is usually based on the cost to implement the tolerance, the consequences of allowing the failure to occur, and the probability of the failure occurring.

The basic tenet of a fault tolerant design is that no single failure, of any nature, can cause the system to completely stop operating. To deliver a fault tolerant system, such designs usually employ some type of redundancy in the subsystems so that if one of the subsystems fails, for whatever reason, the redundant subsystem can keep the overall system operating correctly. A low-tech example of this concept is large trucks that rely on many more than four tires to carry their load. Even when a single tire fails, the truck can continue to carry on its primary task because the load the failed tire was carrying is spread out among the remaining tires.

Common fault tolerant electronics subsystems (we will explore each type of fault tolerant approach in more detail in future posts) that employ different strategies for fault tolerance include computer memories, disk-based data storage, electronic communications, and “padded-cell” operating system virtualization. However, merely relying on redundant subsystems is insufficient for a robust design. In a recent article claiming Toyota’s Acceleration Issue Due to Electronics, several failure experts share that

“… Automakers claim that no danger is posed because they build in “redundant systems”—but that’s not foolproof unless they are truly independent, according to the engineers. EMI can affect both systems the same way. Anderson showed that the two systems lie physically next to each other. It would seem that interference affecting one would affect the other.

A safety override must be a totally independent system, said Armstrong. Safety cannot be achieved by relying only on complex electronic systems. To reduce risk to an acceptable level, independent “fail safes” or backup systems are required. “But the auto industry continues to ignore standard safety engineering principles … even though a modern vehicle is actually a computer controlled machine,” writes Armstrong.”

A fault tolerant design must account for the entire environment that the system resides within. As a result, it is impossible to build a fault tolerant design solely with software because the processor core itself represents a single point of failure. There must be at least some physical component and complete independence between redundant components within the design to accomplish a fault tolerant design.

The EMI statement in the excerpt is potentially incomplete, and this is reflected in the inference of identical similarities between the two subsystems. While the two subsystems addressed may lie next to each other, the statement does not qualify whether the two systems are identical instances of the same subsystem. Even if they are logically (software) identical, each box might use different types of shielding and data interfaces to be immune and susceptible differently to different types of EMI. The two addressed systems might use different processor cores and memories, as well as different software, such that they have different susceptibilities to EMI.

An important point is that it is not immediately obvious what is sufficient to build a fault tolerant design. In the next post about fault tolerance, I will explore fault analysis and fault injection as a technique to increase confidence in the approach of the fault tolerant design.

Extreme Processing Thresholds: Low Price

Friday, March 26th, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on the Embedded Master

Exploring processing thresholds is a tricky proposition. There is a certain amount of marketing specmanship when you are releasing a product that extends some limit of a processing option – say price, power, performance, or integration. It is helpful to understand how the supplying semiconductor vendor is able to meet the new threshold so you can better understand how those trade-offs will or will not affect any design you might choose to consider that new part in.

To lead off this series, I am looking at processors that cross new low price thresholds because there have been a handful of announcements for such parts in the past few months. Texas Instruments’ 16-bit MSP430 represent the lowest public cost parts which start at $0.25. Moving up the processing scale points our attention to NXP’s 32-bit Cortex-M0 processors which start at $0.65. Rounding out the top end of the batch of new value-priced processors is STMicroelectronics’ 32-bit Cortex-M3 processors which start at $0.85.

In looking at these announcements, be aware that the pricing information is not an apples-to-apples comparison. While all of the parts of the announced processor families can address a range of applications spaces and overlap with each other, each of these specific announcements is significant to a different application space. What is most relevant with each of these processors is that each potentially crosses a cost threshold for a given level of processing capacity such that existing designs using a processor at that same price point, but delivering less capability, can now consider incorporating new features with a larger processor than was available before at that price point. The other relevant opportunity is that there are applications that were not using processors before because they cost too much that can now economically implement a function with a processor.

When looking at these types of announcements, there are a few questions you might want to get answers for. For example, what volume of parts must you purchase to get that price? The Cortex-M0 and -M3 pricing is for 10,000 units. This is a common price point for many processor announcements, but you should never assume that all announced pricing is at that level. For example, the MSP430 announcement pricing is for 100,000 units. The announced 1,000 unit pricing for the MSP430G2001 is $0.34. To get an idea of how much volume purchasing can drop the price, VC Kumar, MSP430 MCU product marketing at Texas Instruments, shares that the pricing for the G2001 part drops to around $0.20 at 1,000,000 units. Fanie Duvenhage, Director Product Marketing/Apps/Architecture for the Security, Microcontroller & Technology Development Division at Microchip points out that since around five years ago, very high-volume, small microcontrollers have been available for a unit price in the $0.10 to $0.15 range. So there is a wide range of processing options at a variety of price points.

So how what do these suppliers have to do to be able to sell their processors for these lower prices? According to Joe Yu, Strategic Business Development at NXP Semiconductors, choosing the right core with the right process technology has the largest impact on lowering the price threshold of a processor. The packaging choice represents the second largest impact on pricing thresholds. After that, reducing Flash, then RAM, and then individual features are choices that a processor supplier can make to further lower the pricing point.

VC Kumar shares that the latest MSP430 part price point uses the same process node as other MSP430 devices. The lower price point is driven by smaller on chip resources and by taking into account what are the boundary conditions that the processor will have to contend with. By constraining the boundary conditions, certain value-priced parts can use less expensive, but lower fidelity IP blocks for different functions. As an example, standard MSP430 parts can include a clock module configuration that supports four calibrated frequencies with ±1% accuracy while the value-line sister parts use a clock module configuration that supports a single calibrated frequency and no guarantee for the ±1% accuracy.

Another area of controversy for processors that push the low-end of the pricing model is how much on-chip resources they provide. To reach these price points, the on-chip resources are quite constrained. For example, the Cortex-M3 part includes 16-kbytes of Flash, while the Cortex-M0 part includes 8-kbytes of Flash. The MSP430 part includes 512-bytes of Flash and 128-bytes of SRAM. These memory sizes are not appropriate for many applications, but there are growing areas of applications, including thermometers, metering, and health monitoring that might be able to take advantage of these resource constrained devices.

One thing to remember when considering those devices at the lowest end of the pricing spectrum is that they might represent a new opportunity for designs that do not currently use a processor. Do not limit your thinking to tasks that processors are already doing or you might miss out on the next growth space. Are you working on any projects that can benefit from these value-priced processors or do you think they are just configurations that give bragging rights to the supplier without being practical for real world use?

Question of the Week: Do you use or allow dynamic memory allocation in your embedded design?

Wednesday, March 24th, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on the Embedded Master

Back when I was deep into building embedded control systems (and snow was always 20 feet deep and going to and from school was up hill both ways), the use of dynamic memory allocation was forbidden. In fact, using compiler library calls was also forbidden in many of the systems I worked on. If we needed to use a library call, we rewrote it so that we knew exactly what it did and how. Those systems were highly dependent on predictable, deterministic real-time behavior that had to run reliably for long periods of time without a hiccup of any kind. Resetting the system was not an option, and often the system had to keep working correctly in spite of errors and failures for as long as it could – in many cases lives could be on the line. These systems were extremely resource constrained both from a memory and processing duty-cycle time perspective and we manually planned out all of the memory usage.

That was then, this is now. Today’s compilers are much better than they were then. Today’s processors include enormous amounts of memory and peripherals compared to the processors back then. Processor clock rates support much more processing per processing period than before such that there is room to waste a few cycles on “inefficient” tasks. Additionally, some of what were application-level functions back then are low-level, abstracted function calls in today’s systems. Today’s tools are more aware of memory leaks and are better at detecting such anomalies. But are they good enough for low level or deeply embedded tasks?

Do today’s compilers generate good enough code with today’s “resource rich” microcontrollers to make the static versus dynamic memory allocation a non-issue for your application space? I believe there will always be some classes of applications where using dynamic allocation, regardless of rich resources, is a poor choice. So in addition to answering whether you use or allow dynamic memory allocation in your embedded designs, please share what types of applications your answer applies to.

Robust Design: Ambiguity and Uncertainty

Monday, March 22nd, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on the Embedded Master

Undetected ambiguity is the bane of designers. Unfortunately, the opportunities for ambiguity to manifest in our specifications and designs are numerous, and they are easy to miss. Worse, when an ambiguity is discovered because two or more groups on a design team interpreted some information differently, the last person or team that touched the system often gets the blame – and that almost always is the software team.

For example, in the Best Guesses comments, DaveW points out that

“… This kind of problem is made worse when software is in the loop. Software is invisible. If it fails, you cannot know [how] unless the software gave you data to record and you recorded it.”

A problem with this common sentiment is unambiguously determining what constitutes a software failure. I shared in the lead-in Robust Design post that

“… Just because a software change can fix a problem does not make it a software bug – despite the fact that so many people like to imply the root cause of the problem is the software. Only software that does not correctly implement the explicit specification and design are truly software bugs. Otherwise, it is a system level problem that a software change might be more economically or technically feasible to use to solve the problem – but it requires first changing the system level specifications and design. This is more than just a semantic nit”

Charles Mingus offers a system perspective that hints at this type of problem:

“… And the solution most companies nowadays offer (Linear and now National) is to put ‘solutions in a box’ like small SMPS circuits, etc. You never completely know the behaviour, so always take very good care. These things are the basis of design failures, because the basic knowledge ‘seems’ not important anymore.”

Pat Ford, in the “Prius software bug?” LinkedIn discussion observes that

“…this isn’t just a software bug, this is a systems design bug, where multiple subsystems are improperly implemented.”

So how do these subsystems get improperly implemented? I contend that improperly implemented systems are largely the result of ambiguity in the system specifications, design assumptions, and user instructions. A classic visual example of ambiguity involves an image that contains a vase or two human faces looking at each other. Another classic visual example involves an image that you can interpret as a young or old woman. If you are not familiar with these images, please take some time to see both sets of images in both examples.

These two images are not so much optical illusions as they are examples of interpreting the same data in two different equally valid ways. I believe one reason why these images have at least two equally valid interpretations is that they are based on symbolic representations of the things that you can interpret them to represent. Symbols are imprecise and simplified abstractions of objects, concepts, and ideas. If you were dealing with the actual objects, the different interpretations might not be equally valid anymore.

Now consider how engineers and designers create systems. They often use a symbolic language in a natural language in a free or structured format to describe the system. It is one thing to describe all the things the system is, but it is a much different problem to explicitly describe all the things that the system is not.

To illustrate the weakness of a purely natural language way to describe something, consider how you teach someone to do a new task they have never done before. Do you explain everything in words and then leave them to their own devices to accomplish the task? Do you show them how to do it the first time?

This is the same type of problem development tool providers have to address each time they release a new development kit, and they are increasingly adopting video or animated walkthroughs to improve the success adoption rate of their systems. And this problem does not apply just to designers – it affects all types of end systems as well.

In the best guesses post, I talked about how a set of conditions had to coincide with the Freon in the air conditioning unit had to be overcharged. How would you have written the instructions for properly charging the Freon in such a system? Would the instructions specify what defined a full charge? To what precision would you have specified a minimum and maximum tolerable charge – or would you have? When using language to describe something, there is a chance that certain types of information are well understood by everyone and that you do not explicitly describe them over and over. This is fine until someone from outside that circle applies a different set of assumptions because they came from a different environment, and that environment made different arbitrary decisions that were appropriate for that operating environment.

I was recently reminded of this concept with the iRobot Roomba vacuum that I own. I went through a larger learning curve than I expected with regards to cleaning all of the brushes because some of the places you need to clear out are not immediately obvious until you understand how the vacuum works. But the real kick in the head came when I tried to use the brush cleaning tool. I read the instructions in the manual about the tool, and it says

“Use the included cleaning tool to easily remove hair from Roomba’s bristle brush by pulling it over the brush.”

Are these instructions simple enough that there is no room for ambiguity and misinterpretation? Well, I found the wrong way to use the tool, and looking at customer comments about the cleaning tool, so have other people. Mind you, this is with a tool that has a very limited possible number of ways of being used, but until you understand how it works, it is possible to use it incorrectly. I realized that the symbolic graphic on the side of the tool could be interpreted in at least two different equally valid ways because of the positioning and use of a triangle symbol which could represent the tool, the direction the brush should be used, or pointing to the place where the brush should enter the tool. Now that I understand how the tool works, the instructions and symbols make sense, but until I actually saw the tool work, it was not clear.

So not only is the specification for a system – that has never existed before – often written in a symbolic language, but so is the software that implements that system, as well as the user/maintenance manual for that system. Add to this that design teams consist of an ever larger number of people that do not necessarily work in the same company, industry, or even country. The opportunity for local, regional, and global culture differences amplifies the chances that equally valid but incompatible interpretations of the data can arise.

Consider the fate of the 1998 Mars Climate Orbiter that failed in its mission because of a mismatch between Imperial and Metric units. The opportunity to inject the mismatch into the system occurred when the units were changed between different instances of the flight software, and because there was inadequate integration testing.

I saw a similarly painful failure on a project when the control system for a spacecraft when the team decided to replace the 100 Hz inertial measurement unit with a 400 Hz unit. The failure was spectacular and completely avoidable.

The challenge then is how do we as designers increase our chances of spotting when these ambiguities exist in our specifications and design choices – especially evolving systems that experience changes in the people working on them? Is there a way to properly capture the tribal knowledge that is taken for granted? Are there tools that help you avoid shipping your end-products with undiscovered time-bombs?

I proposed four different robust design principles in the lead-in post for this series. My next post in this series will explore the fault-tolerance principle for improving the success of our best guesses and minimizing the consequences of ambiguities and uncertainty.

Extreme Processing Thresholds

Friday, March 19th, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on the Embedded Master

Just in the past few weeks there have been two value-line processor announcements that push the lower limit for pricing. STMicroelectronics’ 32-bit Cortex-M3 value line processors are available starting at $0.85, and Texas Instruments’ 16-bit MSP430 are available starting at $0.25. These announcements follow the earlier announcement that NXP’s 32-bit Cortex-M0 processors are available for as low as $0.65.

These value pricing milestones map out the current extreme thresholds for pricing for a given level of processing performance. These types of announcements are exciting because every time different size processors reach new pricing milestones, they enable new types of applications and designs to incorporate new or more powerful processors into their implementation for more sophisticated capabilities. An analogous claim can be made when new processor power and energy consumption thresholds are pushed.

There are many such thresholds that make it both feasible and not feasible to include some level of processing performance into a given design. Sometimes the market is slower than desired in pushing a key threshold. Consider for example the Wal-Mart mandate to apply RFID labels to shipments. The mandate began in January of 2005 and progress to fully adopt the mandate has been slow.

In this new series, I plan to explore extreme processing thresholds such as pricing and power efficiency. What are the business, technical, hardware, and software constraints that drive where these thresholds currently are and what kinds of innovations or changes does it take for semiconductor companies to push those thresholds a little bit further?

I am planning to start this series by exploring the low-end or value pricing thresholds followed by low energy device thresholds. However, there are many other extreme thresholds that we can explore, such as the maximum amount of processing work that you can perform within a given time or power budget. This might be addressed through higher clock rates as well as parallel processing options including hardware accelerators for vertically targeted application spaces. Examples of other types of extreme thresholds could include interrupt service response latency; how much integrated memory is available; how much peripheral integration and CPU offloading is available; higher I/O sampling rates as well as accuracy and precision; wider operating temperature tolerances; and how much integrated connectivity options are available.

I need your help to identify which thresholds matter most to you. Which types of extreme processing thresholds do you want to see more movement on and why? Your responses here will help me to direct my research to better benefit your needs.

Question of the Week: You know you’re an embedded developer when …

Wednesday, March 17th, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on the Embedded Master

Two recent posts reminded me once again of a challenge that I think all embedded developers run up against. In John Sloan’s comment in a LinkedIn discussion asking “Do most embedded software developers also have hardware design experience too,” he volunteers:

“… When people ask me what I do for a living, I seldom say “embedded development” because even I don’t really know what that means. I usually just say “high tech product development”.”

The other recent example of awkwardness in describing embedded development comes from Jack Ganssle’s “The Embedded Muse 190” newsletter:

“… I long ago gave up describing my job at parties, instead telling folks I’m an engineer. Their eyes immediately glaze over for a moment till they turn to talk to someone, anyone, else.”

To the same point, I shared in a different discussion asking is embedded different that

“… One of the first things I had to internalize when transitioning to embedded design was that my software was invisible in the end system. The end user had no idea it was there—nor did they ever need to. I believe this an essential component of what makes something an embedded system. This had a significant impact on how I defined my worth and my ability to tell people what I did for a living. I laughingly adopted the philosophy of “You know you’re an embedded designer when you have to oversimplify your job description for ‘normal’ people”. I found myself just telling people I worked on the Space Shuttle or aircraft because it was too frustrating to try to explain the invisible portion of the system that I actually worked on in those types of systems.”

I will be posting a question each week relevant to embedded developers. One goal of the questions is to uncover those things we have in common with each other. Another goal is to uncover trends and key care-about groupings based around different design considerations or trade-offs, such as power consumption, pricing, and connectivity issues. After a few months of these questions, I plan to produce an article summarizing and commenting on your responses.

I think being able to (or not) succinctly describe what you do as an embedded developer is a testament to our success of delivering results so that (usually) no one is even aware of our contribution to the end-products that people use in their everyday lives.

Please contribute your thoughts on this topic by answering the question “You know you are an embedded developer when …”

Feel free to expand on just completing the sentence. I suspect this community harbors a rich set of answers that will not only amuse and entertain, but that when taken together will help identify the core of what embedded development really is. Who knows, maybe someone has already found the perfect way to describe what we do.

Robust Design: Best Guesses

Monday, March 15th, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on the Embedded Master

An important realization about building robust systems is that the design decisions and trade-offs we make are based on our best guesses. As designers, we must rely on best guesses because it is impossible to describe a “perfect and complete” specification for all but the most simple, constrained, and isolated systems. A “perfect and complete” specification is a mythical notion that assumes that it is possible to not only specify all of the requirements to describe what a system must do, but that it is also possible to explicitly and unambiguously describe everything a system must never do under all possible operating conditions.

The second part of this assumption, explicitly describing everything the system must never do, is not feasible because the complete list of operating conditions and their forbidden behaviors is infinite. A concession to practicality is that system specifications address the anticipated operating conditions and those operating conditions with the most severe consequences – such as injury or death.

As the systems we design and build continue to grow in complexity, so too does the difficulty in explicitly identifying all of the relevant use cases that might cause a forbidden behavior. The short cut of specifying that the system may never act a certain way under any circumstance is too ambiguous. Where do you draw the line between reasonable use-cases and unreasonable ones? For example, did Toyota pursue profits at the expense of safety by knowingly ignoring the potential for unwanted acceleration? But what is the threshold between when you can safely ignore or must react to a potential problem? Maybe sharing my experience (from twenty years ago) with a highly safe and reliable automobile can stimulate some ideas on defining such a threshold.

After a few months of ownership, my car would randomly stall while at full freeway speeds. I brought the car into the dealership three separate times. The first two times, they could not duplicate the problem, nor could they find anything that they could adjust in the car. The third time I brought the car in, I started working with a troubleshooter that was flown in from the national office. Fortunately, I was able to duplicate the problem once for the troubleshooter, so they knew this was not just a potential problem, but a real event. It took two more weeks of full time access to the car for the troubleshooter to return the car to me with a fix.

I spoke with the technician and he shared the following insights with me. I was one of about half a dozen people in the entire country that was experiencing this problem. The conditions required to manifest this failure were specific. First, it only happened on very hot (approximately 100 degrees) and dry days. Second, the car had to be hot from sitting out in the direct sun for some time. Third, the air conditioning unit needed to be set to the highest setting while the car was turned on. Fourth, the driver of the car had to have a specific driving style (the stalls never happened to my wife who has a heavier foot on the accelerator than I do).

It turns out the control software for managing the fuel had two phases of operation. The first phase ran for the first few minutes after the car was started, and it characterized the driving style of the driver to set the parameters for managing the fuel delivered to the engine. After a few minutes of operating the car, the second phase of operation, which never modified the parameter settings, took over until the vehicle was turned off. My driving style when combined with those other conditions caused the fuel management parameters to deliver too little fuel to the engine under a specific driving condition which I routinely performed while on the freeway.

So it was a software problem right? Well, not exactly, there was one more condition that was necessary to create this problem. The Freon for the air conditioning unit had to be at least slightly overcharged. Once the technician set the Freon charge level to no more than full charge, the problem went away and I never experienced the problem again over 150k miles of driving. I always made sure that we never overcharged the Freon when recharging the system.

I imagine there could have been a software fix that used a modified algorithm that also measured and correlated the Freon charge level, but I do not know if that automobile manufacturer followed that course or not for future vehicles.

So how do you specify such an esoteric use-case before experiencing it?

The tragedy of these types of situations is that the political, legal, and regulatory realities prevent the manufacturer of the vehicle in question from freely sharing what information they have, and possibly being able to more quickly pinpoint the unique set of conditions required to make the event occur, without severely risking their own survival.

Have you experienced something that can help distinguish when and how to address potential from probable from actually occurring unintended behaviors? I do not believe any long term operating company puts out any product in volume with the intention of ignoring reasonable safety hazards. If a problem persists, I believe it is more likely because their best guesses have not yet been able to uncover which of the infinite possible conditions are contributing to the event.

My next post in this series will touch on ambiguity and uncertainty.

Robust Design : Good, Fast, Cheap – pick two

Wednesday, February 10th, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on EDN

Reading Battar’s response to the introduction post for this series has suggested to me that it is worth exploring the relationship of the popular expression “good, fast, and cheap – pick two” in the context of robust design principles. The basis for this expression is that it is not possible to globally maximize/minimize all three of these vectors in the same design. Nor does this relationship apply only to engineering. For example, Jacob Cass applied it to Pricing Freelance Work.

There are a few problems with this form of the expression, but the concept of pick (n-1) from (n) choices to optimize is a common trade-off relationship. With regards to embedded processors, the “three P’s”, Performance, Power, and Price capture the essence of the expression, but with a focus on the value to the end user.

One problem is that this expression implies that the end user is interested in the extremes of these trade-offs. The focus is on realizing the full potential of an approach and robustness is assumed. This is an extremely dangerous assumption as you push further beyond the capabilities of real designs that can survive in the real world.

The danger is not in the complexity of delivering the robustness, but rather our inexperience with it because our ability to accommodate that complexity changes over time. For example, I would not want the fastest processor possible if it means it will take a whole star to power it. However, someday that amount of energy might be readily accessible (but not while we currently only have the energy from a single star to power everything on our planet). The fact that it might not be absurd to harness the full output of a star to power a future processor points out that there is a context to the trade-offs designers make. This is the relevant point to remember in robust design principles.

The danger is underestimating the “distance” of our target thresholds from the well-understood threshold points. Moore’s law implicitly captures this concept by observing that the number of transistors in a given area doubles in a constant time relationship. This rate is really driven by our ability to adjust to and maintain a minimum level of robustness with each new threshold for these new devices. The fact that Moore’s law observed a constant time relationship that has stood the test of time, versus a linear or worse relationship, suggests the processor industry has found a good-enough equilibrium point between pushing design and manufacturing thresholds with the offsetting complexity of verifying, validating, and maintaining the robustness of the new approaches.

Robust design principles are the tools and applied lessons learned when designers are pushing the threshold of a system’s performance, power, and/or price beyond the tried and tested thresholds of previous designs.

The four categories of robust design principles I propose – fault-tolerance, sandbox, patch-it, and disposable (which does not mean cheap) – provide context relevant tools and approaches for capturing and adding to our understanding when we push system thresholds beyond our comfort points while maintaining a system that can better survive what the real world will throw at it.