Entries Tagged ‘Patch-It Principle’

Robust Design: Patch-It Principle – Teaching and Learning

Monday, May 3rd, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on the Embedded Master

In the first post introducing the patch-it principle, I made the claim that developers use software patches to add new behaviors to their systems in response to new information and changes in the operating environment. In this way, patching systems allow developers to offload the complex task of learning off the device – at least until we figure out how to build machines that can learn. In this post, I will peer into my crystal ball, and I will describe how I see the robust design patch-it principle will evolve into a mix of teaching and learning principles. There is a lot of room for future discussions, so if you see something you hate or like, speak up – it will signal that topic for future posts.

First, I do not see the software patch going away, but I do see it taking on a teaching and learning component. The software patch is a mature method of disseminating new information to fixed-function machines. I think software patches will evolve from executable code blocks to meta-code blocks. This will be essential to support multi-processing designs.

Without using meta-code, the complexity of building robust patch blocks that can handle customized processor partitioning will grow to be insurmountable as the omniscient knowledge syndrome drowns developers in requiring them to handle even more low-value complexity. Using meta-code may provide a bridge to supporting distributed or local knowledge (more explanation in a later post) processing where the different semi-autonomous processors in a system make decisions about the patch block based on their specific knowledge of the system.

The meta-code may take on a form that is conducive to teaching rather than an explicit sequence of instructions to perform. I see devices learning how to improve what they do by observing their user or operator as well as communicating with other similar devices. By building machines this way, developers will be able to focus more on specifying the “what and why” of a process, and the development tools will assist in the system in genetically searching and applying different coding implementations and focusing on a robust verification of equivalence between the implementation and specification. This may permit systems to consist of less than perfect parts as verifying the implementation will include the imperfections in the system.

The possible downside of learning machines is that they will become finely tuned to a specific user and be less than desirable to another user – unless there is a means for users to carry their preferences with them to other machines. This already is manifesting in chat programs that learn your personal idioms and automagically provide adjusted spell checking and link associations because personal idioms do not always cleanly translate, or are they used in the same connotation, for other people.

In order for the patch-it principle to evolve to the teach and learn principle, machines will need to develop a sense of context of self in their environment, be able to remember random details, be able to spot repetition of random details, be able to recognize sequences of events, and be able to anticipate an event based on a current situation. These are all tall orders for today’s machines, but as we build wider multiprocessing systems, I think we will stumble upon an approach to perform these tasks for less energy than we ever thought possible.

How do you support software patches in your embedded designs?

Wednesday, April 28th, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on the Embedded Master

Earlier, I posted about design-for-patching. While some patches involve fixing something that never worked, I believe most patches actually add new information to the system than was there before the patch was applied. This means there has to be some resource headroom to 1) incorporate the new information, and 2) receive, validate, store, and activate the patch in a safe manner. For resource constrained embedded systems, these resources are the result of deliberate trade-offs.

Patching subsystems that are close to the user interface may present a straight forward way to access the physical interface ports, but I am not aware of any industry-standard “best practices” for applying patches to deeply embedded subsystems.

Please share how you support software patches in your embedded designs. Do you use the same types of interfaces from project to project – or do you make do with what is available? Do you have a standard approach to managing in field patches – or do you require your users to ship you the devices so that you can perform the patch under controlled circumstances? How do you ensure that your patch applied successfully, and how do you recover from failed patches?

Robust Design: Patch-It Principle – Design-for-Patching

Monday, April 26th, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on the Embedded Master

Patching a tire is necessary when the tire has had a part of itself forcibly torn or removed so that it is damaged and can no longer perform its primary function properly. This is also true when you are patching clothing. Patching software in embedded systems however, is not based on replacing a component that has been ripped from the system – rather it involves adding new knowledge to the system that was not part of the original system. Because software patching involves adding new information to the system, there is a need for extra available resources to accommodate the new information. The hardware, software, and labor resources needed to support patching is growing as systems continue to increase in complexity.

Designing to support patching involves some deliberate resource trade-offs, especially for embedded systems that do not have the luxury of idle, unassigned memory and interface resources that a desktop computer might have access to. To support patching, the system needs to be able to recognize that a patch is available, be able to receive the patch through some interface, and verify that that the patch is real and authentic to thwart malicious attacks. It must also be able to confirm that there is no corruption in the received patch data and that the patch information has been successfully stored and activated without breaking the system.

In addition to the different software routines needed at each of these steps of the patching process, the system needs access to a hardware input interface to receive the patch data, an output interface to signal whether or not the patch was received and applied successfully, and memory to stage, examine, validate, apply, and store the patch data. For subsystems that are close to the user interface, gaining access to physical interface ports might be straight forward, but there is no industry-standard “best practices” for applying patches to deeply embedded subsystems.

It is important that the patch process does not leave the system in an inoperable state – even if there is a corruption in the patch file or loss of power to the system while applying the patch. A number of techniques designers use depend on including enough storage space in the system to house the pre- and post-patch code so that the system can confirm the new patch is working properly before releasing the storage holding the previous version of the software. The system might employ a safe, default boot kernel, which the patching process can never change, so that if the worst happens during applying a patch, the operator can use the safe kernel to put the system into known state that can provide basic functionality and accept a new patch file.

In addition to receiving and applying the patch data, system designs are increasingly accommodating custom settings, so that applying the patch does not disrupt the operator customizations. Preserving the custom settings may involve more than just not overwriting the settings; it may involve performing specific checks, transformations, and configurations before completing the patch. Supporting patches that preserve customization can involve more complexity and work from the developers to seamlessly address the differences between each different setting.

The evolving trend for the robust design patch-it principle is that developers are building more intelligence into their patch processes. This simplifies or eliminates the learning curve for the operator to initiate a patch. Smarter patches also enable the patch process to launch, proceed, and complete in a more automated fashion without causing operators with customized settings any grief. Over time, this can build confidence in the user community so that more devices can gain the real benefit of the patch-it principle – devices can change their behavior in a way that mimics learning from their environment years before designers, as a community, figure out how to make self-reliant learning machines.

Robust Design: Patch-It Principle

Monday, April 19th, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on the Embedded Master

The software patch is a much maligned technique for keeping systems robust because many users perceive that the majority of these patches as merely fixes of feature bugs that the developers should have taken care of before shipping the software. While there are many examples where this sentiment has a strong ring of truth to it, the patch-it principle is a critical approach to maintaining robust systems so that they can continue to operate within an acceptable range of behavior despite the huge variability that the real world throws at them. This post focuses on the many other reasons (rather than sloppy and curtailed design, build, and test) for patching a system – all of which share the same basis:

The software patch is currently the primary approach for a system to manifest modified behaviors in light of additional information about the environment it needs to operate within.

The basis for this statement stems from what I have referred to as the omniscient knowledge syndrome, which is the assumption that designers should identify and resolve all relevant issues facing a system at design time. This is a necessary assumption because contemporary embedded systems are not capable of sufficient learning that allows them to determine an appropriate course of action to handle previously unspecified environmental conditions.

Common examples of patching are to add new capabilities to a system; to implement countermeasures to malicious attack attempts; and to implement special case processing to support interoperability across a wider range of hardware configurations, such as new combinations of video cards, system boards, and versions of operating systems.

New capabilities are added to systems experimentally and in response to how the market reacts to those experimental features. Patching enables developers to competitively distribute successful experiments to their existing base of systems without requiring their user base to buy new devices with each new feature.

A robust system should be able to counter malicious attacks, such as viruses and hacks. A perfect static defense against malicious attacks is impossible, or at least it has so far been impossible for the entire history of mankind. Attacks and countermeasures are based on responding to what the other side is doing. Patching helps mitigate device obsolescence that would otherwise ensue when malicious entities successfully compromise those systems.

The rate of evolution in the electronics market is too rapid for any developer to completely accommodate in any single project. The constant flow of new chips, boards, algorithms, communication protocols, and new ways of using devices mean that some mechanism, in this case patching, is needed to allow older devices to integrate with newer devices.

In essence, the patch-it principle is a “poor man’s” approach to allow systems to learn how to behave in a given condition. The designer is the part of the embedded system that is able to learn how the world is changing and develop appropriate responses for those changes. Until embedded systems are able to recognize context within their environment, identify trends, and become expert predictors, designers will have to rely on the patch-it principle to keep their products relevant as the world keeps changing.

Robust Design: Ambiguity and Uncertainty

Monday, March 22nd, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on the Embedded Master

Undetected ambiguity is the bane of designers. Unfortunately, the opportunities for ambiguity to manifest in our specifications and designs are numerous, and they are easy to miss. Worse, when an ambiguity is discovered because two or more groups on a design team interpreted some information differently, the last person or team that touched the system often gets the blame – and that almost always is the software team.

For example, in the Best Guesses comments, DaveW points out that

“… This kind of problem is made worse when software is in the loop. Software is invisible. If it fails, you cannot know [how] unless the software gave you data to record and you recorded it.”

A problem with this common sentiment is unambiguously determining what constitutes a software failure. I shared in the lead-in Robust Design post that

“… Just because a software change can fix a problem does not make it a software bug – despite the fact that so many people like to imply the root cause of the problem is the software. Only software that does not correctly implement the explicit specification and design are truly software bugs. Otherwise, it is a system level problem that a software change might be more economically or technically feasible to use to solve the problem – but it requires first changing the system level specifications and design. This is more than just a semantic nit”

Charles Mingus offers a system perspective that hints at this type of problem:

“… And the solution most companies nowadays offer (Linear and now National) is to put ‘solutions in a box’ like small SMPS circuits, etc. You never completely know the behaviour, so always take very good care. These things are the basis of design failures, because the basic knowledge ‘seems’ not important anymore.”

Pat Ford, in the “Prius software bug?” LinkedIn discussion observes that

“…this isn’t just a software bug, this is a systems design bug, where multiple subsystems are improperly implemented.”

So how do these subsystems get improperly implemented? I contend that improperly implemented systems are largely the result of ambiguity in the system specifications, design assumptions, and user instructions. A classic visual example of ambiguity involves an image that contains a vase or two human faces looking at each other. Another classic visual example involves an image that you can interpret as a young or old woman. If you are not familiar with these images, please take some time to see both sets of images in both examples.

These two images are not so much optical illusions as they are examples of interpreting the same data in two different equally valid ways. I believe one reason why these images have at least two equally valid interpretations is that they are based on symbolic representations of the things that you can interpret them to represent. Symbols are imprecise and simplified abstractions of objects, concepts, and ideas. If you were dealing with the actual objects, the different interpretations might not be equally valid anymore.

Now consider how engineers and designers create systems. They often use a symbolic language in a natural language in a free or structured format to describe the system. It is one thing to describe all the things the system is, but it is a much different problem to explicitly describe all the things that the system is not.

To illustrate the weakness of a purely natural language way to describe something, consider how you teach someone to do a new task they have never done before. Do you explain everything in words and then leave them to their own devices to accomplish the task? Do you show them how to do it the first time?

This is the same type of problem development tool providers have to address each time they release a new development kit, and they are increasingly adopting video or animated walkthroughs to improve the success adoption rate of their systems. And this problem does not apply just to designers – it affects all types of end systems as well.

In the best guesses post, I talked about how a set of conditions had to coincide with the Freon in the air conditioning unit had to be overcharged. How would you have written the instructions for properly charging the Freon in such a system? Would the instructions specify what defined a full charge? To what precision would you have specified a minimum and maximum tolerable charge – or would you have? When using language to describe something, there is a chance that certain types of information are well understood by everyone and that you do not explicitly describe them over and over. This is fine until someone from outside that circle applies a different set of assumptions because they came from a different environment, and that environment made different arbitrary decisions that were appropriate for that operating environment.

I was recently reminded of this concept with the iRobot Roomba vacuum that I own. I went through a larger learning curve than I expected with regards to cleaning all of the brushes because some of the places you need to clear out are not immediately obvious until you understand how the vacuum works. But the real kick in the head came when I tried to use the brush cleaning tool. I read the instructions in the manual about the tool, and it says

“Use the included cleaning tool to easily remove hair from Roomba’s bristle brush by pulling it over the brush.”

Are these instructions simple enough that there is no room for ambiguity and misinterpretation? Well, I found the wrong way to use the tool, and looking at customer comments about the cleaning tool, so have other people. Mind you, this is with a tool that has a very limited possible number of ways of being used, but until you understand how it works, it is possible to use it incorrectly. I realized that the symbolic graphic on the side of the tool could be interpreted in at least two different equally valid ways because of the positioning and use of a triangle symbol which could represent the tool, the direction the brush should be used, or pointing to the place where the brush should enter the tool. Now that I understand how the tool works, the instructions and symbols make sense, but until I actually saw the tool work, it was not clear.

So not only is the specification for a system – that has never existed before – often written in a symbolic language, but so is the software that implements that system, as well as the user/maintenance manual for that system. Add to this that design teams consist of an ever larger number of people that do not necessarily work in the same company, industry, or even country. The opportunity for local, regional, and global culture differences amplifies the chances that equally valid but incompatible interpretations of the data can arise.

Consider the fate of the 1998 Mars Climate Orbiter that failed in its mission because of a mismatch between Imperial and Metric units. The opportunity to inject the mismatch into the system occurred when the units were changed between different instances of the flight software, and because there was inadequate integration testing.

I saw a similarly painful failure on a project when the control system for a spacecraft when the team decided to replace the 100 Hz inertial measurement unit with a 400 Hz unit. The failure was spectacular and completely avoidable.

The challenge then is how do we as designers increase our chances of spotting when these ambiguities exist in our specifications and design choices – especially evolving systems that experience changes in the people working on them? Is there a way to properly capture the tribal knowledge that is taken for granted? Are there tools that help you avoid shipping your end-products with undiscovered time-bombs?

I proposed four different robust design principles in the lead-in post for this series. My next post in this series will explore the fault-tolerance principle for improving the success of our best guesses and minimizing the consequences of ambiguities and uncertainty.

Robust Design: Best Guesses

Monday, March 15th, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on the Embedded Master

An important realization about building robust systems is that the design decisions and trade-offs we make are based on our best guesses. As designers, we must rely on best guesses because it is impossible to describe a “perfect and complete” specification for all but the most simple, constrained, and isolated systems. A “perfect and complete” specification is a mythical notion that assumes that it is possible to not only specify all of the requirements to describe what a system must do, but that it is also possible to explicitly and unambiguously describe everything a system must never do under all possible operating conditions.

The second part of this assumption, explicitly describing everything the system must never do, is not feasible because the complete list of operating conditions and their forbidden behaviors is infinite. A concession to practicality is that system specifications address the anticipated operating conditions and those operating conditions with the most severe consequences – such as injury or death.

As the systems we design and build continue to grow in complexity, so too does the difficulty in explicitly identifying all of the relevant use cases that might cause a forbidden behavior. The short cut of specifying that the system may never act a certain way under any circumstance is too ambiguous. Where do you draw the line between reasonable use-cases and unreasonable ones? For example, did Toyota pursue profits at the expense of safety by knowingly ignoring the potential for unwanted acceleration? But what is the threshold between when you can safely ignore or must react to a potential problem? Maybe sharing my experience (from twenty years ago) with a highly safe and reliable automobile can stimulate some ideas on defining such a threshold.

After a few months of ownership, my car would randomly stall while at full freeway speeds. I brought the car into the dealership three separate times. The first two times, they could not duplicate the problem, nor could they find anything that they could adjust in the car. The third time I brought the car in, I started working with a troubleshooter that was flown in from the national office. Fortunately, I was able to duplicate the problem once for the troubleshooter, so they knew this was not just a potential problem, but a real event. It took two more weeks of full time access to the car for the troubleshooter to return the car to me with a fix.

I spoke with the technician and he shared the following insights with me. I was one of about half a dozen people in the entire country that was experiencing this problem. The conditions required to manifest this failure were specific. First, it only happened on very hot (approximately 100 degrees) and dry days. Second, the car had to be hot from sitting out in the direct sun for some time. Third, the air conditioning unit needed to be set to the highest setting while the car was turned on. Fourth, the driver of the car had to have a specific driving style (the stalls never happened to my wife who has a heavier foot on the accelerator than I do).

It turns out the control software for managing the fuel had two phases of operation. The first phase ran for the first few minutes after the car was started, and it characterized the driving style of the driver to set the parameters for managing the fuel delivered to the engine. After a few minutes of operating the car, the second phase of operation, which never modified the parameter settings, took over until the vehicle was turned off. My driving style when combined with those other conditions caused the fuel management parameters to deliver too little fuel to the engine under a specific driving condition which I routinely performed while on the freeway.

So it was a software problem right? Well, not exactly, there was one more condition that was necessary to create this problem. The Freon for the air conditioning unit had to be at least slightly overcharged. Once the technician set the Freon charge level to no more than full charge, the problem went away and I never experienced the problem again over 150k miles of driving. I always made sure that we never overcharged the Freon when recharging the system.

I imagine there could have been a software fix that used a modified algorithm that also measured and correlated the Freon charge level, but I do not know if that automobile manufacturer followed that course or not for future vehicles.

So how do you specify such an esoteric use-case before experiencing it?

The tragedy of these types of situations is that the political, legal, and regulatory realities prevent the manufacturer of the vehicle in question from freely sharing what information they have, and possibly being able to more quickly pinpoint the unique set of conditions required to make the event occur, without severely risking their own survival.

Have you experienced something that can help distinguish when and how to address potential from probable from actually occurring unintended behaviors? I do not believe any long term operating company puts out any product in volume with the intention of ignoring reasonable safety hazards. If a problem persists, I believe it is more likely because their best guesses have not yet been able to uncover which of the infinite possible conditions are contributing to the event.

My next post in this series will touch on ambiguity and uncertainty.

Robust Design : Good, Fast, Cheap – pick two

Wednesday, February 10th, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on EDN

Reading Battar’s response to the introduction post for this series has suggested to me that it is worth exploring the relationship of the popular expression “good, fast, and cheap – pick two” in the context of robust design principles. The basis for this expression is that it is not possible to globally maximize/minimize all three of these vectors in the same design. Nor does this relationship apply only to engineering. For example, Jacob Cass applied it to Pricing Freelance Work.

There are a few problems with this form of the expression, but the concept of pick (n-1) from (n) choices to optimize is a common trade-off relationship. With regards to embedded processors, the “three P’s”, Performance, Power, and Price capture the essence of the expression, but with a focus on the value to the end user.

One problem is that this expression implies that the end user is interested in the extremes of these trade-offs. The focus is on realizing the full potential of an approach and robustness is assumed. This is an extremely dangerous assumption as you push further beyond the capabilities of real designs that can survive in the real world.

The danger is not in the complexity of delivering the robustness, but rather our inexperience with it because our ability to accommodate that complexity changes over time. For example, I would not want the fastest processor possible if it means it will take a whole star to power it. However, someday that amount of energy might be readily accessible (but not while we currently only have the energy from a single star to power everything on our planet). The fact that it might not be absurd to harness the full output of a star to power a future processor points out that there is a context to the trade-offs designers make. This is the relevant point to remember in robust design principles.

The danger is underestimating the “distance” of our target thresholds from the well-understood threshold points. Moore’s law implicitly captures this concept by observing that the number of transistors in a given area doubles in a constant time relationship. This rate is really driven by our ability to adjust to and maintain a minimum level of robustness with each new threshold for these new devices. The fact that Moore’s law observed a constant time relationship that has stood the test of time, versus a linear or worse relationship, suggests the processor industry has found a good-enough equilibrium point between pushing design and manufacturing thresholds with the offsetting complexity of verifying, validating, and maintaining the robustness of the new approaches.

Robust design principles are the tools and applied lessons learned when designers are pushing the threshold of a system’s performance, power, and/or price beyond the tried and tested thresholds of previous designs.

The four categories of robust design principles I propose – fault-tolerance, sandbox, patch-it, and disposable (which does not mean cheap) – provide context relevant tools and approaches for capturing and adding to our understanding when we push system thresholds beyond our comfort points while maintaining a system that can better survive what the real world will throw at it.

Robust Design

Thursday, February 4th, 2010 by Robert Cravotta

I am accelerating my plans to start a series on robust design principles because of the timely interest in the safety recall by Toyota for a sticking accelerator pedal. Many people are weighing in on the issue, but Charles J. Murray’s article “Toyota’s Problem Was Unforeseeable” and Michael Barr’s posting “Is Toyota’s Accelerator Problem Caused by Embedded Software Bugs?” make me think there is significant value in discussing robust design approaches right away.

A quick answer to the questions posed by the first article is no. The failure was not unforeseeable if a robust system level failure analysis effort is part of the specification, design, build, test, and deploy process. The subheading for Charles article hits the nail on the head:

“As systems grow in complexity, experts say designing for failure may be the best course of action for managing it.”

To put things in perspective, my own engineering experience with robust designs is strongly based on specifying, designing, building, and testing autonomous systems in an aerospace environment. Some of these systems were man-rated, triple-fault-tolerant designs – meaning the system had to operate with no degradation in spite of any three failures. The vast majority of the designs I worked on were at least single-fault-tolerant designs. Much of my design bias is shaped by those projects. In the next post in this series, I will explore fault-tolerant philosophies for robust design.

A quick answer to the questions posed by the second article is – it depends. Just because a software change can fix a problem does not make it a software bug – despite the fact that so many people like to imply the root cause of the problem is the software. Only software that does not correctly implement the explicit specification and design are truly software bugs. Otherwise, it is a system level problem that a software change might be more economically or technically feasible to use to solve the problem – but it requires first changing the system level specifications and design. This is more than just a semantic nit – it is an essential perspective to root cause analysis and resolution, and I hope in my next post to clearly explain why.

I would like to initially propose four robust design categories (fault tolerant, sandbox, patch-it, and disposable); if you know of another category, please share it here. I plan to follow up with separate posts focusing on each of these categories. I would also like to solicit for guest posts from anyone that has experience in any of these different types of robust design.

Fault tolerant design focuses on keeping the system running or safe in spite of failures. These techniques are commonly applied in high value designs where people’s lives are at stake (like airplanes, space ships, and automobiles), but there are techniques that can be applied at even lesser impact consumer level designs (think exploding batteries – which I’ll expand on in the next post).

Sandbox design focuses on controlling the environment so that failures cannot occur. Ever wonder why Apple’s new iPad does not support third party multitasking?

Patch-it design focuses on fixing problems after the system is in the field. This is a common approach for a lot of software products where the consequences of failures in the end system are not catastrophic and where implementing a correction is low cost.

Disposable design focuses on short life span issues. This affects robust design decisions in a meaningfully different way than the other three types of designs.

The categories I’ve proposed are system level in nature, but I think the concepts we can uncover in a discussion would apply to all of the disciplines required to design each component and subsystem in contemporary projects.

[Editor's Note: This was originally post on EDN ]