Robust Design Channel

Robust Design explores the toolbox of design principles that embedded developers can use to build components and systems that continuously work as intended, even when they are subjected to unintended operating circumstances.

When is “cutting corners” good engineering versus derelict complacency?

Wednesday, June 16th, 2010 by Robert Cravotta

The recent articles claiming BP was demonstrating carelessness and complacency when the company cut corners in their well design got me thinking. Companies and design teams constantly improve their process in ways that cut costs and shrink schedules. This incremental process of “cutting corners” is the cornerstone of the amazing advances made in technology and improvements in our overall quality of life over the years. There seems to be a lot of second-guessing and criticism by people outside the design, build, and maintenance process when those incremental changes cause a system to cross the line between “good enough” and broken. The disaster that BP is in the middle of right now in the gulf is quite serious, but the magnitude of the disaster makes me think we should explore a thought exercise together.

What would be the cost/benefit if BP never “cut corners” on their well and rig designs? The immediately obvious answer might be, there would be no oil volcano at the bottom of the gulf right now and BP would be happily pumping oil out of that well instead of cleaning up after it. I say volcano because the words spill and leak seem so insufficient to describe the massive force required to spew out all of that oil despite the tremendous pressure pushing down on that opening by being under all of that water. A possible problem with the immediately obvious answer is that it ignores an essential implied assumption. While there might not be any oil pouring into the gulf, we might not be harvesting any of the oil either.

Let’s reword the question to make it more general. What would be the cost to society if everyone only engaged in ventures that would never find the line between good enough and broken? While raising my own children, I developed a sense of the importance that we all need to find the line between good enough and broken. I believe children do not break the rules merely to break the rules – I think they are exploring the edges and refining their own models of what rules are and why and when they should adhere to them. If we deny children the opportunity to understand the edges of rules, they might never develop the understanding necessary to know when to follow and when to challenge a rule.

This concept applies to engineering (as well as any human endeavor). If designers always use large margins in their designs, how will they know when and why they can or should not push those margins? How will they know if the margins are excessive (wasteful) or just right? My experience shows me that people learn the most from the failures, especially because it enables them to refine their models of how and why the world works the way it does.

I think one of the biggest challenges to “cutting corners” is minimizing the impact of when you cross the line to a failure precisely because you do not know where that line is. To me, derelict complacency depends on the assumption that the designer knew where the line to failure was and crossed it anyways. If my engineering career taught me anything, it taught me that we never know what will or will not work until we try it. We can extrapolate from experience, but experience does not provide certainty for everything we have not tried yet.

To an outsider, there might not be an easy to see difference between good engineering and derelict complacency. What are your thoughts on how to describe the difference between appropriate risk-assessed process improvement and derelict complacency? Can we use common failures in the lab to explore, refine, and communicate this difference so that we can apply it to larger disasters such as the oil in the gulf or even unintended acceleration in automobiles?

If you would like to suggest questions to explore, please contact me at Embedded Insights.

[Editor's Note: This was originally posted on the Embedded Master]

Robust Design: Quality vs. Security

Monday, June 14th, 2010 by Robert Cravotta

I had a conversation recently with Nat Hillary, a field application engineer at LDRA Technologies, about examples of software fault tolerance, quality, and security. Our conversation identified many questions and paths that I would like to research further. One such path relates to how software systems that are not fault tolerant may present vulnerabilities that attackers can use to compromise the system. A system’s vulnerability and resistance to software security exploits is generally a specification, design, and implementation quality issue. However, just because secure systems require high quality does not mean that high quality systems are also secure systems because measuring a system’s quality and security focuses on different metrics.

 

Determining a system’s quality involves measuring and ensuring that each component, separately and together, fits or behaves within some specified range of tolerance. The focus is on whether the system can perform its function with acceptable limits rather than on the complete elimination of all variability. The tightness or looseness of a component’s permitted tolerance balances the cost and difficulty of manufacturing identical components with the cumulative impact of allowing variability among the components against the system’s ability to perform its intended function. For example, many software systems ship with some number of known minor implementation defects (bugs) because the remaining bugs do not prevent the system from operating within tolerances during the expected and likely use scenarios. The software in this case is identical from unit to unit, but the variability in the other components in the system can introduce differences in behavior in the system. I will talk about an exhibit at this year’s ESC San Jose that demonstrated this variability in a future post.

 

In contrast, a system’s security depends on protecting its vulnerabilities operating under extraordinary conditions. A single vulnerability under the proper extraordinary conditions can compromise the system’s proper operation. However, similar to determining a system’s quality, a system’s security is not completely dependent on a perfect implementation. If the system can isolate and contain vulnerabilities, it can still be good enough to operate in the real world. The 2008 report “Enhancing the Development Life Cycle to Produce Secure Software” identifies that secure software exhibits:

 

1. Dependability (Correct and Predictable Execution): Justifiable confidence can be attained that software, when executed, functions only as intended;

2. Trustworthiness: No exploitable vulnerabilities or malicious logic exist in the software, either intentionally or unintentionally inserted;

3. Resilience (and Survivability): If compromised, damage to the software will be minimized, and it will recover quickly to an acceptable level of operating capacity;

 

An example of a software system vulnerability that has a fault tolerant solution is the buffer overflow. The buffer overflow is a technique that exploits functions that do not perform proper bounds checking. The Computer Security Technology Planning Study first publicly documented the technique in 1972. Static analysis software tools are able to assist developers to avoid this type of vulnerability by identifying array overflows and underflows, as well as when signed and unsigned data types are improperly used. Using this fault tolerant approach can allow a software system to exhibit the three secure software properties listed above.

[Editor's Note: This was originally posted on the Embedded Master]

Question of the Week: Do you always use formal test procedures for your embedded designs?

Wednesday, June 9th, 2010 by Robert Cravotta

I commented earlier this week about how watching the BP oil well capping exercise live video reminded me about building and using formal test procedures when performing complex or dangerous operations. Before I had a lot of experience with test procedures, I used to think of them as an annoying check-off box for quality assurance. They were expensive to make, and they consumed huge amounts of time to build and refine. However, with more experience, I came to appreciate formal test procedures as valuable engineering design tools because they are a mechanism that injects fault tolerance in systems where the operator is an integral part of the system’s decision process. The procedure frees up the operator’s attention while performing “routine tasks” with the system so they can better recognize and react to the shifting external conditions of a complex environment.

Similar to the BP oil well capping exercise, the formal procedures I worked on involved complex systems that used dangerous chemicals. We needed to make sure we did not damage the systems while using them, both for safety and schedule reasons. Building the formal procedure and going through it with the entire team captured each member’s specialized knowledge so that the team was able to develop and refine each step in the procedure with a higher level of confidence than any subset of the team could have performed alone.

I personally understand the value of formal procedures for testing and operating very complex and dangerous systems, but I wonder if the formal procedure process offers similar value, compared to the cost and effort to make one, when applied to simple, low cost, or benign system designs.

Do you always build and use formal test procedures or are there designs that are so simple, low cost, or benign that you skip the process of building a formal test procedure? What types of designs would you consider skipping the formal procedure process and why?

If you would like to suggest questions for future posts, please contact me at Embedded Insights.

[Editor's Note: This was originally posted on the Embedded Master]

Robust Design: Formal Procedures

Monday, June 7th, 2010 by Robert Cravotta

Watching the BP oil pipe-capping live-video the other day made me think about formal procedures. Using a formal procedure is a way to implement the fault tolerant principle in cases where the operator is an integral part of the system’s decision process. Using a formal procedure helps to free up the operator’s attention while performing “routine tasks” so they can better recognize and react to the shifting conditions of a complex environment. Even though I could not see all of the video feeds or hear any of the audio feeds, I could imagine what was going on in each of the feeds during the capping operation.

I have worked on a fair share of path-finding projects that demonstrated the feasibility of many design ideas to make a completely autonomous vehicle that could complete complex tasks in real world environments. The design team would subject new ideas to substantive analysis and review before building a lab implementation that we could test. With each test-case success, we would move closer to testing the system in the target environment (space-based in many cases). One thing that impressed me then and has forever stayed with me is the importance and value of a formal operational procedure document.

I have written and reviewed many operational procedures. When I first started working with these procedures, I thought they were a bother that took a lot of time and effort to produce. We had to explicitly account for and document every little detail that we could imagine, especially the unlikely scenarios that we needed to detect during the system checkouts. We performed what seemed like endless walkthroughs of the procedures – each time refining small details in the procedure as different members of the team would share a concern about this or that. We did the walkthroughs so many times that we could complete them in our sleep. When I finally participated in my first live test, the value of those procedures and all of those walkthroughs became apparent – and they no longer seemed like a waste of time.

The formal procedure was an effective way to capture the knowledge of the many people, each with a different set of skills and specific knowledge about the system, in a single place. By performing all of those walkthroughs, it forced us to consider what we should do under different failure conditions without the burden of having to come up with a solution in real-time. The formal procedure enabled everyone on the team to be able to quickly perform complex and coordinated tasks that would be practically impossible to execute in an impromptu fashion – especially under stressful conditions.

The first and foremost reason for the procedures was to protect the people working around the system. We were dealing with dangerous materials where injuries or deaths were very real possibilities if we were not careful. The second reason for the procedures was to protect the system from operating in a predictable destructive scenario. Unpredictable failures are a common enough occurrence when you are working on the leading (bleeding?) edge because you are working in the realm of the unknown. The systems we were building only existed as a single vehicle or a set of two vehicles, and having to rebuild them would represent a huge setback.

The capping operation of the BP oil-pipe appears to encompass at least as much, if not significantly more, complexity than the projects I worked on so long ago. The video feed showed the ROV (remotely operated vehicle) robot arm disconnecting a blue cable from the capping structure. Then the video feed showed what I interpreted as the ROV operator checking out various points in the system before moving on to the next cable or step in some checklist. I could imagine the numerous go/no-go callouts from each of the relevant team members that preceded each task performed by the ROV operator. I am guessing that the containment team went through a similar process that we went through in building their formal procedures – first in the conference room, then in simulation, and finally on the real thing 5000 feet under the surface of the ocean.

While building and testing prototypes of your embedded systems may not involve the same adrenaline pumping excitement as these two scenarios, the cost of destroying your prototype system can be devastating. If you have experience using formal procedures while building embedded systems, and you would like to contribute your knowledge and experience to this series, please contact me at Embedded Insights.

[Editor's Note: This was originally posted on the Embedded Master]

Robust Design: Fault Tolerance – Nature vs. Malicious

Tuesday, June 1st, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on the Embedded Master]

For many applications, the primary focus of robust design principles is on making the design resilient to rare or unexpected real world phenomenon. Embedded designers often employ filters to help mitigate the uncertainty that different types of signal noise can cause. They might use redundant components to mitigate or compensate for specific types of periodic failures within various subsystems. However, as our end devices become ever more connected to each other, there is another source of failure that drives and benefits from a strong set of robust design principles – the malicious attack.

On most of the systems I worked on as an embedded developer, we did not need to spend too much energy addressing malicious attacks on electronics and software within the system because the systems usually included a physical barrier that our electronics control system could safely hide behind. That all changed when we started looking at supporting remote access and control into our embedded subsystems. The project that drove this concept home for me was on a Space Shuttle payload that would repeatedly fire its engines to maneuver around the Shuttle. No previous payload had ever been permitted to fire its engines for multiple maneuvers around the Shuttle before. The only engine fire they performed was to move away from the Shuttle and move into their target orbital position.

The control system for this payload was multiple fault-tolerant, and we often joked among ourselves that the payload would be so safe that it would not ever be able to fire its own engines to perform its tasks because the fault tolerant mechanisms were so complicated. This was even before we knew we had to support one additional type of fault tolerance – ensuring that none of the maneuvering commands came from a malicious source. We had assumed that because we were working in orbital space and the commands would be coming from a Shuttle transmitter, that we were safe from malicious attacks. The NASA engineers were concerned that a ground-based malicious command could send the payload into the Shuttle. The authentication mechanism was crude and clunky by today’s encryption standards. Unfortunately, after more than two years of working on that payload, the program was defunded and we never actually flew the payload around the Shuttle.

Tying this back to embedded systems on the ground, malicious attacks often take advantage of the lack of fault tolerance and security in a system’s hardware and software design. By deliberately injecting fault conditions onto a system or into a communication stream, an attacker with sufficient access to and knowledge of how the embedded system operates, can create physical breaches that provide access to the control electronics or expose vulnerabilities in the software system through techniques such as forcing a buffer overflow.

Adjusting your design to mitigate the consequences of malicious attacks can significantly change how you approach analyzing, building, and testing your system. With this in mind, this series will include topics that would overlap with a series on security issues, but with a mind towards robust design principles and fault tolerance. If you have experience designing for tolerance to malicious attacks in embedded systems, not only with regards to the electronics and software, but also from a mechanical perspective, and you would like to contribute your knowledge and experience to this series, please contact me at Embedded Insights.

Robust Design: Fault Tolerance – Performance vs. Protection

Monday, May 24th, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on the Embedded Master

Fault tolerant design focuses on keeping the system running or safe in spite of failures usually through the use of independent and redundant resources in the system. Implementing redundant resources does not mean that designers must duplicate all of the system components to gain the benefits of a fault tolerant design. To contain the system cost and design complexity, often only the critical subsystems or components are implemented with redundant components. Today we will look at fault tolerance for data storage on long term storage, such as hard disk drives used for network storage to illustrate some of the performance vs. fault tolerance protection trade-offs that designers must make.

As network storage needs continue to grow, the consequences of a hard disk drive failure increase. Hard disk drives are mechanical devices that will eventually fail. As network storage centers increase the number of hard drives in their center, the probability of a drive failure goes up in a corresponding fashion. To avoid loss of data from a hard drive failure, you either must be able to detect and replace the hard drive before it fails, or you must use a fault tolerant approach to compensate for the failure.

Using a RAID (redundant array of independent disks) configuration provides fault tolerance that keeps the data on those drives available and protects it from loss through a single disk drive failure. Using a RAID configuration does not protect the data from application failures or malicious users that cause the data to be overwritten or deleted; regular data backups are a common approach used to protect from those types of failures. RAID is a technique for configuring multiple hard disk drives into a single logical device to increase the data reliability and availability by storing the data redundantly across the drives.

A RAID configuration relies on a combination of up to three techniques: mirroring, striping, and error correction. Mirroring is the easiest method for allocating the redundant data and it consists of writing the same data to more than one disk. This approach can speed up the read speed because the system can read different data from different disks at the same time, but it may trade off write speed if the system must confirm the data is written correctly across all of the drives. Striping is more complicated to implement, and it consists of interlacing the data across more than one disk. This approach permits the system to complete reads and writes faster than performing the same function on a single hard drive. Error correction consists of writing redundant parity data either on a separate disk or striped across multiple disks. Storing the error correction parity data means the amount of usable storage is less than the total amount of raw storage on all of the drives.

The different combinations of these three techniques provide different performance and fault protection trade-offs. A system that implements only data striping (RAID 0) will benefit from faster read performance, but all of the data will be at risk because any disk failure will cause loss of the data in the array. Data recovery can be costly and not guaranteed with this approach. This approach is appropriate for fixed data or program structures, such as operating system images that do not change often and can be recovered by restoring the data image from a backup.

A system that implements only mirroring (RAID 1) creates two identical copies of the data on two different drives; this provides data protection from a drive failure, but it does so at the cost of a 50% storage efficiency because every bit of data is duplicated. This approach allows the system to keep a file system available at all times even when performing backups because it can declare a disk as inactive, perform a backup of that drive, and then rebuild the mirror.

A system that implements data striping across two or more data drives with a dedicated parity drive (RAID 3 and 4) provide better storage efficiency because the fault tolerance is implemented through a single extra drive in the array. The more drives in the array, the higher the storage efficiency. However, all reads and writes access the dedicated parity drive, so the dedicated drive throttles the maximum performance of the system. RAID 5, which stripes the parity data across all of the drives in the array, has all but replaced RAID 3 and 4 implementations.

RAID 5 offers good storage efficiency as the parity data consumes the equivalent of a single drive in the system. This approach suffers from poor write performance because the system must update the parity on each write. Because the parity data is striped across the drives, the system is able to continue degraded (slower) operation despite a hard drive failure. The system is able to rebuild a fault tolerant data image with a new disk drive, such as on a hot swap drive, while the system continues to provide read and write access to the existing data. RAID6 extends on RAID 5 by using additional parity blocks for dual fault tolerance.

Each of these configurations provides different performance and fault tolerance trade-offs and are appropriate based on the context of the subsystem they are used for. Note that a single RAID controller itself can be the single point of failure within a system.

Performance and protection trade-offs are not limited to disk storage. An embedded design might a firmware library from Flash to SRAM to gain a performance boost, but the code in SRAM is now vulnerable to unintended modification. If you would like to participate in a guest post, please contact me at Embedded Insights.

Robust Design: Disposable Design Principle

Wednesday, May 19th, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on the Embedded Master]

The disposable design principle focuses on short life span or limited use issues. At first glance this principle may make you think these principles only apply to cheap systems, but that would be incorrect. An example of an expensive system that can embody the disposable design principle is an expendable vehicle or component such as a rocket engine. These systems are in contrast to a reusable space vehicle, such as the Space Shuttle, which require a heavier mechanical structure and a recovery system, such as wings, thermal protection system and wheels that result in a lower overall payload capacity. In practice, using the single-use systems are less expensive, support a shorter time to launch, and are considered low risk for mission failure for many types of missions, including launching satellites into orbit.

Limited-use or disposable embedded systems can enjoy similar advantages over reusable versions. Limited-use systems are being embedded into all types of applications, such as inventory tracking tags, medical appliances, fireworks, environmental tracking pads for agriculture, security tags on retail items, and authentication modules to ensure that consumable subsystems are not matched with unsupported end-systems.

The disposable design principle also applies to systems that enforce an end-of-life. The plight of CFLs (Compact Fluorescent Lights) is a good example of a product industry that is responding to the consequences of adopting or ignoring the disposable principle. When a CFL reaches its end-of-life, it can manifest a (purportedly) rare failure mode where a fuse on the control board will burn out. I say purportedly rare because every CFL I have used to end-of-life (even on different lamps) has failed the same way with a small fire, smoke that smells like burning plastic, and burnt plastic on the base of the bulb. The CFL industry has taken notice of the consumer concern about unsettling end-of-life behaviors and is setting standards for handling end-of-life for CFLs. Enforcing an end-of-life mechanism can simplify the complexity the designers must accommodate because the system will shut itself down before the probability of key failure modes manifesting crosses some threshold.

Disposable or limited-use does not necessarily mean lower quality of the components, but it can mean that the system can take meaningful optimizations that drastically drop the cost of the system and improve the delivered quality of the end system. Disposable contact lenses are available in many styles, from daily, weekly, and monthly wear. Each type of lens makes different trades in the materials for durability and sterility that allows each to deliver superior quality at each price point.

Disposable hearing aids use a battery or cell that is fitted permanently within the system; there is no way to replace the battery with a new one. Using a permanent battery allows the disposable hearing aid to last longer on the same charge store than traditional hearing aids. The permanent battery also eliminates the need for the designer to implement a battery door and a mechanism for removing and placing batteries into the hearing aid. In fact, some disposable hearing aid designs are able to make use of a larger microphone area that would normally be consumed by a battery replacement door and hinge.

Robust Design: Dogfooding

Monday, May 10th, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on the Embedded Master

Dogfooding is an idiom coined from the saying “Eating your own dog food.” Examples of companies that practice dogfooding include Microsoft and Google. Other companies, such as Amulet Technologies and Green Hills Software use the term informally in their presentations when they are emphasizing they use their own tools to make the tools they sell. CNET recently reported that Adobe is planning to provide Android phones running Flash to its employees. I interpret this to be a dogfooding moment for Adobe to ensure that Flash has the best possible chance to succeed on the Android mobile platform.

In the above examples, dogfooding spans from beta testing a product to out and out using your own product for production work. A common thread in these examples though is that they are software-based products that target developers; however, dogfooding is not limited to software. As an example, many years ago, my brother worked at a company that built numerical control tooling machines that they used to build the product they ultimately sold.

The purported advantages of using your own products is that it proves to customers that you believe in your own product enough to use it. However, the claim that using your own product means you catch more bugs and design flaws is not a strong claim because it might suggest that the company’s QA process is less effective than it should be. One of the biggest advantages of working with your own products is that it means that your developers are more aware of what works well with the product as well as why and how they could improve it from a usability perspective. In essence, working with your own products makes it more obvious to the developers what the differences are between how you think the product works and how well it actually does against what you envisioned it would do.

However, as a best practice concept, dogfooding needs to take on a different tact for embedded developers. Using the embedded system as an end product is usually not an option because embedded systems make up the invisible components of the end product. The mechanisms to dogfooding an embedded design occurs during the demonstration and validation steps of the design and build effort when the designers, integrators, and users of the embedded system work together to quantifiably prove whether the requirements specification for the embedded system was sufficient. It is not sufficient that the embedded design meets the requirements specification if the specification was lacking key information. Often, at the demonstration and validation phase, the designers and users discover what additional requirements they need to incorporate into the system design.

In short, dogfooding is that step where designers prove that the system they envisioned actually does what the end user needs it to do. In the case of embedded systems, proving out the concept to reality requires a cooperative, and often an iterative, effort between the designer, system integrator, and end user.

Robust Design: Patch-It Principle – Teaching and Learning

Monday, May 3rd, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on the Embedded Master

In the first post introducing the patch-it principle, I made the claim that developers use software patches to add new behaviors to their systems in response to new information and changes in the operating environment. In this way, patching systems allow developers to offload the complex task of learning off the device – at least until we figure out how to build machines that can learn. In this post, I will peer into my crystal ball, and I will describe how I see the robust design patch-it principle will evolve into a mix of teaching and learning principles. There is a lot of room for future discussions, so if you see something you hate or like, speak up – it will signal that topic for future posts.

First, I do not see the software patch going away, but I do see it taking on a teaching and learning component. The software patch is a mature method of disseminating new information to fixed-function machines. I think software patches will evolve from executable code blocks to meta-code blocks. This will be essential to support multi-processing designs.

Without using meta-code, the complexity of building robust patch blocks that can handle customized processor partitioning will grow to be insurmountable as the omniscient knowledge syndrome drowns developers in requiring them to handle even more low-value complexity. Using meta-code may provide a bridge to supporting distributed or local knowledge (more explanation in a later post) processing where the different semi-autonomous processors in a system make decisions about the patch block based on their specific knowledge of the system.

The meta-code may take on a form that is conducive to teaching rather than an explicit sequence of instructions to perform. I see devices learning how to improve what they do by observing their user or operator as well as communicating with other similar devices. By building machines this way, developers will be able to focus more on specifying the “what and why” of a process, and the development tools will assist in the system in genetically searching and applying different coding implementations and focusing on a robust verification of equivalence between the implementation and specification. This may permit systems to consist of less than perfect parts as verifying the implementation will include the imperfections in the system.

The possible downside of learning machines is that they will become finely tuned to a specific user and be less than desirable to another user – unless there is a means for users to carry their preferences with them to other machines. This already is manifesting in chat programs that learn your personal idioms and automagically provide adjusted spell checking and link associations because personal idioms do not always cleanly translate, or are they used in the same connotation, for other people.

In order for the patch-it principle to evolve to the teach and learn principle, machines will need to develop a sense of context of self in their environment, be able to remember random details, be able to spot repetition of random details, be able to recognize sequences of events, and be able to anticipate an event based on a current situation. These are all tall orders for today’s machines, but as we build wider multiprocessing systems, I think we will stumble upon an approach to perform these tasks for less energy than we ever thought possible.

Robust Design: Patch-It Principle – Design-for-Patching

Monday, April 26th, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on the Embedded Master

Patching a tire is necessary when the tire has had a part of itself forcibly torn or removed so that it is damaged and can no longer perform its primary function properly. This is also true when you are patching clothing. Patching software in embedded systems however, is not based on replacing a component that has been ripped from the system – rather it involves adding new knowledge to the system that was not part of the original system. Because software patching involves adding new information to the system, there is a need for extra available resources to accommodate the new information. The hardware, software, and labor resources needed to support patching is growing as systems continue to increase in complexity.

Designing to support patching involves some deliberate resource trade-offs, especially for embedded systems that do not have the luxury of idle, unassigned memory and interface resources that a desktop computer might have access to. To support patching, the system needs to be able to recognize that a patch is available, be able to receive the patch through some interface, and verify that that the patch is real and authentic to thwart malicious attacks. It must also be able to confirm that there is no corruption in the received patch data and that the patch information has been successfully stored and activated without breaking the system.

In addition to the different software routines needed at each of these steps of the patching process, the system needs access to a hardware input interface to receive the patch data, an output interface to signal whether or not the patch was received and applied successfully, and memory to stage, examine, validate, apply, and store the patch data. For subsystems that are close to the user interface, gaining access to physical interface ports might be straight forward, but there is no industry-standard “best practices” for applying patches to deeply embedded subsystems.

It is important that the patch process does not leave the system in an inoperable state – even if there is a corruption in the patch file or loss of power to the system while applying the patch. A number of techniques designers use depend on including enough storage space in the system to house the pre- and post-patch code so that the system can confirm the new patch is working properly before releasing the storage holding the previous version of the software. The system might employ a safe, default boot kernel, which the patching process can never change, so that if the worst happens during applying a patch, the operator can use the safe kernel to put the system into known state that can provide basic functionality and accept a new patch file.

In addition to receiving and applying the patch data, system designs are increasingly accommodating custom settings, so that applying the patch does not disrupt the operator customizations. Preserving the custom settings may involve more than just not overwriting the settings; it may involve performing specific checks, transformations, and configurations before completing the patch. Supporting patches that preserve customization can involve more complexity and work from the developers to seamlessly address the differences between each different setting.

The evolving trend for the robust design patch-it principle is that developers are building more intelligence into their patch processes. This simplifies or eliminates the learning curve for the operator to initiate a patch. Smarter patches also enable the patch process to launch, proceed, and complete in a more automated fashion without causing operators with customized settings any grief. Over time, this can build confidence in the user community so that more devices can gain the real benefit of the patch-it principle – devices can change their behavior in a way that mimics learning from their environment years before designers, as a community, figure out how to make self-reliant learning machines.