Robust Design Channel

Robust Design explores the toolbox of design principles that embedded developers can use to build components and systems that continuously work as intended, even when they are subjected to unintended operating circumstances.

How do you mitigate single-point failures in your team’s skillset?

Wednesday, December 22nd, 2010 by Robert Cravotta

One of the hardest design challenges facing developers is how to keep the system operating within acceptable bounds despite being used in non-optimal conditions. Given a large enough user base, someone will operate the equipment in ways that the developers never intended. For example, a friend recently shared that his young daughter has developed an obsession with turning the lights in the house on and off repeatedly. Complicating this scenario is that some of the lights she likes to flip on and off are fluorescent lights (the tubes, not CFLs (compact fluorescent light)). Unfortunately, repeatedly turning them on and off in this fashion significantly reduces their useful life. Those lights were not designed to be put under those types of operating conditions. I’m not sure designers can ever build a fluorescent bulb that will flourish under those types of operating conditions – but you never know.

Minimizing and eliminating single-point failures in a design is a valuable strategy for increasing the robustness of the design. Experienced developers exhibit a knack for avoiding and mitigating single-point failures – often as the result of experience with similar failures in previous projects. Successful methods for avoiding single-point failures usually involve implementing some level of overlap or redundancy between separate, and ideally independent, parts of the system.

A look at the literature addressing single-point failures reveals a focus on technical and tangible items like devices and components, but there is an intangible source of single-point failures that can be devastating to a project – when a given skillset or knowledge set is a single-point failure. I was first introduced to this idea when someone asked me “What will you do if Joe wins the Lottery?” We quickly established that winning the Lottery was a nice way to describe a myriad of unpleasant scenarios to consider – in each case the outcome is the same – Joe, with all of his skills, experience, and project specific knowledge, leaves the project.

As a junior member of the technical staff, I did not need to worry about this question, but once I started into the ranks of project lead – well that question become immensely more important. If you have the luxury of a large team and budget, you might assign people to overlapping tasks. However, small teams may lack not just the budget but the cognitive bandwidth of the team members to be aware of everything everyone else is doing.

One approach we used to mitigate the consequences of a key person “winning the Lottery” involved holding regular project status meetings. Done correctly, these meetings can provide a quick and cost effective mechanism for spreading the project knowledge among more people. The trick is to avoid involving too many people for too long or too frequently so that the meetings cost more than the possible benefit they provide. Maintaining written documentation is another approach for making sure the project can recover from the loss of a key member. Another approach we used for more tactical types of skills was to contract with an outside team that specialized in said skillset. By working with someone who understands the project’s tribal knowledge, this approach can help the team recover quickly and salvage the project.

What methods do your teams employ to protect from the consequences of a key person winning the Lottery?

Does your design team suffer from “Group Think?”

Wednesday, November 17th, 2010 by Robert Cravotta

I have had the opportunity to work with many teams of excellent people. Over the years, I developed a way to measure the health of a company by examining how team members interacted with each other – most notably – how often do people voice their disagreements? I was amazed when I first realized there was a positive correlation between the amount of arguments you could hear between team members and the health of the company that we all worked for. This correlation, at least in my personal experience, held up across different teams, different companies, and different industries.

When I started as an engineer, it was common to hear, and participate in passionate discussions between the merits and faults of different approaches to various design problems we needed to solve. The lively discussions transcended seniority in that everyone passionately participated and everyone’s voice was heard. It was never a foregone conclusion that the grey beards were always right and that the “greenies” were always wrong. In fact, in plenty of cases, the junior members provided valuable contributions to the eventual trade-offs that the team did make.

In contrast, when the health of the company was suffering, the tone in the hallways changed. The number and intensity of the lively discussions would taper off. By the time that dissenting ideas stopped being offered, the company’s poor health was visible to everyone. It was especially during the several down times I have lived through that I learned to recognize the reemergence of lively discussions as a harbinger of better times.

It wasn’t until I had experienced a couple of these cycles that I learned about this concept called Group Think. From my observations, the most important aspect of Group Think is the suppression of contradictory ideas. It is during these times of low levels of disagreement that a group does not explore the problem space robustly enough. The team members are less willing to take the risk of disagreeing with the leadership. This increases the risks that the team will miss an important detail that leads to an expensive failure and fix process.

This relationship between open disagreements and lively discussions has been so strong during my career that the obvious presence or lack of such discussions plays a key role for me when I am considering joining a company or group. Does your team suffer from Group Think? Have you discovered other ways to measure the health of a group or company? Have you discovered ways to revive a group into the more “confrontational” means of working together? Or does your experience differ from mine as to the value of lively groups?

How do you handle contractual indemnity and liabilities for embedded systems?

Wednesday, November 10th, 2010 by Robert Cravotta

Embedded designs continue to grow in complexity, and yet, embedded designs are becoming a mainstay in nearly every design – even systems that demand the highest quality, such as medical, automotive, and industrial equipment. Tim Cummins comments on the trend “by buyers to simply allocate the risks of failure to their suppliers through broad-brush application of ‘burdensome’ terms, such as onerous liability and indemnity provisions.” By definition, embedded designs are not the end device, so they are used by someone else in their final device – thus embedded developers always find themselves in the position of supplier to some end device manufacturer.

I find the request for onerous liability and indemnity provisions are not limited to embedded designs where there is the potential for significant unknowns, but also in more mundane spaces such as writing articles. My background in aerospace taught me that unlimited liabilities are never worth agreeing to. In fact, it is better to avoid indemnity and liability clauses where possible, but that kind of buyer seems to be a rarer and rarer beast to find. An approach I have taken is to explicitly describe and limit what liabilities I am willing to take on in a contract – specifically what I will warrant and guarantee about the product I deliver to the customer.

Warranties may explicitly identify limits for expecting the subsystem or product to work as specified – which means there is a growing amount of resources expended on specifying what the system is not designed to handle. Is this a best practice approach? What does your team do to address a buyer’s risk and liability concerns for your embedded components?

Are you, or someone you know, using voting within your embedded system designs?

Wednesday, November 3rd, 2010 by Robert Cravotta

With the midterm elections in the United States winding down, I thought I’d try to tie embedded design concepts to the process of elections. On some of the aerospace projects I worked on, we used voting schemes as fault tolerant techniques. In some cases, because we could not trust the sensors, we used multiple sensors, and performed voting among the sensor controllers (along with separate and independent filtering) to improve the quality of the data that we used for our control algorithms. We might use multiple of the same type of sensor, and in some cases we would use sensors that differed from each other significantly so that they would not be susceptible to the same types of bad readings.

I did a variety of searches on fault tolerance and voting to see if there was any recent material on the topic. There was not a lot of material available, and what was available was scholarly, and I was generally not able to download the files. It is possible I did a poor job choosing my search terms. However, this lack of material made me wonder if people are using the technique at all and/or has it evolved into a different form. In this case, sensor fusion.

Sensor fusion is the practice of combining data derived from sensors from disparate sources to deliver “better” data than would be possible if these sources were used individually. “Better” in this case can mean more accurate, complete, reliable data. From this perspective, the fusion of the data is not strictly a voting scheme, but there are some similarities with the original concept.

This leads me to this week’s question. Are you, or someone you know, using voting or sensor fusion within embedded system designs? As systems continue to increase in complexity, the need for robust design principles that can enable systems to correctly operate with less-than-perfect components becomes more relevant. Is the voting schemes of yesterday still relevant, or have they evolved into something else?

Software Coding Standards

Friday, October 8th, 2010 by Robert Cravotta

The two primary goals of many software coding standards is to reduce the probability that software developers will introduce errors into their code caused by “poor” coding practices and to make it easier to identify errors or vulnerabilities that make it into a project’s code base. By adopting and enforcing a set of known best practices, coding standards enable software development teams to work together more effectively because they are working from a common set of assumptions. Examples of the types of assumptions that coding standards address are: prohibiting language constructs known to be associated with common runtime errors; specifying when and how compiler or platform-specific constructs may and may not be used; and specifying policies for managing system memory resources such as static and dynamic memory allocation.

Because coding standards involve aligning a team of software developers to a common set of design and implementation assumptions and because every project has its own unique set of requirements and constraints, there is no single, universal best coding standard. Industry-level coding standards center on a given programming language, such as C, C++, and Java. There may be variants for each language based on the target application requirements, such as MISRA-C (Motor Industry Software Reliability Association), CERT C (Computer Emergency Response Team), JSF AV C++ (Joint Strike Fighter), IPA/SEC C (Information-Technology Promotion Agency/ Software Engineering Center), and Netrino C.

MISRA started as a guideline for the use of the C language in vehicle based software, and it has found acceptance in the aerospace, telecom, medical devices, defense, and railway industries. CERT is a secure coding standard that provides rules and recommendations to eliminate insecure coding practices and undefined behaviors that can lead to exploitable vulnerabilities. JSF specifies a safe subset of the C++ language targeting use in air vehicles. The IPA/SEC specifies coding practices to assist in the consistent production of high quality source code independent of an individual programmer’s skill. Netrino is an embedded coding standard targeting the reliability of firmware while also improving the maintainability and portability of embedded software.

Fergus Bolger, CTO at PRQA shares that different industries need to approach software quality from different perspectives – which adds more complexity to the sea of coding standards. For example, aerospace applications exist in a high certification environment. Adopting coding standards is common for aerospace projects where the end system software and the tools that developers use go through a certification process. In contrast, the medical industry takes a more process oriented approach where it is important to understand how the tools are made. MISRA is a common coding standard in the medical community.

At the other end of the spectrum, automotive has an installed automotive software code base that is huge and growing rapidly. Consider that a single high-end automobile can include approximately 200 million lines of code to manage the engine and system safety as well as all of the bells and whistles of the driver interface and infotainment systems. Due to the sheer amount of software, there is less code verification. Each automotive manufacture has their own set of mandatory and advisory rules that they include with the MISRA standard.

A coding standard loses much of its value if it is not consistently adhered to, so a number of companies have developed tools to help with compliance and software checking. The next article will start the process of identifying the players and their approaches to supporting coding standards.

Identifying sweet spot assumptions

Monday, August 30th, 2010 by Robert Cravotta

I am continuing to develop a taxonomy to describe the different types of software tools. Rather than waiting until I have a fully fleshed out model, I am sharing my thought process with you in the hopes that it will entice you to share your thoughts and speed up the process of building a workable model.

I am offering up the following processing mapping as an example of how an analogous software mapping might look. The mapping identifies two independent characteristics, in this case, the number of states and the amount of computation that the system must handle. One nice thing about mapping the design characteristics like this is that it provides an independence from the target application and allows us to focus on what an architecture is optimizing and why.

For example, a microcontroller’s sweet spot is in the lower end of the computation load but spans from very simple to complicated state machines. Microcontroller architectures emphasize excellent context switching. In contrast, DSP architectures target streaming problems where context switching is less important and maximizing computation for the same amount of time/energy is more important.

I suspect that if we can identify the right characteristics for the axis of the mapping space that software tools will fall into analogous categories of assumptions and optimizations. The largest challenge at this moment is identifying the axes. Candidate characteristics include measures of productivity, efficiency, reusability, abstraction, coupling, and latency tolerance.

An important realization is that the best any software can accomplish is to not stall the hardware processing engine. The software will perform data manipulations and operations that cause the processing engine to stall, or be idle, some percentage of the time. As a result, all software tools are productivity tools that strive to help the developer produce software that is efficient enough to meet the performance, schedule, and budget requirements of the project. This includes operating systems, which provide a layer of abstraction from the underlying hardware implementation.

I propose using a measure of robustness or system resilience and a measure of component coupling as the two axes to map software development tools to a set of assumptions and optimization goals.

The range for the component coupling axis starts at microcode and moves toward higher levels of abstraction such as machine code, assembly code, BIOS, drivers, libraries, operating systems, and virtual machines. Many embedded software developers must be aware of multiple levels of the system in order to extract the required efficiency from the system. As a result, many software tools also target one or more of these layers of abstraction. The more abstraction layers that a tool accommodates, the more difficult it is to build and support.

Consider that while a compiler ideally allows a developer to work at a functional and/or data flow level, it must also be able to provide the developer visibility into the lower level details in case the generated code performs in an unexpected fashion that varies with the hardware implementation. The compiler may include an inline assembler and support #pragma statements that enable the developer to better specify how the compiler can use special resources in the target system.

The robustness axis is harder to define at this moment. The range for the robustness axis should capture the system’s tolerance to errors, inconsistent results, latency, and determinism. My expectation for this axis is to capture the trade-offs that allow the tool to improve the developer’s productivity while still producing results that are “good enough.”  I hope to be able to better define this axis in the next write-up.

Do you have any thoughts on these two axes? What other axes should we consider? The chart can go beyond a simple 2D mapping.

Software Ecosystem Sweet Spots

Monday, August 16th, 2010 by Robert Cravotta

I have been refining a sweet spot taxonomy for processors and multiprocessing designs for the past few years. This taxonomy highlights the different operating assumptions for each type of processor architecture including microcontrollers, microprocessors, digital signal processors, hardware accelerators, and processing fabrics.

I recently started to focus on developing a similar taxonomy to describe the embedded software development ecosystem that encompasses the different types of tools and work products that affect embedded software developers. I believe developing a taxonomy that identifies the sweet spot for each type of software component in the ecosystem will enable developers and tool providers to better describe the assumptions behind each type of software development tool and how to evolve them to compensate for the escalating complexity facing embedded software developers.

The growing complexity facing software developers manifests in several ways. One source of the growing complexity is the increase in the amount of code that exists in designs. The larger the amount of code within a system, the more there are opportunities for unintended resource dependencies that affect the performance and correct operation of the overall system. Using a modular design approach is a technique to manage some of this complexity, but modular design works to abstract the resource usage within a module and does not directly affect how to manage the memory and timing resources that the modules share with each other by virtue of executing on the same processor.

Another source of growing complexity is “path-finding” implementations of new functions and algorithms because many algorithms and approaches for solving a problem are first implemented as software. New algorithms undergo an evolution as the assumptions in the specification and coding implementation of the algorithms operate under wider range of operating conditions. It is not until the implementation of those algorithms matures, by being used across a wide enough range of operating conditions, that it makes sense to implement hardened and optimized versions using coprocessors, accelerators, and specialized logic blocks.

According to many conversations I have had over the years, the software in most embedded designs consumes more than half of the development budget; this ratio holds true even for “pure” hardware products such as microcontrollers and microprocessors. Consider that no company releases contemporary processor architectures anymore without also providing significant software assets that include tools, intellectual property, and bundled software. The bundled software is necessary to ease the learning curve for developers to use the new processors and to get their designs to market in a reasonable amount of time.

The software ecosystem taxonomy will map all types of software tools including assembler/compilers, debuggers, profilers, static and dynamic analyzers, as well as design exploration and optimization tools to a set of assumptions that may abstract to a small set of sweet spots. It is my hope that applying such a taxonomy will make it easier to understand how different software development tools overlap and complement each other, and how to evolve the capabilities of each tool to improve the productivity of developers. I think we are close to the point of diminishing returns of making compilation and debugging faster; rather, we need more tools that understand system level constructs and support more exploration – otherwise the continuously growing complexity of new designs will negatively impact the productivity of embedded software developers in an increasingly meaningful manner.

Please contact me if you would like to contribute any ideas on how to describe the assumptions and optimization goals behind each type of software development tool.

Eating dog food? It’s all in the preparation.

Monday, June 28th, 2010 by Jason Williamson

Altia provides HMI (human machine interface) engineering tools to companies in industries like automotive, medical, and white goods. When you’re providing interface software, it makes sense to use your own tools for “real” work, just as your customers would. Not only do you prove you know your own product, but you get an invaluable “user’s perspective” into the workings of your software. You get the opportunity to see where your tools shine and where they are lacking, allowing your team to plan for new features to make them better. Through our own “dog fooding” experiences, we have developed some valuable guidelines that we believe make the process go more smoothly.

First, it is important to only use released versions of the product. It is tempting to pull in the latest beta capabilities to a project, but this is a perilous course. There is a reason why that feature hasn’t been released. It hasn’t been through the full test cycle. You cannot risk the project schedule or quality of what is delivered. Producing quality on time is why you’ve been engaged in the first place. Another reason to stick with the released versions of your tools is that you should approach all of your consulting work with the idea that the customer will ultimately need to maintain the project. They need to know that the features and output used in the creation of the project are mature and trustworthy.

The next guideline addresses releases and your revision control system.  A revision control system is the repository where all of the versions of product source code are stored.  This often includes the “golden,” release versions of the product as well as in-development “sand boxes.”  We structure our revision control system such that release-worthy code for new features is kept in a nearly ready-to-release state as the next version of our product. That is, whole feature groups should be checked in together and tested to an extent such that only running the overall test suites are needed to create a product. That way, if a new feature absolutely must be used in a project, you have a lower barrier to an interim release.

Finally, it is very important to spend sufficient time architecting the project. When deadlines rapidly approach, it is tempting to take shortcuts to the end result. Since you know your software so well, you can be quite certain that these shortcuts will not be a detriment to the delivered product. However, this is almost always a shortsighted choice. When handing off the design to another person, especially a valued customer, a well-documented and rigorously-followed architecture is paramount. Your customers need to own and usually extend this design. There should be no “duct tape” in it. Who would want to receive that call to explain a kludge four years after the project has been delivered?

I encourage you to have a hearty helping of your own dog food. Not only do you serve up a result that will please your customer, but you learn by experience where you can make your software stronger and more capable. By developing with current releases, by keeping new features tested and ready to go, and by taking appropriate measures to architect the project, you make the eating of your own dog food a gourmet experience — and keep your customers coming back for seconds.

Do you permit single points of failure in your life?

Wednesday, June 23rd, 2010 by Robert Cravotta

AT&T’s recent national outage of their U-Verse voice service affected me for most of one day last month. Until recently, such outages never affected me because I was still using a traditional landline phone service. That all changed a few months ago when I decided that the risk and consequences of an outage might be offset by the additional services and lower cost of the VoIP service over the landline service. Since the outage, I have been thinking about whether I properly evaluated the risks, costs, and benefits, and whether I should keep or change my services.

The impact of the outage was significant, but not as bad as it could have been. The outage did not affect my ability to receive short phone calls or send and receive emails. It did however severely reduce my ability to make outgoing phone calls and to maintain a long phone call as the calls that did get through would randomly drop. I had one scheduled phone meeting that I had to reschedule as a result of the outage. Overall, the severity and duration of the outage was not sufficient to cause me to drop the VoIP service in favor of the landline service. However, if more similar outages were to occur, say more frequently than on a twelve months cycle or for more than a few hours at a time, I might seriously reconsider this position.

An offsetting factor in this experience was my cell phone. My cell phone sort-of acts as my backup phone in emergencies, but it is insufficient for heavy duty activity in my office because I work at the edge of a wireless dead coverage spot in the mountains. I find it ironic that the cell phone has replaced my landline as my last line of defense to communicate in emergencies because I kept the landline so long as a last line of defense against the wireless phone service going down.

Many people are making this type of trade-off (knowingly or not). A May 12, 2010 report from the Centers for Disease Control and Prevention, says that 24.5% of American homes, in the last half of 2009, had only wireless phones. According to the repost, 48.6% of adults aged 25 to 29 years old lived in households with only wireless phones. The term VoIP never shows up in the report, so I cannot determine whether or not the data lumps landline and VoIP services into the same category.


 100623-phones.png

Going with a wireless only household incurs additional exposures of single point of failure. 9-1-1 operators cannot automatically find you in an emergency. And in a crisis, such as severe storms,

the wireless phone infrastructure may overload and prevent you from receiving a cell signal.

The thing about single points of failure is that they are not always obvious until you are already experiencing the failure. Do you permit single point failures in the way you design your projects or in your personal life choices? For the purpose of this question, ignoring the possibility of a single point failure is an implied acceptance of the risk and benefit trade-off.

If you would like to suggest questions to explore, please contact me at Embedded Insights.

[Editor's Note: This was originally posted on the Embedded Master]

Operational Single Points of Failure

Monday, June 21st, 2010 by Robert Cravotta

A key tenet of fault tolerant designs is to eliminate all single points of failure from the system. A single point of failure is a component or subsystem within a system such that if it suffers a failure, it can cause the rest of the system to fail. When I was first exposed to the single point of failure concept, we used it to refer to sub systems in electronic control systems. A classic example of a single point failure is a function that is implemented completely in software, even if you use multiple functions or algorithms to check each other, because the processor core itself represents a single point of failure in a system with only one processor.

As my experience grew, and I was exposed to more of the issues facing internal program management as well as design level trade-off complexities, I appreciated that single points of failure are not limited to just what is inside your box. It is a system level concept and while it definitely applies to the obvious engineering candidates, such as hardware, software, and mechanical considerations (and thermal and EMI and … and …), it also applies to team processes, procedures, staffing, and even third-party services.

Identifying single points of failure in team processes and procedures can be subtle, but the consequences of allowing them to stay within your system design can be as bad as a normal engineering single point of failure. As an example, processes that only a single person executes are possible sources of failures because there is no cross check or measurement to ensure the person is performing the process correctly and this might allow certain failure conditions to go undetected. In contrast, you can eliminate such a failure point if the process involves more than a single person and the tasks performed by both people support some level of cross-correlation.  

Staffing policies can introduce dangerous single points of failure into your team or company, especially if there is no mechanism for the team to detect and correct when a given skill set is not duplicated across multiple people on the team or in the company. You never know when or why that person with the unique skills or knowledge will become unavailable. While you might be able to contact them if they leave the company or win the lottery, you would have a hard time being able to tap them if they died.

There was a cartoon I displayed in my office for a while many years ago that showed a widow and her child in the rain standing over a grave, and there is an engineer standing next to them asking if the husband ever mentioned anything about source code. The message is powerful and terrifying for anyone that is responsible for maintaining systems. The answer is to plan for redundancy in your staff’s skills and knowledge. When you identify that you have a single point of failure in your staff’s skills or knowledge, commit to fixing that problem as soon as possible. Note that this is a “when condition” and not an “if condition” because it will happen from time to time for reasons completely out of your control.

The thing to remember is that single points of failure can exist anywhere in your system and are not limited to just the components in your products. As systems include more outside or third-party services or partners, the scope of the system grows accordingly and the impact of non-technical single points of failure can grow also.

If you would like to be an information source for this series or provide a guest post, please contact me at Embedded Insights.

[Editor's Note: This was originally posted at the Embedded Master]