Entries Tagged ‘Failure’

How should embedded systems handle battery failures?

Wednesday, November 30th, 2011 by Robert Cravotta

Batteries – increasingly we cannot live without them. We use batteries in more devices than ever before, especially as the trend to make a mobile version of everything continues its relentless advance. However, the investigation and events surrounding the battery fires for the Chevy Volt is yet another reminder that every engineering decision involves tradeoffs. In this case, damaged batteries, especially large ones, can cause fires. However, this is not the first time we have seen damaged battery related issues – remember the exploding cell phone batteries from a few years ago? Well that problem has not been completely licked as there are still reports of exploding cell phones even today (in Brazil).

These incidents remind me of when I worked on a battery charger and controller system for an aircraft. We put a large amount of effort into ensuring that the fifty plus pound battery could not and would not explode no matter what type of failures it might endure. We had to develop a range of algorithms to constantly monitor each cell of the battery and appropriately respond if anything improper started to occur with any of them. One additional constraint on our responses though was that the battery had to deliver power when it was demanded by the system despite parts of the battery being damaged or failing.

Even though keeping the battery operating as well as it can under all conditions represents an extreme operating condition, I do not believe it is all that extreme a condition when you realize that automobiles and possibly even cell phones sometimes demand similar levels of operation. I recall discussing the exploding batteries a number of years ago, and one comment was that the exploding batteries was a system level design concern rather than just a battery manufacturing issue – in most of the exploding phones cases at that time, the explosions were the consequence of improperly charging the battery at an earlier time. Adding intelligence to the battery to reject a charging load that was out of some specification was a system-level method of minimizing the opportunity to damage the batteries via improper charging.

Given the wide range of applications that batteries are finding use in, what design guidelines do you think embedded systems should follow to provide the safest operation of batteries despite the innumerable ways that they can be damaged or fail? Is disabling the system appropriate?

Food for thought on disabling the system is how CFL (compact fluorescent lights) handle end-of-life conditions for the bulbs when too much of the mercury has migrated to the other end of the lighting tube – they purposefully burn out a fuse so that the controller board is unusable. While this simple approach avoids operating a CFL beyond its safe range, it has caused much concern among the user population as more and more people are scared by the burning components in their lamp.

How should embedded systems handle battery failures? Is there a one size fits all approach or even a tiered approach to handling different types of failures so that users can confidently use their devices without fear of explosions and fire while knowing when there is a problem with the battery system and getting it fixed before it becomes a major problem?

How does your company handle test failures?

Wednesday, August 17th, 2011 by Robert Cravotta

For many years, most of the projects I worked on were systems that had never been built before in any shape or form. As a consequence, many of the iterations for each of these projects included significant and sometimes spectacular failures as we moved closer to a system that could perform its tasks successfully in an increasingly wider circle of environmental conditions. These path-finding designs needed to be able to operate in a hostile environment (low earth orbit), and they needed to be able to make autonomous decisions on their own as there was no way to guarantee that instructions could come from a central location in a timely fashion.

The complete units themselves were unique prototypes with no more than two iterations in existence at a time. It would take several months to build each unit and develop the procedures by which we would stress and test what the unit could do. The testing process took many more months as the system integration team moved through ground-based testing and eventually moved on to space-based testing. A necessary cost of deploying the units would be to lose it when it reentered the Earth’s atmosphere, but a primary goal for each stage of testing was to collect as much data as possible from the unit until it was no longer able to operate and/or transmit telemetry about its internal state of health.

During each stage of testing, the unit was placed into an environment that would minimize the amount of damage the unit would physically be subjected to (such as operating the unit within a netted room that would prevent the unit from crashing into the floor, walls, or ceiling). The preparation work for each formal test consisted of weeks of refining all of the details in a written test procedure that fortyish people would follow exactly. Any deviations as the final test run would flag a possible abort of the test run.

Despite all of these precautions, sometimes things just did not behave the way the team expected. In each failure case, it was essential that the post mortem team be able to explicitly identify what went wrong and why so that future iterations of the unit would not repeat those failures. Because we were learning how to build a completely autonomous system that had to properly react to a range of uncertain environmental conditions, it could sometimes take a significant effort to identify root causes for failures.

Surprisingly, it also took a lot of effort to prove that the system did not experience any failures that we were not able to identify by simple observation during operation. It took a team of people analyzing the telemetry data days to determine whether the interactions between the various subsystems were behaving correctly or had coincidently behaved in an expected fashion during the test run.

The company knew we were going to experience many failures during this process, but the pressure was always present to produce a system that worked flawlessly. However, when the difference between a flawless operation and one that experienced a subtle, but potentially catastrophic anomaly rests on nuanced interpretation of the telemetry data, it is essential that the development team is not afraid to identify possible anomalies and follow them up with robust analysis.

In this project, a series of failures was the norm, but for how many projects is a sequence of system failures acceptable? Do you feel comfortable raising a flag for potential problems in a design or test run? Does how your company handles failure affect what threshold you apply to searching for anomalies and teasing out true root causes? Or is it safer to search a little less diligently and let said anomalies slip through and be discovered later when you might not be on the project anymore? How does your company handle failures?