Entries Tagged ‘System Integration’

How does your company handle test failures?

Wednesday, August 17th, 2011 by Robert Cravotta

For many years, most of the projects I worked on were systems that had never been built before in any shape or form. As a consequence, many of the iterations for each of these projects included significant and sometimes spectacular failures as we moved closer to a system that could perform its tasks successfully in an increasingly wider circle of environmental conditions. These path-finding designs needed to be able to operate in a hostile environment (low earth orbit), and they needed to be able to make autonomous decisions on their own as there was no way to guarantee that instructions could come from a central location in a timely fashion.

The complete units themselves were unique prototypes with no more than two iterations in existence at a time. It would take several months to build each unit and develop the procedures by which we would stress and test what the unit could do. The testing process took many more months as the system integration team moved through ground-based testing and eventually moved on to space-based testing. A necessary cost of deploying the units would be to lose it when it reentered the Earth’s atmosphere, but a primary goal for each stage of testing was to collect as much data as possible from the unit until it was no longer able to operate and/or transmit telemetry about its internal state of health.

During each stage of testing, the unit was placed into an environment that would minimize the amount of damage the unit would physically be subjected to (such as operating the unit within a netted room that would prevent the unit from crashing into the floor, walls, or ceiling). The preparation work for each formal test consisted of weeks of refining all of the details in a written test procedure that fortyish people would follow exactly. Any deviations as the final test run would flag a possible abort of the test run.

Despite all of these precautions, sometimes things just did not behave the way the team expected. In each failure case, it was essential that the post mortem team be able to explicitly identify what went wrong and why so that future iterations of the unit would not repeat those failures. Because we were learning how to build a completely autonomous system that had to properly react to a range of uncertain environmental conditions, it could sometimes take a significant effort to identify root causes for failures.

Surprisingly, it also took a lot of effort to prove that the system did not experience any failures that we were not able to identify by simple observation during operation. It took a team of people analyzing the telemetry data days to determine whether the interactions between the various subsystems were behaving correctly or had coincidently behaved in an expected fashion during the test run.

The company knew we were going to experience many failures during this process, but the pressure was always present to produce a system that worked flawlessly. However, when the difference between a flawless operation and one that experienced a subtle, but potentially catastrophic anomaly rests on nuanced interpretation of the telemetry data, it is essential that the development team is not afraid to identify possible anomalies and follow them up with robust analysis.

In this project, a series of failures was the norm, but for how many projects is a sequence of system failures acceptable? Do you feel comfortable raising a flag for potential problems in a design or test run? Does how your company handles failure affect what threshold you apply to searching for anomalies and teasing out true root causes? Or is it safer to search a little less diligently and let said anomalies slip through and be discovered later when you might not be on the project anymore? How does your company handle failures?

Can we improve traffic safety and efficiency by eliminating traffic lights?

Wednesday, August 18th, 2010 by Robert Cravotta

I love uncovering situations where there is a mismatch between the expected results and the actual results of an experiment because it helps reinforce the importance of actually performing an experiment despite how much you think you “know” how it will turn out. System level integration of the software with the hardware is a perfect example.

It seems, with a frequency that defies pure probability, that if the integration team fails to check out an operational scenario during integration and testing, the system will behave in an unexpected manner when that scenario occurs. Take for example Apple’s recent antenna experience:

“…The electronics giant kept such a shroud of secrecy over the iPhone 4′s development that the device didn’t get the kind of real-world testing that would have exposed such problems in phones by other manufacturers, said people familiar with the matter.

The iPhones Apple sends to its carrier partners for testing are “stealth” phones that disguise a new device’s shape and some of its functions, people familiar with the matter said. Those test phones are specifically designed so the phone can’t be touched, which made it hard to catch the iPhone 4′s antenna problem. …”

The prototype units did not operate under the same conditions as they would in a production capacity, and that allowed an undesirable behavior to get through to the production version. The message here is never assume your system will work the way you expect it to – test it because the results may just surprise you.

Two recent video articles about removing traffic lights from intersections support this sentiment. In one of the video interviews a traffic specialist that suggests that turning off the traffic lights at intersections can actually improve the safety and efficiency of some intersections. The other video highlights what happened when a town turned off the traffic lights at a specific intersection. The results are anti-intuitive. This third video of an intersection is fun to watch, especially when you realize that there is no traffic control and there are all types of traffic, ranging from pedestrians, bikes, small cars, large cars, and buses all sharing the road. I am amazed watching the pedestrians and the near misses that do not appear to faze them.

I am not advocating that we turn off traffic lights, but I am advocating that we explore whether we are testing our assumptions sufficiently – whether in our own embedded designs or in other systems including traffic control. What is causing better traffic flow and safety in these test cases? Is it because the flow is low enough? Is it because the people using the intersection are using a better set of rules rather than “green means go?” Are there any parallel lessons learned that apply to integrating and testing embedded systems?