Entries Tagged ‘Failsafe’

What is your most memorable demonstration/test mishap?

Wednesday, January 12th, 2011 by Robert Cravotta

The crush of product and technology demonstrations at CES is over. As an attendee of the show, the vast majority of the product demonstrations I saw seemed to perform as expected. The technology demonstrations on the other hand did not always fare quite so well – but then again – the technology demonstrations were prototypes of possibly useful ways to harness new ideas rather than fully developed and productized devices. Seeing all of these demonstrations at the show reminded me of the prototypes I worked on and some of the spectacular ways that things could go wrong. I suspect that sharing these stories with each other will pass around some valuable (and possibly expensive) lessons learned to the group here.

On the lighter side of the mishap scale, I still like the autonomous robots that Texas Instruments demonstrated last year at ESC San Jose. The demonstration consisted of four small, wheeled robots that would independently roam around the table top. When they bumped into each other, they would politely back away and zoom off in another direction. That is, except for one of the robots which appeared to be a bit pushy and bossy as it would push the other robots around longer before it would back away. In this case, the robots were all running the same software. The difference in behavior had to do with a different sensitivity of the pressure bar that told the robot that it had collided with something (a wall or another robot in this case).

I like this small scale example because it demonstrates that even identical devices can take on significantly different observable behaviors because of small variances in the components that make up the device. It also demonstrates the possible value for closed-loop control systems to be able to access independent or outside reference points so as to be able to calibrate their behavior to some set of norms (how’s that for a follow-up idea on that particular demonstration). I personally would love to see an algorithm that allowed the robots to gradually influence each other’s behavior, but the robots might need more sensors to be able to do that in any meaningful way.

On the more profound side of the mishap scale, my most noteworthy stories involve live testing of fully autonomous vehicles that would maneuver in low orbit space. These tests were quite expensive to perform, so we did not have the luxury of “do-overs”. Especially in failure, we had to learn as much about the system as we could; the post mortem analysis could last months after the live test.

In one case, we had a system that was prepped (loaded with fuel) for a low orbit test; however, the family of transport vehicle we were using had experienced several catastrophic failures over the past year that resulted in the payloads being lost. We put the low-orbit payload system, with the fuel loaded, into storage for a few months while the company responsible for the transport vehicle went through a review and correction process. Eventually we got the go ahead to perform the test. The transport vehicle delivered its payload perfectly; however, when the payload system activated its fuel system, the seals for the fuel lines blew.

In this case, the prototype system had not been designed to store fuel for a period of months. The intended scenario was to load the vehicle with fuel and launch it within days – not months. During the period of time in storage, the corrosive fuel and oxidizer weakened the seals so that they blew when the full pressure of the fuel system was placed upon them during flight. A key takeaway from this experience was to understand the full range of operating and non-operating scenarios that the system might be subjected to – including being subjected to extended storage. In this case, a solution to the problem would be implemented as additional steps and conditions in the test procedures.

My favorite profound failure involves a similar low-orbit vehicle that we designed to succeed when presented with a three-sigma (99.7%) scenario. In this test though, there was a cascade of failures during the delivery to low-orbit phase of the test, which presented us with a nine-sigma scenario. Despite the series of mishaps leading to deployment of the vehicle, the vehicle was almost able to completely compensate for its bad placement – except that it ran out of fuel as it was issuing the final engine commands to put it into the correct location. To the casual observer, the test was an utter failure, but to the people working on that project, the test demonstrated a system that was more robust than we ever thought it could be.

Do you have any demonstration or testing mishaps that you can share? What did you learn from it and how did you change things so that the mishap would not occur again?

What is your favorite failsafe design?

Wednesday, January 5th, 2011 by Robert Cravotta

We had snow falling for a few hours where I live this week. This is remarkable only to the extent that the last time we had any snow fall was over 21 years ago. The falling snow got me thinking about how most things in our neighborhood, such as cars, roads, houses, and our plumbing, are not subjected to the wonders of snow with any regularity. On days like that, I am thankful that the people who designed most of the things we rely on took into account what impact different extremes, such as hot and cold weather, would have on their design. Designing a system to operate or degrade gracefully in rare operating conditions is a robust design concept that seems to be missing in so many “scientific or technical” television shows.

Designing systems so that they fail in a safe way is an important engineering concept– and it is often invisible to the end user. Developing a failsafe system is an exercise in trading between the consequences and probability of a failure and the cost to mitigate those consequences. There is no single best way to design a failsafe system, but two main tools available to designers are to incorporate interlocks or safeties into the system/or and to implement processes that the user needs to be aware of to mitigate the failure state. Take for example the simple inflatable beach ball; the ones I have seen have such a long list of warnings and disclaimers printed on them that is quite humorous – until you realize that every item printed on that ball probably has a legal case associated with it.

I was completely unaware until a few months ago how a rodent could make an automobile inoperable. Worst, our vehicle became unsteerable while the car was being driven. Fortunately no one got hurt (except the rat that caused the failure). In this case, it looks like the rat got caught in one of the belts in the engine compartment that ultimately made the power steering fail. I was surprised to find out this is actually a common failure when I looked it up on the Internet. I am not aware of a way to design better safety into the vehicle, so we have changed our process when using automobiles. In our case, we do periodic checks of the engine compartment to see if there are any signs of an animal living in there, and we sprinkled peppermint oil around the compartment because we heard that rodents hate the smell.

The ways to make a system failsafe are numerous and I suspect there are a lot of great ideas that have been used over the years. As an example, let me share a memorable failsafe mechanism we implemented on a Space Shuttle Payload I worked on for two years. The payload was going to be actually flying around the Space Shuttle – which means it would be firing engines more than once. This was ground breaking as launching satellites involves firing the engines only once. As a result, we had to go to great lengths to ensure that there could be no way that the engines could misfire – or worse, that the payload could receive a malicious command from the ground to direct the payload into a collision course with the Shuttle. All of the fault tolerant systems and failsafe mechanisms made the design quite complicated. In contrast, the mechanism we implemented to prevent acting on a malicious command was to use a table of random numbers that were loaded onto the payload 30 minutes before the launch and would be known to only two people. Using encryption was not a feasible option at that time because we just did not have the computing power to do it.

Another story of making a system more failsafe involved an X-ray machine. I was never able to confirm if this actually occurred or was a local urban legend, but the lesson is still valid. The model of X-ray machine in question was exposing patients to larger doses of radiation than it was supposed to when the technician pressed the backspace key during a small time window. The short term fix was to send out an order to remove the backspace key from all of the keyboards. The take-away for me was that there are fast, quick, and cheap ways to alleviate a problem that allow you to take the appropriate efforts to find a better way to fix the problem.

Have you ever used a clever approach to making your designs more failsafe? Have you ever run across a product you used that implemented an elegant failsafe mechanism? Have you ever seen a product that you thought of a better way that they could have made the system failsafe or degrade gracefully?