Entries Tagged ‘Fault Tolerance Principle’

Are you, or someone you know, using voting within your embedded system designs?

Wednesday, November 3rd, 2010 by Robert Cravotta

With the midterm elections in the United States winding down, I thought I’d try to tie embedded design concepts to the process of elections. On some of the aerospace projects I worked on, we used voting schemes as fault tolerant techniques. In some cases, because we could not trust the sensors, we used multiple sensors, and performed voting among the sensor controllers (along with separate and independent filtering) to improve the quality of the data that we used for our control algorithms. We might use multiple of the same type of sensor, and in some cases we would use sensors that differed from each other significantly so that they would not be susceptible to the same types of bad readings.

I did a variety of searches on fault tolerance and voting to see if there was any recent material on the topic. There was not a lot of material available, and what was available was scholarly, and I was generally not able to download the files. It is possible I did a poor job choosing my search terms. However, this lack of material made me wonder if people are using the technique at all and/or has it evolved into a different form. In this case, sensor fusion.

Sensor fusion is the practice of combining data derived from sensors from disparate sources to deliver “better” data than would be possible if these sources were used individually. “Better” in this case can mean more accurate, complete, reliable data. From this perspective, the fusion of the data is not strictly a voting scheme, but there are some similarities with the original concept.

This leads me to this week’s question. Are you, or someone you know, using voting or sensor fusion within embedded system designs? As systems continue to increase in complexity, the need for robust design principles that can enable systems to correctly operate with less-than-perfect components becomes more relevant. Is the voting schemes of yesterday still relevant, or have they evolved into something else?

Do you permit single points of failure in your life?

Wednesday, June 23rd, 2010 by Robert Cravotta

AT&T’s recent national outage of their U-Verse voice service affected me for most of one day last month. Until recently, such outages never affected me because I was still using a traditional landline phone service. That all changed a few months ago when I decided that the risk and consequences of an outage might be offset by the additional services and lower cost of the VoIP service over the landline service. Since the outage, I have been thinking about whether I properly evaluated the risks, costs, and benefits, and whether I should keep or change my services.

The impact of the outage was significant, but not as bad as it could have been. The outage did not affect my ability to receive short phone calls or send and receive emails. It did however severely reduce my ability to make outgoing phone calls and to maintain a long phone call as the calls that did get through would randomly drop. I had one scheduled phone meeting that I had to reschedule as a result of the outage. Overall, the severity and duration of the outage was not sufficient to cause me to drop the VoIP service in favor of the landline service. However, if more similar outages were to occur, say more frequently than on a twelve months cycle or for more than a few hours at a time, I might seriously reconsider this position.

An offsetting factor in this experience was my cell phone. My cell phone sort-of acts as my backup phone in emergencies, but it is insufficient for heavy duty activity in my office because I work at the edge of a wireless dead coverage spot in the mountains. I find it ironic that the cell phone has replaced my landline as my last line of defense to communicate in emergencies because I kept the landline so long as a last line of defense against the wireless phone service going down.

Many people are making this type of trade-off (knowingly or not). A May 12, 2010 report from the Centers for Disease Control and Prevention, says that 24.5% of American homes, in the last half of 2009, had only wireless phones. According to the repost, 48.6% of adults aged 25 to 29 years old lived in households with only wireless phones. The term VoIP never shows up in the report, so I cannot determine whether or not the data lumps landline and VoIP services into the same category.


 100623-phones.png

Going with a wireless only household incurs additional exposures of single point of failure. 9-1-1 operators cannot automatically find you in an emergency. And in a crisis, such as severe storms,

the wireless phone infrastructure may overload and prevent you from receiving a cell signal.

The thing about single points of failure is that they are not always obvious until you are already experiencing the failure. Do you permit single point failures in the way you design your projects or in your personal life choices? For the purpose of this question, ignoring the possibility of a single point failure is an implied acceptance of the risk and benefit trade-off.

If you would like to suggest questions to explore, please contact me at Embedded Insights.

[Editor's Note: This was originally posted on the Embedded Master]

Operational Single Points of Failure

Monday, June 21st, 2010 by Robert Cravotta

A key tenet of fault tolerant designs is to eliminate all single points of failure from the system. A single point of failure is a component or subsystem within a system such that if it suffers a failure, it can cause the rest of the system to fail. When I was first exposed to the single point of failure concept, we used it to refer to sub systems in electronic control systems. A classic example of a single point failure is a function that is implemented completely in software, even if you use multiple functions or algorithms to check each other, because the processor core itself represents a single point of failure in a system with only one processor.

As my experience grew, and I was exposed to more of the issues facing internal program management as well as design level trade-off complexities, I appreciated that single points of failure are not limited to just what is inside your box. It is a system level concept and while it definitely applies to the obvious engineering candidates, such as hardware, software, and mechanical considerations (and thermal and EMI and … and …), it also applies to team processes, procedures, staffing, and even third-party services.

Identifying single points of failure in team processes and procedures can be subtle, but the consequences of allowing them to stay within your system design can be as bad as a normal engineering single point of failure. As an example, processes that only a single person executes are possible sources of failures because there is no cross check or measurement to ensure the person is performing the process correctly and this might allow certain failure conditions to go undetected. In contrast, you can eliminate such a failure point if the process involves more than a single person and the tasks performed by both people support some level of cross-correlation.  

Staffing policies can introduce dangerous single points of failure into your team or company, especially if there is no mechanism for the team to detect and correct when a given skill set is not duplicated across multiple people on the team or in the company. You never know when or why that person with the unique skills or knowledge will become unavailable. While you might be able to contact them if they leave the company or win the lottery, you would have a hard time being able to tap them if they died.

There was a cartoon I displayed in my office for a while many years ago that showed a widow and her child in the rain standing over a grave, and there is an engineer standing next to them asking if the husband ever mentioned anything about source code. The message is powerful and terrifying for anyone that is responsible for maintaining systems. The answer is to plan for redundancy in your staff’s skills and knowledge. When you identify that you have a single point of failure in your staff’s skills or knowledge, commit to fixing that problem as soon as possible. Note that this is a “when condition” and not an “if condition” because it will happen from time to time for reasons completely out of your control.

The thing to remember is that single points of failure can exist anywhere in your system and are not limited to just the components in your products. As systems include more outside or third-party services or partners, the scope of the system grows accordingly and the impact of non-technical single points of failure can grow also.

If you would like to be an information source for this series or provide a guest post, please contact me at Embedded Insights.

[Editor's Note: This was originally posted at the Embedded Master]

 

When is “cutting corners” good engineering versus derelict complacency?

Wednesday, June 16th, 2010 by Robert Cravotta

The recent articles claiming BP was demonstrating carelessness and complacency when the company cut corners in their well design got me thinking. Companies and design teams constantly improve their process in ways that cut costs and shrink schedules. This incremental process of “cutting corners” is the cornerstone of the amazing advances made in technology and improvements in our overall quality of life over the years. There seems to be a lot of second-guessing and criticism by people outside the design, build, and maintenance process when those incremental changes cause a system to cross the line between “good enough” and broken. The disaster that BP is in the middle of right now in the gulf is quite serious, but the magnitude of the disaster makes me think we should explore a thought exercise together.

What would be the cost/benefit if BP never “cut corners” on their well and rig designs? The immediately obvious answer might be, there would be no oil volcano at the bottom of the gulf right now and BP would be happily pumping oil out of that well instead of cleaning up after it. I say volcano because the words spill and leak seem so insufficient to describe the massive force required to spew out all of that oil despite the tremendous pressure pushing down on that opening by being under all of that water. A possible problem with the immediately obvious answer is that it ignores an essential implied assumption. While there might not be any oil pouring into the gulf, we might not be harvesting any of the oil either.

Let’s reword the question to make it more general. What would be the cost to society if everyone only engaged in ventures that would never find the line between good enough and broken? While raising my own children, I developed a sense of the importance that we all need to find the line between good enough and broken. I believe children do not break the rules merely to break the rules – I think they are exploring the edges and refining their own models of what rules are and why and when they should adhere to them. If we deny children the opportunity to understand the edges of rules, they might never develop the understanding necessary to know when to follow and when to challenge a rule.

This concept applies to engineering (as well as any human endeavor). If designers always use large margins in their designs, how will they know when and why they can or should not push those margins? How will they know if the margins are excessive (wasteful) or just right? My experience shows me that people learn the most from the failures, especially because it enables them to refine their models of how and why the world works the way it does.

I think one of the biggest challenges to “cutting corners” is minimizing the impact of when you cross the line to a failure precisely because you do not know where that line is. To me, derelict complacency depends on the assumption that the designer knew where the line to failure was and crossed it anyways. If my engineering career taught me anything, it taught me that we never know what will or will not work until we try it. We can extrapolate from experience, but experience does not provide certainty for everything we have not tried yet.

To an outsider, there might not be an easy to see difference between good engineering and derelict complacency. What are your thoughts on how to describe the difference between appropriate risk-assessed process improvement and derelict complacency? Can we use common failures in the lab to explore, refine, and communicate this difference so that we can apply it to larger disasters such as the oil in the gulf or even unintended acceleration in automobiles?

If you would like to suggest questions to explore, please contact me at Embedded Insights.

[Editor's Note: This was originally posted on the Embedded Master]

Robust Design: Quality vs. Security

Monday, June 14th, 2010 by Robert Cravotta

I had a conversation recently with Nat Hillary, a field application engineer at LDRA Technologies, about examples of software fault tolerance, quality, and security. Our conversation identified many questions and paths that I would like to research further. One such path relates to how software systems that are not fault tolerant may present vulnerabilities that attackers can use to compromise the system. A system’s vulnerability and resistance to software security exploits is generally a specification, design, and implementation quality issue. However, just because secure systems require high quality does not mean that high quality systems are also secure systems because measuring a system’s quality and security focuses on different metrics.

 

Determining a system’s quality involves measuring and ensuring that each component, separately and together, fits or behaves within some specified range of tolerance. The focus is on whether the system can perform its function with acceptable limits rather than on the complete elimination of all variability. The tightness or looseness of a component’s permitted tolerance balances the cost and difficulty of manufacturing identical components with the cumulative impact of allowing variability among the components against the system’s ability to perform its intended function. For example, many software systems ship with some number of known minor implementation defects (bugs) because the remaining bugs do not prevent the system from operating within tolerances during the expected and likely use scenarios. The software in this case is identical from unit to unit, but the variability in the other components in the system can introduce differences in behavior in the system. I will talk about an exhibit at this year’s ESC San Jose that demonstrated this variability in a future post.

 

In contrast, a system’s security depends on protecting its vulnerabilities operating under extraordinary conditions. A single vulnerability under the proper extraordinary conditions can compromise the system’s proper operation. However, similar to determining a system’s quality, a system’s security is not completely dependent on a perfect implementation. If the system can isolate and contain vulnerabilities, it can still be good enough to operate in the real world. The 2008 report “Enhancing the Development Life Cycle to Produce Secure Software” identifies that secure software exhibits:

 

1. Dependability (Correct and Predictable Execution): Justifiable confidence can be attained that software, when executed, functions only as intended;

2. Trustworthiness: No exploitable vulnerabilities or malicious logic exist in the software, either intentionally or unintentionally inserted;

3. Resilience (and Survivability): If compromised, damage to the software will be minimized, and it will recover quickly to an acceptable level of operating capacity;

 

An example of a software system vulnerability that has a fault tolerant solution is the buffer overflow. The buffer overflow is a technique that exploits functions that do not perform proper bounds checking. The Computer Security Technology Planning Study first publicly documented the technique in 1972. Static analysis software tools are able to assist developers to avoid this type of vulnerability by identifying array overflows and underflows, as well as when signed and unsigned data types are improperly used. Using this fault tolerant approach can allow a software system to exhibit the three secure software properties listed above.

[Editor's Note: This was originally posted on the Embedded Master]

Extreme Processing: Oil Containment Team vs. High-End Multiprocessing

Friday, June 11th, 2010 by Robert Cravotta

Teaser: Extreme processing thresholds do not only apply to the small end of the spectrum – they also apply to the upper end of the spectrum where designers are pushing the processing performance so hard that they are limited by how well the devices and system enclosures are able to dissipate heat. Watching the BP oil well containment effort may offer some possible insights and hints at the direction that extreme high processing systems are headed.

Categories: extreme processing, fault tolerance (redundancy), multiprocessing

Image Caption: “The incident command centre at Houma, Louisiana. Over 2500 people are working on the response operation. © BP p.l.c.”

Extreme Processing: Oil Containment Team vs. High-End Multiprocessing

So far, in this extreme processing series, I have been focusing on the low or small end of the extreme processing spectrum. But extreme processing thresholds do not only apply to the small end of the spectrum – they also apply to the upper end of the spectrum where designers are pushing the processing performance so hard that they are limited by how well the devices and system enclosures are able to dissipate heat. Watching the BP oil well containment effort may offer some possible insights and hints at the direction that extreme high processing systems are headed.

100611-bp-command-center.jpg


According to the BP CEO’s video, there are 17,000 people working on the oil containment team. At a crude level, the containment team is analogous to a 17,000 core multiprocessing system. Now consider that contemporary extreme multiprocessing devices generally offer a dozen or less cores in a single package. Some of the highest density multicore devices contain approximately 200 cores in a single package. The logistics of managing 17,000 distinct team members toward a single set of goals by delivering new information where it is needed as quickly as possible is analogous to the challenges designers of high-end multiprocessing systems face.

The people on the containment team span multiple disciplines, companies, and languages. Even though each team member brings a unique set of skills and knowledge to the team, there is some redundancy in the partitioning of those people. Take for example the 500 people in the crisis center. That group necessarily consists of two or three shifts of people that fulfill the same role in the center because people need to sleep and no single person could operate the center 24 hours a day. A certain amount of redundancy for each type of task the team performs is critical to avoid single-point failures because someone gets sick, hurt, or otherwise becomes unavailable.

Out in the field are many ships directly involved in the containment effort at the surface of the ocean over the leaking oil pipe. Every movement of those ships needs to be carefully planned, checked, and verified by a logistics team before the ships can execute them because those ships are hosting up to a dozen active ROVs (Remotely operated vehicles) that are connected to the ship via mile long cables. Tangling those cables could be disastrous.

In the video, we learn that the planning lead-time for the procedures that the field team executes extends 6 to 12 hours ahead, and some planning extends out approximately a week. The larger, more ambitious projects require even more planning time. What is perhaps understated is that the time frames for these projects is up to four times faster than the normal pace – approximately one week to do what would normally occur in one month of planning.

The 17,000 people are working simultaneously, similar to the many cores in multiprocessing systems. There are people that specialize in routing data and new information to the appropriate groups, analogous to how the scheduling circuits in multiprocessing systems operate. The containment team is executing planning across multiple paths, analogous to speculative execution and multi-pipelining systems. The structure of the team cannot afford the critical path hit of sending all of the information to a central core team to analyze and make decisions – those decisions are made in distributed pockets and the results of those decisions flow to the central core team to ensure decisions from different teams are not exclusive or conflicting with each other.

I see many parallels with the challenges facing designers of multiprocessing systems. How about you? If you would like to be an information source for this series or provide a guest post, please contact me at Embedded Insights.

[Editor's Note: This was originally posted on the Embedded Master]

Question of the Week: Do you always use formal test procedures for your embedded designs?

Wednesday, June 9th, 2010 by Robert Cravotta

I commented earlier this week about how watching the BP oil well capping exercise live video reminded me about building and using formal test procedures when performing complex or dangerous operations. Before I had a lot of experience with test procedures, I used to think of them as an annoying check-off box for quality assurance. They were expensive to make, and they consumed huge amounts of time to build and refine. However, with more experience, I came to appreciate formal test procedures as valuable engineering design tools because they are a mechanism that injects fault tolerance in systems where the operator is an integral part of the system’s decision process. The procedure frees up the operator’s attention while performing “routine tasks” with the system so they can better recognize and react to the shifting external conditions of a complex environment.

Similar to the BP oil well capping exercise, the formal procedures I worked on involved complex systems that used dangerous chemicals. We needed to make sure we did not damage the systems while using them, both for safety and schedule reasons. Building the formal procedure and going through it with the entire team captured each member’s specialized knowledge so that the team was able to develop and refine each step in the procedure with a higher level of confidence than any subset of the team could have performed alone.

I personally understand the value of formal procedures for testing and operating very complex and dangerous systems, but I wonder if the formal procedure process offers similar value, compared to the cost and effort to make one, when applied to simple, low cost, or benign system designs.

Do you always build and use formal test procedures or are there designs that are so simple, low cost, or benign that you skip the process of building a formal test procedure? What types of designs would you consider skipping the formal procedure process and why?

If you would like to suggest questions for future posts, please contact me at Embedded Insights.

[Editor's Note: This was originally posted on the Embedded Master]

Robust Design: Formal Procedures

Monday, June 7th, 2010 by Robert Cravotta

Watching the BP oil pipe-capping live-video the other day made me think about formal procedures. Using a formal procedure is a way to implement the fault tolerant principle in cases where the operator is an integral part of the system’s decision process. Using a formal procedure helps to free up the operator’s attention while performing “routine tasks” so they can better recognize and react to the shifting conditions of a complex environment. Even though I could not see all of the video feeds or hear any of the audio feeds, I could imagine what was going on in each of the feeds during the capping operation.

I have worked on a fair share of path-finding projects that demonstrated the feasibility of many design ideas to make a completely autonomous vehicle that could complete complex tasks in real world environments. The design team would subject new ideas to substantive analysis and review before building a lab implementation that we could test. With each test-case success, we would move closer to testing the system in the target environment (space-based in many cases). One thing that impressed me then and has forever stayed with me is the importance and value of a formal operational procedure document.

I have written and reviewed many operational procedures. When I first started working with these procedures, I thought they were a bother that took a lot of time and effort to produce. We had to explicitly account for and document every little detail that we could imagine, especially the unlikely scenarios that we needed to detect during the system checkouts. We performed what seemed like endless walkthroughs of the procedures – each time refining small details in the procedure as different members of the team would share a concern about this or that. We did the walkthroughs so many times that we could complete them in our sleep. When I finally participated in my first live test, the value of those procedures and all of those walkthroughs became apparent – and they no longer seemed like a waste of time.

The formal procedure was an effective way to capture the knowledge of the many people, each with a different set of skills and specific knowledge about the system, in a single place. By performing all of those walkthroughs, it forced us to consider what we should do under different failure conditions without the burden of having to come up with a solution in real-time. The formal procedure enabled everyone on the team to be able to quickly perform complex and coordinated tasks that would be practically impossible to execute in an impromptu fashion – especially under stressful conditions.

The first and foremost reason for the procedures was to protect the people working around the system. We were dealing with dangerous materials where injuries or deaths were very real possibilities if we were not careful. The second reason for the procedures was to protect the system from operating in a predictable destructive scenario. Unpredictable failures are a common enough occurrence when you are working on the leading (bleeding?) edge because you are working in the realm of the unknown. The systems we were building only existed as a single vehicle or a set of two vehicles, and having to rebuild them would represent a huge setback.

The capping operation of the BP oil-pipe appears to encompass at least as much, if not significantly more, complexity than the projects I worked on so long ago. The video feed showed the ROV (remotely operated vehicle) robot arm disconnecting a blue cable from the capping structure. Then the video feed showed what I interpreted as the ROV operator checking out various points in the system before moving on to the next cable or step in some checklist. I could imagine the numerous go/no-go callouts from each of the relevant team members that preceded each task performed by the ROV operator. I am guessing that the containment team went through a similar process that we went through in building their formal procedures – first in the conference room, then in simulation, and finally on the real thing 5000 feet under the surface of the ocean.

While building and testing prototypes of your embedded systems may not involve the same adrenaline pumping excitement as these two scenarios, the cost of destroying your prototype system can be devastating. If you have experience using formal procedures while building embedded systems, and you would like to contribute your knowledge and experience to this series, please contact me at Embedded Insights.

[Editor's Note: This was originally posted on the Embedded Master]

Robust Design: Fault Tolerance – Nature vs. Malicious

Tuesday, June 1st, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on the Embedded Master]

For many applications, the primary focus of robust design principles is on making the design resilient to rare or unexpected real world phenomenon. Embedded designers often employ filters to help mitigate the uncertainty that different types of signal noise can cause. They might use redundant components to mitigate or compensate for specific types of periodic failures within various subsystems. However, as our end devices become ever more connected to each other, there is another source of failure that drives and benefits from a strong set of robust design principles – the malicious attack.

On most of the systems I worked on as an embedded developer, we did not need to spend too much energy addressing malicious attacks on electronics and software within the system because the systems usually included a physical barrier that our electronics control system could safely hide behind. That all changed when we started looking at supporting remote access and control into our embedded subsystems. The project that drove this concept home for me was on a Space Shuttle payload that would repeatedly fire its engines to maneuver around the Shuttle. No previous payload had ever been permitted to fire its engines for multiple maneuvers around the Shuttle before. The only engine fire they performed was to move away from the Shuttle and move into their target orbital position.

The control system for this payload was multiple fault-tolerant, and we often joked among ourselves that the payload would be so safe that it would not ever be able to fire its own engines to perform its tasks because the fault tolerant mechanisms were so complicated. This was even before we knew we had to support one additional type of fault tolerance – ensuring that none of the maneuvering commands came from a malicious source. We had assumed that because we were working in orbital space and the commands would be coming from a Shuttle transmitter, that we were safe from malicious attacks. The NASA engineers were concerned that a ground-based malicious command could send the payload into the Shuttle. The authentication mechanism was crude and clunky by today’s encryption standards. Unfortunately, after more than two years of working on that payload, the program was defunded and we never actually flew the payload around the Shuttle.

Tying this back to embedded systems on the ground, malicious attacks often take advantage of the lack of fault tolerance and security in a system’s hardware and software design. By deliberately injecting fault conditions onto a system or into a communication stream, an attacker with sufficient access to and knowledge of how the embedded system operates, can create physical breaches that provide access to the control electronics or expose vulnerabilities in the software system through techniques such as forcing a buffer overflow.

Adjusting your design to mitigate the consequences of malicious attacks can significantly change how you approach analyzing, building, and testing your system. With this in mind, this series will include topics that would overlap with a series on security issues, but with a mind towards robust design principles and fault tolerance. If you have experience designing for tolerance to malicious attacks in embedded systems, not only with regards to the electronics and software, but also from a mechanical perspective, and you would like to contribute your knowledge and experience to this series, please contact me at Embedded Insights.

Robust Design: Fault Tolerance – Performance vs. Protection

Monday, May 24th, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on the Embedded Master

Fault tolerant design focuses on keeping the system running or safe in spite of failures usually through the use of independent and redundant resources in the system. Implementing redundant resources does not mean that designers must duplicate all of the system components to gain the benefits of a fault tolerant design. To contain the system cost and design complexity, often only the critical subsystems or components are implemented with redundant components. Today we will look at fault tolerance for data storage on long term storage, such as hard disk drives used for network storage to illustrate some of the performance vs. fault tolerance protection trade-offs that designers must make.

As network storage needs continue to grow, the consequences of a hard disk drive failure increase. Hard disk drives are mechanical devices that will eventually fail. As network storage centers increase the number of hard drives in their center, the probability of a drive failure goes up in a corresponding fashion. To avoid loss of data from a hard drive failure, you either must be able to detect and replace the hard drive before it fails, or you must use a fault tolerant approach to compensate for the failure.

Using a RAID (redundant array of independent disks) configuration provides fault tolerance that keeps the data on those drives available and protects it from loss through a single disk drive failure. Using a RAID configuration does not protect the data from application failures or malicious users that cause the data to be overwritten or deleted; regular data backups are a common approach used to protect from those types of failures. RAID is a technique for configuring multiple hard disk drives into a single logical device to increase the data reliability and availability by storing the data redundantly across the drives.

A RAID configuration relies on a combination of up to three techniques: mirroring, striping, and error correction. Mirroring is the easiest method for allocating the redundant data and it consists of writing the same data to more than one disk. This approach can speed up the read speed because the system can read different data from different disks at the same time, but it may trade off write speed if the system must confirm the data is written correctly across all of the drives. Striping is more complicated to implement, and it consists of interlacing the data across more than one disk. This approach permits the system to complete reads and writes faster than performing the same function on a single hard drive. Error correction consists of writing redundant parity data either on a separate disk or striped across multiple disks. Storing the error correction parity data means the amount of usable storage is less than the total amount of raw storage on all of the drives.

The different combinations of these three techniques provide different performance and fault protection trade-offs. A system that implements only data striping (RAID 0) will benefit from faster read performance, but all of the data will be at risk because any disk failure will cause loss of the data in the array. Data recovery can be costly and not guaranteed with this approach. This approach is appropriate for fixed data or program structures, such as operating system images that do not change often and can be recovered by restoring the data image from a backup.

A system that implements only mirroring (RAID 1) creates two identical copies of the data on two different drives; this provides data protection from a drive failure, but it does so at the cost of a 50% storage efficiency because every bit of data is duplicated. This approach allows the system to keep a file system available at all times even when performing backups because it can declare a disk as inactive, perform a backup of that drive, and then rebuild the mirror.

A system that implements data striping across two or more data drives with a dedicated parity drive (RAID 3 and 4) provide better storage efficiency because the fault tolerance is implemented through a single extra drive in the array. The more drives in the array, the higher the storage efficiency. However, all reads and writes access the dedicated parity drive, so the dedicated drive throttles the maximum performance of the system. RAID 5, which stripes the parity data across all of the drives in the array, has all but replaced RAID 3 and 4 implementations.

RAID 5 offers good storage efficiency as the parity data consumes the equivalent of a single drive in the system. This approach suffers from poor write performance because the system must update the parity on each write. Because the parity data is striped across the drives, the system is able to continue degraded (slower) operation despite a hard drive failure. The system is able to rebuild a fault tolerant data image with a new disk drive, such as on a hot swap drive, while the system continues to provide read and write access to the existing data. RAID6 extends on RAID 5 by using additional parity blocks for dual fault tolerance.

Each of these configurations provides different performance and fault tolerance trade-offs and are appropriate based on the context of the subsystem they are used for. Note that a single RAID controller itself can be the single point of failure within a system.

Performance and protection trade-offs are not limited to disk storage. An embedded design might a firmware library from Flash to SRAM to gain a performance boost, but the code in SRAM is now vulnerable to unintended modification. If you would like to participate in a guest post, please contact me at Embedded Insights.