How do you mitigate single-point failures in your team’s skillset?

Wednesday, December 22nd, 2010 by Robert Cravotta

One of the hardest design challenges facing developers is how to keep the system operating within acceptable bounds despite being used in non-optimal conditions. Given a large enough user base, someone will operate the equipment in ways that the developers never intended. For example, a friend recently shared that his young daughter has developed an obsession with turning the lights in the house on and off repeatedly. Complicating this scenario is that some of the lights she likes to flip on and off are fluorescent lights (the tubes, not CFLs (compact fluorescent light)). Unfortunately, repeatedly turning them on and off in this fashion significantly reduces their useful life. Those lights were not designed to be put under those types of operating conditions. I’m not sure designers can ever build a fluorescent bulb that will flourish under those types of operating conditions – but you never know.

Minimizing and eliminating single-point failures in a design is a valuable strategy for increasing the robustness of the design. Experienced developers exhibit a knack for avoiding and mitigating single-point failures – often as the result of experience with similar failures in previous projects. Successful methods for avoiding single-point failures usually involve implementing some level of overlap or redundancy between separate, and ideally independent, parts of the system.

A look at the literature addressing single-point failures reveals a focus on technical and tangible items like devices and components, but there is an intangible source of single-point failures that can be devastating to a project – when a given skillset or knowledge set is a single-point failure. I was first introduced to this idea when someone asked me “What will you do if Joe wins the Lottery?” We quickly established that winning the Lottery was a nice way to describe a myriad of unpleasant scenarios to consider – in each case the outcome is the same – Joe, with all of his skills, experience, and project specific knowledge, leaves the project.

As a junior member of the technical staff, I did not need to worry about this question, but once I started into the ranks of project lead – well that question become immensely more important. If you have the luxury of a large team and budget, you might assign people to overlapping tasks. However, small teams may lack not just the budget but the cognitive bandwidth of the team members to be aware of everything everyone else is doing.

One approach we used to mitigate the consequences of a key person “winning the Lottery” involved holding regular project status meetings. Done correctly, these meetings can provide a quick and cost effective mechanism for spreading the project knowledge among more people. The trick is to avoid involving too many people for too long or too frequently so that the meetings cost more than the possible benefit they provide. Maintaining written documentation is another approach for making sure the project can recover from the loss of a key member. Another approach we used for more tactical types of skills was to contract with an outside team that specialized in said skillset. By working with someone who understands the project’s tribal knowledge, this approach can help the team recover quickly and salvage the project.

What methods do your teams employ to protect from the consequences of a key person winning the Lottery?

Tags: ,

One Response to “How do you mitigate single-point failures in your team’s skillset?”

  1. Martin says:

    I know I’m commenting on an article more than one year old, but this is spot on, and I really like the “lottery” analogy. It allows to discuss SPOF in a more pragmatic, positive way that the usual “what if he leaves ?”. As a founder of a company active in skills management, this really nailed it for me.


Leave a Reply