Is it always a software problem?

Wednesday, October 20th, 2010 by Robert Cravotta

When I first started developing embedded software, I ran into an expression that seemed to be the answer for every problem – “It’s a software problem.” At first, this expression drove me crazy because it was blatantly wrong many times, but it was the only expression I ever heard. I never heard it was a hardware problem. If the polarity on a signal was reversed – it was a software problem. If a hardware sensor changed behavior over time – it was a software problem. In short, if it was easier, faster, or cheaper to fix any problem in the system with a change to the software – it was a software problem.

Within a year of working with embedded designs, I accepted the position that any problem that software could provide a fix or limit was defined as a software problem regardless of whether the software did exactly what the design documents specified. I stopped worrying about whether management would think the software developers were inept because in the long run, they seemed to understand that a software problem did not necessarily translate to a software developer problem.

I never experienced this type of culture when I worked on application software. There were clear demarcations between hardware and software problems. Software problems occurred because the code did not capture error return codes or because the code did not handle an unexpected input from the user. A spurious or malfunctioning input device was clearly a hardware problem. A dying power supply was a hardware problem. The developer of the application code was “protected” by a set of valid and invalid operating conditions. Either a key was pressed or it was not. Inputs and operating modes had a hard binary quality to them. At worst, the application code should not act on invalid inputs.

In contrast, many embedded systems need to operate based on continuous real world sensing that does not always translate to obvious true/false conditions. Adding to the complexity, a good sensor reading in one context may indicate a serious problem in a different operating context. In the context of a closed-loop control system, it could be impossible to definitely classify every possible input as good or bad.

Was this culture just in the teams on worked on or is it prevalent in the embedded community? Does it apply to application developers? Is it always a software problem if the software can detect, limit, or fix an undesirable operating condition?

Tags:

16 Responses to “Is it always a software problem?”

  1. Angelo says:

    It seems to be a mix. If we were to say that because software can fix things, its a sofware problem, we would be right and wrong.
    Right, because (especially from experience), the “hardware” problem, is in most cases, a design problem. A possible hardware problem can only be overcome with good designs, which come from experience. Now, in this case, if say, for a sensor defect, sensor circuitry were designed properly, the way the software(/firmware) would process the data, would certainly affect the outcome.
    Wrong, because, like you clearly mentioned, a defect at runtime can maybe be taken care of, by software during development, but certainly can’t be taken care of, at runtime.
    One thing can be rest assured though, if the hardware is designed right, and is tested correctly for defects, most problems will come from software/(firmware).

  2. B.C. @ LI says:

    I love it! There seems to be a management view that SW can fix it. At my years in big blue, it was pushed for a SW solution to a problem before spinning the board. At my current job, the SW team has to fix it before even considering changing the HW.

    I think this can result in a deceiving perception that the HW is perfect. In my last job, I found many many mistakes in the HW design from outsourced company from my own design. The only way to fix them, was a soldering iron and yellow wires. I was constantly pushed for a SW fix.

    So if you can fix it in SW should you say “no”? I think a lot of problems that are fixed in SW can come back later as customer defects as the HW fails to perform correctly. For example the input that should be negative incorrectly design as positive. So when that input was shorted, it held the board or broke something.

    I really believe in SW developers working early with the HW developers on the design, as their “well intended” choices can play hell with the SW design raising development costs.

    So if you can fix it SW, do say NO! Tell management, that fixing it SW can come back and hurt the overall product.

  3. S.R. @ LI says:

    Yes, there are times when the right answer is “no,” but often it is best to fix a h/w bug in s/w, if it saves money and offers real time-to-market advantage. Just be sure that the documentation catches up quickly, and everyone benefits (including future developers).

    The trick is to change the culture so people stop calling it a software problem, and start calling it a software solution!

  4. SdR @ LI says:

    It’s unfortunate sometimes that software is perceived as “so easy to fix.” It puts developers in a bad situation and oftentimes forces us to create a poor and brittle solution to a problem that should be fixed or designed out in hardware.

    Many studies have shown that software is much more expensive to develop than hardware, yet because the upfront costs of re-spinning a board or chip is so large and obvious, we’re often told to just fix it or work around it in software. Its much harder to account for the debugging, development and maintenance time costs of a work-around. After all, many managers or executive teams feel that developer time is “free” as they don’t see the cost because it’s an on-going cost sunk into the overhead of a company instead of a right-out-there budget item like another board spin.

    Then there’s the cost of a bad hardware design decision or trade-off. As an example, a company I worked for was developing a product that was going to be very low volume (less than 1000 sold) with a very high margin. We wanted to use a $1 switch debouncer IC, but the executives didn’t want to add $1 to the BOM, so we were told to do it in hardware. Considering the loaded salary cost of developers, we spent many times over the lifetime savings of the IC in development, debugging, and maintenance of that debouncing routine. Not to mention the opportunity cost of not having worked on a different product or other improvements that could have been done to that product.

    It’s not always bad to fix it in software. But it’s often just as, or more expensive to do so. And it’s often a bad decision if you “just fix it in software” when the _correct_ fix is in the hardware.

    To answer the original post, is it industry-wide cultural thing? I think there’s good companies and bad companies with respect to this. But there’s more bad. It comes down to the hidden cost of software. So, yes it’s an industry-wide cultural thing, but there’s exceptions.

    I’ll propose one way to fix this: treat software development time as a project based cost entity just like any other item that goes in the product. If you were working with an out-side software consulting firm (which is what my firm does BTW), you’d get a quote and be billed by the hour for the work. With hardware you already do this: you know the cost of the components on the BOM, the cost to spin prototype boards or silicon, the cost to build each unit overseas over the lifetime of the product. Internally, you can do the same with your software team. Use the fully-loaded cost of each developer, averaged if you like, and keep track of the time spent on the project. Work it into the budget.

    - S.

  5. R.A. @ LI says:

    SdR is correct. It is always more expensive (in the long run) to fix a lower level problem by using a hack at a higher level. This is true whether the lower level is hardware (i.e. software with a bad user interface) or simply a lower level of software (e.g. a driver).

    OTOH, higher level software is drastically underutilized as a means of diagnosing the cause of errors in lower level software (including software with a bad user interface – aka: hardware).

    The lowest cost alternative to fixing any problem is to have well instrumented software (at all levels) that correctly diagnoses the root cause of any problem, and then to FIX THE ROOT CAUSE (wherever that may be). Any other approach will be significantly more expensive in the long term.

  6. S.R. @ LI says:

    At this point in my career, it’s nearly impossible for me to use words like “always.” It is sometimes (I will even concede often) more costly to fix a hardware bug in software. However, it is sometimes much more costly, in terms of opportunity cost, to delay product introduction for the sake of philosopy. In fact, costs and solutions must be determined on a case-by-case basis, and root cause analysis is an essential component in that effort.

    Finding the root cause is absolutely the right thing to do. Ok, ok, it is ALWAYS the right thing to do. Only after that is done can we determine whether a “hack” is appropriate. If the root cause investigation shows that a solid work-around is available in software, and a real fix can appear on the next re-spin of the hardware, then why not start making money with your product ASAP?

    And as for the original question, yup, it’s a pretty pervasive problem, driven by the thinking that Steve mentioned. Everyone knows that software is free, but everyone is wrong about that. The complicating factor in computing software cost is to separate the cost of development from the cost of bugs. We need cost buckets for design, development, test, fixing other people’s bugs (hardware hacks), and real software bugs. For the bugs, we need to investigate where and why they occurred. When that’s done then the focus can go where it belongs, and proper up-front design will get the attention it deserves (for both hardware and software).

    When it’s easier to do it now than it is to do it right, your costs go up for all the wrong reasons. But that’s another cultural problem, and subject for another discussion altogether.

  7. B.Z. @ LI says:

    It is often cheaper to fix a hardware problem in software, but as the embedded system grows in size and complexity, you may find that trying to fix it with software either makes the overall system unwieldy, expensive to maintain, or it just won ‘t work properly, popping up a new bug in another form. It’s great to check if SW can fix it without scrambling the system design, but no, it is not always a SW problem.

    In embedded systems, the SW and HW designers should be much closer and communicating constantly because there are many tradeoffs that can avoid dumping all the complexity on one or the other. In smaller designs, the SW and HW designers are the same person.

  8. R.A. @ LI says:

    B. Z. said: “In embedded systems, the SW and HW designers should be much closer and communicating constantly because there are many tradeoffs that can avoid dumping all the complexity on one or the other. In smaller designs, the SW and HW designers are the same person.”

    This is precisely why I have tried to focus the discussion more clearly on the concept that what people mean when they say they are “fixing the problem in software” is actually: “masking a lower level problem at a higher level”.

    As you say, there really isn’t a clear line between hardware and software, and certainly there must be a high degree of cooperation between developers at all levels/layers throughout a system.

    I think everyone agrees that fixing the root cause is the only correct solution. I maintain that if a full economic analysis is done, that it will always be more cost effective to fix the root cause (in the long term).

    While Sam is certainly correct in his assertion that there are times when it is necessary to bear the extra costs associated with the higher level hack, in order to (for example) meet a market window; this doesn’t alter the fact that it will always be more costly to mask the symptoms, than it would be to fix the root cause (in the given example, it is simply a cost that must be borne in order to maintain the products market viability).

  9. A.N. @ LI says:

    I honestly think that the person managing the project needs to have a good understanding of where to fix the problem. Issues can be fixed in hardware and in software but if the decision is not taken correctly it can cost the project a lot. In embedded systems the mere fact that a hardware engineer should not understand or write software is unacceptable (That’s one of the reasons I never let our Engineers be Embedded Hardware Engineers or Embedded Software Engineers – They are Embedded Systems Engineers). The decision of whether something has to be done or fixed in software is the time when a real Embedded Engineer is needed who understands both aspects thoroughly. That’s what makes an Embedded Systems Engineer. My two cents !

  10. F.G. @ LI says:

    I’d add one more thing since no one else brought it up. The idea that it’s cheaper to fix in software is outmoded by the ubiqious use of FPGA in modern hardware design. FPGA turnaround is very similiar in cost of time and effort to a new software build. So this ought to push the balance back to where Rennie wants it focused on root cause. Unfortunately due to organizational silos or just palin laziness it some times takes a while for behaviour to catch up with economic reality.

  11. B.C. @ LI says:

    I really like Frank’s comments regarding FPGAs. It’s been years since I have created any FPGA code. I really liked working with them. I remember writing more test code than implementation code to check my designs. FPGAs do have one advantage of having better simulation tools for design verification.

  12. V.P. @ LI says:

    That’s correct observation but from business perspective it’s fine. The “sw problem” does not necessary point to mistake a sw engineer made (or a bad sw engineering execution) but it’s a practical statement how the problem could be resolved in most efficient way. You don’t want to hear “it’s hw problem”, really, because that means a silicon re-spin, board re-design, that’s… not a good news.

  13. R.A. @ LI says:

    V. P. said: “…how the problem could be resolved in most efficient way…”

    I think that what is most often really meant when this statement is made (in the context of resolving a problem) is:

    “how can the problem be resolved in the most EXPEDIENT way”

    Expediency != Efficiency.

  14. F.G. @ LI says:

    @V. In the 1990′s it used to mean silicon re-spin. How many companies are still designing their own ASICS or Silicon? At least for pure digital applications aren’t FPGAs the rule rather than the exception? Even for analog designs the analog sections are becoming smaller and pushed out to the periphery of the design. Technological trends like FPGA cores with embedded processors have turned line of S/W H/W demarcation into blurry gray area.

    Expediency is fine when its actually expedient! These decisions that were rooted in sound business logic over time become calcified into organizational traditions. It becomes its a S/W problem because the H/W (who messed up the first design) is too busy designing the next generation design to address it in H/W so we’ll just do a quick S/W hack and another and another

    Another thing that happens is most firms have a software defect tracking process that is more or less tied into the S/W release process but because of the historical context they do not have a similar capability on the H/W side or a way to properly assign problems in the first place. In some cases the most efficient solution might be a combination of S/W and H/W changes but this is unlikely to be expedient.

    I think the underlying problem that this thread exposes is the siloed organizational structure that dominates most engineering design teams. It used to make sense because of the wide chasm in tools and skill sets. But as time to market become more and more critical and the technology and techniques cross pollinate this structure makes less and less sense. What you really want are collocated teams of H/W and S/W engineers who have some cross training.This would allow both sides to have some appreciation for the difficulties of “driving a screw with a hammer” so to speak.

  15. SdR @ LI says:

    I think F. has hit part of it on the head:
    “Another thing that happens is most firms have a software defect tracking process that is more or less tied into the S/W release process but because of the historical context they do not have a similar capability on the H/W side or a way to properly assign problems in the first place.”

    Many places have a heavy process for hardware changes. ECN paperwork and such for even little changes during development and lack development bug tracking. SW bugs are usually tracked quickly and efficently using software like Bugzilla and can be “fixed” much easier from this standpoint.

    I advocate using a bug tracking system, like Bugzilla, for ALL defect tracking, Software, Hardware, Documentation, Process, and so on. Setup categories for each of these with owners in the appropriate department. Separate it by product, and always include some general “products” that aren’t products like “Process”, “IT” or “Engineering Server”, or “Company”. And that way you can track everything.

    A bug would get entered on a product. An initial category can be set by the reporter (for example say it’s a hardware problem and it’s an actual obvious hardware issue). And as comments and discussion accumulates, perhaps it is decided that the best fix is in a different area. For our example, perhaps we decide it’s best (easier/cheaper whatever, just think engineering trade-offs) to fix it in software. Change the category to software and fix it.

    Make it easy to fix and switch catagories of problem and it’s no longer “that’s a software problem”, it becomes “that’s a problem, lets figure out the best engineering decision to fix it.”

    - S.

  16. V.P. @ LI says:

    @F. & S.. Fully agree w/ F. & S., but… Yes, I actually was talking from perspective of those who builds silicon today and it’s not 1990, correct, we have 100s of millions of transistors today to put on a die, and when it is out of wafer you do everything (reasonable of course) not to have another wafer coming out. That’s very right thing to do at that scale. If you are at luxury of building FPGA design, then it’s a different story. (Indeed, there are still nuances due to imperfection of the system and FPGA compilers when ‘practically’ no change may render unexpected results and so, depending on the stage of your design and how close you are to production, amount of QA passed the decision needs to be reasonable and appropriate again, sometimes sw change just safer, sometimes hw change is OK to try). I think there is no general remedy, every case is different. I am just trying to make a point that “sw problem” not always literally means that but it could be the most appropriate solution to the problem at certain case (I never take it as offensive in any way, or I just got used to it in the culture of silicon building company…). BTW, for future designs I have no problem to request to put a few extra gates or else I would have to change sw (to support inverted signal etc. or change control distribution in better unified form to be able better abstract it’s support in sw to be compatible for longer term and for more future designs, as an example) and I have no problem getting it done in hw even though it can be done w/ sw change but -why- to change sw if there is opportunity to “fix” the originally intended hw design to be better backward compatible w/ existed sw at the advantage of using already validated sw (or parts of it) on it — Thanks!

Leave a Reply