How do you support software patches in your embedded designs?

Wednesday, April 28th, 2010 by Robert Cravotta

[Editor's Note: This was originally posted on the Embedded Master

Earlier, I posted about design-for-patching. While some patches involve fixing something that never worked, I believe most patches actually add new information to the system than was there before the patch was applied. This means there has to be some resource headroom to 1) incorporate the new information, and 2) receive, validate, store, and activate the patch in a safe manner. For resource constrained embedded systems, these resources are the result of deliberate trade-offs.

Patching subsystems that are close to the user interface may present a straight forward way to access the physical interface ports, but I am not aware of any industry-standard “best practices” for applying patches to deeply embedded subsystems.

Please share how you support software patches in your embedded designs. Do you use the same types of interfaces from project to project – or do you make do with what is available? Do you have a standard approach to managing in field patches – or do you require your users to ship you the devices so that you can perform the patch under controlled circumstances? How do you ensure that your patch applied successfully, and how do you recover from failed patches?

Tags:

12 Responses to “How do you support software patches in your embedded designs?”

  1. S.T. @EM says:

    Our embedded system incorporates a Web server, so we can ‘upload’ files to that server using an Ethernet connection. Firmware update files get saved in a ‘staging SPI Flash ROM’. Once uploaded, they are validated via CRCs, and a header in the file specifies the target ROM in the system to be updated. Invalid update files result in an error returned to the user.

  2. M. @EM says:

    We reserve one flash sector as bootloader, which manages an upload coming from SPI, Ethernet, RS425, depending on the hardware of the project. The firmware image has a CRC that is checked at every startup, asking for another firmware upload if the image is not correct.

  3. J.L. @LI says:

    We used to apply the patches on EPROM and have a service rep. do the upgrade. If there is difficulty with coding standards then apply them so that even deep patches are straight forward.

  4. S.P. @EM says:

    Re: Software patches for products already in the field: For new products, in early runs we use microcontrollers in DIP packages, so we can send plug-in updates/fixes. After the product ages, we can use surface mount with in-circuit programming. This way factory returns can be updated with other fixes. The firmware revision is embedded in the chip surface mount or through hole.

  5. D.T. @LI says:

    I have never been comfortable with the idea of patching. It seems like you are jumping in the air, and assembling a trampoline before you hit the ground. That and a few ruined products, due to a failed upgrade, makes for serious hesitation.

    I now throw my lot in with the PC motherboard manufacturers who simply double the size of flash needed for the BIOS. If you really HAVE to support field upgrades, then you probably should justify the additional cost of twice the flash.

    Don’t ever switch over internally to the new version until you can verify its CRC. Flash has gotten sufficiently cheap that the hazard isn’t usually worth it any more.

  6. J.L. @LI says:

    suggesting flash memory is no good in critical systems like hands off lights out semiconductor metrology robots. let alone a safety critical application

  7. D.T. @LI says:

    I’m not aware of any reasons not to use flash memory in critical systems, but that is irrelevant to the point above. Whether it’s flash or EEPROM or something else, if you need to upgrade, then media aside, the wise old adage “look before you leap” behooves us to “verify CRC before you execute”.

  8. B.G. @LI says:

    CRC for sure; I think it is also prudent to maintain two banks of storage for your applications (note: the applications I’ve worked on tend to be larger ones–doubling image storage is free in my worldview). One stores the current version, and one stores the previous. Both are versioned and CRCed, and you should provide the means to forcibly boot either one. I find it very common for customers to want the ability to “step back from the brink” and revert to the previous version of the application if the current one does not agree with them.

  9. J.M. @LI says:

    In most of the products I’ve worked on we implement some kind of a small boot application that never gets updated in the field. The product always runs through the boot application first and verifies the application integrity(checksum and/or CRC) before executing the main application. If the main application is ever corrupted or an update fails for some reason, then the board is still recoverable via the boot application. I’ve never really seen a problem with this approach, with a few million units in the field.

  10. M.P. @LI says:

    The nature of spacecraft flight software is that we cannot quite stop the execution and make an upgrade offline. We need to perform patches due to various reasons – handling of unexpected situations, adapting to current mission constraints, fixing bugs, etc. The traditional patching is kind of old-fashioned: a new executable image or data could be uploaded to EEPROM of the flight computer, or eventually its RAM could be also patched.

    We also have a more progressive method. A carefully selected part of flight software runs in a different execution mode within a kind of virtual machine (I am not talking about Java here). The VM allows for two things: 1) to execute code in isolation from the rest of the software (i.e. no faults could propagate out) and 2) to upload new modules into the VM. The fault containment property allows us to upload new code without jeopardizing the flight software integrity.

  11. R.W. @LI says:

    The kind of updating described by M. is exactly the kind of approach to fault tolerance that has evolved from using partitioned systems. The ability to partition code means you can sand-box it until you are sure it is safe/ secure/ complete and does allow for use in critical systems. I have worked with many of my customers to discuss update scenarios – and we are not just talking about patching here. A partition – whether it contains a full VM or not is a fundamental aspect of DO-178B safety critical systems and they really do add something to the pot. If you can monitor your system and notice a fault developing or see some erroneous action ocurring, then you have a chance to do something about it. That might involve uploading a patch or it might involve simply re-starting the problem app. Either way, it means that the system in the field can survive for longer without direct human intervention. That means its more robust and cheaper to support in the long run.
    While I do think there is still a place for loading new images in to FLASH and transferring control to them, I think the world has moved on. Systems need to be live far longer and be more flexible so this approach that began with DO-178B but is now in the more commercial world using things like separation kernels and hypervisors, has really taken hold and is making this possible.

  12. M.C.. @LI says:

    I have worked on systems that had similar requirements to what M. described for spaceflight. Of course, this type of requirement demands more complex hw and sw which adds cost to the system.

    The systems I have worked on had fault tolerant hardware (duplicated hw) that could switch from an active side to the standby side without software interaction. But software isn’t perfect and there must be a way to replace code without having to restart the system. The method used was having a patcher that loaded the replacement code into memory so it was ready to be used. When everything was ready, the patch was activated atomically. The first instruction in every procedure was a NOP. To activate a patch, the NOP was simply replaced by a jump to the replacement code. Undo the patch by replacing the jump with a NOP. I have greatly simplified this, of course, as other structure and rules were needed so code was patch-friendly.

Leave a Reply