On many of the projects I worked on it made a lot of sense to implement BISTs (built-in self tests) because the systems either had some safety requirements or the cost of executing a test run of a prototype system was expensive enough that it justified the extra cost of making sure the system was in as good a shape as it could be before committing to the test. A quick search for articles about BIST techniques suggested that it may not be adopted as a general design technique except in safety critical, high margin, or automotive applications. I suspect that my literature search does not reflect reality and/or developers are using a different term for BIST.
A BIST consists of tests that a system can initiate and execute on itself, via software and extra hardware, to confirm that it is operating within some set of conditions. In designs without ECC (Error-correcting code) memory, we might include tests to ensure the memory was operating correctly; these tests might be exhaustive or based on sampling depending on the specifics of each project and the time constraints for system boot up. To test peripherals, we could use loop backs between specific pins so that the system could control what the peripheral would receive and confirm that outputs and inputs matched.
We often employed a longer and a shorter version of the BIST to accommodate boot time requirements. The longer version usually was activated manually or only as part of a cold start (possibly with an override signal). The short version might be activated automatically upon a cold or warm start. Despite the effort we put into designing, implementing, and testing BIST as well as developing responses when a BIST failed, we never actually experienced a BIST failure.
Are you using BIST in your designs? Are you specifying your own test sets, or are you relying on built-in tests that reside in BIOS or third-party firmware? Are BISTs a luxury or a necessity with consumer products? What are appropriate actions that a system might make if a BIST failure is detected?
Tags: BIST, Built-In Self Test
I don’t do much design these days, but the data-acquisition systems I worked on always had a way to test operations, although the operator initiated the self-test routines by means of a special plug that connected to the back of the equipment. Then the operator could initiate several tests of everything from serial communications, the ADCs, front-panel controls, memory, etc. It took extra code but we thought it worthwhile.
I tend to use the terminology BIT (built in test) or POST (power on self test).
I have worked on systems that employed these techniques, but the vast majority of the systems I have come across just rely on a watchdog to detect an anomaly in the system.
I have created an number of BIST tests. Some of them operate during power up (POST) and some are invoked by a host application as part of a test procedure usually considered to be part of diagnostic testing. Others may operate, if an anomoly is found in the system. For example, if a kernel panic occurs then run the BIST to determine what might have caused the panic and determine if software can continue to function (maybe at a reduced functionality level) by bypassing whatever caused the panic..
Agree with all the responses. Here is my anecdote: I recently started working on a project where we decided it would be nice if we could execute certain sections of our embedded code on our PCs (in Visual C++). Nice idea. We quickly learned that our PCs (x86-based) and our embedded target (PowerPC-based) had different endian-ness. Ever since then I’ve had to use the word endian-ness a lot when working on this project …
I have used self test module goes with the same name BIST. It was used for periodic diagnostics and had feature of on demand remote diagnostic. Overall system was for detection purpose only. I worked on another system had onboard diagnostic to detect the fault and trigger fault tolerance mechanism to start the backup system with the same state.
I have been involved with many built-in diagnostics in embedded developments.
When I worked on flight control software we had BIST for all hardware and software. We had BIST that ran at power up, BIST that ran continuously, and BIST that ran when selected by a user.
Now that I am working on less critical industrial devices we have another kind of BIST. We have tests that run when a PCB (Printed Circuit Board) is on a test fixture. The test fixture contains the additional hardware that enables testing of all hardware on a PCB.
If the BIOS has features that we can use, we will use them. We usually need to write our own tests.
Some BIST is a high priority feature and some is a low priority feature. The lower priority BIST development may be delayed to meet a delivery schedule. BIST actions range from immediately notifying an operator to logging the event so a maintenance person can view a log later. The action depends on the severity of the problem. A broken ribbon in a printer should usually cause a failure message. A communication error that was resolved with a retry has low priority.
Avionics uses terms such as PBIT (Power Up BIT), IBIT (Initialization BIT), PBIT (Periodic BIT), CBIT (Continuous BIT), IBIT (Initiated BIT), and just plain BIT (Built In Test). Note the ambiguity that exists from one airframe to the next. Ironically I have not seen “BIST” in use on the twenty plus airframes where I have done Systems, Software, or Electronic Design. Since about 1980, the bombers and jumbo jets have started adding a variant on a CITS (Centralized Integrated Built In Test) black box to initiate BIT on other boxes, to collect the results, and to resolve ambiguity (transmitting box versus receiving box, or power bus versus one power supply fed by power bus). To say that Built In Test is essential is an understatement. The safety related systems obviously need it. The reduced time and money for test are justified for vendor and user for quantity one through quantity one million. The reduced skill level for debug is essential: skip the 50 questions, skip the test equipment, just read the error code / diagnostic message / blinking light. BIT means 10 minute debug and 40 minute repair by a dummy about 95 percent of the time: this is being written into contracts with FMECA (Failure Mode Effects and Criticality Analysis) data to back it up, even if the contract is non-safety related. Face it: BIT has a payback even for “replace cargo door switch” or “replace left front interior light”. 30 man-hours maintenance to support one hour of flight time is no longer allowed.
All of our Industrial Control products have a POST, and online diagnostics. The online tests, like verifying the CRC of the flash and scanning RAM for errors are done incrementally, over hours, to mimize CPU consumption. In Safety Systems terminology this is the Diagnostic Test Interval. This represents the long time a fault can go undetected.
The factory relies on POST plus commanded tests as necessary to validate the hardware, especially the external interfaces,because POST does not know what is attached. Because the online diagnostics are incremental and slow, they are disabled during factory test; the test controller can command a full quick execution, since there is no other demand on the hardware.
Online diagnostics generally cannot test everything because they may disturb the running system. In redundant systems, we recommend occasional switchover so that POST can run again on one of the partners — if nothing else for a full destructive memory test.. Again in Safety Systems terminology, the longest time a system should run without revalidation is the Proof Test Interval. Of course in a Safety System, the diagnostics are more extensive and more CPU is dedicated to them. And, the revalidation may include more than just a power cycle to run POST.
Almost everything we do has a form of BIST, some runs at power up for restoring paramter values and such which verifies checksums or config data before using else goes to defatuls and logs and error and such. This is all “boiler plate” kind of stuff where the real testis occur constantly during operation monitoring currents, tempertures and so on. For example, running a motor where a FET could overheat under low voltage conditions and such we monitor fet power, back off when needed, indicate to user of a pending fault conditions and such. SInce this is a Real Time Embedded chat area I think run time diagnostics are assumed. In my opinion you try to detect and avoid a failure and as a second resort, indicate the failure occurred. Of course, you always look at what is the likelihood fo the fault and the consequences of a fault ant then let the management team decide what they would like to do. Some products do things like monitor supply voltage, and if excessive and would cause damage, then don’t start. The other side is, “you can’t fix what you can’t see” so informing the end user of detected fautls, in all devices, the same way, is always appreciated. Being from the industrial controls industry, a lot is said about diagnostics where some products are “very good” in this area, and others pathetic. Like everything else, a product is as only as good as those whom have designed it and those whom make the requirements decisions.
In all equipments used in aircraft, Built in tests are present. As mentioned by many others in their comments, there are three types of Built In tests- called BIT in short. Power on BIT, continuous BIT and initiated BIT. Power on BIT will ensure the system is fine when we start. Continuous BIT will check if system is fine and can be relied upon. Based on the type of fault detected during Continuous BIT, the system can be declared faulty and failed or will inform pilot in case a degraded performance is possible. Every system will be subjected to an initiated BIT to uncover any latent faults that might have crept in the system over a period of time or if some interface has not been used for a long time.
We are designing power electronic /for ex. power supplies/ and use POST in our products, however most of field hardware errors are not in control board but in analogue components. Therefore testing requires that the unit is wired to external measurement device which can’t be done in field. Until now it was not possible to improve this situation so the diagnostic is not precise enough. Any suggestion would be appreciated.
@Ondrej: Not sure I understand your problem from descripton. Do you need to check the values coming from your “analog outputs” or determine if values are correct at your “analog inputs” or what?
As a rule of thumb, unless the product is very cheap I always design in some level of self testing.
The very least is the verification of the code CRC before running it, but it goes all the way to exercising the hardware, including analog and power stages.
On the battery-operated medical equipment I currently work on, we have several levels of self test.
There is a short power on self test performed each time the unit is used, but also several kinds of periodic self test where the unit powers up on its own to assess its readiness (daily, weekly, monthly), a manual self test that can be requested by the user and a battery-insertion self test.
The need for different self-tests is driven by the energy cost of each – some consume a lot of energy and we don’t want to perform these too frequently to avoid draining the battery.
On an earlier project (power supplies) the self test was embedded in each module and could be controlled either through a console or a script, ensuring that the same test was used for self test and for bench tests.
Scripting these tests allow for first-order smoke tests of your daily builds: just grab the code of the day, load it on the hardware and run it.
Ondrej, testing the hardware may require building the test equipment in the unit from the start. This may be impossible when writing code for existing hardware or when building cheap equipment.
It is also frequently necessary to dedicate for the test run time during which the unit does not perform normally – this might be a big no-no depending on the nature of the equipment and its usage.
One approach that may work is identification of the system during its normal operation. For instance by monitoring current vs. voltage on a battery while it is used you can sometimes infer its chemistry, charge level, number of cells or whether its cells need balancing. If you have several sources/sinks, you can even run experiments without shutting the system down. Russ addressed the redundant case above.
Otherwise you of course want to have run-time fault detection and mitigation, but this to me falls in a separate category aside from self test.
@Don, yes, we need to have feedback of our analogue output and input to determine if unit is working correctly. As these are working in range of 0-100A 0-600V AC/DC this is a non trivial task. From other feedback above it seems that design for testability would be needed. Up to now this approach is precluded as it would make development cost rise. Also run time testing is difficult as our units are required to run 24/7/365 optimally from first reset to EOL. This situation is new to me as devices which I designed before has been shutdown at regular intervals, therefore I’am still pondering how to test online and during operation.
@Liptak: Welcome to the “real world of real time”. I understand the issues; you have me by 10 amps! I do contract real time products where we say to customers; “you get what you pay for, buy only what you need”. I am always in awe of products where they have the luxury of going into “test mode” as that is usually not the case in most of our “requirements”. In reality most products “evolve” through time, and once the sales or support motivation exists you could get the time to “clean it up”. In fact in a “prior life” at a major controls company we looked at “cost of quality” and if a “commercial versus industrial” product was warranted to reduce product cost. To make a very long story short cost reductions were insignificant building a “commercial” product and quality was more dependent upon “engineering time” (assuming quality engineers) than anything else. In fact I currently working on a “please fix it” project to “clean up” another guys design that “evolved” and my guess is, by what I see, he wasn’t given the time (money) to go back and correct “stuff” after the last changes. Thus the “transient” issues, “ground is not ground” issues, “no a wire is a resistor not a short” and all that stuff that happens at 90 amps. The before and after is amazing, but from what I have seen, it takes field failures before budget is usually allocated. Be patient, the time will likely come, hopefully later than sooner.
I also design in POST, continuous BIT, and commanded BIT like Jayanthi described. As the project develops, each type of test can be expanded. I always plan on implementing each type and design it into the hardware so the critical data is there. The commanded BIT is typically layered so that the host can ask for a summary, and if there are any error or flags, the host can then investigate further.
Even if the product is bare-bones or will ultimately not implement such features, I find it helpful during the development.