What is your favorite debugging anecdote?

Wednesday, December 1st, 2010 by Robert Cravotta

We all know stories about how something went wrong during a project and how we or someone else was able to make a leap of logic that enabled them to solve the problem. However, I think the stories that stick with us through the years are the ones that imparted a longer term insight that goes beyond the actual problem we were trying to solve at the time. For example, I have shared two such stories from my days as a junior member of the technical staff.

One story centers around solving an intermittent problem that ultimately would have been completely avoided if the development team had been using a more robust version control process. The other story involves uncovering an unexpected behavior in a vision sensor that was uncovered only because the two junior engineers that were working with the sensor were encouraged to think beyond the immediate task they were assigned to do.

More than twenty years later, these two stories still leave me with two key insights that I find valuable to pass on to other people. In the version control story, I learned that robustness is not just doing things correctly, but involves implementing processes and mechanisms to be able to automatically self-audit the system. Ronald Reagan’s saying “Trust but verify” is true on so many levels. In the valuing uncertainty story, I learned that providing an appropriate amount of wiggle room in work assignments is an essential ingredient to creating opportunities to grow your team member’s skills and experience while improving the performance of the team.

I suspect we all have analogous stories and that when we share them with each other, we scale the value of our lessons learned that much more quickly. Do you have a memorable debugging anecdote? What was the key insight(s) you got out of it? Is it a story that grows in value and is worth passing on to other members in the embedded community? I look forward to seeing your story.

Tags: ,

80 Responses to “What is your favorite debugging anecdote?”

  1. A colleague was debugging an RTOS via a serial line debugger. He made a fix, uploaded the changes then stepped through the code (at source level and at assembler level). Despite stepping through the fixed code, it didn’t seem to fix the problem, even when it obviously should.

    Much later, he discovered the debugger upload was at fault. It computed a CRC of the code, found that this was the same as the CRC of the code on the target, so didn’t bother uploading the new code. When running single step, it displayed the updated code, but executed the old code.

    Normally, identical CRCs are a remote possibility, but in this case, the CRC algorithm was buggy anyway.

  2. Ian Broster says:

    I always remember a really great developer asking me once: “I think I’ve got a bug in this code, but the compiler hasn’t spotted it, do you remember how to use the debugger?”. To give some context to this: our development is done in Ada, which puts a lot of emphasis on getting the source right before running it. After months and months of development, we’d never once needed to use a debugger.

  3. M.M. @ LI says:

    My favorite story can be summarized as “don’t lie to yourself/customer”.
    We had an algorithm to generate certain parameters to operate equipment. When algorithm produced value of 14.9999999 (or so) it was rounded to 15.0 to show it to user. Value used in operation was still 14.999999.
    Of cause that difference would cause ONE data line on data bus to change……. And of cause that ONE line on the bus was shorted to ground.
    As the result when we would enter “15.0” by hand – unit operated fine. When “15.0” was created by automated algorithm – not so much. :-)
    I do not remember how long it took to find this “clearly software” issue…….

  4. R.S. @ LI says:

    We had a distributed system with multiple workstations with multiple people debugging at the same time. The startup window for an app did not come up on the workstation that started it, but on another. Every time it came up, the user at that workstation killed it not knowing what it was. The user who caused it to start didn’t see the app come up but detected the death. This start-kill-restart cycle was repeated multiple times with confusion at both workstations. One person watching the two users at their workstations said it was like a three stooges show, the one where one stooge carries a long board on his shoulders and every time he turns he knocks another stooge over.

  5. S.T. @ LI says:

    I am a contract developer and used to spend many days running from client to client. Some time ago, I recevied the following voicemail messages back-to-back from a client:
    Message #1: “Ever since you installed that latest version of firmware, we keep getting low battery faults. Please come in and fix this bug.”
    Message #2 (30 minutes later): “Uh, we put new batteries in the unit and now it works fine. Never mind.”

  6. Shae Erisson says:

    A customer recently requested an RMA for their non-functional PCIe card.

    They sent us a picture where the card appeared to have been struck by lightning.

    A few minutes later someone noticed the customer had filed off the tab that prevents PCIe cards being plugged into PCI slots….

    The RMA request was denied.

  7. E.P. @ LI says:

    along the lines of knowing what the software really does, let me offer this story:

    I was on a project to develop an embedded database for a medical test instrument. Th design was done and I moved on to another part of the project. The developer assigned to code the database followed the design and added an enhancement, a data block cache. That developer then left the project (and company). During integration testing, it was found the performance was not as expected. After some real puzzlement, we carefully examined the code. yes the cache was really there, one block long. however the block fetch algorithm basically went like this:
    1. DB request, read the DB header block. (block stored in cache)
    2 from header block determine the block to read
    3 read the block needed to serve the request (block stored in cache)
    4 complete the request, and go to step 1

    Did you notice the problem? with a cache that held only ONE block, every read from the disc erases the entire cache. No wonder the DB server application was slow!

    The lesson is to make sure you understand the side effects of any enhancements.

  8. A.M. @ LI says:

    OK here’s one from one of my past employments:

    We had a wire-wrapped circuit with several ALTERAs on it. It would work fine, but once in 10 minutes it would go crazy. It would produce completely illogical values along the lines of 1+1=347923874932874.12. Three engineers were frustrated for several days – they could not understand what is going on. My work table was nearby, and I noticed a rather peculiar coincidence – every time the nearby mini-bar turned on the compressor, the engineers would cry with frustration.
    I pointed that out, and sure enough we calculated that the RFI generated by the electric motor was amplified by the long wire-wrap legs of the board, because they were exactly half wavelength of the RFI, and the distance between was also right. Kind of like a Yaggi antenna for AC motor RFI.

  9. O.G. @ LI says:

    A hardware one. A student asked for help to a former colleague of mine, a seasoned engineer at the time.

    - Ex-colleague: Are you sure your EPROM is working?
    - Student: Yes, it’s all fine but my microprocessor won’t work.

    After a good half hour of debugging, the ex-colleague asked:
    - How do you know that the EPROM is working? Your system doesn’t seem to get any data out of it…
    - Student: Yes, I know it is working. I saw some orange light coming out of the window when I powered it on.

    Always ask for proofs when a student claims something works…

  10. RSK @ LI says:

    In one of the automotive projects, we had to store the calibration values in EEPROM on a PIC microcontroller. It was a simple task. Call a library routine and be done with it. It would not work. Each time I tried the calibration routine, the uC would hang. Spent a lot of sleepless nights working. Test program to write data onto EEPROM worked like a charm. Not with my application. Everything was same or so I thought.

    The problem was the Hardware stack was overflowing due to call depth increasing. I finally ended up writing ‘inline’ like function using preprocessor macros to reduce the call depth.

  11. D.D. @ LI says:

    I was tasked with trying to find why an embedded print server was crashing on a 80188-based interface board my R&D group had built. It ran our own cooperative multitasking OS and a dozen or so processes and every so often a print job would just finish without throwing an abort error. Since the job state variable was accessed all over the place, I cooked up a high frequency, high priority timer interrupt tied to a quick ISR that would log the PC when the state “transitioned” from running to done. After experiencing a few normal behaviors, I got a hit! The PC was within an occasionally called function that contained the following test:

    if(JobState = JOB_DONE) { …

    This was both in the days before compilers would warn you about assignments within test conditions (and JTAG watchpoints!) Nowadays, if my compiler won’t warn me then I use the MISRA recommendation and test for (CONST = var) just to be safe.

  12. J.M. @ LI says:

    It once took 3 days to track down a lovely C issue that was crashing the system.

    Deep in a call tree someone was passing a negative number to a routine, which the function used as an index into an array. C being C it happily used the negative number to index backwards in memory, thus writing it’s data into code space. Later on, the CPU eventually started executing the this data as code, which worked for a few instructions before it crashed…

  13. L.W. @ LI says:

    This one’s not embedded related, but does come with an important lesson.

    I had once written a library in C to decode CGI form variables as they were submitted. Over a few years, this library came to be used in about 8 or 9 projects, one of which was part of a fairly high-traffic website. No problems whatsoever.

    One time I was developing a new application with this C library, and it kept seg-faulting. I was so sure that because of the perfect history that I had with this code, that the fault could not possibly be within that library. Half a day wasted later, I started poking around inside the library and found a null pointer access that had been there for years. It was inside a conditional block that would almost never run, but in this case, it did.

    It’s this very anecdote that scares me when it comes to writing software where a malfunction could cause injury (or worse) to a human being.

  14. S.B. @ LI says:

    One place that I was working I was assured that the hardware was fine but when my code was in a PROM rather than on the ICE nothing happened… This software fault was solved by modifying the PCB layout so that the processor had a 0V rail connected there wasn’t one but the ICE had it’s own. Then there was the FIFO with the repeat line strapped to on, (First In Forever Out), also a software fault that resulted in a track change. The moral if the hardware engineer says it all works ask them to prove it.

  15. M.M. @ LI says:

    The IT could not install the OS on an embedded system which just arrived from the manufacturer. They tried everything, but failed. Eventually they brought the system to me – “the guru”.
    It took only several minutes to understand that the manufacturer forgot to put a hard drive into the system :P

    Is it plugged in? Is it turned on? LOL……

  16. A.M. @ LI says:

    Not really debugging, but sure is an anecdote:

    The IT could not install the OS on an embedded system which just arrived from the manufacturer. They tried everything, but failed. Eventually they brought the system to me – “the guru”.
    It took only several minutes to understand that the manufacturer forgot to put a hard drive into the system :P

  17. E.P. @ LI says:

    Dan’s 80188 memory reminder me of working on several 80186 projects at CompuGraphic. I became in charge of the OS and Debugger software. The Debugger ws written in PL/M (Intel’s PL/1 for microprocessors). I enjoyed PL/M but compiling the debugger was a real pain on the dual 8inch floppy based Intellec(?) box. It took hours and you only had enough space to generate either the executable image OR the assembler listing, but not both. One day the manager comes in with a couple salemen and asks if I could run a performance benchmark of their hardware: a Winchester hard disk! “Sure, I have just the thing!”

    After they installed it and we booted the system, I got the compiler going, generating BOTH executable and listing. About 5 minutes later (maybe less because we were talking and I sure did not expect it to be that fast), the compiler was done. Check the files yes the image AND the listing were both there. I turned to my boss and pleaded to have him buy the drive. I hadn’t begged like that since i was a kid.

    The salesmen did not take long to close the deal.

    Lesson: appropriate hardware can do wonders.

  18. R.D.F. @ LI says:

    My project was to enhance our codebase’s handling of the Simple Network Time Protocol (SNTP). Time servers provide the time in a seconds and fractional portion. When testing I used, a web based time server. My code worked fine – most of the time. It took some time to get to the root of the problem, by adding debug instrumentation and studying protocol traces. The server would occassionally not update the seconds portion but did just fine with the fractional portion, resulting in negative values. After I realized what what was happening, I added a simple test ensuring that the returned time from the Time Server was marked later than the previously retrieved time. One would think that a ‘clock’ wouldn’t run backwards. Again – “Trust but verify”.

  19. A.M. @ LI says:

    R. – I feel like that solution is going to bite you hard. You should probably check for the time integrity some other way.

  20. J.R. @ FB says:

    I have a saying: “More Code = More Bugs”
    (People tend to measure their effort by how big their code is… who feels good about working on a subroutine all day and not having much to show for it? Too bad they don’t realize that coding is like playing “Name That Tune”, where I can make the subroutine with the least amount of wires and functions!)

  21. J.N. @ LI says:

    In 1981 we had a small 8080-based system that used a 60Hz interrupt for timing and a software RTC. The clock was running at half speed. I started checking the code and realized my colleague had optimized for the worst case, which of course is rollover. Unfortunately it means it checked for carry on every digit, which with other overhead took longer than the interrupt. We could tolerate one of those, but two in a row would lose an interrupt. All I did was fiddle the code for each digit so it quit if there was no carry. It added a slight amount of time to calculate each digit, but it was only significant for large numbers of carries, and I was reasonably confident that we’d never get two of those in a row. :)

    Obviously that was the problem and the correct fix or I wouldn’t be bragging about it now.

    One lesson from this is that one should not necessarily always optimize for the worst case. There are also lessons about code reviews and multiple eyes looking at a problem. I have no idea how long M___ would have taken to find the problem in his own code but I spotted it right away because I would have written it differently to start with. What caught my attention was the (to me) odd method he used.

  22. R.C. @ LI says:

    I was working on a digital TV. It would sometimes crash. Someone was able to record the video signal. I was able to reproduce it about 1 of 10 times. A table that was used in decoding the tables describing the video streams was being overwritten.

    I check summed the table when it was written intentionally. I moved the code to verify the checksum around the various functions. I had gone though about 90% of the code when I got to the high priority interrupts. A driver that came with a chip we used assumed the S/W buffer was at least as long as its buffer. Our buffer was shorter, but it was longer than the worst case packet.

    The protocol is define with as start and end marker. There is no length marker. When the signal fades, an end marker may never be found. When that happened, the driver would over run the buffer. The table happened to be near it in memory. I modified the driver to accept a buffer length.

  23. D.C. @ LI says:

    I once had a contract to tidy up and bug-fix somebody else’s port of an operating system and associated utilities to a new computer. It was one of the first to have 3.5 inch floppy disk drives. I finished the project, got it accepted by the customer and returned the hardware to him. A few days later, the customer telephoned to say that whenever he formatted or reformatted a floppy disk, the disk became unreadable (in those days, floppy disks did not come ready formatted out of the box). I found this very strange, because I had formatted all the 3.5″ floppies I had used on the customer’s machine, with no problems. So the customer sent the computer back to me and I looked into it. It transpired that whoever wrote the disk formatting program had decided to generate a volume label incorporating the date of formatting. Unfortunately, he/she hadn’t converted the date from binary to ASCII. The formatting program DMA’d a data stream to the controller chip, and the controller chip wrote the data to disk – except that it interpreted byte codes 0xFA to 0xFF as commands such as write an address mark, write a CRC, etc. So for 6 days out of every 256, the LSB of the binary date would be interpreted as one of these special commands, resulting is a bad format for track 0. You could format a floppy disk successfully on any day apart from those 6.

  24. C.S. @ LI says:

    I worked on a VME 64x system with 2 single board computers on it. During data structure transfer between the 2 single board computer, on data reception in the second single board computer, some of the data were good and some not !!!! I verified all my software and everything seem good. Then I noticed that some of the float values that were wrong and all integers and characters were good. I wrote a test program to transfer the standard 0xFFFFFFFF. On the other side I got 0xAFFFFFFF. The data line D30 and D28 were grounded on the backplane. Since my Integer values were relatively small, they were always correctly transferred as well as the characters. Even the negative value because D31 line was not grounded. The any float values could have been corrupted by theses two grounded VME data line.

  25. E.P. @ LI says:

    yes, J., interrupts can be fun. On an I186 system, I was testing the initial installation of the OS I maintained. It crashed nearly every time just after initialization with an unsolicited interrupt. I knew the OS Initialization routine was running and doing what it was supposed to do. I touched based with the guy who wrote the Boot code and he confirmed my logic. Finally I resorted to essentially single stepping the code. Check the pending interrupts register: none waiting. Enable interrupts: Bang!

    After some weeping and gnashing of teeth, we were able to identify the bug IN THE HARDWARE! That Step of 186 had a problem. Since the Boot code ran hardware tests, it left the clock running generating interrupts. Though I followed the documented process to stop the clock and clear all interrupts, apparently an interrupt was latched somewhere between the integrated interrupt controller and the main interrupt controller. The only way to release it was to accept it.

    It was code that should not have been written, but likely existed for several generations of the OS after I left.

  26. xvr says:

    In the 198x I had a great experience with small computer system. It was based on i8080 chip and use a couple of another chips (i8257 and i8275) to connect to TV, which was used as text display. Problem was in display – it shows some chaos, not an expected text. The second problem was in debugging – when we try to connect oscilloscope probe to any pin of any chip in display circuit, TV magically eliminate chaos and start to show perfect text. Moreover, when we just moved closer to the pin with scope probe (not connecting), magic came again – clear picture arise.
    Answer was very surprising – root case was not in a probe, but in a shadow from my hand! There is a desk lamp, near the system, which illuminates desktop and i8257 chip. When light was removed from chip – it starts working! We check it out twice (with a ruler instead of hand – when a shadow from ruler fall on chip inside i8257 package it turns on)
    I’d never see chip with ‘photo-effect’ before and never afterwards.
    PS. Bug was in clock input of register which latch address from i8257 – it was disconnected. The was a second magic – haw it could work, even in darkness?

  27. K.N. @ LI says:

    “Stare at the code some more, DimWit!” This is all i heard every time the senior-most Engr in the Co. caught me sneak a peek using gdb. Since he was spot-on with the dimwit part, i’ve followed his advise ever since.
    Many years later, we had a major issue with losing data intermittently on an I2C bus. Slow, bus carrying occasional control information that no one really bothered with for quite some time. My part of the code was reading data off a USB and that was working fine, so not really my problem, but it got so the delivery date was fast approaching. After “staring at the code” for a long hours, noticed that one of the interrupt routines was quite long, more than half my vi screen. Pointed it out to the Engr, but since checking this involved mucking about with a ‘scope, we didn’t get around to it for a few more days. Late one night, more out of desperation than anything else, we stuck in a ‘scope and sure enough, the INT clocks was off. It wasn’t our code and don’t really know how the time was shaved off, but the project went through.

  28. N.M. @ LI says:

    I like Krishnan’s comment. I have had so many embedded debugging experiences where, despite all the slick tools in the world, the problem was ultimately solved by staring at the code. As I see it, the problem is that by the time you can see the recognisable symptoms of the bug, often so much is wrong that you can barely untangle what you have in front of you. I still like to have good debug tools to hand.

    In response to an earlier comment, yes Lauterbach tools are excellent and deserve their good reputation. And they also have excellent technical support. I remember sending a fairly deep question to their tech support email and getting a very comprehensive reply from a Mr. Stefan Lauterbach, no less.

  29. K.N. @ LI says:

    @n.: yes, unfortunately we haven’t yet learned the art of *comprehensive* testing from the h/w folks; much as we’d like to blame them for our bugs! Programmers earn money from each others mistakes, whereas the h/w guys have a Million $$ staring at them in the face, going back to the fab to fix their bugs.

  30. J.A. @ LI says:

    I was working on a Z180 embedded system many years ago. I was handed the first, hand assembled, production prototype to begin debugging the firmware. I plugged in the processor emulator and proceeded to examine the memory and I/O lines. Nothing worked. So I started scoping signals. At the processor socket 100%, everything where it should be. At the memory circuits, RAM, ROM, and I/O, control lines 100%, address and data however were 0%. It turned out that the RAM, ROM, and I/O had been connected to each other, but the processor was left out. I still have the board with all the data and address lines as little blue wires, in a shadow box, to remind me of the importance of a solid DRC and DFM check. BTW the design number assigned by the hardware department to the board in question was 666, I should have known there were bound to be issues!

  31. R.S. @ LI says:

    Friendly competition is ok, but finger pointing and recriminations are not a good thing.
    I remember working in a lab that had a chalk board to count errors by (perceived) responsible discipline.
    If your intent is to embarrass individuals to leave the field, that’s the way to do it.

  32. R.H. @ LI says:

    Recalling this ex-collegue who once had to write a test system for EPROM’s. That was in the days when the things could only be programmed three times. So he wrote a routine to program each EPROM 3 times (and erase it again). When the chip passed this test, it was shipped to the customers. Of course enormous amounts of complaints from customers after a while. We seriously doubted the semiconductor company making those EPROMS (and things somehow heated up to have lawyers involved). Then the collegue somehow heard what was going on, and confessed…

    Also recall another collegue who had to write a float-to-int conversion in Pascal. That went something like (in C): i = 0; while (f > 0) do { f = f-1; i = i+1}. And then had to wait a while on large floats…. even larger floats never ended the loop….

  33. R.D. @ LI says:

    In early 2000′s I was asked to work on a problem that was killing our wireless modem. Every now and then the connection would go into a state where the call would be up but no data would be going either way. This was holding up the release of the product. It took me quite some time to figure out that the ARQ protocol implementation was done in such a way that the SOF byte was checked nibble wise and when the higher nibble matched the pattern and the lower nibble didn’t, the SOF checking algo jumped to the next received nibble without checking the lower nibble for being the SOF’s higher nibble. So, there was a possible condition in the system that the SOF’s higher nibble would be preceded by the same nibble and the ARQ would never detect the frame.

    There another case where our compiler was automatically choosing a 14-bit register to store a 16-bit address value. The code was jumping to all vague locations in the memory due to the masking of the upper 2 address bits. It took lot of head scratching to figure out what was happening ( I literally mugged up the instruction mnemonics to search for known patterns in the hex output generated by the compiler). The solution was weird! When we put a NOP instruction before assignment of the 16-bit value to a register, the compiler would use a regular 16-bit register!!

    These two were the most mind-boggling and also satisfying bug fixes in my entire career.

  34. S.F. @ LI says:

    I recall trying to integrate a satellite telemetry system for the first time; initially with much success as the we walked through the expected operation with lots of debug information being sent out of a serial port. When everything appeared to be working properly we then disabled the debug and the system stopped working at all. Every time we put any debug back in, the system worked fine; causing lots of head-scratching and starring at the code. Eventual problem was found to be a variable that was shared between 2 ISRs and not properly protected. When the code didn’t have any debug than the compiler kept the variable in a processor register and so the value didn’t appear to change in memory, but when the debug code was added then the compiler couldn’t do this optimisation. A 1-line fix in the end, but took a while to find.

  35. S.B. @ LI says:

    Adding a new type of display to a sonar system I found that my code was not displaying the output correctly. Eventually I relised that a context switch was needed to display graphical data so added it into the old code – a day or two later – on holiday – I got a panicked call that one of the test results screens was now “corrupted” – it developed that the actual results of the test had been displayed graphically on the original system but had been lost about 10 years earlier as a result of the same lack of context switch and had been forgotten about in the interrim.

  36. S.B. @ LI says:

    Had a lovely time porting a pile of code written in PLM-86 that had been in use for years running under iRMX86 on the old Intel Iris boxes, (dual 8″ floppies), to PLM-486 running under iRMX486 on a PC. Got it all building- with no errors and fewer warnings than the old system but it would only run for a couple of mins before the OS crashed. Tried under the debugger and it ran to slow to be useful but never crashed. As it was a high profile project I was under constant pressure to explain exactly what the problem cause was – to which my reply was always that if I knew exactly what the problem was I would have fixed it. Eventually pointed out that I was spending more time in meetings, writing reports and answering questions than trying to find the problem, (after a day in which I did 1 hour of debugging and 7 explaining that I did not yet know exactly what the problem was), and was told that the project manager would take over the questions and meetings. 7 solid hours work later, adding debug output – rebuilding – testing starting again – I found that there was a library call that passed a structure with the correct values in the correct size structure into a OS call – unfortunately the types were off. The debugger was intercepting the call and passing it on in the correct types.

    3 morals – 1 running in debug is NOT the same as for real, 2 let the engineer get on with his job and he/she can produce results, 3 “What exactly is the problem?” is the ultimate stupid question.

  37. T.Z. @ LI says:

    On the negative time theme, I’ve always used 5Hz or 10Hz GPS units. I would log the data to a KML and view it on google earth. I had one long drive with a strange time code – it said it started at midnight on the day I started the drive (in the afternoon). It turned out that on this particular GPS, they update the day with some delay from updating the time, so at the midnight-cross, the first few entries showed the previous day (visible from a log). Also the GPGLL (which doesn’t have the day) might arrive before the GPRMC so I had to avoid using the time from the GPGLL sentence. So when writing things for GPS, I’ve learned to beware the glitching hour.

  38. J.N. @ LI says:

    On the theme of time rollover, the same 8080 project (I mentioned before) ran into another issue, but this time the compiler was optimizing when we didn’t want it to. And in fact we stumbled on the bug completely by accident, never having seen it manifest.

    If you recall we had a 60Hz-interrupt-driven RTC in software that incremented the 1/60th counter and then rolled over into the digits. To read back and use the value, T___ multiplied and added the time into a single word, stored it into a variable, then did the same again into a second variable and then compared them. If they were different he’d figure it rolled over during the calculation and just loop back and do it again. Nothing wrong with that.

    One day I heard him call out in surprise. He was debugging in the emulator and just happened to be in that section of code. The code the compiler generated went through the calculation, stored the result, and then very efficiently stored the same value from the same registers into the second variable. Then it compared the two variables.

    Mind you, this was circa 1981 in Intel PL/M. I think the options were “optimize everything you can” and “don’t optimize anything” and we didn’t want to disable all the other nice optimization just for one exception.

    I guessed that the compiler was triggering on two identical source statements, so I suggested he simply reverse the order of the statement — seeing as how both addition and multiplication are commutative — and see what happened. He was skeptical but he gave it a shot and son of a gun, this time it generated two separate calculations.

    I don’t know if it was the source or some intermediate code it was optimizing but since it worked we chalked it up to experience and moved on.

    If it hadn’t worked I’d have tried something else. These days I would do something like, oh, store the seconds LSDigit into a temp variable, run the calculation, then compare the current LSDigit against the temp. Faster than running the calculation twice and less likely to get optimized away.

    Many times when I’m first ready to start testing a new piece of code I’ll start by stepping through it in the debugger, mostly going through the initialization functions. It’s a variant on the “staring at the code” method but it lets you check a lot of your assumptions — like what values ACTUALLY get calculated and stored.

  39. N.M. @ LI says:

    Good story J.. I should say that I include the compiler’s assembly in the code to be stared at. Many a seemingly tricky problem has become simple when I looked at the assembly.

    One in particular comes to mind. We had a body of portable code and it failed with the ARM’s own ARM compiler. The failure was that it never exited from a busy wait loop:
    while( 0 == var ) ;
    It turned out that the ARM compiler compiled this the same way it compiled
    if( 0 == var ) for( ;; ) ;
    It also turned out that the ARM compiler was entirely correct in its optimisation because we had not (and did not want to, because we needed the optimisations everywhere else) declare ‘var’ as volatile. All the other compilers had had their optimisation level set to maximum but they had not done this particular optimisation.

    As I am sure you all know, the way to force a read on every iteration of such a busy wait loop is with a cast:
    while( 0 == *(volatile int *)(&var) ) ;

  40. R.S. @ LI says:

    Staring at the code won’t help with problems in the environment. On one project in trouble we once pulled in people from other projects just to stare at code. The problem though, was an out of date device driver.

  41. R.F. @ LI says:

    Using a bugged debugger (really).

  42. J.N. @ LI says:

    “Staring at the code won’t help with problems in the environment.”

    Quite true. Of course, staring at the code is a valuable tool — but one of several. Different tools have different strengths, which is why you learn to use several and switch off when one isn’t finding the problem.

    And sometimes one tool gives you a hint that gets you closer, but it takes a different tool (or tools) to get you the rest of the way.

    “Using a bugged debugger (really).”

    Oh, I quite believe it. I really, really hate fighting with the tools. The Cypress stuff drives me bugnuts. The only reason I put up with it is that I love the processor.

  43. R.C. @ LI says:

    Back in 1984 as a new engineer I was responsible for the design of the frame buffer for a 2048×1536 (!) pixel computer display. It had a horizontal microcode bit slice processor so I wrote some code to write and read values to the buffer and it all worked fine. For some reason I got it into my head to try it without the frame buffer installed. Passed fine. Turns out the bus capacitance was sufficient to keep the data if it was written and read back immediately. I changed the code to write, write different data to a different location, then read the first location again.

    I had had experience with slower microprocessor based systems where the original test strategy worked fine.

    Always test the test…

  44. G.C.Y. @ LI says:

    New guy adding “feature” to 411 operator console. Code reviewed by “experts” who said their original unreachable “dead code” should should stay as safety measure and should not be an operational concern since “it can never go there”. New guy adds warning logs to every peice of dead code so that operational measurement software can be used during regression runs to track “bugs”. First run of changed code is when “experts” get lab time next morning. New guy called down to account for all the “bugs” that are spewing out of the log console. Turns out the “dead” sections are traversed on _every_ run since the experts used the “out” clause to handle the normal case. Moral of the story: If it is not your code, treat it as new and untested.

  45. R.A. @ LI says:

    G.C.Y. said: “Turns out the “dead” sections are traversed on _every_ run…”

    Hence the rule for safety critical systems, that there can be no “dead” code.

  46. M.L.M. @ LI says:

    I was working a MIPS 3000 project and was responsible for the cross compilation tools. As the project approached critical mass the optimizer was turned on and the compiler immediately exploded with stack corruption in a machine generated module.

    Since I didn’t have a viable stack frame, I couldn’t see were the call came from. I had gcc spit out the intermediate files on that module. There didn’t seem to be anything wrong with the code, but I did discover that the machine code generator had built a single basic block with over 100,000 lines of executable code.

    The optimizer pass allocated an array of data structures with one element per line of executable code. On the stack! The quick fix was to reconfigure all the build machines and engineer’s workstations to allow 32MB of stack instead of the 8MB default.

    Always be open to the possibility of the environment causing you problems.

  47. J.N. @ LI says:

    Bus capacitance! Last time that bit me my boss actually didn’t believe me when I suggested that was the problem. It can get interesting working with pure software guys when you have hardware experience.

    After that I got into the habit of, when I’m doing matrix inputs, turning the inputs to outputs just long enough to give them a good yank between readings. Even if they seemed to work anyway. (Yes, I do remember to turn the outputs to inputs first to avoid bus conflicts. :)

  48. R.S. @ LI says:

    Cable too long for driver to drive to the output! Took forever to get the hardware guys to hook up a scope at the input and stop blaming my code.

  49. R.S. @ LI says:

    N. M. said:
    As I am sure you all know, the way to force a read on every iteration of such a busy wait loop is with a cast:
    while( 0 == *(volatile int *)(&var) ) ;

    I was thinking the following code snippet should also work, though I’ve not tested it yet:
    volatile int temp = 0;
    do {
    // do whatever with var
    temp = (volatile int)var;
    } while (0 == temp);

  50. P.K. @ LI says:

    Good earning in Management careers. You can get careers in Management work.

  51. R.S. @ LI says:

    If I wanted to be in Management, I wouldn’t be posting here now would I?

    Not all C compilers properly implement volatile.

  52. N.M. @ LI says:

    Good point about volatile being treated differently by different compilers. That’s why it is so useful to spend a moment checking the generated assembly.

  53. R.H. @ LI says:

    ‘Volatile’ reminds me about a compiler (around 1984) whose optimizer looked at the last statement before the closing } of main(), and looked whether that statement’s result somehow left something of a result to be used in the exit-call, and if not, it removed that last statement from the executable code. Then it checked the new last statement, etc. etc. etc.

    Unfortunately the optimizer didn’t know about memory mapped I/O. So it could happen that it optimized away *all* your code, which resulted in extremely fast runtimes with incredibly small executables (even for those days).

    Now it happened that these executables were in turn used for benchmark testing… this compiler finished as nr. 1 !!!!

  54. J.A. @ LI says:

    Not all C compilers handle complex bit mapped data structures correctly. A case in point that has bitten me in the past. GCC for an embedded ARM system handles the following bit mapped, packed structure in ascending order (LSB first) while MS-Visual C++ handles the exact same code in descending order (MSB first)

    struct typical_Rx_item // ARM
    {
    unsigned int32 sub_1:24; // LSB 24 bits (bits 0-23)
    unsigned int32 sub_2:23; // next 23 bits (bits 24-46)
    unsigned char filler:1; // MSB. (bit 47) (bytes 0 – 5)
    unsigned char Name[16]; // item name (bytes 6-21)
    }__attribute__ ((packed));

    The example above is a simple one…
    This was a real problem because only the bit mapped variables that end on byte, word, double word, and long boundaries end up being swapped. The end result was a scrambled bit stream. It was a real mess. One applications programmer felt that this was incomprehensible as a compile time task and went so far as to write routines for each variable defined, to extract the bits, one at a time out of each byte of the communications stream, and recompile the variables at runtime. His compiled program code is huge, runs slow, uses lots of processor bandwidth, and crashes frequently.

  55. R.A. @ LI says:

    ” GCC for an embedded ARM system handles the following bit mapped, packed structure in ascending order (LSB first) while MS-Visual C++ handles the exact same code in descending order (MSB first) ”

    That’s not really a compiler issue, but an operating system (or executive) issue (and it is more of a choice than an “issue” per se).

    The ARM processor can operate in either big-endian or little-endian modes. I suspect the environment you were using Visual C with operated the processor little-endian, and the environment you operated under with gcc used big-endian mode. Gcc, can be configured to generate either ARM LE or BE, but if you are compiling for a BE OS/executive, you obviously must use a gcc compiler configured for BE.

  56. T.D. @ LI says:

    I worked on a prototype ethernet switch that had a small set up and hold error in the bus controller for an address/data multiplexed bus. When certain circuitry was in operation on the board, the set up and hold conditions for the bus were violated. One bit in an opcode for the RISC CPU would be corrupted and an illegal opcode would result. The manual for the CPU indicated that the behavior of the processor was ‘Undefined” when presented with that opcode.
    Things that made it hard to get a handle on the problem:
    1) It was very hard to reproduce the behavior at will.
    2) The follow-on crashes were a second or third order consequence of the bus controller error and it caused us to search for a software error that did not exist.
    3) Caching was required for proper system operation. This made analysis of the memory fetches difficult to correlate with what the SW was doing. Turning off the cache also turned off the problem.

    The observed behavior was that the CPU was resetting. Using GDB we could see what looked like impossible program flow. (eg hyperspace jump from the middle of one function into another.) At first and second glance it looked a lot like something was corrupting the stack.

    It took months to figure out that what looked like stack corruption was actually a subtle and intermittent bus glitch. Ultimately it was found by pouring over traces of memory fetches from a logic analyzer.

  57. J.A. @ LI says:

    Nope. We tried that first. We determined early on that it was not an endian issue. Straight bytes, words, double words, and longs on byte boundaries converted directly without issue. Something about bit mapping that way, MSVC++ had issues with and always scrambled the data. I was able to figure out the rules and re-engineer all of the data structures to compile and work in the MSVC++ universe. However, the applications programmer felt that his way was “better”. I just love it when an applications programmer with 8 years experience tells a systems programmer with 30 years under his belt “You don’t know what you’re talking about.” It always ends the same. With excessively large, and cumbersome applications code.

  58. R.A. @ LI says:

    J., sorry I misread your post, I had thought you had said that all types were of different endianess.

    Yes, the C standard does not define how bits are mapped into bit fields. This is why most ‘C’ programming books recommend not using bit fields ( http://www.cs.cf.ac.uk/Dave/C/node13.html#SECTION001322000000000000000 ).

  59. D.C. @ LI says:

    James, sometimes there’s a choice between writing compact but non-portable code, and less compact code that is portable to any standard-conforming compiler. When developing critical systems we generally prefer the portable code. The C standard states explicitly (in Appendix F.3.9) that the order of allocation of bit fields within a unit is implementation defined, so the way MSVC behaved in your example was not incorrect.

  60. S.B. @ LI says:

    The worst example of this that I have hit was writing some ISDN code using the Solaris compiler – we had “some” bitfields, (just a few 100), as defined by the specifications and testing on the development system said all was ok – we had even declared them all twice so as to allow for endian issues – but when we came to testing with an external system nothing worked. Tried re-building with the use other endian flag set – still no good. Eventually realised that given an 8 bit bitfield type sizeof() returned 64. The compiler was using 1 64 bit word for each bit in the bitfield. Also found that there were loads of padding being added internal to the structures and that there were piles of compiler options that varied these items. In the end we switched to gcc with a few simple options set and the problem, (and several others), was resolved.

    The morals – a) compilers that produce the fastest executables are the often the least useful for interacting with external systems, b) CHECK all assumptions.

  61. K.B. @ LI says:

    My best one involves this line of code: int seconds_since_midnight; I worked with a group of early risers. They came in every morning and found it hung but blamed it on the hardware. I was an intern and couldn’t convince anyone that I knew what I was talking about… in staying late to do my “grunt work” I had seen that it hung at exactly the same time every day. I had to offer to buy the software engineer dinner to get him to look at it. Luckily when he found the bug he turned around and paid.

  62. P.F. @ LI says:

    James said “Something about bit mapping that way, MSVC++ had issues with and always scrambled the data. I was able to figure out the rules and re-engineer all of the data structures to compile and work in the MSVC++”.
    I’ve had the same issue with Labview talking over a tcpip socket to a micro. it seems LV treats a float like int64, and does a hton, causing fun byte swapping. i had assumed IEEE 754 clearly define byte order….
    What a pain!

  63. J.A. @ LI says:

    David, quite true, however in the case mentioned, I did manage, through analysis of the data on both ends of the chain, to ascertain the rules used by MSVC++ and make the appropriate changes to the data structures involved. It was somewhat less pretty, but none the less would have worked significantly faster. So, instead of being able to process megabytes of data per second, his application chugs along at kilobytes per minute. And he wonders where the data bottle neck is, as the input data backs up to the point were the application crashes, overloaded. I handed him the answer, he didn’t want to hear it as his “Portable” answer was the best approach in his mindset. Some people just don’t get the fact that portable code is not always the fastest running code, no matter how heavily you optimize it.
    Another moral – c) Where speed is required, portable should not be.

  64. R.H. @ LI says:

    Once there was this very large telecomprovider who thougth that they could code better than anyone else, and thus they redesigned the daylight savings time algorithm in (then) Unix SysV.2. It ran happily for several months, so it got deployed a whole country over.

    And so it happened that we got an angry telephone call at Monday morning with a complaint that several hundred telephone exchanges still had their clock somewhere between 2 and 3 AM, which was ‘of course’ the fault of Unix if not the supplier, and that we were liable for the damages being: all telephone calls in a whole country for the duration of the problem (wrong bill due to wrong timestamp? don’t have to pay it!).

    With much management attention, we started investigating the rootcause. It then quickly appeared that their own redesigned DST algorithm was at fault. It was simply coded like:

    if (last sunday in october) and (time == 03:00 AM) then time = 02:00 AM;

    But the programmer forgot that one hour later it was yet again 03:00AM….

  65. J.M. @ LI says:

    Back in the dark ages I was doing firmware development on a new cpu/motherboard combo. Two hardware engineers had each designed half of the motherboard – one the CPU related and the other the I/O portion.

    When they initially powered up the motherboard, it would execute a few instructions then go off to left field. They naturally pushed us software people off to the side and got the scopes out and started trouble shooting.

    I was bored and decided to try to understand the hardware better – got the data sheets out and started tracing signals. After a few hours looking at the schematic, I asked the simple question – “is reset active high or active low?”.

    Sure enough, half the design assumed it was active high and the other half active low. As soon as the firmware tried to start initializing the I/O portion by enabling the hardware, it would reset the CPU and stop itself from running.

    Some red faces on that one and a board spin.

  66. R.A. @ LI says:

    J. A. said: “handed him the answer, he didn’t want to hear it as his “Portable” answer was the best approach in his mindset. Some people just don’t get the fact that portable code is not always the fastest running code, no matter how heavily you optimize it.
    Another moral – c) Where speed is required, portable should not be.”

    But it seems that you neglected the portable and efficient approach which is simply to use unsigned storage types, and to use bit-shift macros to test and set bits (with a couple of “#ifdefs” you can make this completely portable).

  67. M.B. @ LI says:

    Part 1: Let me first to state that bugs in compilers and standard libraries are rare. Very rare. So rare that no one ever expects to find one. Which makes it all the harder when there is one, because you just assume (rightly so usually) that it can’t be a bug in the compiler / standard library / bintools, etc.

    So finding this one was quite an adventure. The background, the project I’ve been working on decided to upgrade Linux distributions. Since we currently keep the development machines at the same OS version as the target this meant we all had to upgrade too. Which required upgrading our version of bintools (cross-compiled for the target), which exposed compile time bugs in the version of newlib we were using, so we ended up upgrading newlib too. Ok, so new OS, new binutils, new cross-compile of gcc, new version of newlib, but other than that everything is identical!

    And it all compiled and ran. Almost…

    A week later someone noticed that a data calculation was fluctuating. So I dug into it. And dug and dug, and discovered that the standard C library function floor() appeared to be returning 0 once in a while. Hmm, it can’t be the library, this function has been around for ages! More debug code later and I confirm that floor() sometimes returns the number not rounded down, sometimes returns 0, and sometimes works. And for those multithreaded programmers out there, the floor() function is thread safe, and in fact was only ever called by one thread anyway (This was what I looked at first.).

    Ok, newlib is open source, so I can compare versions. Easier said than done, newlib is a huge project. Eventually I find the floor() function between versions, and they are identical! Yep, floor() didn’t change at all! So I have no idea what’s going on now. So I fall back to experimenting with the floor function, let’s try putting in various floats and doubles and see what happens. And I discover that printf() crashes hard on the “%f” format. Another clue!

  68. M.B. @ LI says:

    Part 2: The debugger in use never displayed double precision numbers correctly, so now since I can’t see the return values from printf() and I can’t see them with the debugger, and I know that floor() garbles the data, I break down and dump the raw memory for each float and double and start examining the bits and make an interesting discovery. Turns out that for this target, there can be two endian settings for double precision numbers. Note that an IEEE double is 8 bytes. The target’s native int size is 4 byes, little endian, but if you are working with a double, how are the two native 4 byte chunks ordered? Newlib actually handles both ways, via a compile / preprocessor setting. And guess what changed between versions of newlib? I traced down actual code differences in preprocessor settings related to endian, and I had it! A recent change to newlib reworked the endian preprocessor settings. Our target is native little endian, but uses big endian ordering of the 4 byte chunks within a double precision number. There was a mismatch between how the math library expected doubles to be formatted and how they actually were. So the floor() function was treating the 7-15 decimal places of a number as the whole part of number, which doesn’t work so well.

    And everything appeared to almost work because floats work fine, being mapped to a 4 byte int which is natively handled by the library and processor, and all of the program variables are floats not doubles. It’s only when you call one of the standard C library math functions that you see problems, because just about all of these require double arguments and return doubles, so C promotes the floats to doubles and back again.

    So a patch to the newlib endian settings file, recompile the libraries, recompile the project and try it out, and it’s fixed!

    So why didn’t anyone see this right away? Because newlib is an embedded library. This bug exists only in one specific version of the library, against a specific processor architecture only. The user base is small, and in embedded development upgrades to tools are not the norm making the user base that would stumble across this even smaller. To newlib’s credit, they did find this before I did: http://old.nabble.com/-PATCH,-ARMEL,-FPA–revert-arm-ieee-word-endian.patch—correct-newlib-1.18.0-soft-FP-breakage-td28902801.html and even had a fix to the top of the tree code in place.

  69. R.M.N. @ LI says:

    Enjoyed all the “war” stories. Great Lessons. I especially loved
    1. @M. M. • “don’t lie to yourself/customer”. …
    2. @ K. N. • “Stare at the code some more, DimWit!”
    3. @ A. M. • about RFI
    4. @ J. N. … stepping through in the debugger …

    I posted the following to my blog http://processshepherd.blogspot.com/ :

    “This happened almost two decades ago. It was my first startup. We had built a radar Moving Target Detector using four DSP processors, plus another DSP as a controller. I had taken it from proof of concept, to field trial, to ruggedized prototype. Over that time the team size had grown from one (me) to three.

    The ruggedized prototype was now to be put through acceptance checks. We gave a final check and then packed it for transporting to the customer’s site. When we assembled it, and put it into operation, the ouput would not settle down as it was meant to. That got me into a flap. I was sure that during transportation some damage had been done to the hardware. I had visions of doing a messy hardware debug on the customer’s premises. Luckily, I decided to sleep on the problem.

    The next day I asked one of my engineers if anything had been done to the code that was in the EEPROMs. He said that there was a long sequence of code which did nothing but write zeroes into the RAM. He had taken that out. The data memory was no longer initialized.

    So was my engineer at fault? No, I was. This was code that I had written almost at the start of the project. The purpose of the code was clear to me. But I was no longer “in contact” with the code. And I had failed to document it.

    I have since then, been rather fanatical about documenting the intent of any block of code. Documenting the WHAT [is intended to be acheived], and the WHY [it is important that it be acheived]. The HOW is not that important; the code should say that – unless of course the code is convoluted. But then one should not be writing convoluted code in the first place:-)

    This incident led me to formulate my extension to Murphy’s Law:

    If anything can go wrong, it will – and it will happen in the presence of the customer!”

  70. M.P. @ LI says:

    My story is too long to copy here, so here’s a link to my blog and the bottom line:

    http://blog.sofistes.net/2010/06/extreme-debugging.html

    “Moral of the story? Debugging is labour but sometimes also instinct. Work only gets you so far.”

  71. M.K. @ LI says:

    More years ago than I care to remember; I was writing a library to read and write some datafiles according to a customer spec.
    The library was working fine and the customer was happy, until we passed 1st august. Al data from recorded from that day was inaccessible. The debugging was postponed for some a couple of month. From 1st of October it was working perfectly.

    Eventually it appeared I used to %i in a sscanf to parse the month number. Only problem was the month number was padded with leading zeroes to two digits, causing sscanf to assume the number was octal. 01, 02, 03…07 was ok, 08 and 09 is not valid numbers, 10 is a decimal number. I really learned a lesson on testing….

  72. J.S. @ LI says:

    My favorites are from people who do not understand the operating system they are programming for. Two incidents from my consulting days that I remember well.

    1) Customer needed to make decisions on machine operation (running on a priority based real-time operating system) based on state of DIO. They wrote a program which checked the DIO at 25 ms intervals and then “signaled” (actually, an OS based mechanism which was implemented as a counter and received as a message) another program when particular DIO values changed. Since the programmer knew the initial state, any signal that was received was interpreted as a change in actual value and a decision made. The problem – the decision was made on the basis of the signal being received rather than checking the current state of the DIO. The customer had not adjusted the priority of their program properly, so it would (occasionally) be interrupted for seconds at a time. During that time, several changes could occur to the same DIO port, multiple “signals” queued indicating change of port state, however when the program woke up it only received the first signal and made a decision based on what the programmer THOUGHT the current value of the DIO was rather than the actual value at that time. Resulted in some “unusual” (and often damaging) behavior of the equipment. Bad design in the first place (check actual state rather than assuming you did not miss a signal indicating state change causing your own internal record of state to be off), however they also failed to understand how the underlying OS signalling mechanism and how it worked (plus the scheduling mechanism).

    2) Customer wrote a program which controlled a motor. Again, customer was running on a real-time, priority driven OS, however one that defaulted to a decay based scheduling. Customer did not realize this, so the motor would work fine for 5 minutes, then shut down for 5 minutes as control program dropped in priority, then work again, etc. A failure to understand the underlying OS and how it worked (although it is a valid argument that the OS should not have defaulted to a decay based scheduling, it was also well documented that it did along with recommendations change this scheduling for any program needing real-time scheduling).

  73. My favorite debugging anecdote is:

    If you’ve looked in an area of code for a bug for more than 24 hours and haven’t found it, look where you least expect it.

  74. R.A. @ LI says:

    L.R. “D” R. said: “If you’ve looked in an area of code for a bug for more than 24 hours and haven’t found it, look where you least expect it.”

    That’s more of a rule-of-thumb or technique than it is an anecdote, but it is a good rule-of-thumb.

    If we’re talking rules-of-thumb, then I would add this:

    If you can’t figure out why the code doesn’t work, grab a colleague (who knows nothing about the code in question) and explain to them why it *does* work.

    I have about a 98% success rate at finding bugs this way (and a bunch of bemused colleagues who listen to me babble on for a few minutes and then conclude with “…wait a minute… oh wow… thanks for your help” without them ever having uttered a word :-) .

  75. N.M. @ LI says:

    +1 for R.’s excellent rule-of-thumb. This is exactly why Sherlock Holmes needed Dr. Watson :)

  76. R.S.K. @ LI says:

    +1 for R.. I have had innumerable occasions when this has worked for me.

  77. J.N. @ LI says:

    @R.: Yeah, I had a friend in college who would smile and nod. Best darned debugging help I’ve ever had. :)

  78. Heck, no +1′s for me. The anecdotes are essentially the same!

  79. blues says:

    I once had a failing hard-drive in my computer at home – at least I thought so first. After having multiple but not so common issues, like a loss of the Partition Table I found out it was the CD Burner Drive on the same IDE channel causing the hard drive failures! Even if the CD Burner Drive did work, I just recognized it being defective when I tried to burn a CD one day….

Leave a Reply to J.S. @ LI