19 votes

Debugging Folklore

10 comments

  1. [4]
    deadbeef
    (edited )
    Link
    I'm not sure if this is interesting to anyone, but I have a debugging story to share. I probably made it way too long. Sorry! Some years ago, I worked as an applications engineer for a...

    I'm not sure if this is interesting to anyone, but I have a debugging story to share. I probably made it way too long. Sorry!

    Some years ago, I worked as an applications engineer for a microcontroller (MCU) manufacturer. Part of my job was doing technical support, which 99% of the time boiled down to figuring out what part of the documentation the customer had missed or misunderstood. But every now and then, we'd get cases that indicated actual hardware problems, i.e., either a design flaw or a QA issue with our MCU.

    One such case came from a manufacturer of electronically controlled heating lamps (possibly propane powered -- not an ideal situation) , who reported that some percentage of the MCUs would randomly run at 1/8 of the configured speed in their product when started up. They noticed this because their product took 8 times as long to boot up, LEDs were blinking slower, etc. This was extremely curious, since the system clock of these MCUs was configured at production programming time and there was no changing that configuration at runtime. Changing the configuration required an external tool, and the change would be permanent, not sporadic as this customer saw. What could be going on here?

    We already knew that software couldn't change the system clock, so clearly, it was a hardware problem. One possible culprit could be that the system voltage in their design was not within specifications, i.e., it did not rise at an appropriate rate at power up, or it was not stable enough, which generally causes undefined behavior in electronics. However, the MCUs had integrated brown-out detection and power-on reset circuitry, precisely to mitigate against these sorts of problems. Perhaps there was a weakness in our design?

    Suspected failures with the MCUs themselves were always initially investigated by the team that did the verification and characterization of the MCUs, as well as defining and implementing the production tests. Customers would send us a few specimens of the failing product, and this team would try to reproduce the issue and determine if the failure was in any way "our" fault, i.e., due to design flaws, production issues or lacking documentation. But.. This team couldn't find any problems with the hardware -- not in the customer's design and not in the MCU. So, it appeared to be a software issue after all. Great! And so the case ended up on my desk.

    The team had been able to reproduce the issue, though, and made some observations which were helpful to me:

    1. the failing specimens had to be cooled down for the failures to occur
    2. serial communications actually ran at the normal speed

    The latter essentially confirmed that the system clock was running at the correct speed, configured during programming of the MCU. And the temperature dependency was an indication that uninitialized memory was the culprit. But it wasn't clear where things could be going wrong. The customer had sent us their full source code, and it seemed ridiculously complex for such a simple product; basically a heating lamp with timer and adjustable power. It used a homegrown OS, with several concurrent tasks and timers. Perhaps the problem was in their task scheduler?

    At this point, a month had already passed since the customer sent us devices for failure analysis. The customer was getting rather irate, demanding that we "fix our MCUs". They were not pleased when told that we couldn't find any problems with the hardware..

    I started by digging into their application code, but after a day of looking through their ~30k LOC, I couldn't identify any cases of unitialized memory being used. And so, I decided to move on to on-chip debugging (OCD) in the hopes of uncovering the cause, the idea being that I could inspect the MCU's runtime configuration and behavior in detail when this issue arose. But OCD entails connecting an external tool to the MCU, which changes its operational parameters, plus enabling OCD in the MCU, which changes certain aspects of its behavior. There was a chance that all this would prevent the issue from arising at all.

    As mentioned, the MCU had to be cooled down for the issue to arise, so it made for a time-consuming issue to reproduce. I stuck everything in a climate cabinet, with only power and the debugger's USB cables coming out, powered everything off and gave the MCU two minutes to cool down, before turning things back on in the correct order for OCD to work. I then spent nearly a day trying to reproduce the issue. Of course, the issue never, ever arose in this setup.. sigh

    So, it seemed like the way forward was to make minor changes to the customer's code, program it to the MCU, then run it without any of the regular debugging capabilities. Of course, I needed a way to get information out of the MCU in a non-invasive way. For that purpose, I could bit bang on one of the unused general purpose input/output (GPIO) pins of the MCU. But what to investigate next?

    One hunch I now had, was that the hardware timer used for their OS' task scheduler was not triggering as frequent as it should. So I added code to toggle my bit bang pin (high<->low) every time that interrupt handler was entered. It was a dead end -- the interrupt timing was fine. But this lead me to another idea: might other interrupts be triggering and keeping the CPU busy? For example, an interrupt triggered every time one of the buttons was pressed and caused one of the GPIO pins to change state. Could something be causing that interrupt to trigger constantly? It turned out that this was not the cause, either. I quickly found that none of the interrupts enabled in their code was triggering.

    That's when I got my eureka moment: the factor of slowdown made perfect sense if the MCU was constantly jumping into and returning from the default interrupt handler (one which does nothing). So, I made all the interrupt handlers bit bang individual numbers on my debugging pin to identify which one it was. And voilá, I found it! The "EEPROM ready" interrupt was constantly triggering. That interrupt should only trigger when enabled, and the on-chip non-volatile memory is ready to be written to. But the code was in fact explicitly disabling all EEPROM-related interrupts. So why did it trigger??

    At this point, I started looking into the disassembly of the code which should be disabling the EEPROM interrupts. The failure was basically in a single line of code which should write the value 0 to the EEPROM control register (EECR). The code compiled to do this essentially copied the value from a specific CPU register, let's say R0, into EECR. But I did not see R0 getting set to 0 before this. How strange! How could this ever work? I did one more experiment, and my suspicion was confirmed: sometimes, the bit which enabled the "EEPROM ready" interrupt would be set to 1 in R0 before its value was copied into EECR.

    I now realized the problem was somehow related to the toolchain, so I dug into the documentation. That's when I learned about the initialization routines that are supposed to run before the application. I learned that R0 should have been initialized to 0 by that code. The toolchain expected it to be so. It relied on it for optimization. Oh my! Getting close now!

    The final piece of the puzzle was something I've not told you about yet, because I hadn't given it much thought myself before this point in the story: the customer's bootloader, which could be entered by pressing a specific button on the device at power-up. The code which detected the button press acted as a separate program, which the customer invoked in such a way that it executed before the initialization code. So when did the initialization code actually run? Finally, we get to the root cause: the customer had linked their code in such a way that the initialization code only ran if the bootloader was entered! If the MCU started right into the application, no initialization was done. The content of CPU registers such as R0 would basically be random when it entered the application. That's a sure-fire way to get undefined behavior!

    I sent my report to the irate customer and offered to help them fix their project, but I never got a response back. I presume they quickly got too busy putting their bootloader to good use.

    10 votes
    1. [3]
      joplin
      Link Parent
      Name checks out! This was a great story! Thank you for sharing. I'm curious about this detail: Can you explain why that's the case? I mostly do regular app programming on desktops, and...

      Name checks out!

      This was a great story! Thank you for sharing. I'm curious about this detail:

      And the temperature dependency was an indication that uninitialized memory was the culprit.

      Can you explain why that's the case? I mostly do regular app programming on desktops, and uninitialized memory there is usually the fault of yours truly. I've not seen a case like that.

      1 vote
      1. [2]
        deadbeef
        (edited )
        Link Parent
        Thank you! And good question -- that sentence certainly warrants some explanation. First of all, I should have written "was possibly the culprit" since several characteristics of electronic...

        Thank you! And good question -- that sentence certainly warrants some explanation.

        First of all, I should have written "was possibly the culprit" since several characteristics of electronic circuits will be temperature dependent. And I'm not a physicist or chip designer, but AFAIK, the main mechanism is temperature's effect on resistance and capacitance. I'm sure some other tilderino can explain it better than me.

        But which characteristics will change with temperature? The most relevant in this case would be:

        • oscillator frequencies -- clocks will drift from the nominal
        • voltage references -- detection thresholds* for digital level (0/1) will change
        • charge leakage -- f.ex. memory can keep its level for longer at low temperatures
        • electrical noise -- lower temperature means less noise, which is usually good

        The customer wasn't really using any peripherals/functionality of the MCU that we could in any way see causing the issue at lower temperatures. The initial failure analysis team had already ruled out the main oscillator, which was driving the system clock, as a possible cause. So, in our experience, there simply weren't that many culprits left to choose from.

        Memory that is uninitialized after power-on doesn't necessarily contain a random value. You should rather consider it an unknown value. And it can be affected by bringing the temperature up or down before turning on the power. I believe that's mainly due to the detection thresholds for digital levels varying with temperature, but it's likely also in combination with f.ex. the change in charge leakage.

        Also.. There is one detail that I forgot to mention, which reinforced my suspicion that memory was involved: I actually had to leave the device powered down for a minute or two in order to reproduce the issue in the climate cabinet. It seemed more likely to me that this was due to residual charges having to leak/disperse rather than the device temperature having to sink all that much.

        *) Digital input/sensing circuits are generally specified with two voltage thresholds:

        1. VIH, above which they are guaranteed to detect a signal as logical 1
        2. VIL, below which they are guaranteed to detect a signal as logical 0

        The range between VIL and VIH is the undefined range, where you don't really know what level your signal will be detected as. But there's a threshold somewhere in that range where the sensed value switches. That's the one I meant above. The sensing circuit can also have a hysteresis in there, but that's not really relevant in this context. :-)

        3 votes
        1. joplin
          Link Parent
          Thanks for the explanation! It's been many years since I worked directly with hardware. Fascinating stuff!

          Thanks for the explanation! It's been many years since I worked directly with hardware. Fascinating stuff!

          1 vote
  2. [6]
    joplin
    Link
    I really like debugging stories. This is a collection of some of the more well-known stories. Some are apocryphal, but many are real. Some hilarious, some downright frightening.

    I really like debugging stories. This is a collection of some of the more well-known stories. Some are apocryphal, but many are real. Some hilarious, some downright frightening.

    4 votes
    1. [2]
      gpl
      Link Parent
      Thanks for sharing this. I've always loved these types of stories but could never quite described what I meant, but folklore is the perfect word really. Internet folklore.

      Thanks for sharing this. I've always loved these types of stories but could never quite described what I meant, but folklore is the perfect word really. Internet folklore.

      5 votes
    2. [3]
      dblohm7
      Link Parent
      There are a few interesting ones over on Reddit too (shameless plug: mine is one of them).

      There are a few interesting ones over on Reddit too (shameless plug: mine is one of them).

      5 votes
      1. joplin
        Link Parent
        OMG! This one is killing me. The video of the NPC climbing up an invisible ladder in background is hilarious.

        OMG! This one is killing me. The video of the NPC climbing up an invisible ladder in background is hilarious.

        4 votes
      2. joplin
        Link Parent
        Thanks! Checking it out now!

        Thanks! Checking it out now!

        1 vote