Saturday, October 15, 2016

A memory corruption bug as result of 'single character'

It has been a quite a few months since I saw or have written some serious algorithmic code involving C++ (read computing surfaces or doing numerical computation). So when a weird bug was reported for a program I work with, I though it must be fun. The case was for a particular feature in the program that would plot a surface based on some input parameters. The bug was strange because it made the program crash on Windows but on Linux it plotted the surface correctly.

Good then, let us fire gdb and figure out where it was faulting. But for some reason, on the build system I was using for Windows, I couldn't get gdb to work properly. So was left with old way: read the code and put a lot of printf(). Now since this was Windows, printf() wouldn't work either! The program however had a logging API, that would log text to a window. The problem however was that there was no text in the window (or the window was not refreshed) just before the program used to crash. So the next technique was to use the logging and commenting one line at a time, with a premature return from the affected function, which is usually very tricky to do when there are nested loops. And this was exactly the case here. Finally, the real culprit was one line that read:

if (k+l < numberOfXPoints) {
 ....
 x[i][j][k+l] = ...
 ..
}

There you have it. The code ought to be:

if (k+l < numberOfZPoints) {
 ....
 x[i][j][k+l] = ...
 ..
}

Apparently this error was never observed, and looking back at the history of the source code, I found that it was this way ever since it was written, a couple of years ago! This error didn't probably produce any visible output errors and apparently went past all the test cases as well, because for none of those, the points along z-direction exceeded the points along x-direction.

Two things to learn:
1) Never name two variables such that they differ in only one character. If you indeed need to then ideally have the differing character towards the beginning of the word as in the case above a better way to name would have been using: xPts and yPts.
2) If an error is occurring on one platform but not on another, it is most likely a memory corruption issue. Hunt down all the code dealing with arrays.