First, a bit of scheduling news: the monthly update of Clockwork Empires, revision 40, will be released next week. Due to E3 and a giant Steam sale there’s not a lot of point in releasing anything this week.
Last time I wrote a blog post, I had written half of the Clockwork Empires codebase in order to fix various issues with autosaves being reported by users. Naturally, these things never happen when I am playing the game; instead, we got DMP files and sorted through them. (I wrote a little in house tool to process these for me; now I can run a batch file and it runs all the DMP files we get through a little script which automatically outputs their call stacks and other pertinent information.)
To try and fix these issues, I rewrote the entire jobs system in order to move everything into nice, flat arrays, and to get rid of direct references to pointers. My original hope was that this would fix the save game crashes we were experiencing, which were due to various instances of dead pointers to jobs being left in Lua space. Well, it fixed *some* of those crashes. Once Revision 39A came out, we got a bunch more crash reports: various people’s assignments were crashing, due to assignments being held in Lua after they were dead. Well, it wasn’t part of my original plan, but they got the same treatment – assignment handles are now first class citizens, and we can now actually tell if an assignment handle exists or not, and is deleted or not. Starting from Revision 39B, all the workshop assignments were also moved over to a new system where they were controlled by C++, and we were able to remove huge chunks of broken Lua code. Now the workshops were starting to feel good, in a way that they hadn’t felt since… well, since we first put them in – but yet, we still had save game crashes.
What’s even weirder, of course, is that the save game crashes seem to be saving a building footprint – the little data structure full of points that is used to regenerate the shape of workshops and houses when you draw them. What the heck was going on?
So at first, I couldn’t figure out what was going on. I could draw buildings, save buildings and load buildings just fine, and nobody could reproduce the crash. Revision 39C, therefore, was spent ignoring the autosave crash. What changed is that somebody sent me a replay for something *else* – some sort of stockpile related issue, I believe. The replay system is still not perfectly deterministic – sometimes, somehow, determinism will fail, and the replay doesn’t match up perfectly on my machine. We’re still sorting this out, and this is one of the major stumbling blocks towards getting multiplayer in. I got it on my computer, I left for lunch while it replayed 100,000 frames of Clockwork Empires save game… and what happens when I come back from lunch? Well, wouldn’t you know it – it had crashed saving the game. Furthermore, it had crashed saving a building footprint.
Finally, I thought to myself, a repro case. Quickly pulling up the watch window in Microsoft Visual Studio, which is used to inspect the state of variables in the program when debugging, I saw the problem: the array of points thought it had, uh, 2,576,320 points in it. Yikes.
First order of business, see if it was reproducible. Quit the replay, run the replay again. Fortunately, it showed up after only about 20,000 frames of the game. Unfortunately, it only showed up sometimes, which puts it in the category of “heisenbug” – a technical term which refers to a bug which stops appearing when you’re looking for it. When you get one of these, you cannot change anything else in the program, or you will lose any hope you have of ever seeing it again – you are now stuck, working on this problem, until you solve it (or are a mewling, quivering wreck.) A bit more inspection revealed that the pointer at the *start* of the array of points (pointing to the first element in the array) was set to NULL, meaning that something *else* had trashed it. That something else could be anywhere else in the program.
At this point, I tried my luck re-running the application with AppVerifier enabled to see if it would turn up anything interesting. Sometimes you get lucky; AppVerifier lets you set guard pages around heap allocations, and occasionally, the thing that is misbehaving will write to one of those guard pages and it will instantly throw an exception at the point in your program which is misbehaving. So, it’s worth a shot – but no dice.
After gently poking the code some more, and sort of randomly guessing at things that might be broken, I decided there was really only one option left. I wrote a little C++ function called ValidateBuildings(). What it does, when you call it, is it runs through every building in the game to try and make sure that its Points() array is valid. If it’s not valid (if the size of the points array is over, say, 1000 points), we throw a breakpoint. So, I put it in, and… ugh, the program starts working again and not crashing.
At this point, you just curse and swear a bit, and hope that it starts crashing again. Sometimes, the nature of computer memory allocation is such that a random heap corrupting thing will just start writing into different areas of program memory that do not, obviously, affect the state of the program. Fortunately, after a few more run it started crashing again, and I started sprinkling the code with the ValidateBuilding() all over the place. The first place I put it was in the message broadcasting loop, which sends the Update messages to everything in the world that is supposed to be updated. Sure enough, it caught the problem, after a citizen updated. I ran the code again, and it caught the problem after a module updated. Conclusion: it’s not the gameplay thread causing this bug, it’s something on the renderer side.
I sprinkled the renderer full of ValidateBuildings() calls, and none of it really seemed to narrow it down much. The renderer and the game are both large, large functions, and it’s hard to figure out what goes where, when and how. Finally, I decided to try something different – I started dumping consoles from various runs that crashed or ended in a crash, trying to figure out what the last thing was before ValidateBuildings() failed.
It seemed like the culprit, every time, was when a workshop pickled Black Fungus. My first thought was one of horror – was it the new workshop code? Drat! But, no, it was something on the workshops. I grabbed the script, and started trying to figure out what calls to the renderer were being made from the workshop script object. The culprit? The blasted Machine Expression message had a use-after-deletion in it; a rare condition, but one that could happen if the renderer was running at certain speeds, or under certain conditions. That bug had probably been in there for ten versions of Clockwork Empires, and only now did it get caught. A one line fix, and the rendBuildingFootprint crash went away (and, consequently, the crash when autosaving which started all of this went away.)
Total time for this one line fix: I started this Friday, and finished it early Sunday morning – so, several hours.
What ended up being the key to tracking it? All three debug files, working in tandem. We used the DMP to figure out that the building footprints were crashing. Somebody gave me a replay for another, unrelated thing, which just happened to crash in the right place (the same place the game was crashing for everybody else, but not me); the console was what eventually got used to track down the exact point in the renderer that the game crashed, once I had a nice function added to the renderer to automatically detect when the save game would crash. It took all three of these little pieces of debug information to finally figure it out.
As a consequence of all this rewriting, most of the work I’ve done on this month’s patch is invisible. Revision 39E generated a bug report from one of our users, which was that “In limited testing, I’ve discovered that I have to quit and save when it is time for me to head to bed rather than being able to rely on a crash to break me free of the game.” This is a good sign – when users start complaining about the game being too stable AND keeping them up all night, you know you’re doing something right!
Otherwise, I’ve been dealing with small niceties – making sure various controls work correctly, making the module and building creator feel and look better, fixing text rendering issues, that sort of thing. Most of what I have done in this patch isn’t actually on my sprint, so I guess it’s been one of those months, but the game is much more stable and feels much more polished as a result – hopefully folks will like it.