CPU and GPU Run-time Profiling in Simulation Land

[Warning: This is a very technical post.]

It’s not easy writing a complicated simulation. There are lots of complex, interconnected moving parts to worry about and when something goes wrong, it can be hard to figure out where that small, broken thing is in the midst of a larger picture. We recently discovered that the game was running… unreasonably slowly, shall we say? My desktop was getting 15 FPS, and it wasn’t clear what the problem was. On Chris Whitman’s machine, which uses a slightly different graphics card than my computer, our FPS was in single digits. I don’t like optimizing too early – as Donald Knuth pointed out famously, “premature optimization is the root of all evil” – but something was going on. Finally, sick and tired of the problems, I decided to get some answers.

There were three solutions for profiling that we looked at: Intel’s VTune, what we might call a ‘classical’ profiler which you can download a 30-day trial of from their website, and Telemetry, a different sort of profiler made by RAD Game Tools (specifically by Brian Hook, who you might know from such games as Quake 3.) RAD Game Tools also provided us with a 30-day trial of Telemetry, and this gave me an interesting opportunity to compare two profilers. Finally, we tried our luck with NVidia’s GPU Perfstudio to see if we could figure out what was happening on the graphics card.

Three profilers. One slow down. Who cracked the mystery? Find out below.

These actually have nothing to do with what Nicholas is doing but It Was Decided that the post needed some more visual accompaniment. Think of it as a tenuous thematic connection.

Intel’s VTune operates from Microsoft Visual C++ 2012. I have worked with previous versions of VTune, most notably back in 2001 when I was trying (and failing) to optimize The Lithtech Engine. VTune integrates nicely into Visual C++, and you run the profiler from within the application. For the most part, VTune failed to tell me anything interesting at all. It may have been doing useful things with the data, but there was no real way to narrow down what in the world was actually going on. Also, it crashed a lot.

Telemetry takes a different approach. Unlike VTune, which tries to run on a codebase without markup, Telemetry asks you to manually label your C++ codebase with little snippets of code to indicate what different parts of your codebase do in context: “drawing skeletal animation”, “shaving the yak”, “flushing the pnorb”, etc. Integrating the library into our codebase turned out to be pretty easy, taking about half an hour of work to get good profiler coverage. Running your program with Telemetry causes it to output data to a stand-alone server application, which whirs away in the background recording what your program is doing.

You then open your recorded session in a standalone application called the Telemetry Visualizer, which is basically the coolest thing ever. Each frame is cleverly marked up with all the information you specified in your C++ code, and you can keep diving through a problem area until you hit the bottom level.

Because individual frames are annotated with what Telemetry calls “zones”, you can hierarchically profile your code. You can look at a frame and see that a certain function (say, “rendering all the widgets”) takes a certain amount of milliseconds in a given frame; you can then refine that by adding more C++ markup to narrow in on what, exactly, was causing the problem. In this way, we made the following exciting discoveries:

  • Certain parts of the rendering pipeline really, really sucked. In particular, we were calling malloc() at one key point, when we really, really shouldn’t be. This was removed.
  • All the things that we thought were slow, really weren’t. We had been operating on a sort of … well, implicit assumption… from day one that the skeletal animation skinning was being a huge bottleneck. (This is because it’s been a bottleneck on every other game I’ve worked on.) We still have to move it to the GPU at a certain point, because when you put 100 characters on screen at once it’s still a bottleneck, but it wasn’t causing our problems.
  • Rendering our post-processing effects was taking 50% of our frame rate. This got cleaned up, and everything ran faster.

The most interesting finding, however, was that our frame rate was highly intermittent. Two consecutive frames were having radically different frame rates, but the content being drawn was identical. Slowdowns were appearing in different functions, for different reasons, and seemingly at random. What the heck was going on?

Shamelessly taken from the RAD Game Tools Website because, well, it shows all the cool markup including the bits we don’t use.

After some thought, I decided whatever was going on wasn’t going on in our code, but somewhere in the graphics card driver. Clockwork Empires uses a renderer that is heavily designed towards processing batches, but it was looking increasingly like something we were doing was occasionally causing the GPU to stall in random areas to catch up. We use OpenGL, and at the time Telemetry didn’t support OpenGL profiling (although I understand this is in the works.) To get answers, I turned to NVidia’s GPU PerfStudio.

Rather annoyingly, GPU PerfStudio is an integrated plugin for Microsoft Visual C++ – but for Microsoft Visual C++ 2010. It does NOT support MSVC 2012, which is what we use (due to support for C++11), and so running our code through GPU Perfstudio required me to launch MSVC 2012, compile Clockwork Empires, and then run it from MSVC 2010. Cumbersome. People, make your tools stand-alone!

What GPU PerfStudio mainly does is tells you where the graphics card and graphics drivers are spending all their time. (I believe Telemetry can do this for Direct3D, and now supports GPU timings in OpenGL as of last October.) This struck paydirt fairly quickly: in order to debug rendering issues efficiently, I had stuck little functions throughout the renderer to call glGetError() and report if, in fact, I had screwed up. Being an idiot, however, these a) ran every frame, and b) were always enabled. Turn them off, and the GPU stops stalling all over the place trying to check if anything has gone wrong. Success.

GPU PerfStudio also revealed that certain functions were spending too much time doing render state switches – turning blending on and off, changing textures, that sort of thing. Via some clever rearrangement of code and a handful of macro functions, we were able to clean this up quickly. It also turns out that our UI rendering code can be quite slow, especially for complicated scenes; fixing this is a TODO to be taken care of later in the development process.

It is worth noting that sorting out the cause of our lag required multiple strategies. GPU PerfStudio told us how much time we were spending in error-testing functions, but Telemetry’s Visualizer told us that we needed to be looking at the GPU in the first place, and also let us confirm easily that certain things were, and were not, bottlenecks without us wasting our time optimizing something that we only thought was slow.

It is really nice when you find a new piece of software that you can add to your toolkit. We’ve had similar experiences with a few other products – 3D-Coat has massively improved productivity on the artist side, and Visual Assist X has been a favourite of mine for years. After begging and pleading with Daniel, I am pleased to say that Telemetry gets added to this list. I haven’t even figured out all the things it can do yet, and I haven’t started profiling our game thread with it yet, but I’m pretty confident that it will be able to take care of our few remaining mysterious frame rate drops. Since we can also mark up memory allocations, we can also use it to deal with the Horrible Memory Leak that is plaguing us this week. End result? Less time spent tracking mystery bugs, and more time spent coming up with hidden fun stuff and entertaining new features for everybody.

Disclaimer: Brian Hook gave me some bourbon once. It is 62.35% alcohol by volume. I’m still working on it.

Posted in Clockwork Empires, Programming | Tagged , , , , , , , , , , ,
19 Comments

19 Responses to “CPU and GPU Run-time Profiling in Simulation Land”

  1. Headjack says:

    Just kick the flywheel, it’s arright.

    { reply }
  2. Oscar says:

    Being a specialist on high computing architectures, I can definitely assert that big performance problems are always due to small errors and assumptions.

    { reply }
  3. Zentay says:

    So how many fps were you getting after this?

    { reply }
  4. Kamisma says:

    Please allow me to disable post-processing. In that case, my old laptop is the bottleneck :)

    Oh and memory leaks are evil, purge them with holy fire !

    { reply }
    • Seriously Mike says:

      Ha, forget it. It Was Decided that the minimum requirements for this game will include a four-core processor. Because next year, everyone will have a four-core processor.

      The thing is that at least on my laptop, I don’t.

      { reply }
  5. Wootah says:

    How big a shot of the bourbon does it take to hit the balmer peak? http://xkcd.com/323/

    :)

    OH and very interesting. I love reading this stuff.

    { reply }
  6. Dtolman says:

    As a sustaining engineer (professional bug fixer), I found this post fascinating! Thanks for posting this inside look!

    { reply }
  7. kikito says:

    Have you tried turning it off and on again?

    I’m a programmer, and this still gets me. Nice writeup about performance.

    { reply }
  8. Kazeto says:

    I have to say, I like the occasional change from in-game-content posts to ones about the technical side.

    And I actually started reminiscing about my long-past time as a student after reading about redundant malloc() calls. Heh, the times when we were still young and learning…

    { reply }
  9. Ivan says:

    I love those technical articles! Keep ‘em coming!

    { reply }
  10. Cuthalion says:

    This was interesting to read. I’m still a bit of a newbie myself, but in my own game programming thus far I’ve found NetBean’s built in… thing… to be helpful in finding out what is and isn’t a bottleneck. One day, I suspect I’ll end up checking out tools like Telemetry, so I’m grateful for the post!

    { reply }
  11. Bropocalypse says:

    It’s no matter that this post was extremely technical and light on humor- it’s still nice to be able to view whatever progress you guys make.

    { reply }
  12. Alan says:

    Excellent post. If we ask all sweet-like, can you dig up an intern to try these very useful-sounding tools on Dredmor? :) It always seemed a little pokey to me, relative to its (perceived, probably not actual) complexity.

    { reply }
  13. Callan McInally says:

    GPU PerfStudio is actually the name for AMD’s GPU Profiling tool – not Nvidia’s. It is a standalone tool. The Nvidia tool you used was probably Parallel Nsight. I would suggest you take a look at the AMD tool also.

    { reply }
  14. Disclaimer: I work for Intel. Also, it’s been years since I used Vtune or GPA but thought I’d chime in anyway :-)

    1) Sounds like you were just using time-based sampling, but you can also call APIs w/in your code to start/stop profiling, or can use event counters, etc.

    2) You should have a look at the Graphics Performance Analyzer (GPA), if you are interested in GPU analysis.

    { reply }
  15. Pingback: Dev Links: Heart Made Of Stone | The Indie Game Magazine - Indie Game Reviews, Previews, News & Downloads

  16. Pingback: Dev Links: Heart Made Of Stone « DIYGamer

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>