Technical Notes: (Finally) Sprite Batching in our Graphics Pipeline

The really big update in the current version of The Last Federation is anew sprite batching system — and this is something that is going to be making its way into our other games soon, too. This is a performance improvement that I have been putting off since 2010, and arguably since 2002. I first ran into this sort of batching problem in 2002, and never did get around to solving it. Then in 2008 I switched to using Direct3DX, which had sprite batching built in. So that made it moot. Then in 2010 we switched to unity, and lost that capability again. Since that time I’ve made about every sort of graphics pipeline improvement under the sun except to actually do this one.

I’m not actually that lazy, but it’s a very difficult problem in our sort of pipeline. It’s hard to explain why exactly, but it has to do with the way that we queue things for rendering, the way we load things off disk, and so forth. Also the way we use the depth buffer and the way we use orthographic cameras, in some limited cases (those relating to things in an isometric perspective, which thankfully doesn’t apply to TLF but does to Spectral Empire and Skyward Collapse and I believe the overworld of Valley 2).

Avoiding Transient Memory Allocation

One of the chief problems (among many many others) that I was facing was trying to figure out how to do sprite batching without having to constantly reallocate arrays given that the number of sprites that go into an array can constantly change. Something Keith said in an email last week gave me kind of a lightbulb moment and I realized that I could have pools and pools and pools of stuff. We do a lot of pooling in general, but generally we don’t have pooled objects that reference other pooled objects that reference other pooled objects. But here that’s exactly what we’re doing, and it keeps the RAM usage incredibly tight and efficient — and avoids hitting the RAM garbage collector, which impacts performance, and which was one of my biggest worries.

In other words, part of why I put this off for so long was that I felt like I could get some performance gains out of this, but that I’d also be making some tradeoffs in order to do so. I’d be basically trading some transient memory allocation and CPU processing for less data passing between the CPU and the GPU. The latter is an important goal, but the former is a dangerous thing to play with. So with Shattered Haven I used RenderTextures to get around the GPU limitations, with AI War Keith coded a proximity-based CPU-side combiner for far zoom icons (and something similar is used on the solar map with fleets in TLF), with Valley 1 and 2 and Bionic Dues we came up with ways of combining tiles and using texture UVs to to do repeating patterns across smaller numbers of vertexes, which in turn reduced the draw calls.

So we made do, in other words, because I was unable to think of a solution to the transient memory allocation problem. Well, that and some other things.

Matrix Math

One of the other things is that we use Matrix4x4 transformations for scale, rotation, and positioning. Moving scale and rotation out of that and into our own code is simple enough, really. But moving rotation out of that and into our own code in an efficient way that would not bog down the CPU was no small task. We were going to have to give up some precision for that, do a lot of caching, and so forth. Keith spent a goodly bit of time last week working that out, and got it fixed up.

And then a funny thing happened yesterday: I realized that I could still use the Matrix4x4 math anyhow, and that we didn’t need to do any of our own custom code there at all. So it literally looks and works like it always did, because we didn’t wind up needing to use the reinvented wheel that we made for that. I hate reinventing wheels, but I was unaware of a few things regarding matricies and Vector3s. Anyway, Keith’s work was not in vain, however, because his implementation had yet some more ideas that I cribbed and used to make other parts of the pipeline code more efficient. So despite the fact that that code didn’t wind up being used directly, it still had an impact on making the pipeline as fast as possible.

What Sort Of Benefits Are There?

In Spectral Empire, there is an orthographic view of a hex map where you see countryside, buildings, etc, etc. The number of draw calls this can cause reach into the thousands if you zoom out much. On my nVidia GeForce GTX 650 Ti on my main desktop, when I was all the way zoomed out prior to these updates, I could only get 27fps. And that was actually clamping the zoom a lot tighter than I ultimately wanted it to be. On the same scene, all the way zoomed out with the new updates, I now get about 200fps.

In The Last Federation, on my same card, I get around 800-1000fps when all the way zoomed out during a large battle with the bullet-crazy Obscura ships that are coming in the expansion for the game. Those guys can fill a battlefield with literally thousands of shots on the screen at once, and performance understandably suffered previously. I didn’t remember to check exactly what it was before the shift, but something sub-30fps is a pretty good bet.

Even in older versions of TLF, where bullets were not so plentiful, there were some folks on older graphics cards (in particular laptop ones) that could get bogged down during really heavy fighting and see their fps drop to the 10ish range. That’s super frustrating, even considering that’s on a computer that’s 5+ years old (I think one was actually more than 10 years old). Still, even my 4 year old MacBook Pro was dropping into the 40s during heavy fighting before, and I wanted that to stick at a solid 60fps minimum.

I can’t vouch for what will happen with all lower-end machines in terms of the improvements seen here, but I would expect that 30fps ought to be maintainable at the very least, and it’s possible that even during heavy fighting that 60fps can be maintained.

Why Are The Benefits So High?

Basically this lets us just use one draw call for a given texture/shader combination (or if it’s a hue-shifting shader, then texture/shader/hue combination), rather than one draw call per sprite. This means that, depending on the scene, you can see an improvement that is an order of magnitude greater. “Draw calls” are expensive in CPU/GPU time, because they have to pull a texture out of RAM (across that bus), push it to the GPU (across that bus), and then send some vertex data (which is very tiny by comparison). Then the next time there is a draw call, it does the same thing. There are other implications as well, a lot of which vary by your platform and how the shaders compile there. But at any rate, the rule of thumb is “Draw Calls == Slow.”

When there are a ton of shots on the screen, there are usually not more than maybe 10ish actual distinct graphics. Even if it looks like more, usually a given shot type has a dictionary of sprites, meaning that the number of textures is still like 10 for all the shots. But if you’re seeing 2000 shots, so that would be 2000 draw calls, without batching. Sloooow.

With batching, it only matters how many TYPES of shots you have. If you’re using 10 types of shots, it doesn’t matter if you have 10, 200, 2000, or 5000 shots on-screen — there are still only 10 draw calls. (Actually I simplified that a bit for the sake of brevity, but it’s not far off). The faces and vertices and colors arrays (this is the per-sprite data) is really small (6 floats, 9 floats, and 12 floats, respectively per sprite in each respective array). At some point if we start sending 5000 shots to the screen we start hitting a different problem — namely that of GPU fill rates, and a couple of other possible slowdowns that are intra-GPU. But it’s a much less likely problem to hit than the bus problems, because even mobile GPUs are built for pushing out WAY more pixels and vertices than we remotely come close to.

Anyway, so the difference in performance is something that is hard to quantify. In high-load cases it’s orders of magnitudes faster. In low-load cases, it’s about the same speed as before (which is very fast). Which is good news, because I worried that in those low-load cases we’d be actually getting slightly slower with an approach like this. Another barrier to my doing this, over the past years. But not so! It’s pretty awesome, and I am super stoked to have this new piece of our pipeline in place after so long.

An Extra Challenge: Orthographic Views

The deal with orthographic views is that you have to be able to sort them from front-to-back in terms of tiles. You’re giving a fake sense of distance. So this means that some tiles of Grass Texture show up further away, and some show up closer. Meanwhile, some textures of Mountain show up very far away, some closer, some even closer. Aka, some of the tiles of Grass are behind those of Mountain, and some are in front of it.

With an orthographic camera, this isn’t too hard for us to handle, even with delayed-write draw calls (as opposed to single-thread immediate-mode draw calls, which we used to use heavily but no longer rely on for most of our stuff as of a year or so ago — though we do use a hybrid direct and batched system for our pipeline). Anyway, we’re able to just set z positions on sprites in an orthographic camera, and the stuff with the lower z position renders first. Easy, right?

Well… it turns out that this is really only per vertex batch. With our new sprite batching, we have a bunch of vertices that are defining textured quads (aka a square sprite), and these are all at different z depths. No problem whatsoever when showing Grass relative to other grass. It works just like it always has.

But wait… when you show Grass relative to Mountain in an orthographic camera that is using a false 2D orthographic perspective… you wind up with the entire batch of vertices being drawn as a whole, and thus you wind up with “Z fighting” issues. It’s a familiar problem to 3D programmers, and not one that I had properly considered prior to doing the sprite batching. Basically, you need to use the z buffer in order to make sure that things further back don’t draw on top of things that are closer forward, since each mesh (collection of vertices and such associate with a texture) is drawn sequentially.

That sounds fine, but normally we don’t write to the z buffer or do z checks, because we’re using 2D and the order of draws just overdraws each pixel and it’s fine. That works particularly well for 2D because if you have something with partly-transparent edges that is closer forward, it can blend well with the stuff behind it. z buffer is an all-or-nothing thing per pixel. That means that if a partly-transparent pixel from something closer-in draws, then whatever would have shown through that partial transparency from further back will NOT draw, if its overall mesh is drawn second (but it will if it is drawn first). Hence: z-fighting.

Anyway, the overall solution was to create a new shader that I called Depth, which is a variant of our normal basic shader, but with z writing and z testing turned on. That has to be used for any layers using a fake 2D orthographic perspective, but is not required on any other layers. To go along this, any sprites being drawn with the Depth shader automatically skip rendering any pixels that have less than 50% opacity on them. That keeps it so that the edges are sharper (unfortunately, but a minor thing especially when already layering tiles), but also prevents largely-transparent pixels from causing strange black lines and creases between tiles — and worse, black lines and creases that flicker thanks to z-fighting.

Oy. Thankfully, even with the games that use a 2D orthographic perspective, anything that isn’t in the layer of sprites that is orthographic is free to continue ignoring the z buffer, and all the wonderful blending effects can remain just as they always have. That’s important for things like special blend modes (additive blending, multiplicative, etc), which rely on combining sprites with partial transparency. So the need for the Depth shader remains thankfully quite limited, as it has some slight drawbacks that would frustrate certain kinds of art (particle effects, for instance).

Conclusion

I do wish that I had done this years ago, but honestly at the same time I am glad that I never implemented this the wrong way, because that would have been a far worse problem than not doing it at all. And because we had to be creative for all those years and work around not having sprite batching, we actually came up with a number of OTHER performance improvements which still are part of our engine, and which help us squeak out even more performance than we could get if we were using sprite batching alone. So that’s a pretty big silver lining.

As it was, even once I had the epiphanies on how to handle this, it took me the bulk of four or five days to actually fully figure out all the details and get them implemented. That’s an unusually large chunk of engine work time by Arcen’s standards at this point in the life of our engine, but it was definitely worth it. At the time of this writing, The Last Federation and Spectral Empire now have this fully up and running, but I’ll be porting it to some of our other titles soon. Mainly AI War and probably Bionic Dues. Honestly our other titles are already so GPU-efficient using other methods that I don’t think they really need it.

Thanks for reading, and if you’re another indie developer I hope this gives you some ideas for potential solutions to your own sprite batching problems if you’re using a custom engine.

Click here to view the forum topic for this post.