I'm currently in the process of writing a software renderer in C for a game, using the X11 library as a back end. At the moment, it manages to create a window and render a scrolling pattern into it. Not too exciting, but even with a small amount of code, I’ve learned first-hand how easy it is to introduce pointer aliasing into C code—and how damaging it can be to performance.
The code to render the pattern was fairly trivial:
static void update_window(
Window window, GC gc,
int width, int height,
int xoffset, int yoffset)
{
if (!pixels)
resize_ximage(width, height);
uint32_t *pixel = (uint32_t*)pixels;
for (int y = 0; y < height; ++y) {
for (int x = 0; x < width; ++x) {
uint8_t blue = (x + xoffset);
uint8_t green = (y + yoffset);
uint8_t red = 0;
uint8_t alpha = 255;
*pixel++ = (alpha << 24) | (red << 16) | (green << 8) | blue;
}
}
XPutImage(
display, window,
gc, &ximage,
0, 0,
0, 0,
width, height);
}
It first allocates a back buffer if it hasn’t done so already, then loops through the pixels, assigning a colour to each one based on its XY position and the XY offset passed in as arguments. The XPutImage
function then copies the data from the back buffer to the window’s display buffer.
As you can probably tell, this function not only has a large number of input parameters, but also modifies a lot of global state (pixels
, display
, and ximage
). I’m not proud of this code and was hesitant to post it, but hopefully you’ll appreciate that it was a rapid prototype and can look past it for now.
To clean things up, I decided to gather all of the global state and window data into a global x11_context
struct that would eventually be passed in as a parameter.
struct x11_context context;
static void update_window(int xoffset, int yoffset)
{
if (!context.backbuffer.pixels)
resize_ximage(context.width, context.height);
uint32_t *pixel = (uint32_t*)context.backbuffer.pixels;
for (int y = 0; y < context.height; ++y) {
for (int x = 0; x < context.width; ++x) {
uint8_t blue = (x + xoffset);
uint8_t green = (y + yoffset);
uint8_t red = 0;
uint8_t alpha = 255;
*pixel++ = (alpha << 24) | (red << 16) | (green << 8) | blue;
}
}
XPutImage(
context.display, context.window,
context.gc, &context.ximage,
0, 0,
0, 0,
context.width, context.height);
}
After this refactor, I was surprised to see a 35% increase in frame time compared with the old code (4.1ms → 5.7ms). I immediately suspected I’d introduced some form of pointer aliasing into the loop—but was confused, as there was only a single pointer dereference in that part of the code.
Pointer Aliasing
Experienced C and C++ programmers will already be familiar with pointer aliasing and how it affects compiler optimisations, but it’s worth reviewing here, if only to cement it in my mind.
A good—if contrived—example appears on the Wikipedia page for the restrict
keyword:
void updatePtrs(size_t *ptrA, size_t *ptrB, size_t *val)
{
*ptrA += *val;
*ptrB += *val;
}
The values pointed to by ptrA
and ptrB
are updated using the value pointed to by val
. An optimising compiler would love to cache the value of *val
to avoid loading it twice. But val
and ptrA
could theoretically point to the same memory, so the compiler can’t assume they're independent. It’s forced to reload *val
each time, just in case *ptrA
modifies it.
One fix is to use the restrict
keyword, which tells the compiler that the pointers reference disjoint memory regions—the responsibility is on the programmer to ensure this:
void updatePtrs(
size_t *restrict ptrA,
size_t *restrict ptrB,
size_t *restrict val);
We could also manually cache the value ourselves. However, this doesn’t make the aliasing contract explicit:
void updatePtrs(size_t *ptrA, size_t *ptrB, size_t *val)
{
size_t value = *val;
*ptrA += value;
*ptrB += value;
}
perf
to the Rescue
Back to the update_window
function. There’s only one pointer being dereferenced in the loop—so where’s the aliasing? I cracked out the perf
profiler to investigate the hotspots in the assembly.
│ for (int x = 0; x < context.width; ++x) {
0.04 │ add $0x1,%edx
│ *pixel++ = (alpha << 24) | (red << 16) | (green << 8) | blue;
33.32 │ or %edi,%eax
0.02 │ add $0x1,%ecx
31.39 │ mov %eax,-0x4(%rsi)
│ for (int x = 0; x < context.width; ++x) {
2.85 │ mov context+0xe4,%eax
31.84 │ cmp %eax,%edx
│ ↑ jl 3b8
0.37 │ mov context+0xe8,%edx
perf
showed that 95% of samples landed in the region around the inner loop1. The interesting part here is mov context+0xe4,%eax
, which loads context.width
into eax
on every iteration—suggesting the compiler doesn't cache it in a register. Why?
Because the compiler can’t assume that pixels
doesn’t point somewhere within context
, including context.width
. That means width
must be reloaded each time. When width
and height
were passed directly as function parameters, they lived in the stack frame and were safe from aliasing by pixels
.
This highlighted that pointer aliasing doesn’t just occur between pointers—it can also happen between a pointer and any data that isn’t local to the current stack frame.
The Solution
The fix was simple: cache width
and height
in local variables before entering the loop.
/* x11_context struct now passed in as a parameter as well */
static void update_window(
struct x11_context *context,
int xoffset, int yoffset)
{
int width = context->width;
int height = context->height;
if (!context->backbuffer.pixels)
resize_ximage(width, height);
uint32_t *pixel = (uint32_t*)context->backbuffer.pixels;
for (int y = 0; y < height; ++y) {
for (int x = 0; x < width; ++x) {
uint8_t blue = (x + xoffset);
uint8_t green = (y + yoffset);
uint8_t red = 0;
uint8_t alpha = 255;
*pixel++ = (alpha << 24) | (red << 16) | (green << 8) | blue;
}
}
XPutImage(
context->display, context->window,
context->gc, &context->ximage,
0, 0,
0, 0,
width, height);
}
After this minor change, performance matched the original version. perf
no longer reported any significant hotspots, and the compiler could hoist the loads and even unroll the loop—since width
and height
were known constants elsewhere.
Conclusion
This post demonstrates how even minor refactors can have drastic effects on performance. Pointer aliasing issues aren’t always obvious, especially when using globals. Even if you don’t explicitly dereference memory, the compiler still has to account for potential aliasing.
It’s worth reiterating that this 35% hit was only visible in an optimised build. I strongly believe code should be regularly built with optimisations enabled—disable them only when you need to debug logic errors. Performance bugs are best investigated with tools like perf
.
Some programmers criticise C for pitfalls like this. Personally, I see it as a strength: the performance is there if you understand what’s going on. In higher-level languages, these issues are hidden, but the performance penalties often remain.
To my knowledge, perf
interrupts the program and inspects the instruction pointer to determine what’s executing. If it interrupts during a long instruction, it must wait for it to finish and may attribute the sample to the instruction that follows. That’s why the 31.84% value appears next to cmp
rather than the more expensive mov
before it.