Apr 13, 2025

Optimizing wLipSync even further

This post continues where we left of after the previous part: Optimizing wLipSync. While it seemed there wouldn’t be much performance left to squeeze out after the initial improvements, that turned out to be false.

As is often the case in software development, letting a problem rest for a while leads to fresh ideas and new insights.

Quick wins

At one point I realized that the project wasn’t compiled with -ffast-math. By passing this compiler flag, the compiler would be able to perform various optimizations to floating point math (which wLipSync uses throughout). The initial testing was promising, the execution time was more than halved. Though after further testing it became clear that the results were no longer correct.

There was still some hope. By selectively applying it to different parts of the project, it should be possible to identify the code that breaks and the code that benefits the most from these optimizations. These turned out to be one and the same. In fact, it was the same culprit as last time: the low pass filter.

The Code

If I wanted to improve the performance, I had to find a clever way to improve the low-pass-filter. And more importantly, ensure it remains compatible with the original uLipSync profiles. The code was as follows:

for (int j = 0; j < bLen; ++j) {
    // Start i at j rounded up to the nearest multiple of skip
    for (int i = j + (skip - j%skip)%skip; i < len; i += skip) {
        data[i] += b[j] * tmp[i - j];
    }
}

A simple nested loop, where the outer loop goes over the coefficients stored in b and added onto data, where tmp is a copy taken of data right before this loop. The coefficients are calculated as follows:

for (int i = 0; i < bLen; ++i) {
    float x = i - (bLen - 1) / 2.f;
    float ang = 2.f * PT_PI * cutoff * x;
    b[i] = 2.f * cutoff * PT_sinf(ang) / ang;
}

One thing I overlooked previously is that this loop and the outer loop in the previous snippet both iterate over the coefficients. It’s no longer needed to store them in an array. Each iteration the relevant coefficient could be computed and used directly:

for (int j = 0; j < bLen; ++j) {
    float x = j - (bLen - 1) / 2.f;
    float ang = 2.f * PT_PI * cutoff * x;
    float b = 2.f * cutoff * PT_sinf(ang) / ang;

    // Start i at j rounded up to the nearest multiple of skip
    for (int i = j + (skip - j%skip)%skip; i < len; i += skip) {
        data[i] += b * tmp[i - j];
    }
}

While this didn’t provide a tangible improvement, it made me take a closer look at the coefficients and then it hit me!

Abusing symmetry

The value of each coefficient only depends on its index, as all other elements of the computation where constant throughout the low-pass-filter. What’s more there was a nice symmetry to be found by pairing up coefficients (the algorithm guarantees the number of coefficients is even).

Each coefficient i is the same as coefficient len(b) - i. So what if the outer loop didn’t go through all coefficients, but only half of them?

This mostly worked, but there’s an edge case at the start of the array. The inner loop uses a start offset to avoid indexing out of bounds. But by processing them pairwise you’d have to distinguish between either one or both being in bounds:

// Start i at j rounded up to the nearest multiple of skip
for (int i = j + (skip - j%skip)%skip; i < len; i += skip) {
    if(i < bLen - 1 - j) {
        // Process one
        data[i] += b * tmp[i - j];
    } else {
        // Process two (symmetry)
        data[i] += b * (tmp[i - j] + tmp[i - (bLen - 1 - j)]);
    }
}

It comes as no surprise that this tanks the performance. Luckily there’s a neat little solution. Since the tmp array is a copy, we can introduce some padding at the start. By padding with zeros the outcome will be the same, leaving us with the final code:

// Start i at j rounded up to the nearest multiple of skip
for (int i = j + (skip - j%skip)%skip; i < len; i += skip) {
    data[i] += b * (tmp[bLen + i - j] + tmp[bLen + i - (bLen - 1 - j)]);
}

This optimization, along with some other small changes, made wLipSync about 77% faster. This is on top of the previous ~7x speed-up.

Learn more

If you’re interested in the final code or the lip sync project in general, you can find it on GitHub.