Jan 22, 2025

Optimizing wLipSync

When porting uLipSync for the web, the choice for WASM was quickly made. Performance is paramount for audio processing. Especially since the lip sync should run in realtime on mobile devices. However, just using WASM does not automatically mean performance. After porting the C# code to C++, it was time to look for possible optimizations.

Profiling

It’s tempting to just go through the code in hopes of finding places that can be optimized. Generally this is a waste of time. You’ll be putting effort into squeezing performance out of parts that aren’t even the problem. A better approach is to profile and measure where time is actually spent, and thus where most time can be saved. This was also the case when I looked into the main algorithm of wLipSync.

I created a simple benchmark to easily measure performance. Then I went and measured the individual steps of the algorithm to find the most expensive one. Try and guess which of the following steps ended up being the bottleneck:

CopyRingBuffer
LowPassFilter
DownSample
PreEmphasis
HammingWindow
Normalize
FFT
MelFilterBank
PowerToDb
DCT
CalculateScores
NormalizeScores

If you’ve guessed the Fast Fourier Transform (FFT) or the Discrete cosine transform (DCT), you’d be wrong. My eyes were drawn to these as there’s a lot going, but it’s actually the low pass filter. It wasn’t even close. The low pass filter alone takes more time than all other steps combined(!) The total time was ~2.86 ms with the low pass filter taking ~2.75 ms (96%)

The Code

Armed with this knowledge it was clear that in order to improve performance meaningfully the low pass filter would need to be improved. The code was as followed:

for (int i = 0; i < len; ++i) {
    for (int j = 0; j < bLen; ++j) {
        if (i - j >= 0) {
            data[i] += b[j] * tmp[i - j];
        }
    }
}

A nested loop with one if-statement in it. The outer loop goes over the input samples and the inner loop goes over the coefficients. It’s worth pointing out that the number of coefficients is considerably less than the number of input samples. So the first simple modification, was swapping the loops around:

for (int j = 0; j < bLen; ++j) {
    for (int i = 0; i < len; ++i) {
        if (i - j >= 0) {
            data[i] += b[j] * tmp[i - j];
        }
    }
}

Lo and behold, the processing time dropped from 2.86 ms to 1.11 ms. But this was only just the beginning. Now that the loops have been re-ordered the if-statement becomes trivial. It makes sure that i >= j, but can be eliminated if the inner loop starts at j:

for (int j = 0; j < bLen; ++j) {
    for (int i = j; i < len; ++i) {
        data[i] += b[j] * tmp[i - j];
    }
}

Now the time dropped from 1.11 ms down to 0.76 ms. A nice improvement and the code even becomes a bit easier to read. But this seems as far as this small piece of code can be pushed. Or is it?

Taking a step back

When focussing on a small piece of code, it’s easy to loose sight of the bigger picture. This low pass filter is done right before the downsample step to avoid aliasing. This means that the output only needs to satisfy the requirements of the downsample step. It just so happens that the downsample steps has a fast-path in case the input sample rate is a multiple of the target sample rate (which it generally will be for wLipSync).

This fast path jumps through the input samples with a certain stride. All other samples are ignored, which means they don’t have to be computed. The low pass filter only needs to be applied to the samples that the downsample step will actually use. With some careful arithmetic, this leads to the following code:

for (int j = 0; j < bLen; ++j) {
    // Start i at j rounded up to the nearest multiple of skip
    for (int i = j + (skip - j%skip)%skip; i < len; i += skip) {
        data[i] += b[j] * tmp[i - j];
    }
}

This brings down the total time from 0.76 ms to 0.36 ms. This is a 7x improvement over the original time of 2.86 ms, just by optimizing a single step in the whole algorithm.

Conclusion

The moral of the story is to profile your code before optimizing. Since the code in wLipSync is ported from uLipSync I was curious if the above optimizations translated back as well. It actually does, so I’ve opened a PR to get these optimizations upstreamed as well (hecomi/uLipSync#86).

If you’re interested in the final code or the lip sync project in general, you can find it on GitHub.