Optimizing wLipSync
When porting uLipSync for the web, the choice for WASM was quickly made. Performance is paramount for audio processing. Especially since the lip sync should run in realtime on mobile devices. However, just using WASM does not automatically mean performance. After porting the C# code to C++, it was time to look for possible optimizations.
Profiling
It’s tempting to just go through the code in hopes of finding places that can be optimized. Generally this is a waste of time. You’ll be putting effort into squeezing performance out of parts that aren’t even the problem. A better approach is to profile and measure where time is actually spent, and thus where most time can be saved. This was also the case when I looked into the main algorithm of wLipSync
.
I created a simple benchmark to easily measure performance. Then I went and measured the individual steps of the algorithm to find the most expensive one. Try and guess which of the following steps ended up being the bottleneck:
- CopyRingBuffer
- LowPassFilter
- DownSample
- PreEmphasis
- HammingWindow
- Normalize
- FFT
- MelFilterBank
- PowerToDb
- DCT
- CalculateScores
- NormalizeScores
If you’ve guessed the Fast Fourier Transform (FFT) or the Discrete cosine transform (DCT), you’d be wrong. My eyes were drawn to these as there’s a lot going, but it’s actually the low pass filter. It wasn’t even close. The low pass filter alone takes more time than all other steps combined(!)
The total time was ~2.86
ms with the low pass filter taking ~2.75
ms (96%)
The Code
Armed with this knowledge it was clear that in order to improve performance meaningfully the low pass filter would need to be improved. The code was as followed:
for (int i = 0; i < len; ++i) {
for (int j = 0; j < bLen; ++j) {
if (i - j >= 0) {
data[i] += b[j] * tmp[i - j];
}
}
}
A nested loop with one if-statement in it. The outer loop goes over the input samples and the inner loop goes over the coefficients. It’s worth pointing out that the number of coefficients is considerably less than the number of input samples. So the first simple modification, was swapping the loops around:
for (int j = 0; j < bLen; ++j) {
for (int i = 0; i < len; ++i) {
if (i - j >= 0) {
data[i] += b[j] * tmp[i - j];
}
}
}
Lo and behold, the processing time dropped from 2.86
ms to 1.11
ms. But this was only just the start. Now that the loops have been re-ordered the if-statement becomes trivial. It makes sure that i >= j
, but the inner loop can be changed to start at j
:
for (int j = 0; j < bLen; ++j) {
for (int i = j; i < len; ++i) {
data[i] += b[j] * tmp[i - j];
}
}
Now the time dropped from 1.11
ms down to 0.76
ms. A nice improvement and the code even becomes a bit easier to read. But this seems as far as this small piece of code can be pushed. Or is it?
Taking a step back
When focussing on a small piece of code, it’s easy to loose sight of the bigger picture. This low pass filter is done right before the downsample step to avoid aliasing. This means that the output only needs to satisfy the requirements of the downsample step. It just so happens that the downsample steps has a fast-path in case the input sample rate is a multiple of the target sample rate (which is generally will be for wLipSync
).
This fast path jumps through the input samples with a certain stride. All other samples are ignored, which means they don’t have to be computed. The low pass filter only needs to be applied to the samples that the downsample step will actually use. With some careful arithmetic, this leads to the following code:
for (int j = 0; j < bLen; ++j) {
// Start i at j rounded up to the nearest multiple of skip
for (int i = j + (skip - j%skip)%skip; i < len; i += skip) {
data[i] += b[j] * tmp[i - j];
}
}
This brings down the total time from 0.76
ms to 0.36
ms. This is a 7x improvement over the original time of 2.86
ms, just by optimizing a single step in the whole algorithm.
Conclusion
The moral of the story is to profile your code before optimizing. Since the code in wLipSync
is ported from uLipSync
I was curious if the above optimizations translated back as well. It actually does, so I’ve opened a PR to get these optimizations upstreamed as well (hecomi/uLipSync#86).
If you’re interested in the final code or the lip sync project in general, you can find it on GitHub.


