GPU DSP — Latency

I must have stirred controversy in my previous post. Take the comments from this thread for example: The future of DSP?. I almost wanted to reply, but I’d rather not. The tone is too condescending to elicit a response. Ah, the forums, right? You’ve got to love ’em or hate ’em. Instead, I’ll present information (and code!) that substantiates my article.

Quick update: Someone graciously posted a link to this post in the thread mentioned above. Thank you!

Summary for the impatient: So far we have:

  1. A link to an actual proof of concept (a video clip): A 96 samples buffer latency at 96 kHz and holding 50 1-cell tracks in parallel with the 1 ms input buffer.
  2. Actual latency tests in the μsec range.
  3. An example solution (convolution) on how to mitigate latency that uses both the CPU and the GPU.

Now, Delta Sound-Labs gave me another interesting link (in the comments section below): In the link, Victor Zappi presents us with a tutorial on how to implement real-time physical modeling synthesis on the GPU, using C++ and OpenGL shaders. 

Here’s another interesting paper on “Analysis of CPU and GPU Implementations of Convolution Reverb Effect“: “We observed speedups in the range of 2 to 3 times over CPU implementation”… “The authors claim that GPUs are eligible for real-time applications, such as software plugins for professional audio production”  

Who Me?

C++ Dev, Japan in the 90s

First, despite the ‘imposter syndrome’ that I every so often experience, I think I know what I am talking about :-). I know most of you know me as the guy who makes multichannel pickups, but my main profession is actually C++ programming. I’ve been doing that for almost 3 decades now, specializing in modern high-performance C++.

I have authored several highly successful Open Source projects such as Boost.Spirit, Boost.Phoenix and Boost.Fusion. These libraries are all part of the Boost Libraries, a well respected, peer-reviewed, Open Source, collaborative development effort. There’s a big chance that the computer you are using to read this now is using the Boost C++ Libraries. The Boost Libraries are being used by organizations and companies all over the world from CERN to Adobe. Boost.Fusion, in particular, is being used in high-performance computing.

Recently, I’ve been busy working on Q, a DSP Audio library. Q leverages the power of modern C++ and efficient use of functional programming techniques, especially function composition using fine-grained and reusable function objects (both stateless and stateful), to simplify complex DSP programming tasks without sacrificing readability.

TL;DR Syndrome

OK, ’nuff introduction… On to some substance…

It’s unfortunate that the busy Internet world does not have the attention span beyond a few minutes. Some say the average page visit lasts less than a minute and users often leave web pages in just 10-20 seconds. Sigh. And in comparison, I prepare every post I write with substantial research, often with links for further reading near the end. A typical post like GPU DSP — When You Can’t Have Enough Cores takes at least a few days to complete, with initial background research that may span months or even years. These are the topics I really care about.

The very first paragraph contains a link to an article: Why The Future Of Real-time Audio Processing Could Be In Graphics Cards which somehow corroborates my report. And in that link, if you dig through the comments section, (interesting discourse, BTW!) there’s an actual working proof-of-concept: An example of a low-latency Audio DSP plugin using the GPU with a 96 samples buffer latency at 96 kHz and holding 50 1-cell tracks in parallel with the 1 millisecond input buffer. Here’s the YouTube video:


I did say that “The issue to contend with is latency and data transfer rates with data going in and out of the CPU and the GPU. But 1) modern GPUs are gaining such incredible data transfer rates in the GB/sec range and latencies in the μsec range and 2) there are software techniques you can use to mitigate the data transfer bottleneck.”

Let’s tackle the first. I’ll tackle 2 later. Here’s the clincher: PCI-E 3.0 standard guarantees data transfer for 4 kb data with 1-2 μsec (3-10 round trip). Have a look at this article: A look at GPU memory transfer. In that article, the important point is that (quote): “We see the general shape we expected to see. Up until about 8K, all transfers take around 6 or 7 microseconds. Afterwards, the transfer time increases linearly.”

Here’s the graph:

The numbers are not shabby at all! And take note that he wrote that article in 2012! Now, I replicated the numbers using my development laptop (a mid-2015 MacBook Pro i7) with the code below (using the Boost.Compute library for the OpenCL programming parts).

I am getting similar (but better) results.

The Code

#include <vector>
#include <cstdlib>
#include <iostream>
#include <boost/compute.hpp>

int main()
   auto gpu = boost::compute::system::default_device();
   boost::compute::context context(gpu);

   auto properties = boost::compute::command_queue::enable_profiling;
   auto queue = boost::compute::command_queue(context, gpu, properties);

   size_t size = 1;

   for (int i = 0; i < 26; ++i)
      auto num_bytes = size * sizeof(float);

      std::vector<float> host_vector(size);
      std::generate(host_vector.begin(), host_vector.end(), rand);

      boost::compute::vector<float> device_vector(host_vector.size(), context);

      boost::compute::future<void> future =

      // wait for copy to finish

      // get elapsed time in microseconds
      auto elapsed = future.get_event().duration<boost::chrono::microseconds>().count();

         << "size: " << num_bytes << " bytes, "
         << "time: " << elapsed << " us, "
         << std::endl;

      size *= 2;

   return 0;


size: 4 bytes, time: 3.6 us,
size: 8 bytes, time: 3.36 us,
size: 16 bytes, time: 3.36 us,
size: 32 bytes, time: 3.68 us,
size: 64 bytes, time: 4 us,
size: 128 bytes, time: 3.68 us,
size: 256 bytes, time: 3.52 us,
size: 512 bytes, time: 3.76 us,
size: 1024 bytes, time: 4.16 us,
size: 2048 bytes, time: 4 us,
size: 4096 bytes, time: 4.08 us,
size: 8192 bytes, time: 4.72 us,

size: 16384 bytes, time: 5.12 us,
size: 32768 bytes, time: 6.24 us,
size: 65536 bytes, time: 9.2 us,
size: 131072 bytes, time: 12.48 us,
size: 262144 bytes, time: 18.24 us,
size: 524288 bytes, time: 33.68 us,
size: 1048576 bytes, time: 64.88 us,
size: 2097152 bytes, time: 105.12 us,
size: 4194304 bytes, time: 247.68 us,
size: 8388608 bytes, time: 760.32 us,
size: 16777216 bytes, time: 906.4 us,
size: 33554432 bytes, time: 1823.44 us,

My results show around 4-5 µs data transfer latency up to around 8K. I do not know about you, but to me this looks very promising. I expect future generation GPUs to push this further down the nano-sec range. But now that we are in the µsec latency range, IMO, GPGPU for audio has already taken off. It’s just not mainstream yet.

A Different Paradigm

I can venture one guess why it is not mainstream yet: The change from Sequential processing to Parallel processing is a huge leap. A new paradigm. The cute joke that “we cannot take 9 women and have a baby in 1 month”, alluding to audio DSP processing, is amusing, but misguided. Who says we should limit ourselves to making only babies? Think outside the box, man! Break the sequential one-step-at-a-time mindset.

For example, the fact that there’s CUDA FFT  —an FFT implementation with up to 10x faster than CPU-only alternatives, is an indication that GPGPU DSP processing is viable and should not be ignored! Anyone who studied the inner workings of FFT should know that the FFT algorithm is inherently parallel. It’s just one example, albeit a very important one! There are others, such as FIR filters, convolution and correlation (e.g. autocorrelation for pitch detection), to name a few (see Parallel multidimensional digital signal processing in the links below).

Mitigating Latency

All those mentioned above have their own inherent latencies. For instance, the latency of FFT is directly proportional to its low-frequency resolution. There will always be latency. The goal is to keep it as low as possible below perception.

There are software techniques for mitigating latency. For example, one solution is explored here: Efficient Convolution Without Latency for both uniprocessor and multiprocessor architectures. “Essentially, early portions of the impulse response are rendered using small block transforms, and late portions of the impulse response are rendered using large block transforms. To achieve zero latency, the earliest portion of the filter response must be rendered using a direct form filter.”

This approach can be implemented using heterogenous processor architectures. The low-latency operation is performed in the CPU, while a more efficient but higher latency operation (e.g. using FFT) proceeds in the GPGPU.

Another example is pitch detection, using a variant of autocorrelation. This requires at least 1.5 cycles of the lowest fundamental frequency of interest, which could be rather prohibitive at the lower frequencies. For instance, detecting C#1 (34.65 Hz) requires 43 ms. One way to minimize this latency is to use a faster, but less accurate, feature based time-domain pitch detector (e.g. using peak detectors) in the attack portion.


So, there you go. I hope there’s enough material here for the naysayers. Or not ¯\_(ツ)_/¯. C’est la vie.

‘Till next time, friends. Cheers!

Further Reading

  1. A look at GPU memory transfer
  2. HPC-oriented Latency Numbers Every Programmer Should Know
  4. Efficient Convolution Without Latency
  5. Boost Compute library: C++ interface to multi-core CPU and GPGPU computing platforms.
  6. Parallel multidimensional digital signal processing.

Notify of

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Inline Feedbacks
View all comments
Would love your thoughts, please comment.x