GPU DSP — When You Can’t Have Enough Cores

NVIDIA Jetson Nano

The writing is on the wall. I said it before, I’ll say it again: The dedicated DSP processor will become obsolete. I’m being modest, actually. Some even claim that the DSP Is already dead (see Why The Future Of Real-time Audio Processing Could Be In Graphics Cards).

First, DSP is a very crucial part of modern software and there is no inherent limitation for it not to be assimilated by the CPU. More DSP functionalities are creeping into the CPU as we speak. Modern CPUs and MCUs with SIMD and DSP capabilities such as NEON are proving to be usable now, especially for audio. I can do wonders even with small MCUs such as the STM32H7. There was a time, not too long ago, when you can’t do complex real-time audio processing even with a “high-end” desktop using pure software. Now you can. It is commonplace, and it will just get better.

But more importantly, GPUs are taking off as parallel, general purpose computing engines. Modern GPUs are actually GPGPUs: general-purpose graphics processing units that perform non-specialized calculations that would typically be conducted by the CPU. These GPGPUs are already inside your laptop, and are now encroaching in the land of embedded processing, thanks to advances in AI and Machine Learning, that necessitated such computing power. Oh, and thanks also to crypto currencies, such as Bitcoin. These new trends somehow pushed companies such as NVIDIA to advance the state-of-the art in such a rapid pace.

The modern GPU as a computing device will leave any dedicated DSP in the dust. The issue to contend with is latency and data transfer rates with data going in and out of the CPU and the GPU. But 1) modern GPUs are gaining such incredible data transfer rates in the GB/sec range and latencies in the μsec range and 2) there are software techniques you can use to mitigate the data transfer bottleneck.

Jetson Nano Dev Kit

Have a look at the Jetson Nano from NVIDIA for example. This small $99 USD(!) Developers Kit with 128 cores delivers an amazing 472 GFLOPS. It is being positioned as a platform for running modern AI algorithms with multiple neural networks in parallel. That computing power can be utilized for Audio DSP. After all, it is a GPGPU! In contrast, the high-end TMS320C6678 Multicore (8 cores) Fixed and Floating Point Digital Signal Processor from Texas Instruments can deliver 160 GFLOPS. Take note that while the TMS320C6678 is a high-end device, the Jetson is considered to be an entry-level offering by NVIDIA.

Jetson Nano Specs

GPGPU programming involves CUDA: a parallel computing platform and programming model invented by NVIDIA and OpenCL (Open Computing Language): an open, royalty-free standard for cross-platform, parallel programming of diverse processors. OpenCL is an open standard while CUDA is tied to NVIDIA GPUs. Fortunately, NVIDIA also supports OpenCL. With OpenCL, you are not tied to a proprietary architecture that is inflexible and difficult to port. Ideally, you will want to program in a high-level language. C/C++ is the language of choice here. C/C++ makes your DSP code easy to port to a different machine without sacrificing performance.

In the early days, OpenCL support for NVIDIA GPUs was not as efficient compared to CUDA, but this is no longer the case. For that matter, OpenCL is the obvious choice if you do not want to be tied to a specific vendor., although it is certainly possible to target both OpenCL and CUDA at the same time. Actually, OpenCL is not just for GPUs. OpenCL can be used to execute programs across heterogeneous platforms comprising CPUs, GPUs, DSPs, and even FPGAs.

Where Do I Go From Here?

So this Jetson thing looks like a super cool deal and it’s hard to resist the temptation. I will transform Q, the Audio DSP C++ library I’ve been working on to take full advantage of GPGPU programming. Audio DSP programming is inherently parallel and one can’t really have enough cores. I can imagine a small stand-alone rack-mount (or stomp box?) multichannel Audio DSP box based on the Jetson Nano with USB or Ethernet connectivity to a laptop.

I could definitely make good use of all those cores. For the past year, I’ve been trying hard to optimize traditionally computationally intensive processing such as pitch detection and autocorrelation (see Bitstream Autocorrelation for example). In retrospect, I wonder if I should’ve just used GPGPUs early on. Convolutions, FFTs, FIRs, correlations… all these stuff can take advantage of massively parallel cores. Oh… and multiply all that processing by N channels, and you’ll understand why. Now with 128 cores, all these are fair game. Multi-core multi-processing is definitely the way to go for non trivial DSP.

A worthy endeavor? Leave a comment below or our FB Forum.

Further Reading

  1. General-purpose computing on graphics processing units
  2. Multidimensional DSP with GPU Acceleration
  3. Why The Future Of Real-time Audio Processing Could Be In Graphics Cards
  4. NVIDIA Jetson Nano
  5. Hands-On: New NVIDIA Jetson Nano Is More Power In a Smaller Form Factor
  6. OpenCL vs. CUDA: Which Has Better Application Support?
  7. CUDA
  8. OpenCL

Notify of

This site uses Akismet to reduce spam. Learn how your comment data is processed.

1 Comment
Newest Most Voted
Inline Feedbacks
View all comments
1 year ago

good news nice work

Would love your thoughts, please comment.x