SSFFT: Energy-Efficient Selective Scaling for Fast Fourier Transform in Embedded GPUs

Thread Allocation of SSFFT

Abstract

Fast Fourier Transform (FFT) is critical in applications such as signal processing, communications, and AI. Embedded GPUs are often used to accelerate FFT due to their computational efficiency, but energy efficiency remains a key challenge due to power constraints. Existing solutions, such as the cuFFT library provided by NVIDIA, employ static configurations for the number of thread blocks and threads per block. This static approach often results in ineffective threads that consume power without contributing to performance, particularly if the FFT length or batch size varies. Furthermore, for large FFT lengths, cuFFT internally splits the computation into multiple kernel invocations. This decomposition can lead to L2 cache thrashing, resulting in redundant global memory accesses and degraded efficiency. To address these challenges, this paper proposes SSFFT, a software technique for embedded GPUs. The key idea of SSFFT is to maximize the number of useful threads that contribute to performance while minimizing ineffective threads. SSFFT is implemented based on a novel theoretical model that determines how many thread blocks and threads per block are effective for a given FFT length, batch size, and hardware resource availability. SSFFT statically determines these configurations and adaptively launches either a GPU kernel for regular FFT operations or a newly implemented kernel that integrates multiple FFT steps. By tailoring thread allocation to workload characteristics and minimizing inter-kernel memory interference, SSFFT improves energy efficiency without compromising performance. In our evaluation, SSFFT achieves a 1.29× speedup and a 1.26× improvement in throughput per watt compared to cuFFT.

Publication
International Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES)
Gunjae Koo
Gunjae Koo
Associate Professor