GPU Implementation of a Programmable Turbo Decoder for Software Defined Radio Applications

doi:10.1109/VLSID.2012.62

Home
/
Papers
/
GPU Implementation of a Programmable Turbo Decoder for Software Defined Radio Applications

Proceedings Article•DOI•

GPU Implementation of a Programmable Turbo Decoder for Software Defined Radio Applications

Dhiraj Reddy Nallapa Yoge¹, Nitin Chandrachoodan¹•Institutions (1)

Indian Institute of Technology Madras¹

07 Jan 2012-pp 149-154

TL;DR: This paper presents the implementation of a 3GPP standards compliant configurable turbo decoder on a GPU by suitably parallelizing the Log-MAP decoding algorithm and doing an architecture aware mapping of it on to the GPU.

read less

Abstract: This paper presents the implementation of a 3GPP standards compliant configurable turbo decoder on a GPU. The challenge in implementing a turbo decoder on a GPU is in suitably parallelizing the Log-MAP decoding algorithm and doing an architecture aware mapping of it on to the GPU. The approximations in parallelizing the Log-MAP algorithm come at the cost of reduced BER performance. To mitigate this reduction, different guarding mechanisms of varying computational complexity have been presented. The limited shared memory and registers available on GPUs are carefully allocated to obtain a high real-time decoding rate without requiring several independent data streams in parallel.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

20 Years of Turbo Coding and Energy-Aware Design Guidelines for Energy-Constrained Wireless Applications

[...]

Matthew Brejza¹, Liang Li¹, Robert G. Maunder¹, Bashir M. Al-Hashimi¹, Claude Berrou, Lajos Hanzo¹ - Show less +2 more•Institutions (1)

University of Southampton¹

21 Jan 2016-IEEE Communications Surveys and Tutorials

TL;DR: This tutorial investigates holistic design methodologies conceived for energy-constrained wireless communication applications by introducing turbo coding in detail, highlighting the various parameters of TCs and characterizing their impact on the encoded bit rate, on the radio frequency bandwidth requirement, onThe transmission EC and on the BER.

...read moreread less

Abstract: During the last two decades, wireless communication has been revolutionized by near-capacity error-correcting codes (ECCs), such as turbo codes (TCs), which offer a lower bit error ratio (BER) than their predecessors, without requiring an increased transmission energy consumption (EC). Hence, TCs have found widespread employment in spectrum-constrained wireless communication applications, such as cellular telephony, wireless local area network, and broadcast systems. Recently, however, TCs have also been considered for energy-constrained wireless communication applications, such as wireless sensor networks and the ‘Internet of Things.’ In these applications, TCs may also be employed for reducing the required transmission EC, instead of improving the BER. However, TCs have relatively high computational complexities, and hence, the associated signal-processing-related ECs are not insignificant. Therefore, when parameterizing TCs for employment in energy-constrained applications, both the processing EC and the transmission EC must be jointly considered. In this tutorial, we investigate holistic design methodologies conceived for this purpose. We commence by introducing turbo coding in detail, highlighting the various parameters of TCs and characterizing their impact on the encoded bit rate, on the radio frequency bandwidth requirement, on the transmission EC and on the BER. Following this, energy-efficient TC decoder application-specific integrated circuit (ASIC) architecture designs are exemplified, and the processing EC is characterized as a function of the TC parameters. Finally, the TC parameters are selected in order to minimize the sum of the processing EC and the transmission EC.

...read moreread less

37 citations

Cites methods from "GPU Implementation of a Programmabl..."

...Other windowing techniques include the Previous Iteration Value Initialization (PIVI) technique of [39], [41], which is also known as State-Metric Propagation (SMP) [42]....
[...]

Proceedings Article•DOI•

Beyond Gbps Turbo decoder on multi-core CPUs

[...]

Adrien Cassagne, Thibaud Tonnellier, Camille Leroux, Bertrand Le Gal, Olivier Aumage¹, Denis Barthou¹ - Show less +2 more•Institutions (1)

L'Abri¹

05 Sep 2016

TL;DR: The results show that proposed multi-core CPU implementation of turbo-decoders is a challenging alternative to GPU implementation in terms of throughput and energy efficiency.

...read moreread less

Abstract: This paper presents a high-throughput implementation of a portable software turbo decoder. The code is optimized for traditional multi-core CPUs (like x86) and it is based on the Enhanced max-log-MAP turbo decoding variant. The code follows the LTE-Advanced specification. The key of the high performance comes from an inter-frame SIMD strategy combined with a fixed-point representation. Our results show that proposed multi-core CPU implementation of turbo-decoders is a challenging alternative to GPU implementation in terms of throughput and energy efficiency. On a high-end processor, our software turbo-decoder exceeds 1 Gbps information throughput for all rate-1/3 LTE codes with K < 4096.

...read moreread less

18 citations

Proceedings Article•DOI•

A 122Mb/s Turbo decoder using a mid-range GPU

[...]

Jiao Xianjun¹, Chen Canfeng¹, Pekka Jääskeläinen², Vladimir Guzma², Heikki Berg¹ - Show less +1 more•Institutions (2)

Nokia¹, Tampere University of Technology²

01 Jul 2013

TL;DR: This paper proposes loosening the synchronization between the sub-decoders to achieve higher utilization of parallel processor resources and achieves the fastest Turbo decoding throughput achieved with a GPU-based implementation.

...read moreread less

Abstract: Parallel implementations of Turbo decoding has been studied extensively. Traditionally, the number of parallel sub-decoders is limited to maintain acceptable code block error rate performance loss caused by the edge effect of code block division. In addition, the sub-decoders require synchronization to exchange information in the iterative process. In this paper, we propose loosening the synchronization between the sub-decoders to achieve higher utilization of parallel processor resources. Our method allows high degree of parallel processor utilization in decoding of a single code block providing a scalable software-based implementation. The proposed implementation is demonstrated using a graphics processing unit. We achieve 122.8Mbps decoding throughput using a medium range GPU, the Nvidia GTX480. This is, to the best of our knowledge, the fastest Turbo decoding throughput achieved with a GPU-based implementation.

...read moreread less

14 citations

Cites background from "GPU Implementation of a Programmabl..."

...General Purpose computing capable Graphics Processor Units (GPGPU) [7] often consist of a large number of parallel processing elements with reduced dynamic scheduling support....
[...]

Proceedings Article•DOI•

Energy consumption analysis of software polar decoders on low power processors

[...]

Adrien Cassagne, Olivier Aumage¹, Camille Leroux, Denis Barthou¹, Bertrand Le Gal - Show less +1 more•Institutions (1)

L'Abri¹

29 Aug 2016

TL;DR: This paper presents a new dynamic and fully generic implementation of a Successive Cancellation (SC) decoder (multi-precision support and intra-/inter-frame strategy support) that is used to perform comparisons of the different configurations in terms of throughput, latency and energy consumption.

...read moreread less

Abstract: This paper presents a new dynamic and fully generic implementation of a Successive Cancellation (SC) decoder (multi-precision support and intra-/inter-frame strategy support). This fully generic SC decoder is used to perform comparisons of the different configurations in terms of throughput, latency and energy consumption. A special emphasis is given on the energy consumption on low power embedded processors for software defined radio (SDR) systems. A N=4096 code length, rate 1/2 software SC decoder consumes only 14 nJ per bit on an ARM Cortex-A57 core, while achieving 65 Mbps. Some design guidelines are given in order to adapt the configuration to the application context.

...read moreread less

11 citations

Cites background from "GPU Implementation of a Programmabl..."

...…dedicated hardware circuits on communication devices, the evolution of general purpose processors in terms of energy efficiency and parallelism (vector processing, number of cores,...) drives a growing interest for software ECC implementations (e.g. LDPC decoders [1]–[3], Turbo decoders [4], [5])....
[...]

Proceedings Article•DOI•

Turbo decoding on tailored OpenCL processor

[...]

Heikki Kultala¹, Otto Esko¹, Pekka Jääskeläinen¹, Vladimir Guzma¹, Jarmo Takala¹, Jiao Xianjun², Tommi Zetterman², Heikki Berg² - Show less +4 more•Institutions (2)

Tampere University of Technology¹, Nokia²

01 Jul 2013

TL;DR: A static multi-issue exposed datapath processor design tailored for turbo decoding that achieves over 63 Mbps Turbo decoding throughput on a single low-power core and can be used as a general-purpose OpenCL accelerator for arbitrary integer workloads as well.

...read moreread less

Abstract: Turbo coding is commonly used in the current wireless standards such as 3G and 4G. However, due to the high computational requirements, its software-defined implementation is challenging. This paper proposes a static multi-issue exposed datapath processor design tailored for turbo decoding. In order to utilize the parallel processor datapath efficiently without resorting to low level assembly programming, the turbo decoder is implemented using OpenCL, a parallel programming standard for heterogeneous devices. The proposed implementation includes only a small set of Turbo-specific custom operations to accelerate the most critical parts of the algorithm. Most of the computation is performed using general-purpose integer operations. Thus, the processor design can be used as a general-purpose OpenCL accelerator for arbitrary integer workloads as well. The proposed processor design was evaluated both by implementing it using a Xilinx Virtex 6 FPGA and by ASIC synthesis using 130 nm and 40 nm technology libraries. The implementation achieves over 63 Mbps Turbo decoding throughput on a single low-power core. According to the ASIC synthesis, the maximum operating clock frequency is 344 MHz/1 050 MHz (130 nm/40 nm).

...read moreread less

9 citations

1
2
3
4
…
5

References

PDF

Open Access

More filters

Book•

Programming Massively Parallel Processors: A Hands-on Approach

[...]

David B. Kirk, Wen-mei W. Hwu

31 Dec 2012

TL;DR: Programming Massively Parallel Processors: A Hands-on Approach as discussed by the authors shows both student and professional alike the basic concepts of parallel programming and GPU architecture, and various techniques for constructing parallel programs are explored in detail.

...read moreread less

Abstract: Programming Massively Parallel Processors: A Hands-on Approach shows both student and professional alike the basic concepts of parallel programming and GPU architecture. Various techniques for constructing parallel programs are explored in detail. Case studies demonstrate the development process, which begins with computational thinking and ends with effective and efficient parallel programs. Topics of performance, floating-point format, parallel patterns, and dynamic parallelism are covered in depth. This best-selling guide to CUDA and GPU parallel programming has been revised with more parallel programming examples, commonly-used libraries such as Thrust, and explanations of the latest tools. With these improvements, the book retains its concise, intuitive, practical approach based on years of road-testing in the authors' own parallel computing courses. Updates in this new edition include: New coverage of CUDA 5.0, improved performance, enhanced development tools, increased hardware support, and more Increased coverage of related technology, OpenCL and new material on algorithm patterns, GPU clusters, host programming, and data parallelism Two new case studies (on MRI reconstruction and molecular visualization) explore the latest applications of CUDA and GPUs for scientific research and high-performance computing Table of Contents 1 Introduction 2 History of GPU Computing 3 Introduction to Data Parallelism and CUDA C 4 Data-Parallel Execution Model 5 CUDA Memories 6 Performance Considerations 7 Floating-Point Considerations 8 Parallel Patterns: Convolutions 9 Parallel Patterns: Prefix Sum 10 Parallel Patterns: Sparse Matrix-Vector Multiplication 11 Application Case Study: Advanced MRI Reconstruction 12 Application Case Study: Molecular Visualization and Analysis 13 Parallel Programming and Computational Thinking 14 An Introduction to OpenCL 15 Parallel Programming with OpenACC 16 Thrust: A Productivity-Oriented Library for CUDA 17 CUDA FORTRAN 18 An Introduction to C++ AMP 19 Programming a Heterogeneous Computing Cluster 20 CUDA Dynamic Parallelism 21 Conclusions and Future Outlook Appendix A: Matrix Multiplication Host-Only Version Source Code Appendix B: GPU Compute Capabilities

...read moreread less

1,594 citations

Journal Article•DOI•

Programming Massively Parallel Processors. A Hands-on Approach

[...]

Jie Cheng

01 Jan 2010-Scalable Computing: Practice and Experience

TL;DR: This comprehensive test/reference provides a foundation for the understanding and implementation of parallel programming skills which are needed to achieve breakthrough results by developing parallel applications that perform well on certain classes of Graphic Processor Units (GPUs).

...read moreread less

Abstract: Programming Massively Parallel Processors. A Hands-on Approach David Kirk and Wen-mei Hwu ISBN: 978-0-12-381472-2 Copyright 2010 Introduction This book is designed for graduate/undergraduate students and practitioners from any science and engineering discipline who use computational power to further their field of research. This comprehensive test/reference provides a foundation for the understanding and implementation of parallel programming skills which are needed to achieve breakthrough results by developing parallel applications that perform well on certain classes of Graphic Processor Units (GPUs). The book guides the reader to experience programming by using an extension to C language, in CUDA which is a parallel programming environment supported on NVIDIA GPUs, and emulated on less parallel CPUs. Given the fact that parallel programming on any High Performance Computer is complex and requires knowledge about the underlying hardware in order to write an efficient program, it becomes an advantage of this book over others to be specific toward a particular hardware. The book takes the readers through a series of techniques for writing and optimizing parallel programming for several real-world applications. Such experience opens the door for the reader to learn parallel programming in depth. Outline of the Book Kirk and Hwu effectively organize and link a wide spectrum of parallel programming concepts by focusing on the practical applications in contrast to most general parallel programming texts that are mostly conceptual and theoretical. The authors are both affiliated with NVIDIA; Kirk is an NVIDIA Fellow and Hwu is principle investigator for the first NVIDIA CUDA Center of Excellence at the University of Illinois at Urbana-Champaign. Their coverage in the book can be divided into four sections. The first part (Chapters 1–3) starts by defining GPUs and their modern architectures and later providing a history of Graphics Pipelines and GPU computing. It also covers data parallelism, the basics of CUDA memory/threading models, the CUDA extensions to the C language, and the basic programming/debugging tools. The second part (Chapters 4–7) enhances student programming skills by explaining the CUDA memory model and its types, strategies for reducing global memory traffic, the CUDA threading model and granularity which include thread scheduling and basic latency hiding techniques, GPU hardware performance features, techniques to hide latency in memory accesses, floating point arithmetic, modern computer system architecture, and the common data-parallel programming patterns needed to develop a high-performance parallel application. The third part (Chapters 8–11) provides a broad range of parallel execution models and parallel programming principles, in addition to a brief introduction to OpenCL. They also include a wide range of application case studies, such as advanced MRI reconstruction, molecular visualization and analysis. The last chapter (Chapter 12) discusses the great potential for future architectures of GPUs. It provides commentary on the evolution of memory architecture, Kernel Execution Control Evolution, and programming environments. Summary In general, this book is well-written and well-organized. A lot of difficult concepts related to parallel computing areas are easily explained, from which beginners or even advanced parallel programmers will benefit greatly. It provides a good starting point for beginning parallel programmers who can access a Tesla GPU. The book targets specific hardware and evaluates performance based on this specific hardware. As mentioned in this book, approximately 200 million CUDA-capable GPUs have been actively in use. Therefore, the chances are that a lot of beginning parallel programmers can have access to Telsa GPU. Also, this book gives clear descriptions of Tesla GPU architecture, which lays a solid foundation for both beginning parallel programmers and experienced parallel programmers. The book can also serve as a good reference book for advanced parallel computing courses. Jie Cheng, University of Hawaii Hilo

...read moreread less

1,511 citations

"GPU Implementation of a Programmabl..." refers background in this paper

...II showcases the speed up achieved using the GPU over an implementation done purely on the CPU for both Max-LogMAP and Full Log-MAP implementations....
[...]
...The GPU architecture differs significantly from that of a CPU [9]....
[...]
...For a Max Log-MAP turbo decoder with 5 iterations, the GPU implementation with 96 parallel sub-blocks is more than an order of magnitude faster than the CPU implementation....
[...]
...The C code run on the CPU is compiled using gcc with -O3 optimization flag and is single threaded i.e it does not utilize any parallelism on multiple CPU cores....
[...]
...More than an order of magnitude speed up over an implementation done purely on the CPU has been achieved....
[...]

Book•

CUDA by Example: An Introduction to General-Purpose GPU Programming

[...]

Jason Sanders, Edward Kandrot

19 Jul 2010

TL;DR: Cuda by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology and details the techniques and trade-offs associated with each key CUDA feature.

...read moreread less

Abstract: This book is required reading for anyone working with accelerator-based computing systems. From the Foreword by Jack Dongarra, University of Tennessee and Oak Ridge National Laboratory CUDA is a computing architecture designed to facilitate the development of parallel programs. In conjunction with a comprehensive software platform, the CUDA Architecture enables programmers to draw on the immense power of graphics processing units (GPUs) when building high-performance applications. GPUs, of course, have long been available for demanding graphics and game applications. CUDA now brings this valuable resource to programmers working on applications in other domains, including science, engineering, and finance. No knowledge of graphics programming is requiredjust the ability to program in a modestly extended version of C. CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. The authors introduce each area of CUDA development through working examples. After a concise introduction to the CUDA platform and architecture, as well as a quick-start guide to CUDA C, the book details the techniques and trade-offs associated with each key CUDA feature. Youll discover when to use each CUDA C extension and how to write CUDA software that delivers truly outstanding performance. Major topics covered include Parallel programming Thread cooperation Constant memory and events Texture memory Graphics interoperability Atomics Streams CUDA C on multiple GPUs Advanced atomics Additional CUDA resources All the CUDA software tools youll need are freely available for download from NVIDIA.http://developer.nvidia.com/object/cuda-by-example.html

...read moreread less

1,334 citations

"GPU Implementation of a Programmabl..." refers background in this paper

...Four different kinds of device memories are presented to the programmer [10]....
[...]

Proceedings Article•DOI•

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

[...]

Sunpyo Hong¹, Hyesoon Kim¹•Institutions (1)

Georgia Institute of Technology¹

20 Jun 2009

TL;DR: A simple analytical model is proposed that estimates the execution time of massively parallel programs by considering the number of running threads and memory bandwidth and estimates the cost of memory requests, thereby estimating the overall executionTime of a program.

...read moreread less

Abstract: GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance bottlenecks of those parallel programs on GPU architectures to improve application performance is even more difficult. Current approaches rely on programmers to tune their applications by exploiting the design space exhaustively without fully understanding the performance characteristics of their applications.To provide insights into the performance bottlenecks of parallel applications on GPU architectures, we propose a simple analytical model that estimates the execution time of massively parallel programs. The key component of our model is estimating the number of parallel memory requests (we call this the memory warp parallelism) by considering the number of running threads and memory bandwidth. Based on the degree of memory warp parallelism, the model estimates the cost of memory requests, thereby estimating the overall execution time of a program. Comparisons between the outcome of the model and the actual execution time in several GPUs show that the geometric mean of absolute error of our model on micro-benchmarks is 5.4% and on GPU computing applications is 13.3%. All the applications are written in the CUDA programming language.

...read moreread less

672 citations

"GPU Implementation of a Programmabl..." refers background in this paper

...To obtain a high throughput on the GPU, an architecture aware [8] mapping of the algorithm is paramount....
[...]

Journal Article•DOI•

Interleavers for turbo codes using permutation polynomials over integer rings

[...]

Jing Sun¹, O.Y. Takeshita¹•Institutions (1)

Ohio State University¹

01 Jan 2005-IEEE Transactions on Information Theory

TL;DR: A class of deterministic interleavers for turbo codes (TCs) based on permutation polynomials over /spl Zopf//sub N/ is introduced, which can be algebraically designed to fit a given component code.

...read moreread less

Abstract: A class of deterministic interleavers for turbo codes (TCs) based on permutation polynomials over /spl Zopf//sub N/ is introduced. The main characteristic of this class of interleavers is that they can be algebraically designed to fit a given component code. Moreover, since the interleaver can be generated by a few simple computations, storage of the interleaver tables can be avoided. By using the permutation polynomial-based interleavers, the design of the interleavers reduces to the selection of the coefficients of the polynomials. It is observed that the performance of the TCs using these permutation polynomial-based interleavers is usually dominated by a subset of input weight 2m error events. The minimum distance and its multiplicity (or the first few spectrum lines) of this subset are used as design criterion to select good permutation polynomials. A simple method to enumerate these error events for small m is presented. Searches for good interleavers are performed. The decoding performance of these interleavers is close to S-random interleavers for long frame sizes. For short frame sizes, the new interleavers outperform S-random interleavers.

...read moreread less

285 citations