scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Non-uniform DFT implementation for channel simulations in GPU

TL;DR: A parallel scan based method to speed up channel simulation in wireless link-level OFDM network simulators without restricting the scope of the simulations is proposed, and the DFT properties in scan method are utilized to reduce register usage and hence the computation overhead of sine and cosine values.
Abstract: Channel simulation in wireless link-level OFDM network simulators involves a computationally intensive non-uniform discrete Fourier transform. In this paper, we propose a parallel scan based method to speed up this computation in GPU without restricting the scope of the simulations. We further utilize the DFT properties in scan method to reduce register usage and hence the computation overhead of sine and cosine values. This technique is compared against a method that saves computation by using uniform power delay profiles at the cost of generality, and we show that the performance is competitive. For single DFT, up to 19x speedup over a CPU implementation is observed using the scan based approach. For a simulation with 512 channels and a 1024 point DFT, the scan method gives a speedup of 141x with respect to the CPU, which compares favourably to the more restrictive uniform PDP method.
Citations
More filters
Proceedings ArticleDOI
01 Aug 2017
TL;DR: Frequency domain representation of commonly accepted Tapped Delay Line (TDL) model is discussed and three transformation algorithms are evaluated to develop an efficient 3GPP compliant method to simulate multiple independent fading radio channels in software defined E-EUTRAN traffic generator.
Abstract: The purpose of the study is to develop an efficient 3GPP compliant method to simulate multiple independent fading radio channels in software defined Evolved Universal Terrestrial Radio Access Network (E-EUTRAN) traffic generator. In this paper, frequency domain representation of commonly accepted Tapped Delay Line (TDL) model is discussed and three transformation algorithms are evaluated. The effects of multipath fading channel are applied to the signal at the level of Orthogonal Frequency Division Multiplexing1 (OFDM) transmitter prior to IFFT stage. Models 0 and 1 are based on Digital Fourier Transform (DFT) of TDL with and without consideration of Intercarrier Interference (ICI) phenomenon. Model 2 is the novel method that extends quasi-stationary model with low-cost linear approximation of ICI applied directly in frequency domain in order to gain overall accuracy with small computational effort. When limiting the ICI term to 16 neighboring subcarriers, Model 2 exhibits 12 dB SNR improvement comparing to stationary model and offers execution time advantage comparing to TDL model when the number of terminals sharing radio resources is high.

6 citations


Cites background or methods from "Non-uniform DFT implementation for ..."

  • ...Equation (8) may be reduced to 1-D non-uniform DFT to compute the vector of diagonal elements [11]:...

    [...]

  • ...A non-unitary DFT based approach was used in [11] to implement frequency-domain channel simulator in GPU....

    [...]

Proceedings ArticleDOI
01 Sep 2017
TL;DR: The research shows that direct frequency domain linear ICI approximation, represented by the frequency domain model 2, offers good accuracy in term of ICI synthesis for the most practical Doppler frequencies and simultaneously requires less operations if the number of simulated UEs is large.
Abstract: This paper presents a recent result of the study on efficient 3GPP compliant method to simulate multiple independent multipath fading radio channels in multi-UE (User Equipment) software defined Evolved Universal Terrestrial Radio Access Network (E-UTRAN) traffic generator. In the previous research [1], authors discussed Digital Fourier Transform (DFT) based frequency domain simulation models of a multipath fading channel, including aspects of terminal mobility and Intercarrier Interference (ICI). In this paper, previously published simulation results, showing accuracy and efficiency of the models, are confirmed using mathematical analysis of ICI phenomenon, considering nature of Orthogonal Frequency Division Multiple Access (OFDMA) schemes adapted by the 3GPP Long Term Evolution (LTE) standard. Complexity of frequency domain models is evaluated in detail using Big O notation and compared against time domain approach. The research shows that direct frequency domain linear ICI approximation, represented by the frequency domain model 2, offers good accuracy in term of ICI synthesis for the most practical Doppler frequencies and simultaneously requires less operations if the number of simulated UEs is large.

1 citations


Cites background from "Non-uniform DFT implementation for ..."

  • ...Comparing to other similar existing frameworks [3]–[5], proposed solution takes into consideration consequences of wireless channel non-stationarity and practical aspects of multiuser scenarios....

    [...]

References
More filters
Proceedings ArticleDOI
25 Jul 1995
TL;DR: The authors present the MMSE and LS estimators and a method for modifications compromising between complexity and performance and the symbol error rate for a 18-QAM system is presented by means of simulation results.
Abstract: The use of multi-amplitude signaling schemes in wireless OFDM systems requires the tracking of the fading radio channel. The paper addresses channel estimation based on time-domain channel statistics. Using a general model for a slowly fading channel, the authors present the MMSE and LS estimators and a method for modifications compromising between complexity and performance. The symbol error rate for a 18-QAM system is presented by means of simulation results. Depending upon estimator complexity, up to 4 dB in SNR can be gained over the LS estimator.

1,647 citations


"Non-uniform DFT implementation for ..." refers methods in this paper

  • ...The baseband OFDM system [3] is shown in Figure 1, where x is the transmitted symbol, g(t) is the channel impulse response, ñ(t) is additive white Gaussian noise and y is the received symbol....

    [...]

Book
31 Dec 2012
TL;DR: Programming Massively Parallel Processors: A Hands-on Approach as discussed by the authors shows both student and professional alike the basic concepts of parallel programming and GPU architecture, and various techniques for constructing parallel programs are explored in detail.
Abstract: Programming Massively Parallel Processors: A Hands-on Approach shows both student and professional alike the basic concepts of parallel programming and GPU architecture. Various techniques for constructing parallel programs are explored in detail. Case studies demonstrate the development process, which begins with computational thinking and ends with effective and efficient parallel programs. Topics of performance, floating-point format, parallel patterns, and dynamic parallelism are covered in depth. This best-selling guide to CUDA and GPU parallel programming has been revised with more parallel programming examples, commonly-used libraries such as Thrust, and explanations of the latest tools. With these improvements, the book retains its concise, intuitive, practical approach based on years of road-testing in the authors' own parallel computing courses. Updates in this new edition include: New coverage of CUDA 5.0, improved performance, enhanced development tools, increased hardware support, and more Increased coverage of related technology, OpenCL and new material on algorithm patterns, GPU clusters, host programming, and data parallelism Two new case studies (on MRI reconstruction and molecular visualization) explore the latest applications of CUDA and GPUs for scientific research and high-performance computing Table of Contents 1 Introduction 2 History of GPU Computing 3 Introduction to Data Parallelism and CUDA C 4 Data-Parallel Execution Model 5 CUDA Memories 6 Performance Considerations 7 Floating-Point Considerations 8 Parallel Patterns: Convolutions 9 Parallel Patterns: Prefix Sum 10 Parallel Patterns: Sparse Matrix-Vector Multiplication 11 Application Case Study: Advanced MRI Reconstruction 12 Application Case Study: Molecular Visualization and Analysis 13 Parallel Programming and Computational Thinking 14 An Introduction to OpenCL 15 Parallel Programming with OpenACC 16 Thrust: A Productivity-Oriented Library for CUDA 17 CUDA FORTRAN 18 An Introduction to C++ AMP 19 Programming a Heterogeneous Computing Cluster 20 CUDA Dynamic Parallelism 21 Conclusions and Future Outlook Appendix A: Matrix Multiplication Host-Only Version Source Code Appendix B: GPU Compute Capabilities

1,594 citations

Journal ArticleDOI
TL;DR: This comprehensive test/reference provides a foundation for the understanding and implementation of parallel programming skills which are needed to achieve breakthrough results by developing parallel applications that perform well on certain classes of Graphic Processor Units (GPUs).
Abstract: Programming Massively Parallel Processors. A Hands-on Approach David Kirk and Wen-mei Hwu ISBN: 978-0-12-381472-2 Copyright 2010 Introduction This book is designed for graduate/undergraduate students and practitioners from any science and engineering discipline who use computational power to further their field of research. This comprehensive test/reference provides a foundation for the understanding and implementation of parallel programming skills which are needed to achieve breakthrough results by developing parallel applications that perform well on certain classes of Graphic Processor Units (GPUs). The book guides the reader to experience programming by using an extension to C language, in CUDA which is a parallel programming environment supported on NVIDIA GPUs, and emulated on less parallel CPUs. Given the fact that parallel programming on any High Performance Computer is complex and requires knowledge about the underlying hardware in order to write an efficient program, it becomes an advantage of this book over others to be specific toward a particular hardware. The book takes the readers through a series of techniques for writing and optimizing parallel programming for several real-world applications. Such experience opens the door for the reader to learn parallel programming in depth. Outline of the Book Kirk and Hwu effectively organize and link a wide spectrum of parallel programming concepts by focusing on the practical applications in contrast to most general parallel programming texts that are mostly conceptual and theoretical. The authors are both affiliated with NVIDIA; Kirk is an NVIDIA Fellow and Hwu is principle investigator for the first NVIDIA CUDA Center of Excellence at the University of Illinois at Urbana-Champaign. Their coverage in the book can be divided into four sections. The first part (Chapters 1–3) starts by defining GPUs and their modern architectures and later providing a history of Graphics Pipelines and GPU computing. It also covers data parallelism, the basics of CUDA memory/threading models, the CUDA extensions to the C language, and the basic programming/debugging tools. The second part (Chapters 4–7) enhances student programming skills by explaining the CUDA memory model and its types, strategies for reducing global memory traffic, the CUDA threading model and granularity which include thread scheduling and basic latency hiding techniques, GPU hardware performance features, techniques to hide latency in memory accesses, floating point arithmetic, modern computer system architecture, and the common data-parallel programming patterns needed to develop a high-performance parallel application. The third part (Chapters 8–11) provides a broad range of parallel execution models and parallel programming principles, in addition to a brief introduction to OpenCL. They also include a wide range of application case studies, such as advanced MRI reconstruction, molecular visualization and analysis. The last chapter (Chapter 12) discusses the great potential for future architectures of GPUs. It provides commentary on the evolution of memory architecture, Kernel Execution Control Evolution, and programming environments. Summary In general, this book is well-written and well-organized. A lot of difficult concepts related to parallel computing areas are easily explained, from which beginners or even advanced parallel programmers will benefit greatly. It provides a good starting point for beginning parallel programmers who can access a Tesla GPU. The book targets specific hardware and evaluates performance based on this specific hardware. As mentioned in this book, approximately 200 million CUDA-capable GPUs have been actively in use. Therefore, the chances are that a lot of beginning parallel programmers can have access to Telsa GPU. Also, this book gives clear descriptions of Tesla GPU architecture, which lays a solid foundation for both beginning parallel programmers and experienced parallel programmers. The book can also serve as a good reference book for advanced parallel computing courses. Jie Cheng, University of Hawaii Hilo

1,511 citations


"Non-uniform DFT implementation for ..." refers methods in this paper

  • ...A second kernel uses parallel scan method [8] to get all the rows of the twiddle factor matrices (e−j1θ, e−j2θ, ....

    [...]

Journal ArticleDOI
01 Jan 2001
TL;DR: The automatically tuned linear algebra software (ATLAS) project is described, as well as the fundamental principles that underly it, with the present emphasis on the basic linear algebra subprograms (BLAS), a widely used, performance-critical, linear algebra kernel library.
Abstract: This paper describes the automatically tuned linear algebra software (ATLAS) project, as well as the fundamental principles that underly it. ATLAS is an instantiation of a new paradigm in high performance library production and maintenance, which we term automated empirical optimization of software (AEOS); this style of library management has been created in order to allow software to keep pace with the incredible rate of hardware advancement inherent in Moore's Law. ATLAS is the application of this new paradigm to linear algebra software, with the present emphasis on the basic linear algebra subprograms (BLAS), a widely used, performance-critical, linear algebra kernel library.

1,302 citations

01 Jan 2000
TL;DR: This paper describes the ATLAS (Automatically Tuned Linear Algebra Software) project, as well as the fundamental principles that underly it, with the present emphasis on the Basic Linear Al algebra Subprograms (BLAS), a widely used, performance-critical, linear algebra kernel library.
Abstract: This paper describes the ATLAS (Automatically Tuned Linear Algebra Software) project, as well as the fundamental principles that underly it. ATLAS is an instantiation of a new paradigm in high performance library production and maintenance, which we term AEOS (Automated Empirical Optimization of Software); this style of library management has been created in order to allow software to keep pace with the incredible rate of hardware advancement inherent in Moore's Law. ATLAS is the application of this new paradigm to linear algebra software, with the present emphasis on the Basic Linear Algebra Subprograms (BLAS), a widely used, performance-critical, linear algebra kernel library. This work was supported in part by: U.S. Department of Energy under contract number DE-AC0596OR22464; National Science Foundation Science and Technology Center Cooperative Agreement No. CCR-8809615; University of California, Los Alamos National Laboratory, subcontract # B76680017-3Z; Department of Defense Raytheon E-Systems, subcontract# AA23, prime contract# DAHC94-96-C-0010; Department of Defense Nichols Research Corporation, subcontract#s NRC CR-96-0011 (ASC) and prime contract # DAHC-94-96-C-0005; Department of Defense Nichols Research Corporation, subcontract#s NRC CR-96-0011 (CEWES); prime contract # DAHC-94-96-C-0002 Dept. of Computer Sciences, Univ. of TN, Knoxville, TN 37996, rwhaley@cs.utk.edu Dept. of Computer Sciences, Univ. of TN, Knoxville, TN 37996, petitet@cs.utk.edu Dept. of Computer Sciences, Univ. of TN, Knoxville, TN 37996, and Mathematical Sciences Section, ORNL, Oak Ridge, TN 37831, dongarra@cs.utk.edu

994 citations


"Non-uniform DFT implementation for ..." refers methods in this paper

  • ...A note on the CPU implementations that we use for comparison: ATLAS[10] is a library that can be tuned for optimum performance on CPUs, and makes use of some of the parallel features available on modern CPUs....

    [...]

  • ...However, the optimizations done by ATLAS lead to variations that are very sensitive to the size of the number of taps and similar parameters....

    [...]

  • ...The CPU code with ATLAS library gives 2.2x and 3.9x speedup for scan and time-shift method respectively compared to its single threaded 2We have restricted to single-threaded CPU implementation with -O3 compiler optimisation switch: the system has obvious data parallelism across multiple user/channels that can be accounted for separately....

    [...]

  • ...For consistency, we have compared our GPU implementation against the regular CPU implementations with normal compiler optimizations, with the understanding that a further speedup may be possible using ATLAS, but this does not fundamentally change the observations....

    [...]