scispace - formally typeset
Search or ask a question
Author

Jun Lin

Other affiliations: Lehigh University
Bio: Jun Lin is an academic researcher from Nanjing University. The author has contributed to research in topics: Decoding methods & Low-density parity-check code. The author has an hindex of 20, co-authored 165 publications receiving 1472 citations. Previous affiliations of Jun Lin include Lehigh University.


Papers
More filters
Journal ArticleDOI
TL;DR: An efficient selective computation algorithm, which totally avoids the sorting process, is proposed for Min-Max decoding and an efficient VLSI architecture for a nonbinary Min- Max decoder is presented.
Abstract: Low-density parity-check (LDPC) codes constructed over the Galois field GF(q), which are also called nonbinary LDPC codes, are an extension of binary LDPC codes with significantly better performance. Although various kinds of low-complexity quasi-optimal iterative decoding algorithms have been proposed, the VLSI implementation of nonbinary LDPC decoders has rarely been discussed due to their hardware unfriendly properties. In this brief, an efficient selective computation algorithm, which totally avoids the sorting process, is proposed for Min-Max decoding. In addition, an efficient VLSI architecture for a nonbinary Min-Max decoder is presented. The synthesis results are given to demonstrate the efficiency of the proposed techniques.

157 citations

Journal ArticleDOI
TL;DR: The theoretical derivation of parallel fast finite impulse response algorithm (FFA) is introduced and the corresponding fast convolution units (FCUs) are developed for the computation of convolutions in the CNN models.
Abstract: Convolutional neural network (CNN) is the state-of-the-art deep learning approach employed in various applications. Real-time CNN implementations in resource limited embedded systems are becoming highly desired recently. To ensure the programmable flexibility and shorten the development period, field programmable gate array is appropriate to implement the CNN models. However, the limited bandwidth and on-chip memory storage are the bottlenecks of the CNN acceleration. In this paper, we propose efficient hardware architectures to accelerate deep CNN models. The theoretical derivation of parallel fast finite impulse response algorithm (FFA) is introduced. Based on FFAs, the corresponding fast convolution units (FCUs) are developed for the computation of convolutions in the CNN models. Novel data storage and reuse schemes are proposed, where all intermediate pixels are stored on-chip and the bandwidth requirement is reduced. We choose one of the largest and most accurate networks, VGG16, and implement it on Xilinx Zynq ZC706 and Virtex VC707 boards, respectively. We achieve the top-5 accuracy of 86.25% using an equal distance non-uniform quantization method. It is estimated that the average performances are 316.23 GOP/s under 172-MHz working frequency on Xilinx ZC706 and 1250.21 GOP/s under 170-MHz working frequency on VC707, respectively. In brief, the proposed design outperforms the existing works significantly, in particular, surpassing related designs by more than two times in terms of resource efficiency.

101 citations

Proceedings ArticleDOI
Meiqi Wang1, Siyuan Lu1, Danyang Zhu1, Jun Lin1, Zhongfeng Wang1 
01 Oct 2018
TL;DR: This paper performs an efficient hardware implementation of softmax function using hardware description language (HDL) and synthesized under the TSMC 28-nm CMOS technology and results show that the architecture achieves a throughput of 6.976 G/s for 8-bit input data.
Abstract: Recently, significant improvement has been achieved for hardware architecture design of deep neural networks (DNNs). However, the hardware implementation of one widely used softmax function in DNNs has not been much investigated, which involves expensive division and exponentiation units. This paper performs an efficient hardware implementation of softmax function. Mathematical transformations and linear fitting are used to simplify this function. Multiple algorithmic strength reduction strategies and fast addition methods are employed to optimize the architecture. By using these techniques, complicated logic units like multipliers are eliminated and the memory consumption is largely reduced while the accuracy loss is negligible. The proposed design is coded using hardware description language (HDL) and synthesized under the TSMC 28-nm CMOS technology. Synthesis results show that the architecture achieves a throughput of 6.976 G/s for 8-bit input data. The power efficiency of 463.04 Gb/(mm2• mW) is achieved and it costs only 0.015mm2 area resources. To the best of our knowledge, this is the first work on efficient hardware implementation for softmax in open literature.

84 citations

Journal ArticleDOI
Jun Lin1, Zhiyuan Yan1
TL;DR: In this article, the authors proposed an efficient list decoder architecture for the CRC-aided SCL algorithm, based on both algorithmic reformulations and architectural techniques, which achieves 1.24 and 1.83 times the area efficiency.
Abstract: Long polar codes can achieve the symmetric capacity of arbitrary binary-input discrete memoryless channels under a low-complexity successive cancelation (SC) decoding algorithm. However, for polar codes with short and moderate code lengths, the decoding performance of the SC algorithm is inferior. The cyclic-redundancy-check (CRC)-aided SC-list (SCL)-decoding algorithm has better error performance than the SC algorithm for short or moderate polar codes. In this paper, we propose an efficient list decoder architecture for the CRC-aided SCL algorithm, based on both algorithmic reformulations and architectural techniques. In particular, an area efficient message memory architecture is proposed to reduce the area of the proposed decoder architecture. An efficient path pruning unit suitable for large list size is also proposed. For a polar code of length 1024 and rate 1/2, when list size $L=2$ and 4, the proposed list decoder architecture is implemented under a Taiwan Semiconductor Manufacturing Company (TSMC) 90-nm CMOS technology. Compared with the list decoders in the literature, our decoder achieves 1.24–1.83 times the area efficiency.

83 citations

Journal ArticleDOI
TL;DR: A novel hybrid compression method for a widely used RNN variant, long–short term memory (LSTM), to tackle implementation challenges and can reduce more than 95% of memory usage with negligible accuracy loss when verified under language modeling and speech recognition tasks is presented.
Abstract: Recurrent neural networks (RNNs) have achieved the state-of-the-art performance on various sequence learning tasks due to their powerful sequence modeling capability. However, RNNs usually require a large number of parameters and high computational complexity. Hence, it is quite challenging to implement complex RNNs on embedded devices with stringent memory and latency requirement. In this paper, we first present a novel hybrid compression method for a widely used RNN variant, long–short term memory (LSTM), to tackle these implementation challenges. By properly using circulant matrices, forward nonlinear function approximation, and efficient quantization schemes with a retrain-based training strategy, the proposed compression method can reduce more than 95% of memory usage with negligible accuracy loss when verified under language modeling and speech recognition tasks. An efficient scalable parallel hardware architecture is then proposed for the compressed LSTM. With an innovative chessboard division method for matrix–vector multiplications, the parallelism of the proposed hardware architecture can be freely chosen under certain latency requirement. Specifically, for the circulant matrix–vector multiplications employed in the compressed LSTM, the circulant matrices are judiciously reorganized to fit in with the chessboard division and minimize the number of memory accesses required for the matrix multiplications. The proposed architecture is modeled using register transfer language (RTL) and synthesized under the TSMC 90-nm CMOS technology. With 518.5-kB on-chip memory, we are able to process a $512 \times 512$ compressed LSTM in 1.71 $\mu {{{\text{s}}}}$ , corresponding to 2.46 TOPS on the uncompressed one, at a cost of 30.77-mm2 chip area. The implementation results demonstrate that the proposed design can achieve significantly high flexibility and area efficiency, which satisfies many real-time applications on embedded devices. It is worth mentioning that the memory-efficient approach of accelerating LSTM developed in this paper is also applicable to other RNN variants.

79 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: The LLR-based formulation of the successive cancellation list (SCL) decoder is presented, which leads to a more efficient hardware implementation of the decoder compared to the known log-likelihood based implementation.
Abstract: We show that successive cancellation list decoding can be formulated exclusively using log-likelihood ratios. In addition to numerical stability, the log-likelihood ratio based formulation has useful properties that simplify the sorting step involved in successive cancellation list decoding. We propose a hardware architecture of the successive cancellation list decoder in the log-likelihood ratio domain which, compared with a log-likelihood domain implementation, requires less irregular and smaller memories. This simplification, together with the gains in the metric sorter, lead to $ 56\%$ to $137\%$ higher throughput per unit area than other recently proposed architectures. We then evaluate the empirical performance of the CRC-aided successive cancellation list decoder at different list sizes using different CRCs and conclude that it is important to adapt the CRC length to the list size in order to achieve the best error-rate performance of concatenated polar codes. Finally, we synthesize conventional successive cancellation decoders at large block-lengths with the same block-error probability as our proposed CRC-aided successive cancellation list decoders to demonstrate that, while our decoders have slightly lower throughput and larger area, they have a significantly smaller decoding latency.

541 citations

Journal ArticleDOI
20 Mar 2020
TL;DR: This article reviews the mainstream compression approaches such as compact model, tensor decomposition, data quantization, and network sparsification, and answers the question of how to leverage these methods in the design of neural network accelerators and present the state-of-the-art hardware architectures.
Abstract: Domain-specific hardware is becoming a promising topic in the backdrop of improvement slow down for general-purpose processors due to the foreseeable end of Moore’s Law. Machine learning, especially deep neural networks (DNNs), has become the most dazzling domain witnessing successful applications in a wide spectrum of artificial intelligence (AI) tasks. The incomparable accuracy of DNNs is achieved by paying the cost of hungry memory consumption and high computational complexity, which greatly impedes their deployment in embedded systems. Therefore, the DNN compression concept was naturally proposed and widely used for memory saving and compute acceleration. In the past few years, a tremendous number of compression techniques have sprung up to pursue a satisfactory tradeoff between processing efficiency and application accuracy. Recently, this wave has spread to the design of neural network accelerators for gaining extremely high performance. However, the amount of related works is incredibly huge and the reported approaches are quite divergent. This research chaos motivates us to provide a comprehensive survey on the recent advances toward the goal of efficient compression and execution of DNNs without significantly compromising accuracy, involving both the high-level algorithms and their applications in hardware design. In this article, we review the mainstream compression approaches such as compact model, tensor decomposition, data quantization, and network sparsification. We explain their compression principles, evaluation metrics, sensitivity analysis, and joint-way use. Then, we answer the question of how to leverage these methods in the design of neural network accelerators and present the state-of-the-art hardware architectures. In the end, we discuss several existing issues such as fair comparison, testing workloads, automatic compression, influence on security, and framework/hardware-level support, and give promising topics in this field and the possible challenges as well. This article attempts to enable readers to quickly build up a big picture of neural network compression and acceleration, clearly evaluate various methods, and confidently get started in the right way.

499 citations

Journal ArticleDOI
TL;DR: A convolutional neural network architecture in which the neural network is divided into hardware and software parts to increase performance and reduce the cost of implementation resources is proposed.

308 citations

Journal ArticleDOI
TL;DR: The benefits that cloud computing offers for fifth-generation (5G) mobile networks are explored and the implications on the signal processing algorithms are investigated.
Abstract: Cloud computing draws significant attention in the information technology (IT) community as it provides ubiquitous on-demand access to a shared pool of configurable computing resources with minimum management effort. It gains also more impact on the communication technology (CT) community and is currently discussed as an enabler for flexible, cost-efficient and more powerful mobile network implementations. Although centralized baseband pools are already investigated for the radio access network (RAN) to allow for efficient resource usage and advanced multicell algorithms, these technologies still require dedicated hardware and do not offer the same characteristics as cloud-computing platforms, i.e., on-demand provisioning, virtualization, resource pooling, elasticity, service metering, and multitenancy. However, these properties of cloud computing are key enablers for future mobile communication systems characterized by an ultradense deployment of radio access points (RAPs) leading to severe multicell interference in combination with a significant increase of the number of access nodes and huge fluctuations of the rate requirements over time. In this article, we will explore the benefits that cloud computing offers for fifth-generation (5G) mobile networks and investigate the implications on the signal processing algorithms.

272 citations