Home
/
Authors
/
Jun Lin

Author

Jun Lin

Bio: Jun Lin is an academic researcher from Nanjing University. The author has contributed to research in topics: Decoding methods & Low-density parity-check code. The author has an hindex of 20, co-authored 165 publications receiving 1472 citations. Previous affiliations of Jun Lin include Lehigh University.

Papers published on a yearly basis

2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2010
2009
2008

Papers

PDF

Open Access

More filters

Journal Article•DOI•

An Efficient VLSI Architecture for Nonbinary LDPC Decoders

[...]

Jun Lin¹, Jin Sha¹, Zhongfeng Wang², Li Li¹•Institutions (2)

Nanjing University¹, Broadcom²

01 Jan 2010-IEEE Transactions on Circuits and Systems Ii-express Briefs

TL;DR: An efficient selective computation algorithm, which totally avoids the sorting process, is proposed for Min-Max decoding and an efficient VLSI architecture for a nonbinary Min- Max decoder is presented.

...read moreread less

Abstract: Low-density parity-check (LDPC) codes constructed over the Galois field GF(q), which are also called nonbinary LDPC codes, are an extension of binary LDPC codes with significantly better performance. Although various kinds of low-complexity quasi-optimal iterative decoding algorithms have been proposed, the VLSI implementation of nonbinary LDPC decoders has rarely been discussed due to their hardware unfriendly properties. In this brief, an efficient selective computation algorithm, which totally avoids the sorting process, is proposed for Min-Max decoding. In addition, an efficient VLSI architecture for a nonbinary Min-Max decoder is presented. The synthesis results are given to demonstrate the efficiency of the proposed techniques.

...read moreread less

157 citations

Journal Article•DOI•

Efficient Hardware Architectures for Deep Convolutional Neural Network

[...]

Jichen Wang¹, Jun Lin¹, Zhongfeng Wang¹•Institutions (1)

Nanjing University¹

01 Jun 2018-IEEE Transactions on Circuits and Systems I-regular Papers

TL;DR: The theoretical derivation of parallel fast finite impulse response algorithm (FFA) is introduced and the corresponding fast convolution units (FCUs) are developed for the computation of convolutions in the CNN models.

...read moreread less

Abstract: Convolutional neural network (CNN) is the state-of-the-art deep learning approach employed in various applications. Real-time CNN implementations in resource limited embedded systems are becoming highly desired recently. To ensure the programmable flexibility and shorten the development period, field programmable gate array is appropriate to implement the CNN models. However, the limited bandwidth and on-chip memory storage are the bottlenecks of the CNN acceleration. In this paper, we propose efficient hardware architectures to accelerate deep CNN models. The theoretical derivation of parallel fast finite impulse response algorithm (FFA) is introduced. Based on FFAs, the corresponding fast convolution units (FCUs) are developed for the computation of convolutions in the CNN models. Novel data storage and reuse schemes are proposed, where all intermediate pixels are stored on-chip and the bandwidth requirement is reduced. We choose one of the largest and most accurate networks, VGG16, and implement it on Xilinx Zynq ZC706 and Virtex VC707 boards, respectively. We achieve the top-5 accuracy of 86.25% using an equal distance non-uniform quantization method. It is estimated that the average performances are 316.23 GOP/s under 172-MHz working frequency on Xilinx ZC706 and 1250.21 GOP/s under 170-MHz working frequency on VC707, respectively. In brief, the proposed design outperforms the existing works significantly, in particular, surpassing related designs by more than two times in terms of resource efficiency.

...read moreread less

101 citations

Proceedings Article•DOI•

A High-Speed and Low-Complexity Architecture for Softmax Function in Deep Learning

[...]

Meiqi Wang¹, Siyuan Lu¹, Danyang Zhu¹, Jun Lin¹, Zhongfeng Wang¹ - Show less +1 more•Institutions (1)

Nanjing University¹

01 Oct 2018

TL;DR: This paper performs an efficient hardware implementation of softmax function using hardware description language (HDL) and synthesized under the TSMC 28-nm CMOS technology and results show that the architecture achieves a throughput of 6.976 G/s for 8-bit input data.

...read moreread less

Abstract: Recently, significant improvement has been achieved for hardware architecture design of deep neural networks (DNNs). However, the hardware implementation of one widely used softmax function in DNNs has not been much investigated, which involves expensive division and exponentiation units. This paper performs an efficient hardware implementation of softmax function. Mathematical transformations and linear fitting are used to simplify this function. Multiple algorithmic strength reduction strategies and fast addition methods are employed to optimize the architecture. By using these techniques, complicated logic units like multipliers are eliminated and the memory consumption is largely reduced while the accuracy loss is negligible. The proposed design is coded using hardware description language (HDL) and synthesized under the TSMC 28-nm CMOS technology. Synthesis results show that the architecture achieves a throughput of 6.976 G/s for 8-bit input data. The power efficiency of 463.04 Gb/(mm2• mW) is achieved and it costs only 0.015mm2 area resources. To the best of our knowledge, this is the first work on efficient hardware implementation for softmax in open literature.

...read moreread less

84 citations

Journal Article•DOI•

An Efficient List Decoder Architecture for Polar Codes

[...]

Jun Lin¹, Zhiyuan Yan¹•Institutions (1)

Lehigh University¹

01 Jan 2015-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: In this article, the authors proposed an efficient list decoder architecture for the CRC-aided SCL algorithm, based on both algorithmic reformulations and architectural techniques, which achieves 1.24 and 1.83 times the area efficiency.

...read moreread less

Abstract: Long polar codes can achieve the symmetric capacity of arbitrary binary-input discrete memoryless channels under a low-complexity successive cancelation (SC) decoding algorithm. However, for polar codes with short and moderate code lengths, the decoding performance of the SC algorithm is inferior. The cyclic-redundancy-check (CRC)-aided SC-list (SCL)-decoding algorithm has better error performance than the SC algorithm for short or moderate polar codes. In this paper, we propose an efficient list decoder architecture for the CRC-aided SCL algorithm, based on both algorithmic reformulations and architectural techniques. In particular, an area efficient message memory architecture is proposed to reduce the area of the proposed decoder architecture. An efficient path pruning unit suitable for large list size is also proposed. For a polar code of length 1024 and rate 1/2, when list size $L=2$ and 4, the proposed list decoder architecture is implemented under a Taiwan Semiconductor Manufacturing Company (TSMC) 90-nm CMOS technology. Compared with the list decoders in the literature, our decoder achieves 1.24–1.83 times the area efficiency.

...read moreread less

83 citations

Journal Article•DOI•

Accelerating Recurrent Neural Networks: A Memory-Efficient Approach

[...]

Zhisheng Wang¹, Jun Lin¹, Zhongfeng Wang¹•Institutions (1)

Nanjing University¹

03 Jul 2017-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: A novel hybrid compression method for a widely used RNN variant, long–short term memory (LSTM), to tackle implementation challenges and can reduce more than 95% of memory usage with negligible accuracy loss when verified under language modeling and speech recognition tasks is presented.

...read moreread less

Abstract: Recurrent neural networks (RNNs) have achieved the state-of-the-art performance on various sequence learning tasks due to their powerful sequence modeling capability. However, RNNs usually require a large number of parameters and high computational complexity. Hence, it is quite challenging to implement complex RNNs on embedded devices with stringent memory and latency requirement. In this paper, we first present a novel hybrid compression method for a widely used RNN variant, long–short term memory (LSTM), to tackle these implementation challenges. By properly using circulant matrices, forward nonlinear function approximation, and efficient quantization schemes with a retrain-based training strategy, the proposed compression method can reduce more than 95% of memory usage with negligible accuracy loss when verified under language modeling and speech recognition tasks. An efficient scalable parallel hardware architecture is then proposed for the compressed LSTM. With an innovative chessboard division method for matrix–vector multiplications, the parallelism of the proposed hardware architecture can be freely chosen under certain latency requirement. Specifically, for the circulant matrix–vector multiplications employed in the compressed LSTM, the circulant matrices are judiciously reorganized to fit in with the chessboard division and minimize the number of memory accesses required for the matrix multiplications. The proposed architecture is modeled using register transfer language (RTL) and synthesized under the TSMC 90-nm CMOS technology. With 518.5-kB on-chip memory, we are able to process a $512 \times 512$ compressed LSTM in 1.71 $\mu {{{\text{s}}}}$ , corresponding to 2.46 TOPS on the uncompressed one, at a cost of 30.77-mm2 chip area. The implementation results demonstrate that the proposed design can achieve significantly high flexibility and area efficiency, which satisfies many real-time applications on embedded devices. It is worth mentioning that the memory-efficient approach of accelerating LSTM developed in this paper is also applicable to other RNN variants.

...read moreread less

79 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•

“Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告

[...]

杉山拓海

12 Sep 2017-Computers & Graphics

3,940 citations

Journal Article•DOI•

LLR-Based Successive Cancellation List Decoding of Polar Codes

[...]

Alexios Balatsoukas-Stimming¹, Mani Bastani Parizi¹, Andreas Burg¹•Institutions (1)

École Polytechnique Fédérale de Lausanne¹

01 Jun 2015-IEEE Transactions on Signal Processing

TL;DR: The LLR-based formulation of the successive cancellation list (SCL) decoder is presented, which leads to a more efficient hardware implementation of the decoder compared to the known log-likelihood based implementation.

...read moreread less

Abstract: We show that successive cancellation list decoding can be formulated exclusively using log-likelihood ratios. In addition to numerical stability, the log-likelihood ratio based formulation has useful properties that simplify the sorting step involved in successive cancellation list decoding. We propose a hardware architecture of the successive cancellation list decoder in the log-likelihood ratio domain which, compared with a log-likelihood domain implementation, requires less irregular and smaller memories. This simplification, together with the gains in the metric sorter, lead to $ 56\%$ to $137\%$ higher throughput per unit area than other recently proposed architectures. We then evaluate the empirical performance of the CRC-aided successive cancellation list decoder at different list sizes using different CRCs and conclude that it is important to adapt the CRC length to the list size in order to achieve the best error-rate performance of concatenated polar codes. Finally, we synthesize conventional successive cancellation decoders at large block-lengths with the same block-error probability as our proposed CRC-aided successive cancellation list decoders to demonstrate that, while our decoders have slightly lower throughput and larger area, they have a significantly smaller decoding latency.

...read moreread less

541 citations

Journal Article•DOI•

Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey

[...]

Lei Deng¹, Guoqi Li¹, Song Han², Luping Shi¹, Yuan Xie³ - Show less +1 more•Institutions (3)

Tsinghua University¹, Massachusetts Institute of Technology², University of California, Santa Barbara³

20 Mar 2020

TL;DR: This article reviews the mainstream compression approaches such as compact model, tensor decomposition, data quantization, and network sparsification, and answers the question of how to leverage these methods in the design of neural network accelerators and present the state-of-the-art hardware architectures.

...read moreread less

Abstract: Domain-specific hardware is becoming a promising topic in the backdrop of improvement slow down for general-purpose processors due to the foreseeable end of Moore’s Law. Machine learning, especially deep neural networks (DNNs), has become the most dazzling domain witnessing successful applications in a wide spectrum of artificial intelligence (AI) tasks. The incomparable accuracy of DNNs is achieved by paying the cost of hungry memory consumption and high computational complexity, which greatly impedes their deployment in embedded systems. Therefore, the DNN compression concept was naturally proposed and widely used for memory saving and compute acceleration. In the past few years, a tremendous number of compression techniques have sprung up to pursue a satisfactory tradeoff between processing efficiency and application accuracy. Recently, this wave has spread to the design of neural network accelerators for gaining extremely high performance. However, the amount of related works is incredibly huge and the reported approaches are quite divergent. This research chaos motivates us to provide a comprehensive survey on the recent advances toward the goal of efficient compression and execution of DNNs without significantly compromising accuracy, involving both the high-level algorithms and their applications in hardware design. In this article, we review the mainstream compression approaches such as compact model, tensor decomposition, data quantization, and network sparsification. We explain their compression principles, evaluation metrics, sensitivity analysis, and joint-way use. Then, we answer the question of how to leverage these methods in the design of neural network accelerators and present the state-of-the-art hardware architectures. In the end, we discuss several existing issues such as fair comparison, testing workloads, automatic compression, influence on security, and framework/hardware-level support, and give promising topics in this field and the possible challenges as well. This article attempts to enable readers to quickly build up a big picture of neural network compression and acceleration, clearly evaluate various methods, and confidently get started in the right way.

...read moreread less

499 citations

Journal Article•DOI•

Application of the residue number system to reduce hardware costs of the convolutional neural network implementation

[...]

Maria V. Valueva¹, Nikolay Nagornov, Pavel A. Lyakhov¹, Georgii Valuev¹, Nikolay I. Chervyakov¹ - Show less +1 more•Institutions (1)

North-Caucasus Federal University¹

01 Nov 2020-Mathematics and Computers in Simulation

TL;DR: A convolutional neural network architecture in which the neural network is divided into hardware and software parts to increase performance and reduce the cost of implementation resources is proposed.

...read moreread less

308 citations

Journal Article•DOI•

Benefits and Impact of Cloud Computing on 5G Signal Processing: Flexible centralization through cloud-RAN

[...]

Dirk Wubben¹, Peter Rost, Jens Bartelt², Massinissa Lalam, Valentin Savin, Matteo Gorgoglione, Armin Dekorsy¹, Gerhard Fettweis² - Show less +4 more•Institutions (2)

University of Bremen¹, Dresden University of Technology²

14 Oct 2014-IEEE Signal Processing Magazine

TL;DR: The benefits that cloud computing offers for fifth-generation (5G) mobile networks are explored and the implications on the signal processing algorithms are investigated.

...read moreread less

Abstract: Cloud computing draws significant attention in the information technology (IT) community as it provides ubiquitous on-demand access to a shared pool of configurable computing resources with minimum management effort. It gains also more impact on the communication technology (CT) community and is currently discussed as an enabler for flexible, cost-efficient and more powerful mobile network implementations. Although centralized baseband pools are already investigated for the radio access network (RAN) to allow for efficient resource usage and advanced multicell algorithms, these technologies still require dedicated hardware and do not offer the same characteristics as cloud-computing platforms, i.e., on-demand provisioning, virtualization, resource pooling, elasticity, service metering, and multitenancy. However, these properties of cloud computing are key enablers for future mobile communication systems characterized by an ultradense deployment of radio access points (RAPs) leading to severe multicell interference in combination with a significant increase of the number of access nodes and huge fluctuations of the rate requirements over time. In this article, we will explore the benefits that cloud computing offers for fifth-generation (5G) mobile networks and investigate the implications on the signal processing algorithms.

...read moreread less

272 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse