scispace - formally typeset
Search or ask a question
Journal ArticleDOI

VLSI Implementation of a Multi-Mode Turbo/LDPC Decoder Architecture

01 Jun 2013-IEEE Transactions on Circuits and Systems I-regular Papers (IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC, 445 HOES LANE, PISCATAWAY, NJ 08855 USA)-Vol. 60, Iss: 6, pp 1441-1454
TL;DR: This work concentrates on the design of a reconfigurable architecture for both turbo and LDPC codes decoding, tackling the reconfiguration issue and introducing a formal and systematic treatment that was not previously addressed.
Abstract: Flexible and reconfigurable architectures have gained wide popularity in the communications field. In particular, reconfigurable architectures for the physical layer are an attractive solution not only to switch among different coding modes but also to achieve interoperability. This work concentrates on the design of a reconfigurable architecture for both turbo and LDPC codes decoding. The novel contributions of this paper are: i) tackling the reconfiguration issue introducing a formal and systematic treatment that, to the best of our knowledge, was not previously addressed and ii) proposing a reconfigurable NoC-based turbo/LDPC decoder architecture and showing that wide flexibility can be achieved with a small complexity overhead. Obtained results show that dynamic switching between most of considered communication standards is possible without pausing the decoding activity. Moreover, post-layout results show that tailoring the proposed architecture to the WiMAX standard leads to an area occupation of 2.75 mm2 and a power consumption of 101.5 mW in the worst case.

Summary (4 min read)

Introduction

  • In the last years several efforts were spent to develop systems able to give ubiquitous access to telecommunication networks.
  • In both approaches, flexible and efficient interconnection structures are required to connect PEs to each other.
  • The use of an intra-IP NoC as the interconnection framework for both turbo and LDPC code decoders has been demonstrated in several works [16], [19]–[21].
  • In Section VII evaluations of the architecture performance on various existing standards are provided.

II. DECODING ALGORITHMS

  • Turbo and LDPC decoding algorithms are characterized by strong resemblances: they are iterative, work on graphbased representations, are routinely implemented in logarithmic form, process data expressed as Logarithmic-LikelihoodRatios (LLRs) and require high level of both processing and storage parallelism.
  • Both algorithms receive intrinsic information from the channel and produce extrinsic information that is exchanged across iterations to obtain the a priori information of uncoded bits, in the case of binary codes, or symbols, in the case of non binary codes.
  • Moreover, their arithmetical functions are so similar that joint or derived algorithms for both LDPC and turbo decoding exist [24].
  • In the following for both codes the authors will refer to K, N and r = K/N as the number of uncoded bits, the number of coded bits and the code rate respectively.

A. LDPC codes decoding algorithm

  • The decoding of LDPC codes stems from the Tanner graph representation of H where two sets of nodes are identified: Variable Nodes (VNs) and Check Nodes (CNs).
  • There are two main scheduling schemes for the BP: two-phase scheduling and layered scheduling [26].
  • In a layered decoder, parity-check constraints are grouped in layers each of which is associated to a component code.
  • This process is iterated up to the desired level of reliability.
  • Let λ[c] represent the LLR of symbol c and, for column k in H, bit LLR λk[c] is initialized to the corresponding received soft value.

B. Turbo codes decoding algorithm

  • Turbo codes are obtained as the parallel concatenation of two constituent Convolutional Code (CC) encoders connected by the means of an interleaver (Π).
  • Each constituent decoder performs the so called BCJR algorithm [29] that starting from the intrinsic and a priori information produces the extrinsic information.
  • Several exact and approximated expressions are available for the ∗ max{xi} function [31]: for example, it can be implemented as max{xi} followed by a correction term, often stored in a small Look-Up-Table (LUT).
  • On the other hand, they execute (1) to (5) in parallel for P slices of parity check constraints when configured in LDPC code mode.
  • In the following, the authors indicate the j-th message received and generated by PE i as λ′i,j and λi,j respectively.

III. NOC-BASED DECODER

  • The goal of this work is to design a highly flexible LDPC and turbo decoder, able to support a very wide set of different communication standards.
  • The node architecture employed in this work for node i is represented in Fig.
  • The routing algorithm is the one proposed in [19] as Single-ShortestPath-FIFO-Length (SSP-FL).
  • It is worth noting that the destination of each λi,j is imposed by the interleaver and the H matrix respectively.
  • The PE includes both LDPC and turbo decoding cores: their architectures are structured to be as independent as possible of the supported codes.

IV. DECODER RECONFIGURATION

  • Change of decoding mode, standard or code parameters requires not only hardware support, but also memory initialization and specific controls: since in many standards a code switch can be issued as early as one data frame ahead [5], a time efficient reconfiguration technique must be developed.
  • The reconfiguration of the considered decoder to switch from the code currently processed (C1) to a new one (C2) can be overlapped with the decoding of both current and new code, provided that enough locations are free in the configuration memories.
  • Finally, in case the overlap with decoding activity is not sufficient to complete the whole configuration, a further option is pausing the decoder by skipping one or more iterations on the last received frame for C1 and using the available time, before starting the decoding of the new frame encoded with C2.
  • Two alternative cases can arise during Φ1: either this phase is limited by the available time, or it is limited by the number of free locations in the reconfiguration memory: (nit1 − 1) · tit1 R P Nb · (B − lc1) (14).
  • The bound is also proportional to Nb, and can be consequently increased by rising the number of reconfiguration buses.

V. RECONFIGURATION: CASES AND EXAMPLES

  • The reconfiguration method detailed in Section IV has been applied to a set of target standards, in order to identify suitable design parameters (i.e. Nb, B, nstop, Nfmax) that enable reconfiguration without pausing the decoder for most of code sizes.
  • It can be noticed that with nstop = 3 all the large codes are below the right side of the curve: later in this section it will be demonstrated how these skipped iterations are negligible in terms of BER performance.
  • In Fig. 6, the effect of different choices of Nf is shown: from the plot it can be seen that Nf > 0 actually increases the maximum lc2 only for small C1 codes.
  • Among the remaining three combinations, the one that makes use of 6 buses yields a higher area occupation than the others.

VI. DECODING CORES

  • The design of the decoding cores must yield the same degree of flexibility of the NoC, being as independent as possible of the set of supported codes.
  • In [14] a completely serial LDPC decoding core has been designed, mostly independent of block length and code rate: an arbitrary number of CN operations can be scheduled on it.
  • The same holds true for the serial SISO, where different windows can be scheduled, regardless of the size of the interleaver.
  • As a consequence, in this work logic sharing is not addressed.
  • Experimental results show that the area of the architecture is dominated by memories indeed.

A. Quantization and Memory Organization

  • Memory organization evolves from the idea presented in [14], in which in every decoding core two memories are instantiated: a 7-bit memory and a 5-bit memory.
  • On the same graph, yielding similar results, a few turbo codes examples (WiMAX and HPAV) are plotted, in which λk[b] and the channel LLR representation changes from 7 to 6 bits, and λk[c(e)] from 5 to 4 bits (the meaning of λk[b] will be detailed in Section VI-C1).
  • Curves obtained with floating point precision show improvements between 0.1 and 0.2 dB w.r.t. the selected precisions.
  • Thanks to these changes, a single 6-bit wide memory is instantiated, in which both λk[c] and Rlk values are saved.
  • Since Rlk can take only two possible values, for each CN the authors can memorize 576 ·2 magnitudes, and 576 · 15 2-bit indexes that identify the correct Rlk magnitude and its sign.

B. LDPC Decoding Core

  • The LDPC decoding core used in the decoder described in [14] relies on a serial architecture suited for exclusive memory usage.
  • The average number of cycles–per–data varies between one and two.
  • Once min1 and min2 have been successfully extracted, they are compared to all the Qlk[c] of the CN, that are delayed by a number of clock cycles equal to the degree of the CN (deg), to compute Rnewlk as in (4).
  • Both 6-bit and 2-bit memories are implemented as dual port RAMs, allowing two concurrent operations, also known as 2) Memory Scheduling.
  • On the contrary, port 2 is set to read mode, loading the two Roldlk magnitudes of CN j+1 stored during previous iteration.

C. Turbo Decoding Core

  • As for the LDPC decoding core, also the SISO core yields a very high degree of flexibility, limited only by the size of the memories: any double-binary turbo code can be decoded as long as the memory capacity is sufficient.
  • The SISO interfaces with the NoC via two dedicated input and output blocks, respectively called Bit-To-Symbol Conversion Unit (BTS CU) and Symbol–To–Bit Conversion Unit (STB CU).
  • These metrics are computed in this exact order, thus storing βk[s] values in a dedicated set of registers while αk[s] are being processed: the b(e) metric, that needs both βk[s] and αk[s], is calculated last.
  • The 2-bit memory is used in the same way, with port 1 in read mode and port 2 in write mode.
  • Since every window is composed of at least 20 trellis steps, requiring 3 · 20 clock cycles to be executed, there is enough time to load βk[s] and αk[s] values to initialize the next window.

VII. SUPPORTED STANDARDS

  • The 22-node architecture presented in this work has been tested on a large set of communication standards.
  • To comply with each standard throughput requirements, a single fNoC = 300 MHz is sufficient in both LDPC and turbo mode, consequently identifying fLDPCcore = 200 MHz and f turbo core = 170 MHz, both under the fcore/fNoC constraint.
  • This area overhead is due to two specific functionalities that have been introduced in the proposed decoder: (i) full flexibility in terms of supported turbo and LDPC codes, and (ii) dynamic reconfiguration between different standards.
  • The parallelism of the NoC is increased from 22 nodes to 35 nodes, the reconfiguration buses rise from 5 to 8, and the support of LTE requires an increase in the size of 6-bit memories.
  • Throughput results for CMMB and DTMB are shown in the Implementation C column of Table VI.

D. Comparisons

  • Table VIII shows the detailed implementation results in comparison with the state of the art flexible turbo/LDPC decoders.
  • Baghdadhi et al. in [11] propose an ASIP decoder architecture supporting WiMAX and WiFi LDPC codes, and WiMAX, 3GPP-LTE and DVB-RCS turbo codes.
  • On the contrary, worst case throughput in [11] is not high enough for WiMAX.
  • This leads to a better area efficiency in all three proposed implementations for most of the codes: particularly evident is the difference for DBTC (second last row of Table VII).
  • They obtain very high maximum throughput efficiency in both LDPC and turbo mode: the range of supported codes is however quite limited w.r.t. all considered implementations, and the area occupation is larger than A.

IX. CONCLUSIONS

  • This work describes a flexible turbo/LDPC decoder architecture able to fully support a wide range of modern communication standards.
  • A complete analysis of the never previously addressed inter- and intra-standard reconfiguration issue is presented, together with a dedicated reconfiguration technique that limits the complexity overhead and performance loss.
  • Three different implementations are proposed to cover different sets of standards.
  • Full layout design has been completed to provide accurate area and power figures.
  • Comparison of the proposed architectures with the state of the art show very good efficiency, competitive area occupation and an unmatched degree of flexibility.

Did you find this useful? Give us your feedback

Figures (19)

Content maybe subject to copyright    Report

10 August 2022
POLITECNICO DI TORINO
Repository ISTITUZIONALE
VLSI implementation of a multi-mode turbo/LDPC decoder architecture / Condo, Carlo; Martina, Maurizio; Masera,
Guido. - In: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS. I, REGULAR PAPERS. - ISSN 1549-8328. -
STAMPA. - 60:6(2013), pp. 1441-1454. [10.1109/TCSI.2012.2221216]
Original
VLSI implementation of a multi-mode turbo/LDPC decoder architecture
Publisher:
Published
DOI:10.1109/TCSI.2012.2221216
Terms of use:
openAccess
Publisher copyright
(Article begins on next page)
This article is made available under terms and conditions as specified in the corresponding bibliographic description in
the repository
Availability:
This version is available at: 11583/2506382 since:
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC, 445 HOES LANE, PISCATAWAY, NJ 08855 USA

1
VLSI implementation of a multi-mode turbo/LDPC
decoder architecture
Carlo Condo, Maurizio Martina, Member IEEE, Guido Masera, Senior Member IEEE
Dipartimento di Elettronica e Telecomunicazioni, Politecnico di Torino, Italy
Abstract—Flexible and reconfigurable architectures have
gained wide popularity in the communications field. In particular,
reconfigurable architectures for the physical layer are an attrac-
tive solution not only to switch among different coding modes
but also to achieve interoperability. This work concentrates on
the design of a reconfigurable architecture for both turbo and
LDPC codes decoding. The novel contributions of this paper
are: i) tackling the reconfiguration issue introducing a formal
and systematic treatment that, to the best of our knowledge, was
not previously addressed; ii) proposing a reconfigurable NoC-
based turbo/LDPC decoder architecture and showing that wide
flexibility can be achieved with a small complexity overhead.
Obtained results show that dynamic switching between most of
considered communication standards is possible without pausing
the decoding activity. Moreover, post-layout results show that
tailoring the proposed architecture to the WiMAX standard leads
to an area occupation of 2.75 mm
2
and a power consumption of
101.5 mW in the worst case.
Index Terms—VLSI, LDPC/Turbo Codes Decoder, NoC, Flex-
ibility, Wireless communications
I. INTRODUCTION
In the last years several efforts were spent to develop
systems able to give ubiquitous access to telecommunication
networks. These efforts were spent mainly in three directions:
i) improving the transmission rate and reliability; ii) develop-
ing bandwidth efficient technologies; iii) designing low cost
receivers. The most relevant results produced by such a vivid
research were included in the last standards for both wireless
and wired communications [1]–[7]. Besides, several standards
provide multiple modes and functionalities. However, sharing
common features is a challenging task to achieve flexibility
and interoperability.
Several recent works, including [8], have shown that flex-
ibility is an important property in the implementation of
communication systems. Some works investigated this di-
rection facing the challenge of implementing flexible archi-
tectures for the decoding of channel codes. In particular,
flexible turbo/Low-Density-Parity-Check (LDPC) decoder ar-
chitectures have been proposed not only to support differ-
ent coding modes within a specific standard but also to
enable interoperability among different standards. In [9]–
[11] flexibility is achieved through the design of Processing
Elements (PEs) based on Application-Specific-Instruction-set-
Processor (ASIP) architectures, whereas in [12]–[14] PEs rely
on Application-Specific-Integrated-Circuit (ASIC) solutions.
In both approaches, flexible and efficient interconnection struc-
tures are required to connect PEs to each other.
Unfortunately, the communication patterns of turbo and
LDPC codes suffer from collisions, namely two or more PEs
require concurrent access to the same memory resource. To
break the collision a Network-on-Chip (NoC) like approach
was proposed in [15] for turbo codes. This idea has been
further developed in other works. In particular, in [16] the NoC
approach is used as a viable solution to implement flexible and
high throughput interconnection structures for turbo/LDPC
decoders.
An intra-IP NoC [17] is an application specific NoC [18]
where the interconnection structure is tailored to the char-
acteristics of the Intellectual Property (IP). The use of an
intra-IP NoC as the interconnection framework for both turbo
and LDPC code decoders has been demonstrated in several
works [16], [19]–[21]. This choice enables larger flexibility
with respect to other interconnection schemes [16], [22], [23],
but introduces penalties in terms of additional occupied area
and latency in the communication among PEs.
Stemming from the work presented in [14], [19], [20],
where an ASIC implementation of an NoC-based turbo/LDPC
decoder architecture is proposed, this paper aims to further
investigate and optimize it. In particular, this work features
the following novel contributions: i) management of dynamic
reconfiguration to switch between a code to another one
without pausing the decoding, ii) description of a new PE
architecture with an improved shared memory solution which
provides relevant saving of occupied area for min-sum de-
coding algorithm, iii) evaluation of a wide set of standards for
both wireless and wired applications: IEEE 802.16e (WiMAX)
[5], IEEE 802.11n (WiFi) [6], China Mulitimedia Mobile
Broadcasting (CMMB) [3], Digital Terrestrial Multimedia
Broadcast (DTMB) [4], HomePlug AV (HPAV) [2], 3GPP
Long Term Evolution (LTE) [7], Digital Video Broadcasting
- Return Channel via Satellite (DVB-RCS) [1], iv) complete
VLSI implementation of the decoder up to layout level and
accurate evaluation of dissipated power.
It is worth noting that, to the best of our knowledge, this is
the first work addressing dynamic reconfiguration of flexible
channel decoders with an analytical approach, and showing
the actual impact of reconfiguration on both performance and
complexity. The paper is structured as follows. In Section
II decoding algorithms are briefly discussed, whereas section
III deals with the basics of NoC-based turbo/LDPC decoder
architectures and summarizes the main results this work starts
from. The decoder reconfiguration techniques are detailed in
Section IV and V, while Section VI deals with the descrip-
tion of LDPC and turbo decoding cores, along with their

respective memory organization. In Section VII evaluations of
the architecture performance on various existing standards are
provided. Implementation results are portrayed and discussed
in Section VIII, and conclusions are drawn in section IX.
II. DECODING ALGORITHMS
Turbo and LDPC decoding algorithms are characterized
by strong resemblances: they are iterative, work on graph-
based representations, are routinely implemented in logarith-
mic form, process data expressed as Logarithmic-Likelihood-
Ratios (LLRs) and require high level of both processing and
storage parallelism. Both algorithms receive intrinsic informa-
tion from the channel and produce extrinsic information that is
exchanged across iterations to obtain the a priori information
of uncoded bits, in the case of binary codes, or symbols, in
the case of non binary codes. Moreover, their arithmetical
functions are so similar that joint or derived algorithms for
both LDPC and turbo decoding exist [24]. In the following
for both codes we will refer to K, N and r = K/N as the
number of uncoded bits, the number of coded bits and the
code rate respectively.
A. LDPC codes decoding algorithm
Every LDPC code is completely described by its M × N
parity check matrix H (M = N K) which is very sparse
[25]. Each valid LDPC codeword x satisfies H · x
0
= 0,
where (·)
0
is the transposition operator. The decoding of LDPC
codes stems from the Tanner graph representation of H where
two sets of nodes are identified: Variable Nodes (VNs) and
Check Nodes (CNs). VNs are associated to the N bits of the
codeword, whereas CNs correspond to the M parity-check
constraints. The most common algorithm to decode LDPC
codes is the Belief Propagation (BP) algorithm. There are two
main scheduling schemes for the BP: two-phase scheduling
and layered scheduling [26]. The latter nearly doubles the
converge speed as compared to two-phase scheduling. In a
layered decoder, parity-check constraints are grouped in layers
each of which is associated to a component code. Then, layers
are decoded in sequence by propagating extrinsic information
from one layer to the following one [26]. This process is
iterated up to the desired level of reliability.
Let λ[c] represent the LLR of symbol c and, for column k
in H, bit LLR λ
k
[c] is initialized to the corresponding received
soft value. Then, for all parity constraints l in a given layer,
the following operations are executed:
Q
lk
[c] = λ
old
k
[c] R
old
lk
(1)
A
lk
=
X
nN(l),n6=k
Ψ(Q
ln
[c]) (2)
δ
lk
=
Y
nN(l),n6=k
sgn(Q
ln
[c]) (3)
R
new
lk
= δ
lk
· Ψ
1
(A
lk
) (4)
λ
new
k
[c] = Q
lk
[c] + R
new
lk
(5)
λ
old
k
[c] is the extrinsic information received from the previous
layer and updated in (5) to be propagated to the succeed-
ing layer. Term R
old
lk
, pertaining to element (l,k) of H and
initialized to 0, is used to compute (1); the same amount is
then updated in (4), R
new
lk
, and stored to be used again in the
following iteration. In (2) and (3) N(l) is the set of all bit
indexes that are connected to parity constraint l.
According to [27], the Ψ(·) function in (2) and (4) can be
simplified with a limited BER performance loss as
R
new
lk
δ
0
lk
· min
nN(l),n6=k
{|Q
nk
|} , (6)
usually referred to as normalized-min-sum approximation,
where δ
0
lk
= σ · δ
lk
and σ 1.
B. Turbo codes decoding algorithm
Turbo codes are obtained as the parallel concatenation of
two constituent Convolutional Code (CC) encoders connected
by the means of an interleaver (Π). Thus, the decoder is
made of two constituent decoders, referred to as Soft-In-Soft-
Out (SISO) or Maximum-A-Posteriori (MAP) decoders [28]
connected in an iterative loop by the means of the interleaver
Π and the de-interleaver Π
1
. Each constituent decoder per-
forms the so called BCJR algorithm [29] that starting from
the intrinsic and a priori information produces the extrinsic
information. Let k be a step in the trellis representation of the
constituent CC, and u an uncoded symbol. Each constituent
decoder computes λ
k
[u] = σ · (λ
apo
k
[u] λ
apr
k
[u] λ
k
[c
u
])
where σ 1 [30], λ
apo
k
[u] is the a-posteriori information,
λ
apr
k
[u] is the a priori information and λ
k
[c
u
] is the systematic
component of the intrinsic information. According to [29] a-
posteriori information is computed as
λ
apo
k
[u] =
max
e:u(e)=u
{b(e)}
max
e:u(e)=˜u
{b(e)} (7)
where ˜u U is an uncoded symbol taken as a reference
(usually ˜u = 0) and u U \ {˜u} with U the set of uncoded
symbols; e is a trellis transition and u(e) is the corresponding
uncoded symbol. Several exact and approximated expressions
are available for the
max{x
i
} function [31]: for example, it
can be implemented as max{x
i
} followed by a correction
term, often stored in a small Look-Up-Table (LUT). The
correction term, usually adopted when decoding binary codes
(Log-MAP), can be omitted with minor Bit-Error-Rate (BER)
performance degradation (Max-Log-MAP). The term b(e) in
(7) is defined as:
b(e) = α
k1
[s
S
(e)] + γ
k
[e] + β
k
[s
E
(e)] (8)
α
k
[s] =
max
e:s
E
(e)=s
α
k1
[s
S
(e)] + γ
k
[e]
(9)
β
k
[s] =
max
e:s
S
(e)=s
β
k+1
[s
E
(e)] + γ
k
[e]
(10)
γ
k
[e] = λ
apr
k
[u(e)] + λ
k
[c(e)] (11)
where s
S
(e) and s
E
(e) are the starting and the ending states
of e, α
k
[s
S
(e)] and β
k
[s
E
(e)] are the forward and backward
metrics associated to s
S
(e) and s
E
(e) respectively. The term
λ
k
[c(e)] represents the intrinsic information received from the

Routing
algorithm
crossbar
conf.
Memory
Location
load
read enable
MEM iPE i
output
bus
RE i
λ
0
i,j
λ
i,j
t
0
i,j
input
Reconfiguration
Figure 1. Node structure
channel. For further details on the decoding algorithm the
reader can refer to [32].
In a parallel decoder, the decoding operations summarized
in previous paragraphs are partitioned among P PEs. When
configured in turbo code mode, these PEs operate as con-
current SISOs. On the other hand, they execute (1) to (5)
in parallel for P slices of parity check constraints when
configured in LDPC code mode. In both cases, messages
are exchanged among PEs to propagate λ
k
[u] and λ
new
k
[c]
amounts in accordance with the code structure. In the follow-
ing, we indicate the j-th message received and generated by
PE i as λ
0
i,j
and λ
i,j
respectively.
III. NOC-BASED DECODER
The goal of this work is to design a highly flexible LDPC
and turbo decoder, able to support a very wide set of different
communication standards. The proposed multi-mode/multi-
standard decoder architecture relies on an NoC-based struc-
ture, where each node contains a PE and a routing element
(RE). Each PE implements the BCJR and layered normalized
min-sum algorithms. On the other hand, REs are devoted to
deliver λ
i,j
values to the correct destination.
The node architecture employed in this work for node i is
represented in Fig. 1. Each RE is constituted by a 4×4 crossbar
switch with 4 input FIFOs and 4 output registers. The routing
algorithm is the one proposed in [19] as Single-Shortest-
Path-FIFO-Length (SSP-FL). SSP-FL relies on a distributed
table-based routing algorithm where each table contains the
information for shortest path routing. The routing information
is precalculated by running off-line the Floyd-Warshall algo-
rithm. Moreover, in SSP-FL shortest path routing is coupled
with an input serving policy based on the current status to
the FIFOs, namely in case two messages must be routed to
the same output port, priority is given to the message coming
from the longer FIFO. It is worth noting that the destination
of each λ
i,j
is imposed by the interleaver and the H matrix
respectively. As a consequence, the routing is deterministic.
The PE includes both LDPC and turbo decoding cores:
their architectures are structured to be as independent as
possible of the supported codes. The LDPC decoding core
is completely serial and able to decode any LDPC code,
provided that enough memory is available. The SISO core
for turbo decoding is tailored around 8-state turbo codes,
and no other constraints are present: the two cores share the
memories where the incoming data λ
0
i,j
are stored and the
location memory containing the pre-computed t
0
i,j
values, i.e.
the memory addresses to store λ
0
i,j
. Also the interconnection
structure depends only on the location memory size, that sets
an upper bound to the number of messages each PE can
handle.
The decoding task is divided uniformly among the different
nodes. The process is straightforward in turbo mode, with
each node being assigned a portion of the trellis that is
processed in a sliding-window fashion [33], [34]. Extrinsic
and window-initialization information are carried through the
network according to the code interleaving and deinterleaving
rules [19]. On the contrary, in LDPC mode the partitioning
of the decoding task on the PEs is obtained as follows.
Using a proprietary tool based on the METIS graph coloring
library [35], the H matrix is partitioned on the chosen network
topology. At this point the destination of every message
coming out of each decoding core is known. Thus, in both
turbo and LDPC modes each outgoing message is made of a
payload λ
i,j
and a header containing the destination node.
Performance of meshes, toroidal meshes, spidergon, hon-
eycomb, De Bruijn and Kautz graphs were compared, along
with a number of other design choices, as routing algorithm
and collision management policies. This analysis shows that
the Kautz topology yields the best results in terms of area
occupation and obtainable throughput. In particular, in [14] a
22-nodes Kautz NoC was used to fully support IEEE 802.16e
standard, each node being connected to a decoding PE and to
three other nodes via a 4-way router.
IV. DECODER RECONFIGURATION
Flexible decoders available in the literature [9]–[13], [16],
[17], [19], [20], though supporting a wide range of codes,
do not address the reconfiguration issue. Change of decoding
mode, standard or code parameters requires not only hardware
support, but also memory initialization and specific controls:
since in many standards a code switch can be issued as early
as one data frame ahead [5], a time efficient reconfiguration
technique must be developed.
For the proposed decoder the reconfiguration task consists
of i) rewriting the location memory containing t
0
i,j
values; ii)
reloading the CN degree (deg) parameters and the window
size in the control unit of LDPC decoding cores and SISOs
respectively. In the following, the whole set of storage loca-
tions to be updated at reconfiguration time will be indicated as
“reconfiguration memory”. When possible, the decoder must
be reconfigured while the decoding process is still running on
the previous data frame. This means that the reconfiguration
data can be distributed by means of the NoC interconnections
only at the cost of severe performance penalties. Consequently,
we suppose that the reconfiguration data are moved directly
to the PEs via a set of N
b
dedicated buses, each one linked
to
P
N
b
PEs.

00000
00000
00000
00000
00000
00000
00000
00000
00000
00000
00000
11111
11111
11111
11111
11111
11111
11111
11111
11111
11111
11111
00000
00000
00000
00000
00000
00000
00000
11111
11111
11111
11111
11111
11111
11111
000000
000000
000000
000000
000000
000000
000000
000000
000000
000000
000000
111111
111111
111111
111111
111111
111111
111111
111111
111111
111111
111111
000000
000000
000000
000000
000000
111111
111111
111111
111111
111111
000000
000000
000000
111111
111111
111111
00
00
00
00
11
11
11
11
RP
ECC
SCC
(c)
SCC
ECC
SFC
WP
RP
EFC
(b)
ECC
RP
SCC
(a)
Figure 2. Memory reconfiguration process: (a) Decoding of C
1
; (b) Upload
of reconfiguration data required for C
2
(phases Φ
1
to Φ
3
and Φ
5
); (c) First
iteration of C
2
and concurrent upload of reconfiguration data (Φ
4
)
In the following we estimate reconfiguration occurrence
assuming mobile receivers moving at different speeds and
the carrier frequency f
c
= 2.4 GHz. This frequency is
included in most standards’ operation range, and used in a
variety of applications. In this scenario the communication
channel is affected by fading phenomena, namely slow fading,
whose effects have very long time constants, and fast fading.
Fast fading can be modeled assuming a change of channel
conditions every time the receiver is moved by a distance
similar to the wavelength λ of the carrier. Being λ = 0.125
m, at a speed v = 70 km/h the channel changes with a
frequency f
chng
= 155 Hz (WiMAX, WiFi, 3GPP-LTE),
whereas, at v = 10 km/h (DVB-RCS, HPAV, CMMB, DTMB)
changes occurs at f
chng
= 22 Hz. These scenarios result in
different reconfiguration probabilities, whose impact on BER
performance is addressed in Section V.
The reconfiguration memory is organized as a circular
buffer: two sets of pointers are used to manage reading and
writing operations. The Start of Current Configuration (SCC)
pointer and the End of Current Configuration (ECC) pointer
delimit the memory blocks that are currently being used. A
Read Pointer (RP) is used to retrieve the data during the
decoding process, as shown in Fig. 2.(a). The Start of Future
Configuration (SFC) and End of Future Configuration (EFC)
pointers are instead used concurrently with the Write Pointer
(WP) to delimit the locations that are going to be used to store
the new configuration data.
The reconfiguration of the considered decoder to switch
from the code currently processed (C
1
) to a new one (C
2
) can
be overlapped with the decoding of both current and new code,
provided that enough locations are free in the configuration
memories. In particular, part of the configuration process can
be concurrent with the decoding of one or more frames of
C
1
; if necessary, another portion of the configuration can be
scheduled during the first iteration of the new code C
2
. Finally,
in case the overlap with decoding activity is not sufficient to
complete the whole configuration, a further option is pausing
the decoder by skipping one or more iterations on the last
received frame for C
1
and using the available time, before
starting the decoding of the new frame encoded with C
2
.
Let us define B as the size of the location buffer available at
each PE to store configuration data, t
it1
and t
it2
as the duration
in clock cycles of a single decoding iteration for codes C
1
and
C
2
. Moreover, l
c1
and l
c2
express the number of locations
required to store configurations of codes C
1
and C
2
at each
PE, and n
it1
and n
it2
their iteration numbers.
In the considered architecture, the duration of one decoding
iteration t
it
expressed in clock cycles is directly proportional
to the number of memory locations a PE has to read throughout
the decoding process, and consequently to the number of
used locations in the reconfiguration memory (l
c
). Though the
actual relationship between t
it
and l
c
is affected by memory
scheduling and ratio between PE and NoC clock frequencies,
this analysis is carried out with the worst-case assumption
that the reconfiguration memory is read at every clock cycle
of each iteration, setting l
c
= t
it
for both C
1
and C
2
codes.
We define five phases Φ
i
, i = 1, 2, 3, 4, 5 in the config-
uration process and for each phase we identify i) t
Φ
i
a
as
the number of clock cycles available during phase Φ
i
, and
ii) l
Φ
i
a
= N
b
· t
Φ
i
a
/P as the number of locations in each
reconfiguration memory that can be written in t
Φ
i
a
clock cycles.
Φ
1
In the reconfiguration from code C
1
to code C
2
, l
c1
words
must be replaced with l
c2
new words. The first part of the
configuration can be scheduled during the initial n
it1
1
decoding iterations on C
1
and therefore the available time
is t
Φ
1
a
= (n
it1
1)·t
it1
; in this range of time a maximum
of l
Φ
1
a
=
N
b
P
· (n
it1
1) · t
it1
words can be loaded
into each buffer. However, assuming that the buffer size
is larger than l
c1
, we define B l
c1
as the number of
unused memory blocks in current configuration for code
C
1
. Therefore, the actual number of locations written in
Φ
1
is the minimum between B l
c1
and l
Φ
1
a
. The SFC
pointer is thus initialized as ECC (Fig. 2.(b)).
Φ
2
During the last iteration on C
1
, every memory location
between SCC and the current position of RP is available
for reconfiguration. This means that up to l
c1
locations
are available for receiving configuration words for C
2
.
However, this has to be done during a single iteration,
and therefore t
Φ
2
a
= t
it1
cycles are available. During these
cycles, up to l
Φ
2
a
=
N
b
P
· t
it1
words can be loaded.
Φ
3
As mentioned before, part of the configuration can be
overlapped with the first decoding iteration on C
2
code.
SCC is initialized as SFC, and RP will take the duration
of a full iteration to arrive to ECC (Fig. 2.(c)). The
available time is t
Φ
3
a
= t
it2
and the maximum number of
words that can be loaded in this phase is l
Φ
3
a
=
N
b
P
· t
it2
.
Φ
4
In the event that previously listed phases are not sufficient
to complete the configuration, an early stopping in the
decoding of code C
1
can be scheduled to make available
additional cycles to be used for loading the remaining part
of the configuration words. We indicate the number of
cycles available in this phase as t
Φ
4
a
= t
stop
. The number
of words that can be loaded in Φ
4
is l
Φ
4
a
=
N
b
P
· t
stop
. As
one or more complete iterations are dropped in Φ
4
, t
stop
is a multiple of t
it1
, which can be formalized as
t
stop
= n
stop
· t
it1
, n
stop
= 0, 1, 2, 3, · · · (12)
Differently from the other four phases, Φ
4
affects the

Citations
More filters
Journal ArticleDOI
TL;DR: The low-complexity, highly parallel, and flexible systematic-encoding algorithm that is used is described and proved to be correctness and the flexible software encoder and decoder implementations are shown to be able to maintain high throughput and low latency.
Abstract: In this paper, we present hardware and software implementations of flexible polar systematic encoders and decoders. The proposed implementations operate on polar codes of any length less than a maximum and of any rate. We describe the low-complexity, highly parallel, and flexible systematic-encoding algorithm that we use and prove its correctness. Our hardware implementation results show that the overhead of adding code rate and length flexibility is little, and the impact on operation latency minor compared with code-specific versions. Finally, the flexible software encoder and decoder implementations are also shown to be able to maintain high throughput and low latency.

112 citations


Cites background from "VLSI Implementation of a Multi-Mode..."

  • ...The overhead of building a flexible LDPC decoder capable of decoding different codes is significant, and creating flexible LDPC decoders is an active area of research [2], [3]....

    [...]

Patent
10 Jul 2015
TL;DR: Flexible encoders and decoders supporting polar codes of any length up to a design maximum allow adaptive polar code systems responsive to communication link characteristics, performance, etc. whilst maximizing throughput.
Abstract: Modern communication systems must cope with varying channel conditions and differing throughput constraints. Polar codes despite being the first error-correcting codes with an explicit construction to achieve the symmetric capacity of memoryless channels are not currently employed against other older coding protocols such as low-density parity check (LDPC) codes as their performance at short/moderate lengths has been inferior and their decoding algorithm is serial leading to low decoding throughput. Accordingly techniques to address these issues are identified and disclosed including decoders that decode constituent codes without recursion and/or recognize classes of constituent directly decodable codes thereby increasing the decoder throughput. Flexible encoders and decoders supporting polar codes of any length up to a design maximum allow adaptive polar code systems responsive to communication link characteristics, performance, etc. whilst maximizing throughput. Further, designers are provided flexibility in implementing either hardware or software implementations.

103 citations

Journal ArticleDOI
TL;DR: The logarithmic-Bahl-Cocke-Jelinek-Raviv (LBCJR) algorithm used in MAP decoders is presented, and an ungrouped backward recursion technique for the computation of backward state metrics is presented.
Abstract: This work focuses on the VLSI design aspect of high- speed maximum a posteriori (MAP) probability decoders which are intrinsic building-blocks of parallel turbo decoders. For the logarithmic-Bahl-Cocke-Jelinek-Raviv (LBCJR) algorithm used in MAP decoders, we have presented an ungrouped backward recursion technique for the computation of backward state metrics. Unlike the conventional decoder architectures, MAP decoder based on this technique can be extensively pipelined and retimed to achieve higher clock frequency. Additionally, the state metric normalization technique employed in the design of an add-compare-select-unit (ACSU) has reduced critical path delay of our decoder architecture. We have designed and implemented turbo decoders with 8 and 64 parallel MAP decoders in 90 nm CMOS technology. VLSI implementation of an 8 × parallel turbo-decoder has achieved a maximum throughput of 439 Mbps with 0.11 nJ/bit/iteration energy-efficiency. Similarly, 64 × parallel turbo-decoder has achieved a maximum throughput of 3.3 Gbps with an energy-efficiency of 0.079 nJ/bit/iteration. These high-throughput decoders meet peak data-rates of 3GPP-LTE and LTE-Advanced standards.

54 citations


Cites background or result from "VLSI Implementation of a Multi-Mode..."

  • ...Table III summarizes the key characteristics of implemented decoders of our work and compares them with the reported parallel turbo-decoder implementations in the literature [7], [8], [10]–[15]....

    [...]

  • ...Recently, a novel hybrid decoder-architecture of turbo low-density-parity-check (LDPC) codes for multiple wireless communication standards has been proposed in [15]....

    [...]

Journal ArticleDOI
TL;DR: This work proposes dynamic multi-frame processing schedule which efficiently utilizes the layered-LDPC decoding with minimum pipeline stages and efficient comparison techniques for both column and row layered schedule and rejection-based high-speed circuits to compute the two minimum values from multiple inputs required for row layered processing of hardware-friendly min-sum decoding algorithm.
Abstract: This paper presents architecture of block-level-parallel layered decoder for irregular LDPC code. It can be reconfigured to support various block lengths and code rates of IEEE 802.11n (WiFi) wireless-communication standard. We have proposed efficient comparison techniques for both column and row layered schedule and rejection-based high-speed circuits to compute the two minimum values from multiple inputs required for row layered processing of hardware-friendly min-sum decoding algorithm. The results show good speed with lower area as compared to state-of-the-art circuits. Additionally, this work proposes dynamic multi-frame processing schedule which efficiently utilizes the layered-LDPC decoding with minimum pipeline stages. The suggested LDPC-decoder architecture has been synthesized and post-layout simulated in 90 nm-CMOS process. This decoder occupies 5.19 ${\rm mm}^{2}$ area and supports multiple code rates like 1/2, 2/3, 3/4 & 5/6 as well as block-lengths of 648, 1296 & 1944. At a clock frequency of 336 MHz, the proposed LDPC-decoder has achieved better throughput of 5.13 Gbps and energy efficiency of 0.01 nJ/bits/iterations, as compared to the similar state-of-the-art works.

42 citations


Cites methods from "VLSI Implementation of a Multi-Mode..."

  • ...[14] uses different processing cores for LDPC and turbo decoding with shared memory to store messages for both the processes....

    [...]

  • ...Multi-mode reconfigurable architectures in [14] and [15] have the flexibility to switch between LDPC and turbo decoding-process....

    [...]

Journal ArticleDOI
TL;DR: A new SISO-decoder architecture is proposed that leads to significant throughput gains and better hardware efficiency compared to existing architectures for the full range of code rates.
Abstract: Turbo decoders for modern wireless communication systems have to support high throughput over a wide range of code rates. In order to support the peak throughputs specified by modern standards, parallel turbo-decoding has become a necessity, rendering the corresponding VLSI implementation a highly challenging task. In this paper, we explore the implementation trade-offs of parallel turbo decoders based on sliding-window soft-input soft-output (SISO) maximum a-posteriori (MAP) component decoders. We first introduce a new approach that allows for a systematic throughput comparison between different SISO-decoder architectures, taking their individual trade-offs in terms of window length, error-rate performance and throughput into account. A corresponding analysis of existing architectures clearly shows that the latency of the sliding-window SISO decoders causes diminishing throughput gains with increasing degree of parallelism. In order to alleviate this parallel turbo-decoder predicament, we propose a new SISO-decoder architecture that leads to significant throughput gains and better hardware efficiency compared to existing architectures for the full range of code rates.

39 citations

References
More filters
Book
01 Jan 1963
TL;DR: A simple but nonoptimum decoding scheme operating directly from the channel a posteriori probabilities is described and the probability of error using this decoder on a binary symmetric channel is shown to decrease at least exponentially with a root of the block length.
Abstract: A low-density parity-check code is a code specified by a parity-check matrix with the following properties: each column contains a small fixed number j \geq 3 of l's and each row contains a small fixed number k > j of l's. The typical minimum distance of these codes increases linearly with block length for a fixed rate and fixed j . When used with maximum likelihood decoding on a sufficiently quiet binary-input symmetric channel, the typical probability of decoding error decreases exponentially with block length for a fixed rate and fixed j . A simple but nonoptimum decoding scheme operating directly from the channel a posteriori probabilities is described. Both the equipment complexity and the data-handling capacity in bits per second of this decoder increase approximately linearly with block length. For j > 3 and a sufficiently low rate, the probability of error using this decoder on a binary symmetric channel is shown to decrease at least exponentially with a root of the block length. Some experimental results show that the actual probability of decoding error is much smaller than this theoretical bound.

11,592 citations


"VLSI Implementation of a Multi-Mode..." refers background in this paper

  • ...parity check matrix which is very sparse [25]....

    [...]

Proceedings Article
01 Jan 1993

7,742 citations

Proceedings ArticleDOI
23 May 1993
TL;DR: In this article, a new class of convolutional codes called turbo-codes, whose performances in terms of bit error rate (BER) are close to the Shannon limit, is discussed.
Abstract: A new class of convolutional codes called turbo-codes, whose performances in terms of bit error rate (BER) are close to the Shannon limit, is discussed. The turbo-code encoder is built using a parallel concatenation of two recursive systematic convolutional codes, and the associated decoder, using a feedback decoding rule, is implemented as P pipelined identical elementary decoders. >

5,963 citations

Journal ArticleDOI
TL;DR: The general problem of estimating the a posteriori probabilities of the states and transitions of a Markov source observed through a discrete memoryless channel is considered and an optimal decoding algorithm is derived.
Abstract: The general problem of estimating the a posteriori probabilities of the states and transitions of a Markov source observed through a discrete memoryless channel is considered. The decoding of linear block and convolutional codes to minimize symbol error probability is shown to be a special case of this problem. An optimal decoding algorithm is derived.

4,830 citations


"VLSI Implementation of a Multi-Mode..." refers methods in this paper

  • ...According to [29] a-posteriori information is computed as...

    [...]

  • ...Each constituent decoder performs the so called BCJR algorithm [29] that starting from the intrinsic and a priori information produces the extrinsic information....

    [...]

Frequently Asked Questions (12)
Q1. What contributions have the authors mentioned in the paper "Vlsi implementation of a multi-mode turbo/ldpc decoder architecture" ?

This work concentrates on the design of a reconfigurable architecture for both turbo and LDPC codes decoding. The novel contributions of this paper are: i ) tackling the reconfiguration issue introducing a formal and systematic treatment that, to the best of their knowledge, was not previously addressed ; ii ) proposing a reconfigurable NoCbased turbo/LDPC decoder architecture and showing that wide flexibility can be achieved with a small complexity overhead. 

Since every window is composed of at least 20 trellis steps, requiring 3 · 20 clock cycles to be executed, there is enough time to load βk[s] and αk[s] values to initialize the next window. 

In case of Single Binary Turbo Codes (SBTC), like those used in 3GPP-LTE, only two λk[c(e)] and one λk[b] are necessary for a trellis step, and they can be read in two clock cycles without impairing the throughput. 

Six locations are used to store 2 βk[s] or αk[s] (Fig 11): since at most three 8-state windows initialization metrics, i.e. 24 βk[s] and 24 αk[s], are stored at the same time, only 144 out of 400 locations are used. 

In the event that previously listed phases are not sufficient to complete the configuration, an early stopping in the decoding of code C1 can be scheduled to make available additional cycles to be used for loading the remaining part of the configuration words. 

Turbo and LDPC decoding algorithms are characterized by strong resemblances: they are iterative, work on graphbased representations, are routinely implemented in logarithmic form, process data expressed as Logarithmic-LikelihoodRatios (LLRs) and require high level of both processing and storage parallelism. 

The reconfiguration probability ranges between 0.25% and 0.3% in presence of the fast moving receiver, while it remains under 0.15% in the other case. 

Comparison of the proposed architectures with the state of the art show very good efficiency, competitive area occupation and an unmatched degree of flexibility. 

This is because the LDPC consumption is calculated on a DTMB code, that makes full use of the extended memories, while the memory usage percentages for DBTC remains low. 

Though the actual relationship between tit and lc is affected by memory scheduling and ratio between PE and NoC clock frequencies, this analysis is carried out with the worst-case assumption that the reconfiguration memory is read at every clock cycle of each iteration, setting lc = tit for both C1 and C2 codes. 

Taking in account the presented 22-node architecture, the maximum ratio fcore/fNoC for which this assumption stands is 2/3 for LDPC codes and SBTC, while 3/5 is necessary for DBTC. 

the authors suppose that the reconfiguration data are moved directly to the PEs via a set of Nb dedicated buses, each one linked to PNb PEs.