Journal Article•DOI•

VLSI Implementation of a Multi-Mode Turbo/LDPC Decoder Architecture

Carlo Condo, Maurizio Martina, Guido Masera

01 Jun 2013-IEEE Transactions on Circuits and Systems I-regular Papers (IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC, 445 HOES LANE, PISCATAWAY, NJ 08855 USA)-Vol. 60, Iss: 6, pp 1441-1454

TL;DR: This work concentrates on the design of a reconfigurable architecture for both turbo and LDPC codes decoding, tackling the reconfiguration issue and introducing a formal and systematic treatment that was not previously addressed.

read less

Abstract: Flexible and reconfigurable architectures have gained wide popularity in the communications field. In particular, reconfigurable architectures for the physical layer are an attractive solution not only to switch among different coding modes but also to achieve interoperability. This work concentrates on the design of a reconfigurable architecture for both turbo and LDPC codes decoding. The novel contributions of this paper are: i) tackling the reconfiguration issue introducing a formal and systematic treatment that, to the best of our knowledge, was not previously addressed and ii) proposing a reconfigurable NoC-based turbo/LDPC decoder architecture and showing that wide flexibility can be achieved with a small complexity overhead. Obtained results show that dynamic switching between most of considered communication standards is possible without pausing the decoding activity. Moreover, post-layout results show that tailoring the proposed architecture to the WiMAX standard leads to an area occupation of 2.75 mm2 and a power consumption of 101.5 mW in the worst case.

...read moreread less

Summary (4 min read)

Jump to: [Introduction] – [II. DECODING ALGORITHMS] – [A. LDPC codes decoding algorithm] – [B. Turbo codes decoding algorithm] – [III. NOC-BASED DECODER] – [IV. DECODER RECONFIGURATION] – [V. RECONFIGURATION: CASES AND EXAMPLES] – [VI. DECODING CORES] – [A. Quantization and Memory Organization] – [B. LDPC Decoding Core] – [C. Turbo Decoding Core] – [VII. SUPPORTED STANDARDS] – [D. Comparisons] and [IX. CONCLUSIONS]

Introduction

In the last years several efforts were spent to develop systems able to give ubiquitous access to telecommunication networks.
In both approaches, flexible and efficient interconnection structures are required to connect PEs to each other.
The use of an intra-IP NoC as the interconnection framework for both turbo and LDPC code decoders has been demonstrated in several works [16], [19]–[21].
In Section VII evaluations of the architecture performance on various existing standards are provided.

II. DECODING ALGORITHMS

Turbo and LDPC decoding algorithms are characterized by strong resemblances: they are iterative, work on graphbased representations, are routinely implemented in logarithmic form, process data expressed as Logarithmic-LikelihoodRatios (LLRs) and require high level of both processing and storage parallelism.
Both algorithms receive intrinsic information from the channel and produce extrinsic information that is exchanged across iterations to obtain the a priori information of uncoded bits, in the case of binary codes, or symbols, in the case of non binary codes.
Moreover, their arithmetical functions are so similar that joint or derived algorithms for both LDPC and turbo decoding exist [24].
In the following for both codes the authors will refer to K, N and r = K/N as the number of uncoded bits, the number of coded bits and the code rate respectively.

A. LDPC codes decoding algorithm

The decoding of LDPC codes stems from the Tanner graph representation of H where two sets of nodes are identified: Variable Nodes (VNs) and Check Nodes (CNs).
There are two main scheduling schemes for the BP: two-phase scheduling and layered scheduling [26].
In a layered decoder, parity-check constraints are grouped in layers each of which is associated to a component code.
This process is iterated up to the desired level of reliability.
Let λ[c] represent the LLR of symbol c and, for column k in H, bit LLR λk[c] is initialized to the corresponding received soft value.

B. Turbo codes decoding algorithm

Turbo codes are obtained as the parallel concatenation of two constituent Convolutional Code (CC) encoders connected by the means of an interleaver (Π).
Each constituent decoder performs the so called BCJR algorithm [29] that starting from the intrinsic and a priori information produces the extrinsic information.
Several exact and approximated expressions are available for the ∗ max{xi} function [31]: for example, it can be implemented as max{xi} followed by a correction term, often stored in a small Look-Up-Table (LUT).
On the other hand, they execute (1) to (5) in parallel for P slices of parity check constraints when configured in LDPC code mode.
In the following, the authors indicate the j-th message received and generated by PE i as λ′i,j and λi,j respectively.

III. NOC-BASED DECODER

The goal of this work is to design a highly flexible LDPC and turbo decoder, able to support a very wide set of different communication standards.
The node architecture employed in this work for node i is represented in Fig.
The routing algorithm is the one proposed in [19] as Single-ShortestPath-FIFO-Length (SSP-FL).
It is worth noting that the destination of each λi,j is imposed by the interleaver and the H matrix respectively.
The PE includes both LDPC and turbo decoding cores: their architectures are structured to be as independent as possible of the supported codes.

IV. DECODER RECONFIGURATION

Change of decoding mode, standard or code parameters requires not only hardware support, but also memory initialization and specific controls: since in many standards a code switch can be issued as early as one data frame ahead [5], a time efficient reconfiguration technique must be developed.
The reconfiguration of the considered decoder to switch from the code currently processed (C1) to a new one (C2) can be overlapped with the decoding of both current and new code, provided that enough locations are free in the configuration memories.
Finally, in case the overlap with decoding activity is not sufficient to complete the whole configuration, a further option is pausing the decoder by skipping one or more iterations on the last received frame for C1 and using the available time, before starting the decoding of the new frame encoded with C2.
Two alternative cases can arise during Φ1: either this phase is limited by the available time, or it is limited by the number of free locations in the reconfiguration memory: (nit1 − 1) · tit1 R P Nb · (B − lc1) (14).
The bound is also proportional to Nb, and can be consequently increased by rising the number of reconfiguration buses.

V. RECONFIGURATION: CASES AND EXAMPLES

The reconfiguration method detailed in Section IV has been applied to a set of target standards, in order to identify suitable design parameters (i.e. Nb, B, nstop, Nfmax) that enable reconfiguration without pausing the decoder for most of code sizes.
It can be noticed that with nstop = 3 all the large codes are below the right side of the curve: later in this section it will be demonstrated how these skipped iterations are negligible in terms of BER performance.
In Fig. 6, the effect of different choices of Nf is shown: from the plot it can be seen that Nf > 0 actually increases the maximum lc2 only for small C1 codes.
Among the remaining three combinations, the one that makes use of 6 buses yields a higher area occupation than the others.

VI. DECODING CORES

The design of the decoding cores must yield the same degree of flexibility of the NoC, being as independent as possible of the set of supported codes.
In [14] a completely serial LDPC decoding core has been designed, mostly independent of block length and code rate: an arbitrary number of CN operations can be scheduled on it.
The same holds true for the serial SISO, where different windows can be scheduled, regardless of the size of the interleaver.
As a consequence, in this work logic sharing is not addressed.
Experimental results show that the area of the architecture is dominated by memories indeed.

A. Quantization and Memory Organization

Memory organization evolves from the idea presented in [14], in which in every decoding core two memories are instantiated: a 7-bit memory and a 5-bit memory.
On the same graph, yielding similar results, a few turbo codes examples (WiMAX and HPAV) are plotted, in which λk[b] and the channel LLR representation changes from 7 to 6 bits, and λk[c(e)] from 5 to 4 bits (the meaning of λk[b] will be detailed in Section VI-C1).
Curves obtained with floating point precision show improvements between 0.1 and 0.2 dB w.r.t. the selected precisions.
Thanks to these changes, a single 6-bit wide memory is instantiated, in which both λk[c] and Rlk values are saved.
Since Rlk can take only two possible values, for each CN the authors can memorize 576 ·2 magnitudes, and 576 · 15 2-bit indexes that identify the correct Rlk magnitude and its sign.

B. LDPC Decoding Core

The LDPC decoding core used in the decoder described in [14] relies on a serial architecture suited for exclusive memory usage.
The average number of cycles–per–data varies between one and two.
Once min1 and min2 have been successfully extracted, they are compared to all the Qlk[c] of the CN, that are delayed by a number of clock cycles equal to the degree of the CN (deg), to compute Rnewlk as in (4).
Both 6-bit and 2-bit memories are implemented as dual port RAMs, allowing two concurrent operations, also known as 2) Memory Scheduling.
On the contrary, port 2 is set to read mode, loading the two Roldlk magnitudes of CN j+1 stored during previous iteration.

C. Turbo Decoding Core

As for the LDPC decoding core, also the SISO core yields a very high degree of flexibility, limited only by the size of the memories: any double-binary turbo code can be decoded as long as the memory capacity is sufficient.
The SISO interfaces with the NoC via two dedicated input and output blocks, respectively called Bit-To-Symbol Conversion Unit (BTS CU) and Symbol–To–Bit Conversion Unit (STB CU).
These metrics are computed in this exact order, thus storing βk[s] values in a dedicated set of registers while αk[s] are being processed: the b(e) metric, that needs both βk[s] and αk[s], is calculated last.
The 2-bit memory is used in the same way, with port 1 in read mode and port 2 in write mode.
Since every window is composed of at least 20 trellis steps, requiring 3 · 20 clock cycles to be executed, there is enough time to load βk[s] and αk[s] values to initialize the next window.

VII. SUPPORTED STANDARDS

The 22-node architecture presented in this work has been tested on a large set of communication standards.
To comply with each standard throughput requirements, a single fNoC = 300 MHz is sufficient in both LDPC and turbo mode, consequently identifying fLDPCcore = 200 MHz and f turbo core = 170 MHz, both under the fcore/fNoC constraint.
This area overhead is due to two specific functionalities that have been introduced in the proposed decoder: (i) full flexibility in terms of supported turbo and LDPC codes, and (ii) dynamic reconfiguration between different standards.
The parallelism of the NoC is increased from 22 nodes to 35 nodes, the reconfiguration buses rise from 5 to 8, and the support of LTE requires an increase in the size of 6-bit memories.
Throughput results for CMMB and DTMB are shown in the Implementation C column of Table VI.

D. Comparisons

Table VIII shows the detailed implementation results in comparison with the state of the art flexible turbo/LDPC decoders.
Baghdadhi et al. in [11] propose an ASIP decoder architecture supporting WiMAX and WiFi LDPC codes, and WiMAX, 3GPP-LTE and DVB-RCS turbo codes.
On the contrary, worst case throughput in [11] is not high enough for WiMAX.
This leads to a better area efficiency in all three proposed implementations for most of the codes: particularly evident is the difference for DBTC (second last row of Table VII).
They obtain very high maximum throughput efficiency in both LDPC and turbo mode: the range of supported codes is however quite limited w.r.t. all considered implementations, and the area occupation is larger than A.

IX. CONCLUSIONS

This work describes a flexible turbo/LDPC decoder architecture able to fully support a wide range of modern communication standards.
A complete analysis of the never previously addressed inter- and intra-standard reconfiguration issue is presented, together with a dedicated reconfiguration technique that limits the complexity overhead and performance loss.
Three different implementations are proposed to cover different sets of standards.
Full layout design has been completed to provide accurate area and power figures.
Comparison of the proposed architectures with the state of the art show very good efficiency, competitive area occupation and an unmatched degree of flexibility.

Did you find this useful? Give us your feedback

Figures (19)

Figure 4. Maximum lc2 plot with varying Nb

Table I RECONFIGURATION PHASES Φi : t Φi a , AVAILABLE CLOCK CYCLES DURING

Table II IEEE 802.16E STANDARD THROUGHPUT (T ), 10 ITERATIONS FOR LDPC, 8 FOR TURBO

Table III IEEE 802.11N STANDARD THROUGHPUT (T ) WITH DIFFERENT fcore AND NOC SIZES, 10 ITERATIONS

Figure 5. Maximum lc2 plot with varying tstop

Figure 3. Maximum lc2 plot with varying B

Figure 6. Maximum lc2 plot with varying Nfmax

Figure 12. (a) LDPC decoding core, (b) turbo decoding core (SISO)

Figure 13. Minimum Extraction Unit correction factor σ in (6) is applied before the final addition in (5) and λnewk [c] are sent to the output buffer.

Table VII AREA EFFICIENCY (Aeff ) FOR DIFFERENT CODES AND IMPLEMENTATIONS. N/A: CODE NOT SUPPORTED. DASH: RESULTS NOT AVAILABLE.

Table VI THROUGHPUT (T ) RESULTS FOR EACH STANDARD, WITH EVERY IMPLEMENTATION

Figure 8. LDPC and turbo BER with and without reconfiguration loss, AWGN channel, BPSK modulation

Table V CMMB AND DTMB STANDARD THROUGHPUT (T ), 10 ITERATIONS

Table IV summarizes possible switching among the selected standards, taking in account all possible code combinations. The dark gray cells represent the percentages of C1, C2 combinations between two standards whose reconfiguration requires pausing of the decoder. A few cases arise between DVB-RCS and WiMAX turbo codes and within 3GPP-LTE (due to its wide variety of codes), while when C2 belongs to the CMMB, DTMB and LTE standards, it is more likely to encounter a critical combination. On the contrary, if C1 belongs to CMMB or DTMB standards, any reconfiguration can be completed with Nf = 0: this is also the most common situation among the other standards. The choice of maximum Nf = 2 allows to handle all the other reconfiguration cases: the light gray cells show the percentages of code combinations in which 0 < Nf ≤ 2 is necessary.

Figure 11. Turbo mode in-depth memory organization

Figure 10. LDPC and turbo BER with quantization change, AWGN channel, BPSK modulation

Figure 2. Memory reconfiguration process: (a) Decoding of C1; (b) Upload of reconfiguration data required for C2 (phases Φ1 to Φ3 and Φ5); (c) First iteration of C2 and concurrent upload of reconfiguration data (Φ4)

Content maybe subject to copyright Report

10 August 2022

POLITECNICO DI TORINO

Repository ISTITUZIONALE

VLSI implementation of a multi-mode turbo/LDPC decoder architecture / Condo, Carlo; Martina, Maurizio; Masera,

Guido. - In: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS. I, REGULAR PAPERS. - ISSN 1549-8328. -

STAMPA. - 60:6(2013), pp. 1441-1454. [10.1109/TCSI.2012.2221216]

Original

VLSI implementation of a multi-mode turbo/LDPC decoder architecture

Publisher:

Published

DOI:10.1109/TCSI.2012.2221216

openAccess

Publisher copyright

(Article begins on next page)

This article is made available under terms and conditions as specified in the corresponding bibliographic description in

the repository

Availability:

This version is available at: 11583/2506382 since:

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC, 445 HOES LANE, PISCATAWAY, NJ 08855 USA

VLSI implementation of a multi-mode turbo/LDPC

decoder architecture

Carlo Condo, Maurizio Martina, Member IEEE, Guido Masera, Senior Member IEEE

Dipartimento di Elettronica e Telecomunicazioni, Politecnico di Torino, Italy

Abstract—Flexible and reconﬁgurable architectures have

gained wide popularity in the communications ﬁeld. In particular,

reconﬁgurable architectures for the physical layer are an attrac-

tive solution not only to switch among different coding modes

but also to achieve interoperability. This work concentrates on

the design of a reconﬁgurable architecture for both turbo and

LDPC codes decoding. The novel contributions of this paper

are: i) tackling the reconﬁguration issue introducing a formal

and systematic treatment that, to the best of our knowledge, was

not previously addressed; ii) proposing a reconﬁgurable NoC-

based turbo/LDPC decoder architecture and showing that wide

ﬂexibility can be achieved with a small complexity overhead.

Obtained results show that dynamic switching between most of

considered communication standards is possible without pausing

the decoding activity. Moreover, post-layout results show that

tailoring the proposed architecture to the WiMAX standard leads

to an area occupation of 2.75 mm

and a power consumption of

101.5 mW in the worst case.

Index Terms—VLSI, LDPC/Turbo Codes Decoder, NoC, Flex-

ibility, Wireless communications

I. INTRODUCTION

In the last years several efforts were spent to develop

systems able to give ubiquitous access to telecommunication

networks. These efforts were spent mainly in three directions:

i) improving the transmission rate and reliability; ii) develop-

ing bandwidth efﬁcient technologies; iii) designing low cost

receivers. The most relevant results produced by such a vivid

research were included in the last standards for both wireless

and wired communications [1]–[7]. Besides, several standards

provide multiple modes and functionalities. However, sharing

common features is a challenging task to achieve ﬂexibility

and interoperability.

Several recent works, including [8], have shown that ﬂex-

ibility is an important property in the implementation of

communication systems. Some works investigated this di-

rection facing the challenge of implementing ﬂexible archi-

tectures for the decoding of channel codes. In particular,

ﬂexible turbo/Low-Density-Parity-Check (LDPC) decoder ar-

chitectures have been proposed not only to support differ-

ent coding modes within a speciﬁc standard but also to

enable interoperability among different standards. In [9]–

[11] ﬂexibility is achieved through the design of Processing

Elements (PEs) based on Application-Speciﬁc-Instruction-set-

Processor (ASIP) architectures, whereas in [12]–[14] PEs rely

on Application-Speciﬁc-Integrated-Circuit (ASIC) solutions.

In both approaches, ﬂexible and efﬁcient interconnection struc-

tures are required to connect PEs to each other.

Unfortunately, the communication patterns of turbo and

LDPC codes suffer from collisions, namely two or more PEs

require concurrent access to the same memory resource. To

break the collision a Network-on-Chip (NoC) like approach

was proposed in [15] for turbo codes. This idea has been

further developed in other works. In particular, in [16] the NoC

approach is used as a viable solution to implement ﬂexible and

high throughput interconnection structures for turbo/LDPC

decoders.

An intra-IP NoC [17] is an application speciﬁc NoC [18]

where the interconnection structure is tailored to the char-

acteristics of the Intellectual Property (IP). The use of an

intra-IP NoC as the interconnection framework for both turbo

and LDPC code decoders has been demonstrated in several

works [16], [19]–[21]. This choice enables larger ﬂexibility

with respect to other interconnection schemes [16], [22], [23],

but introduces penalties in terms of additional occupied area

and latency in the communication among PEs.

Stemming from the work presented in [14], [19], [20],

where an ASIC implementation of an NoC-based turbo/LDPC

decoder architecture is proposed, this paper aims to further

investigate and optimize it. In particular, this work features

the following novel contributions: i) management of dynamic

reconﬁguration to switch between a code to another one

without pausing the decoding, ii) description of a new PE

architecture with an improved shared memory solution which

provides relevant saving of occupied area for min-sum de-

coding algorithm, iii) evaluation of a wide set of standards for

both wireless and wired applications: IEEE 802.16e (WiMAX)

[5], IEEE 802.11n (WiFi) [6], China Mulitimedia Mobile

Broadcasting (CMMB) [3], Digital Terrestrial Multimedia

Broadcast (DTMB) [4], HomePlug AV (HPAV) [2], 3GPP

Long Term Evolution (LTE) [7], Digital Video Broadcasting

- Return Channel via Satellite (DVB-RCS) [1], iv) complete

VLSI implementation of the decoder up to layout level and

accurate evaluation of dissipated power.

It is worth noting that, to the best of our knowledge, this is

the ﬁrst work addressing dynamic reconﬁguration of ﬂexible

channel decoders with an analytical approach, and showing

the actual impact of reconﬁguration on both performance and

complexity. The paper is structured as follows. In Section

II decoding algorithms are brieﬂy discussed, whereas section

III deals with the basics of NoC-based turbo/LDPC decoder

architectures and summarizes the main results this work starts

from. The decoder reconﬁguration techniques are detailed in

Section IV and V, while Section VI deals with the descrip-

tion of LDPC and turbo decoding cores, along with their

respective memory organization. In Section VII evaluations of

the architecture performance on various existing standards are

provided. Implementation results are portrayed and discussed

in Section VIII, and conclusions are drawn in section IX.

II. DECODING ALGORITHMS

Turbo and LDPC decoding algorithms are characterized

by strong resemblances: they are iterative, work on graph-

based representations, are routinely implemented in logarith-

mic form, process data expressed as Logarithmic-Likelihood-

Ratios (LLRs) and require high level of both processing and

storage parallelism. Both algorithms receive intrinsic informa-

tion from the channel and produce extrinsic information that is

exchanged across iterations to obtain the a priori information

of uncoded bits, in the case of binary codes, or symbols, in

the case of non binary codes. Moreover, their arithmetical

functions are so similar that joint or derived algorithms for

both LDPC and turbo decoding exist [24]. In the following

for both codes we will refer to K, N and r = K/N as the

number of uncoded bits, the number of coded bits and the

code rate respectively.

A. LDPC codes decoding algorithm

Every LDPC code is completely described by its M × N

parity check matrix H (M = N − K) which is very sparse

[25]. Each valid LDPC codeword x satisﬁes H · x

= 0,

where (·)

is the transposition operator. The decoding of LDPC

codes stems from the Tanner graph representation of H where

two sets of nodes are identiﬁed: Variable Nodes (VNs) and

Check Nodes (CNs). VNs are associated to the N bits of the

codeword, whereas CNs correspond to the M parity-check

constraints. The most common algorithm to decode LDPC

codes is the Belief Propagation (BP) algorithm. There are two

main scheduling schemes for the BP: two-phase scheduling

and layered scheduling [26]. The latter nearly doubles the

converge speed as compared to two-phase scheduling. In a

layered decoder, parity-check constraints are grouped in layers

each of which is associated to a component code. Then, layers

are decoded in sequence by propagating extrinsic information

from one layer to the following one [26]. This process is

iterated up to the desired level of reliability.

Let λ[c] represent the LLR of symbol c and, for column k

in H, bit LLR λ

[c] is initialized to the corresponding received

soft value. Then, for all parity constraints l in a given layer,

the following operations are executed:

[c] = λ

old

[c] − R

old

(1)

n∈N(l),n6=k

Ψ(Q

[c]) (2)

n∈N(l),n6=k

sgn(Q

[c]) (3)

new

= −δ

· Ψ

−1

) (4)

new

[c] = Q

[c] + R

new

(5)

old

[c] is the extrinsic information received from the previous

layer and updated in (5) to be propagated to the succeed-

ing layer. Term R

old

, pertaining to element (l,k) of H and

initialized to 0, is used to compute (1); the same amount is

then updated in (4), R

new

, and stored to be used again in the

following iteration. In (2) and (3) N(l) is the set of all bit

indexes that are connected to parity constraint l.

According to [27], the Ψ(·) function in (2) and (4) can be

simpliﬁed with a limited BER performance loss as

new

≈ −δ

· min

n∈N(l),n6=k

{|Q

|} , (6)

usually referred to as normalized-min-sum approximation,

where δ

= σ · δ

and σ ≤ 1.

B. Turbo codes decoding algorithm

Turbo codes are obtained as the parallel concatenation of

two constituent Convolutional Code (CC) encoders connected

by the means of an interleaver (Π). Thus, the decoder is

made of two constituent decoders, referred to as Soft-In-Soft-

Out (SISO) or Maximum-A-Posteriori (MAP) decoders [28]

connected in an iterative loop by the means of the interleaver

Π and the de-interleaver Π

−1

. Each constituent decoder per-

forms the so called BCJR algorithm [29] that starting from

the intrinsic and a priori information produces the extrinsic

information. Let k be a step in the trellis representation of the

constituent CC, and u an uncoded symbol. Each constituent

decoder computes λ

[u] = σ · (λ

apo

[u] − λ

apr

[u] − λ

])

where σ ≤ 1 [30], λ

apo

[u] is the a-posteriori information,

apr

[u] is the a priori information and λ

] is the systematic

component of the intrinsic information. According to [29] a-

posteriori information is computed as

apo

[u] =

∗

max

e:u(e)=u

{b(e)} −

∗

max

e:u(e)=˜u

{b(e)} (7)

where ˜u ∈ U is an uncoded symbol taken as a reference

(usually ˜u = 0) and u ∈ U \ {˜u} with U the set of uncoded

symbols; e is a trellis transition and u(e) is the corresponding

uncoded symbol. Several exact and approximated expressions

are available for the

∗

max{x

} function [31]: for example, it

can be implemented as max{x

} followed by a correction

term, often stored in a small Look-Up-Table (LUT). The

correction term, usually adopted when decoding binary codes

(Log-MAP), can be omitted with minor Bit-Error-Rate (BER)

performance degradation (Max-Log-MAP). The term b(e) in

(7) is deﬁned as:

b(e) = α

k−1

(e)] + γ

[e] + β

(e)] (8)

[s] =

∗

max

e:s

(e)=s



k−1

(e)] + γ

[e]



(9)

[s] =

∗

max

e:s

(e)=s



k+1

(e)] + γ

[e]



(10)

[e] = λ

apr

[u(e)] + λ

[c(e)] (11)

where s

(e) and s

(e) are the starting and the ending states

of e, α

(e)] and β

(e)] are the forward and backward

metrics associated to s

(e) and s

(e) respectively. The term

[c(e)] represents the intrinsic information received from the

Routing

algorithm

crossbar

conf.

Memory

Location

load

read enable

MEM iPE i

output

bus

RE i

i,j

input

Reconﬁguration

Figure 1. Node structure

channel. For further details on the decoding algorithm the

reader can refer to [32].

In a parallel decoder, the decoding operations summarized

in previous paragraphs are partitioned among P PEs. When

conﬁgured in turbo code mode, these PEs operate as con-

current SISOs. On the other hand, they execute (1) to (5)

in parallel for P slices of parity check constraints when

conﬁgured in LDPC code mode. In both cases, messages

are exchanged among PEs to propagate λ

[u] and λ

new

[c]

amounts in accordance with the code structure. In the follow-

ing, we indicate the j-th message received and generated by

PE i as λ

i,j

and λ

i,j

respectively.

III. NOC-BASED DECODER

The goal of this work is to design a highly ﬂexible LDPC

and turbo decoder, able to support a very wide set of different

communication standards. The proposed multi-mode/multi-

standard decoder architecture relies on an NoC-based struc-

ture, where each node contains a PE and a routing element

(RE). Each PE implements the BCJR and layered normalized

min-sum algorithms. On the other hand, REs are devoted to

deliver λ

i,j

values to the correct destination.

The node architecture employed in this work for node i is

represented in Fig. 1. Each RE is constituted by a 4×4 crossbar

switch with 4 input FIFOs and 4 output registers. The routing

algorithm is the one proposed in [19] as Single-Shortest-

Path-FIFO-Length (SSP-FL). SSP-FL relies on a distributed

table-based routing algorithm where each table contains the

information for shortest path routing. The routing information

is precalculated by running off-line the Floyd-Warshall algo-

rithm. Moreover, in SSP-FL shortest path routing is coupled

with an input serving policy based on the current status to

the FIFOs, namely in case two messages must be routed to

the same output port, priority is given to the message coming

from the longer FIFO. It is worth noting that the destination

of each λ

i,j

is imposed by the interleaver and the H matrix

respectively. As a consequence, the routing is deterministic.

The PE includes both LDPC and turbo decoding cores:

their architectures are structured to be as independent as

possible of the supported codes. The LDPC decoding core

is completely serial and able to decode any LDPC code,

provided that enough memory is available. The SISO core

for turbo decoding is tailored around 8-state turbo codes,

and no other constraints are present: the two cores share the

memories where the incoming data λ

i,j

are stored and the

location memory containing the pre-computed t

i,j

values, i.e.

the memory addresses to store λ

i,j

. Also the interconnection

structure depends only on the location memory size, that sets

an upper bound to the number of messages each PE can

handle.

The decoding task is divided uniformly among the different

nodes. The process is straightforward in turbo mode, with

each node being assigned a portion of the trellis that is

processed in a sliding-window fashion [33], [34]. Extrinsic

and window-initialization information are carried through the

network according to the code interleaving and deinterleaving

rules [19]. On the contrary, in LDPC mode the partitioning

of the decoding task on the PEs is obtained as follows.

Using a proprietary tool based on the METIS graph coloring

library [35], the H matrix is partitioned on the chosen network

topology. At this point the destination of every message

coming out of each decoding core is known. Thus, in both

turbo and LDPC modes each outgoing message is made of a

payload λ

i,j

and a header containing the destination node.

Performance of meshes, toroidal meshes, spidergon, hon-

eycomb, De Bruijn and Kautz graphs were compared, along

with a number of other design choices, as routing algorithm

and collision management policies. This analysis shows that

the Kautz topology yields the best results in terms of area

occupation and obtainable throughput. In particular, in [14] a

22-nodes Kautz NoC was used to fully support IEEE 802.16e

standard, each node being connected to a decoding PE and to

three other nodes via a 4-way router.

IV. DECODER RECONFIGURATION

Flexible decoders available in the literature [9]–[13], [16],

[17], [19], [20], though supporting a wide range of codes,

do not address the reconﬁguration issue. Change of decoding

mode, standard or code parameters requires not only hardware

support, but also memory initialization and speciﬁc controls:

since in many standards a code switch can be issued as early

as one data frame ahead [5], a time efﬁcient reconﬁguration

technique must be developed.

For the proposed decoder the reconﬁguration task consists

of i) rewriting the location memory containing t

i,j

values; ii)

reloading the CN degree (deg) parameters and the window

size in the control unit of LDPC decoding cores and SISOs

respectively. In the following, the whole set of storage loca-

tions to be updated at reconﬁguration time will be indicated as

“reconﬁguration memory”. When possible, the decoder must

be reconﬁgured while the decoding process is still running on

the previous data frame. This means that the reconﬁguration

data can be distributed by means of the NoC interconnections

only at the cost of severe performance penalties. Consequently,

we suppose that the reconﬁguration data are moved directly

to the PEs via a set of N

dedicated buses, each one linked

PEs.

00000

11111

00000

11111

000000

111111

000000

111111

000000

111111

ECC

SCC

(c)

SCC

ECC

SFC

EFC

(b)

ECC

SCC

(a)

Locations occupied by C

Free locations Locations uploaded with C

Figure 2. Memory reconﬁguration process: (a) Decoding of C

; (b) Upload

of reconﬁguration data required for C

(phases Φ

to Φ

and Φ

); (c) First

iteration of C

and concurrent upload of reconﬁguration data (Φ

)

In the following we estimate reconﬁguration occurrence

assuming mobile receivers moving at different speeds and

the carrier frequency f

= 2.4 GHz. This frequency is

included in most standards’ operation range, and used in a

variety of applications. In this scenario the communication

channel is affected by fading phenomena, namely slow fading,

whose effects have very long time constants, and fast fading.

Fast fading can be modeled assuming a change of channel

conditions every time the receiver is moved by a distance

similar to the wavelength λ of the carrier. Being λ = 0.125

m, at a speed v = 70 km/h the channel changes with a

frequency f

chng

= 155 Hz (WiMAX, WiFi, 3GPP-LTE),

whereas, at v = 10 km/h (DVB-RCS, HPAV, CMMB, DTMB)

changes occurs at f

chng

= 22 Hz. These scenarios result in

different reconﬁguration probabilities, whose impact on BER

performance is addressed in Section V.

The reconﬁguration memory is organized as a circular

buffer: two sets of pointers are used to manage reading and

writing operations. The Start of Current Conﬁguration (SCC)

pointer and the End of Current Conﬁguration (ECC) pointer

delimit the memory blocks that are currently being used. A

Read Pointer (RP) is used to retrieve the data during the

decoding process, as shown in Fig. 2.(a). The Start of Future

Conﬁguration (SFC) and End of Future Conﬁguration (EFC)

pointers are instead used concurrently with the Write Pointer

(WP) to delimit the locations that are going to be used to store

the new conﬁguration data.

The reconﬁguration of the considered decoder to switch

from the code currently processed (C

) to a new one (C

) can

be overlapped with the decoding of both current and new code,

provided that enough locations are free in the conﬁguration

memories. In particular, part of the conﬁguration process can

be concurrent with the decoding of one or more frames of

; if necessary, another portion of the conﬁguration can be

scheduled during the ﬁrst iteration of the new code C

. Finally,

in case the overlap with decoding activity is not sufﬁcient to

complete the whole conﬁguration, a further option is pausing

the decoder by skipping one or more iterations on the last

received frame for C

and using the available time, before

starting the decoding of the new frame encoded with C

Let us deﬁne B as the size of the location buffer available at

each PE to store conﬁguration data, t

it1

and t

it2

as the duration

in clock cycles of a single decoding iteration for codes C

and

. Moreover, l

and l

express the number of locations

required to store conﬁgurations of codes C

and C

at each

PE, and n

it1

and n

it2

their iteration numbers.

In the considered architecture, the duration of one decoding

iteration t

expressed in clock cycles is directly proportional

to the number of memory locations a PE has to read throughout

the decoding process, and consequently to the number of

used locations in the reconﬁguration memory (l

). Though the

actual relationship between t

and l

is affected by memory

scheduling and ratio between PE and NoC clock frequencies,

this analysis is carried out with the worst-case assumption

that the reconﬁguration memory is read at every clock cycle

of each iteration, setting l

= t

for both C

and C

codes.

We deﬁne ﬁve phases Φ

, i = 1, 2, 3, 4, 5 in the conﬁg-

uration process and for each phase we identify i) t

the number of clock cycles available during phase Φ

, and

ii) l

= N

· t

/P as the number of locations in each

reconﬁguration memory that can be written in t

clock cycles.

In the reconﬁguration from code C

to code C

, l

words

must be replaced with l

new words. The ﬁrst part of the

conﬁguration can be scheduled during the initial n

it1

− 1

decoding iterations on C

and therefore the available time

is t

= (n

it1

−1)·t

it1

; in this range of time a maximum

of l

· (n

it1

− 1) · t

it1

words can be loaded

into each buffer. However, assuming that the buffer size

is larger than l

, we deﬁne B − l

as the number of

unused memory blocks in current conﬁguration for code

. Therefore, the actual number of locations written in

is the minimum between B − l

and l

. The SFC

pointer is thus initialized as ECC (Fig. 2.(b)).

During the last iteration on C

, every memory location

between SCC and the current position of RP is available

for reconﬁguration. This means that up to l

locations

are available for receiving conﬁguration words for C

However, this has to be done during a single iteration,

and therefore t

= t

it1

cycles are available. During these

cycles, up to l

· t

it1

words can be loaded.

As mentioned before, part of the conﬁguration can be

overlapped with the ﬁrst decoding iteration on C

code.

SCC is initialized as SFC, and RP will take the duration

of a full iteration to arrive to ECC (Fig. 2.(c)). The

available time is t

= t

it2

and the maximum number of

words that can be loaded in this phase is l

· t

it2

In the event that previously listed phases are not sufﬁcient

to complete the conﬁguration, an early stopping in the

decoding of code C

can be scheduled to make available

additional cycles to be used for loading the remaining part

of the conﬁguration words. We indicate the number of

cycles available in this phase as t

= t

stop

. The number

of words that can be loaded in Φ

is l

· t

stop

. As

one or more complete iterations are dropped in Φ

, t

stop

is a multiple of t

it1

, which can be formalized as

stop

= n

stop

· t

it1

, n

stop

= 0, 1, 2, 3, · · · (12)

Differently from the other four phases, Φ

affects the

HTML Viewer

Frequently Asked Questions (12)

Q1. What contributions have the authors mentioned in the paper "Vlsi implementation of a multi-mode turbo/ldpc decoder architecture" ?

This work concentrates on the design of a reconfigurable architecture for both turbo and LDPC codes decoding. The novel contributions of this paper are: i ) tackling the reconfiguration issue introducing a formal and systematic treatment that, to the best of their knowledge, was not previously addressed ; ii ) proposing a reconfigurable NoCbased turbo/LDPC decoder architecture and showing that wide flexibility can be achieved with a small complexity overhead.

Q2. How many trellis steps are needed to initialize the next window?

Since every window is composed of at least 20 trellis steps, requiring 3 · 20 clock cycles to be executed, there is enough time to load βk[s] and αk[s] values to initialize the next window.

Q3. How many bits are needed for a trellis step?

In case of Single Binary Turbo Codes (SBTC), like those used in 3GPP-LTE, only two λk[c(e)] and one λk[b] are necessary for a trellis step, and they can be read in two clock cycles without impairing the throughput.

Q4. How many locations are used to store k[s]?

Six locations are used to store 2 βk[s] or αk[s] (Fig 11): since at most three 8-state windows initialization metrics, i.e. 24 βk[s] and 24 αk[s], are stored at the same time, only 144 out of 400 locations are used.

Q5. How many cycles can be used to load the remaining configuration words?

In the event that previously listed phases are not sufficient to complete the configuration, an early stopping in the decoding of code C1 can be scheduled to make available additional cycles to be used for loading the remaining part of the configuration words.

Q6. What are the characteristics of LDPC decoding algorithms?

Turbo and LDPC decoding algorithms are characterized by strong resemblances: they are iterative, work on graphbased representations, are routinely implemented in logarithmic form, process data expressed as Logarithmic-LikelihoodRatios (LLRs) and require high level of both processing and storage parallelism.

Q7. How much is the BER penalty in the presence of the fast moving receiver?

The reconfiguration probability ranges between 0.25% and 0.3% in presence of the fast moving receiver, while it remains under 0.15% in the other case.

Q8. What is the way to compare the proposed architectures with the state of the art?

Comparison of the proposed architectures with the state of the art show very good efficiency, competitive area occupation and an unmatched degree of flexibility.

Q9. Why is the LDPC consumption calculated on a DTMB code?

This is because the LDPC consumption is calculated on a DTMB code, that makes full use of the extended memories, while the memory usage percentages for DBTC remains low.

Q10. What is the relationship between tit and lc?

Though the actual relationship between tit and lc is affected by memory scheduling and ratio between PE and NoC clock frequencies, this analysis is carried out with the worst-case assumption that the reconfiguration memory is read at every clock cycle of each iteration, setting lc = tit for both C1 and C2 codes.

Q11. What is the maximum ratio of fcore/fNoC for a given standard?

Taking in account the presented 22-node architecture, the maximum ratio fcore/fNoC for which this assumption stands is 2/3 for LDPC codes and SBTC, while 3/5 is necessary for DBTC.

Q12. What is the way to move the data to the PEs?

the authors suppose that the reconfiguration data are moved directly to the PEs via a set of Nb dedicated buses, each one linked to PNb PEs.

VLSI Implementation of a Multi-Mode Turbo/LDPC Decoder Architecture

Summary (4 min read)

Introduction

II. DECODING ALGORITHMS

A. LDPC codes decoding algorithm

B. Turbo codes decoding algorithm

III. NOC-BASED DECODER

IV. DECODER RECONFIGURATION

V. RECONFIGURATION: CASES AND EXAMPLES

VI. DECODING CORES

A. Quantization and Memory Organization

B. LDPC Decoding Core

C. Turbo Decoding Core

VII. SUPPORTED STANDARDS

D. Comparisons

IX. CONCLUSIONS

Figures (19)

Citations

Cites background from "VLSI Implementation of a Multi-Mode..."

Cites background or result from "VLSI Implementation of a Multi-Mode..."

Cites methods from "VLSI Implementation of a Multi-Mode..."

References

"VLSI Implementation of a Multi-Mode..." refers background in this paper

"VLSI Implementation of a Multi-Mode..." refers methods in this paper

Related Papers (5)

Frequently Asked Questions (12)

Q1. What contributions have the authors mentioned in the paper "Vlsi implementation of a multi-mode turbo/ldpc decoder architecture" ?

Q2. How many trellis steps are needed to initialize the next window?

Q3. How many bits are needed for a trellis step?

Q4. How many locations are used to store k[s]?

Q5. How many cycles can be used to load the remaining configuration words?

Q6. What are the characteristics of LDPC decoding algorithms?

Q7. How much is the BER penalty in the presence of the fast moving receiver?

Q8. What is the way to compare the proposed architectures with the state of the art?

Q9. Why is the LDPC consumption calculated on a DTMB code?

Q10. What is the relationship between tit and lc?

Q11. What is the maximum ratio of fcore/fNoC for a given standard?

Q12. What is the way to move the data to the PEs?