VLSI Implementation of a Multi-Mode Turbo/LDPC Decoder Architecture
Summary (4 min read)
Introduction
- In the last years several efforts were spent to develop systems able to give ubiquitous access to telecommunication networks.
- In both approaches, flexible and efficient interconnection structures are required to connect PEs to each other.
- The use of an intra-IP NoC as the interconnection framework for both turbo and LDPC code decoders has been demonstrated in several works [16], [19]–[21].
- In Section VII evaluations of the architecture performance on various existing standards are provided.
II. DECODING ALGORITHMS
- Turbo and LDPC decoding algorithms are characterized by strong resemblances: they are iterative, work on graphbased representations, are routinely implemented in logarithmic form, process data expressed as Logarithmic-LikelihoodRatios (LLRs) and require high level of both processing and storage parallelism.
- Both algorithms receive intrinsic information from the channel and produce extrinsic information that is exchanged across iterations to obtain the a priori information of uncoded bits, in the case of binary codes, or symbols, in the case of non binary codes.
- Moreover, their arithmetical functions are so similar that joint or derived algorithms for both LDPC and turbo decoding exist [24].
- In the following for both codes the authors will refer to K, N and r = K/N as the number of uncoded bits, the number of coded bits and the code rate respectively.
A. LDPC codes decoding algorithm
- The decoding of LDPC codes stems from the Tanner graph representation of H where two sets of nodes are identified: Variable Nodes (VNs) and Check Nodes (CNs).
- There are two main scheduling schemes for the BP: two-phase scheduling and layered scheduling [26].
- In a layered decoder, parity-check constraints are grouped in layers each of which is associated to a component code.
- This process is iterated up to the desired level of reliability.
- Let λ[c] represent the LLR of symbol c and, for column k in H, bit LLR λk[c] is initialized to the corresponding received soft value.
B. Turbo codes decoding algorithm
- Turbo codes are obtained as the parallel concatenation of two constituent Convolutional Code (CC) encoders connected by the means of an interleaver (Π).
- Each constituent decoder performs the so called BCJR algorithm [29] that starting from the intrinsic and a priori information produces the extrinsic information.
- Several exact and approximated expressions are available for the ∗ max{xi} function [31]: for example, it can be implemented as max{xi} followed by a correction term, often stored in a small Look-Up-Table (LUT).
- On the other hand, they execute (1) to (5) in parallel for P slices of parity check constraints when configured in LDPC code mode.
- In the following, the authors indicate the j-th message received and generated by PE i as λ′i,j and λi,j respectively.
III. NOC-BASED DECODER
- The goal of this work is to design a highly flexible LDPC and turbo decoder, able to support a very wide set of different communication standards.
- The node architecture employed in this work for node i is represented in Fig.
- The routing algorithm is the one proposed in [19] as Single-ShortestPath-FIFO-Length (SSP-FL).
- It is worth noting that the destination of each λi,j is imposed by the interleaver and the H matrix respectively.
- The PE includes both LDPC and turbo decoding cores: their architectures are structured to be as independent as possible of the supported codes.
IV. DECODER RECONFIGURATION
- Change of decoding mode, standard or code parameters requires not only hardware support, but also memory initialization and specific controls: since in many standards a code switch can be issued as early as one data frame ahead [5], a time efficient reconfiguration technique must be developed.
- The reconfiguration of the considered decoder to switch from the code currently processed (C1) to a new one (C2) can be overlapped with the decoding of both current and new code, provided that enough locations are free in the configuration memories.
- Finally, in case the overlap with decoding activity is not sufficient to complete the whole configuration, a further option is pausing the decoder by skipping one or more iterations on the last received frame for C1 and using the available time, before starting the decoding of the new frame encoded with C2.
- Two alternative cases can arise during Φ1: either this phase is limited by the available time, or it is limited by the number of free locations in the reconfiguration memory: (nit1 − 1) · tit1 R P Nb · (B − lc1) (14).
- The bound is also proportional to Nb, and can be consequently increased by rising the number of reconfiguration buses.
V. RECONFIGURATION: CASES AND EXAMPLES
- The reconfiguration method detailed in Section IV has been applied to a set of target standards, in order to identify suitable design parameters (i.e. Nb, B, nstop, Nfmax) that enable reconfiguration without pausing the decoder for most of code sizes.
- It can be noticed that with nstop = 3 all the large codes are below the right side of the curve: later in this section it will be demonstrated how these skipped iterations are negligible in terms of BER performance.
- In Fig. 6, the effect of different choices of Nf is shown: from the plot it can be seen that Nf > 0 actually increases the maximum lc2 only for small C1 codes.
- Among the remaining three combinations, the one that makes use of 6 buses yields a higher area occupation than the others.
VI. DECODING CORES
- The design of the decoding cores must yield the same degree of flexibility of the NoC, being as independent as possible of the set of supported codes.
- In [14] a completely serial LDPC decoding core has been designed, mostly independent of block length and code rate: an arbitrary number of CN operations can be scheduled on it.
- The same holds true for the serial SISO, where different windows can be scheduled, regardless of the size of the interleaver.
- As a consequence, in this work logic sharing is not addressed.
- Experimental results show that the area of the architecture is dominated by memories indeed.
A. Quantization and Memory Organization
- Memory organization evolves from the idea presented in [14], in which in every decoding core two memories are instantiated: a 7-bit memory and a 5-bit memory.
- On the same graph, yielding similar results, a few turbo codes examples (WiMAX and HPAV) are plotted, in which λk[b] and the channel LLR representation changes from 7 to 6 bits, and λk[c(e)] from 5 to 4 bits (the meaning of λk[b] will be detailed in Section VI-C1).
- Curves obtained with floating point precision show improvements between 0.1 and 0.2 dB w.r.t. the selected precisions.
- Thanks to these changes, a single 6-bit wide memory is instantiated, in which both λk[c] and Rlk values are saved.
- Since Rlk can take only two possible values, for each CN the authors can memorize 576 ·2 magnitudes, and 576 · 15 2-bit indexes that identify the correct Rlk magnitude and its sign.
B. LDPC Decoding Core
- The LDPC decoding core used in the decoder described in [14] relies on a serial architecture suited for exclusive memory usage.
- The average number of cycles–per–data varies between one and two.
- Once min1 and min2 have been successfully extracted, they are compared to all the Qlk[c] of the CN, that are delayed by a number of clock cycles equal to the degree of the CN (deg), to compute Rnewlk as in (4).
- Both 6-bit and 2-bit memories are implemented as dual port RAMs, allowing two concurrent operations, also known as 2) Memory Scheduling.
- On the contrary, port 2 is set to read mode, loading the two Roldlk magnitudes of CN j+1 stored during previous iteration.
C. Turbo Decoding Core
- As for the LDPC decoding core, also the SISO core yields a very high degree of flexibility, limited only by the size of the memories: any double-binary turbo code can be decoded as long as the memory capacity is sufficient.
- The SISO interfaces with the NoC via two dedicated input and output blocks, respectively called Bit-To-Symbol Conversion Unit (BTS CU) and Symbol–To–Bit Conversion Unit (STB CU).
- These metrics are computed in this exact order, thus storing βk[s] values in a dedicated set of registers while αk[s] are being processed: the b(e) metric, that needs both βk[s] and αk[s], is calculated last.
- The 2-bit memory is used in the same way, with port 1 in read mode and port 2 in write mode.
- Since every window is composed of at least 20 trellis steps, requiring 3 · 20 clock cycles to be executed, there is enough time to load βk[s] and αk[s] values to initialize the next window.
VII. SUPPORTED STANDARDS
- The 22-node architecture presented in this work has been tested on a large set of communication standards.
- To comply with each standard throughput requirements, a single fNoC = 300 MHz is sufficient in both LDPC and turbo mode, consequently identifying fLDPCcore = 200 MHz and f turbo core = 170 MHz, both under the fcore/fNoC constraint.
- This area overhead is due to two specific functionalities that have been introduced in the proposed decoder: (i) full flexibility in terms of supported turbo and LDPC codes, and (ii) dynamic reconfiguration between different standards.
- The parallelism of the NoC is increased from 22 nodes to 35 nodes, the reconfiguration buses rise from 5 to 8, and the support of LTE requires an increase in the size of 6-bit memories.
- Throughput results for CMMB and DTMB are shown in the Implementation C column of Table VI.
D. Comparisons
- Table VIII shows the detailed implementation results in comparison with the state of the art flexible turbo/LDPC decoders.
- Baghdadhi et al. in [11] propose an ASIP decoder architecture supporting WiMAX and WiFi LDPC codes, and WiMAX, 3GPP-LTE and DVB-RCS turbo codes.
- On the contrary, worst case throughput in [11] is not high enough for WiMAX.
- This leads to a better area efficiency in all three proposed implementations for most of the codes: particularly evident is the difference for DBTC (second last row of Table VII).
- They obtain very high maximum throughput efficiency in both LDPC and turbo mode: the range of supported codes is however quite limited w.r.t. all considered implementations, and the area occupation is larger than A.
IX. CONCLUSIONS
- This work describes a flexible turbo/LDPC decoder architecture able to fully support a wide range of modern communication standards.
- A complete analysis of the never previously addressed inter- and intra-standard reconfiguration issue is presented, together with a dedicated reconfiguration technique that limits the complexity overhead and performance loss.
- Three different implementations are proposed to cover different sets of standards.
- Full layout design has been completed to provide accurate area and power figures.
- Comparison of the proposed architectures with the state of the art show very good efficiency, competitive area occupation and an unmatched degree of flexibility.
Did you find this useful? Give us your feedback
Citations
112 citations
Cites background from "VLSI Implementation of a Multi-Mode..."
...The overhead of building a flexible LDPC decoder capable of decoding different codes is significant, and creating flexible LDPC decoders is an active area of research [2], [3]....
[...]
103 citations
54 citations
Cites background or result from "VLSI Implementation of a Multi-Mode..."
...Table III summarizes the key characteristics of implemented decoders of our work and compares them with the reported parallel turbo-decoder implementations in the literature [7], [8], [10]–[15]....
[...]
...Recently, a novel hybrid decoder-architecture of turbo low-density-parity-check (LDPC) codes for multiple wireless communication standards has been proposed in [15]....
[...]
42 citations
Cites methods from "VLSI Implementation of a Multi-Mode..."
...[14] uses different processing cores for LDPC and turbo decoding with shared memory to store messages for both the processes....
[...]
...Multi-mode reconfigurable architectures in [14] and [15] have the flexibility to switch between LDPC and turbo decoding-process....
[...]
39 citations
References
11,592 citations
"VLSI Implementation of a Multi-Mode..." refers background in this paper
...parity check matrix which is very sparse [25]....
[...]
7,742 citations
6,667 citations
5,963 citations
4,830 citations
"VLSI Implementation of a Multi-Mode..." refers methods in this paper
...According to [29] a-posteriori information is computed as...
[...]
...Each constituent decoder performs the so called BCJR algorithm [29] that starting from the intrinsic and a priori information produces the extrinsic information....
[...]
Related Papers (5)
Frequently Asked Questions (12)
Q2. How many trellis steps are needed to initialize the next window?
Since every window is composed of at least 20 trellis steps, requiring 3 · 20 clock cycles to be executed, there is enough time to load βk[s] and αk[s] values to initialize the next window.
Q3. How many bits are needed for a trellis step?
In case of Single Binary Turbo Codes (SBTC), like those used in 3GPP-LTE, only two λk[c(e)] and one λk[b] are necessary for a trellis step, and they can be read in two clock cycles without impairing the throughput.
Q4. How many locations are used to store k[s]?
Six locations are used to store 2 βk[s] or αk[s] (Fig 11): since at most three 8-state windows initialization metrics, i.e. 24 βk[s] and 24 αk[s], are stored at the same time, only 144 out of 400 locations are used.
Q5. How many cycles can be used to load the remaining configuration words?
In the event that previously listed phases are not sufficient to complete the configuration, an early stopping in the decoding of code C1 can be scheduled to make available additional cycles to be used for loading the remaining part of the configuration words.
Q6. What are the characteristics of LDPC decoding algorithms?
Turbo and LDPC decoding algorithms are characterized by strong resemblances: they are iterative, work on graphbased representations, are routinely implemented in logarithmic form, process data expressed as Logarithmic-LikelihoodRatios (LLRs) and require high level of both processing and storage parallelism.
Q7. How much is the BER penalty in the presence of the fast moving receiver?
The reconfiguration probability ranges between 0.25% and 0.3% in presence of the fast moving receiver, while it remains under 0.15% in the other case.
Q8. What is the way to compare the proposed architectures with the state of the art?
Comparison of the proposed architectures with the state of the art show very good efficiency, competitive area occupation and an unmatched degree of flexibility.
Q9. Why is the LDPC consumption calculated on a DTMB code?
This is because the LDPC consumption is calculated on a DTMB code, that makes full use of the extended memories, while the memory usage percentages for DBTC remains low.
Q10. What is the relationship between tit and lc?
Though the actual relationship between tit and lc is affected by memory scheduling and ratio between PE and NoC clock frequencies, this analysis is carried out with the worst-case assumption that the reconfiguration memory is read at every clock cycle of each iteration, setting lc = tit for both C1 and C2 codes.
Q11. What is the maximum ratio of fcore/fNoC for a given standard?
Taking in account the presented 22-node architecture, the maximum ratio fcore/fNoC for which this assumption stands is 2/3 for LDPC codes and SBTC, while 3/5 is necessary for DBTC.
Q12. What is the way to move the data to the PEs?
the authors suppose that the reconfiguration data are moved directly to the PEs via a set of Nb dedicated buses, each one linked to PNb PEs.