scispace - formally typeset
Search or ask a question

Showing papers by "Keshab K. Parhi published in 2004"


Journal Article•DOI•
TL;DR: Using the proposed architecture, a fully subpipelined encryptor with 7 substages in each round unit can achieve a throughput of 21.56 Gbps on a Xilinx XCV1000 e-8 bg560 device in non-feedback modes, which is faster and 79% more efficient in terms of equivalent throughput/slice than the fastest previous FPGA implementation known to date.
Abstract: This paper presents novel high-speed architectures for the hardware implementation of the Advanced Encryption Standard (AES) algorithm. Unlike previous works which rely on look-up tables to implement the SubBytes and InvSubBytes transformations of the AES algorithm, the proposed design employs combinational logic only. As a direct consequence, the unbreakable delay incurred by look-up tables in the conventional approaches is eliminated, and the advantage of subpipelining can be further explored. Furthermore, composite field arithmetic is employed to reduce the area requirements, and different implementations for the inversion in subfield GF(2/sup 4/) are compared. In addition, an efficient key expansion architecture suitable for the subpipelined round units is also presented. Using the proposed architecture, a fully subpipelined encryptor with 7 substages in each round unit can achieve a throughput of 21.56 Gbps on a Xilinx XCV1000 e-8 bg560 device in non-feedback modes, which is faster and is 79% more efficient in terms of equivalent throughput/slice than the fastest previous FPGA implementation known to date.

450 citations


Journal Article•DOI•
TL;DR: A systematic approach is proposed to develop a high throughput decoder for quasi-cyclic low-density parity check (LDPC) codes, whose parity check matrix is constructed by circularly shifted identity matrices, and the maximum concurrency of the two stages is explored by a novel scheduling algorithm.
Abstract: In this paper, a systematic approach is proposed to develop a high throughput decoder for quasi-cyclic low-density parity check (LDPC) codes, whose parity check matrix is constructed by circularly shifted identity matrices. Based on the properties of quasi-cyclic LDPC codes, the two stages of belief propagation decoding algorithm, namely, check node update and variable node update, could be overlapped and thus the overall decoding latency is reduced. To avoid the memory access conflict, the maximum concurrency of the two stages is explored by a novel scheduling algorithm. Consequently, the decoding throughput could be increased by about twice assuming dual-port memory is available.

173 citations


Journal Article•DOI•
TL;DR: By simulations, it is shown that quantization error can be reduced up to 50% by the proposed error compensation method compared with the existing method with approximately the same hardware overhead in the bias generation circuit.
Abstract: This paper presents an error compensation method for a modified Booth fixed-width multiplier that receives a W-bit input and produces a W-bit product. To efficiently compensate for the quantization error, Booth encoder outputs (not multiplier coefficients) are used for the generation of error compensation bias. The truncated bits are divided into two groups depending upon their effects on the quantization error. Then, different error compensation methods are applied to each group. By simulations, it is shown that quantization error can be reduced up to 50% by the proposed error compensation method compared with the existing method with approximately the same hardware overhead in the bias generation circuit. It is also shown that the proposed method leads to up to 35% reduction in area and power consumption of a multiplier compared with the ideal multiplier.

155 citations


Journal Article•DOI•
TL;DR: This paper presents a novel group matching scheme to reduce the Chien search hardware complexity by 60% for BCH(2047, 1926, 23) code as opposed to only 26% if directly applying the iterative matching algorithm.
Abstract: To implement parallel BCH (Bose-Chaudhuri-Hochquenghem) decoders in an area-efficient manner, this paper presents a novel group matching scheme to reduce the Chien search hardware complexity by 60% for BCH(2047, 1926, 23) code as opposed to only 26% if directly applying the iterative matching algorithm. The proposed scheme exploits the substructure sharing within a finite field multiplier (FFM) and among groups of FFMs.

136 citations


Journal Article•DOI•
TL;DR: This paper presents an iterated short convolution (ISC) algorithm, based on the mixed radix algorithm and fast convolution algorithm, transposed to obtain a new hardware efficient fast parallel finite-impulse response (FIR) filter structure, which saves a large amount of hardware cost.
Abstract: This paper presents an iterated short convolution (ISC) algorithm, based on the mixed radix algorithm and fast convolution algorithm. This ISC-based linear convolution structure is transposed to obtain a new hardware efficient fast parallel finite-impulse response (FIR) filter structure, which saves a large amount of hardware cost, especially when the length of the FIR filter is large. For example, for a 576-tap filter, the proposed structure saves 17% to 42% of the multiplications, 17% to 44% of the delay elements, and 3% to 27% of the additions, of those of prior fast parallel structures, when the level of parallelism varies from 6 to 72. Their regular structures also facilitate automatic hardware implementation of parallel FIR filters.

118 citations


Journal Article•DOI•
TL;DR: This paper presents a joint (3,k)-regular LDPC code and decoder/encoder design technique to construct a class of LDPC codes that not only have very good error-correcting capability but also exactly fit to high-speed partly parallel decoder and low-complexity encoder implementations.
Abstract: Recently, low-density parity-check (LDPC) codes have attracted a lot of attention in the coding theory community. However, their real-world applications are still problematic mainly due to the lack of effective decoder/encoder hardware design approaches. In this paper, we present a joint (3,k)-regular LDPC code and decoder/encoder design technique to construct a class of (3,k)-regular LDPC codes that not only have very good error-correcting capability but also exactly fit to high-speed partly parallel decoder and low-complexity encoder implementations. We also develop two techniques to further modify this joint design scheme to achieve more flexible tradeoffs between decoder hardware complexity and decoding speed.

87 citations


Proceedings Article•DOI•
26 Apr 2004
TL;DR: Three novel architectures are proposed to reduce the achievable minimum clock period for long BCH encoders after the fanout bottleneck has been eliminated and can achieve a speedup of over 100%.
Abstract: Long BCH codes are used as the outer error-correcting code in the second generation of Digital Video Broadcasting Standard from the European Telecommunications Standard Institute. These codes can achieve around 0.6dB additional coding gain over Reed-Solomon codes with similar codeword length and code rate in long-haul optical communication systems. BCH encoders are conventionally implemented by a linear feedback shift register architecture. High-speed applications of BCH codes require parallel implementations of encoders. In addition, long BCH encoders suffer from the effect of large fanout. In this paper, novel architectures are proposed to reduce the achievable minimum clock period of long BCH encoders after the fanout bottleneck has been eliminated. For an (8191, 7684) BCH code, compared to the original 32-parallel BCH encoder architecture without fanout bottleneck, the proposed architectures can achieve a speedup of over 100%.

61 citations


Proceedings Article•DOI•
23 May 2004
TL;DR: This paper explores the design spaces of both serial and parallel MAP decoders using graphical analysis and several existing designs are compared, and three new parallel decoding schemes are presented.
Abstract: Turbo codes are one of the most powerful error correcting codes. The VLSI implementation of Turbo codes for higher decoding speed requires use of parallel architectures. This paper explores the design spaces of both serial and parallel MAP decoders using graphical analysis. Several existing designs are compared, and three new parallel decoding schemes are presented.

51 citations


Journal Article•DOI•
TL;DR: A novel scheme based on look-ahead computation and retiming is proposed to eliminate the effect of large fanout in parallel long BCH encoders and can achieve a speedup of 132%.
Abstract: Long BCH codes can achieve about 0.6-dB additional coding gain over Reed-Solomon codes with similar code rate in long-haul optical communication systems. BCH encoders are conventionally implemented by a linear feedback shift register architecture. Encoders of long BCH codes may suffer from the effect of large fanout, which may reduce the achievable clock speed. The data rate requirement of optical applications require parallel implementations of the BCH encoders. In this paper, a novel scheme based on look-ahead computation and retiming is proposed to eliminate the effect of large fanout in parallel long BCH encoders. For a (2047, 1926) code, compared to the original parallel BCH encoder architecture, the modified architecture can achieve a speedup of 132%.

49 citations


Proceedings Article•DOI•
17 May 2004
TL;DR: A novel group matching scheme is proposed to reduce the overall hardware complexity of both Chien search and syndrome generator units by 46% for BCH(2047, 1926, 23) code as opposed to only 22% if directly applying the iterative matching algorithm.
Abstract: Long BCH codes achieve additional coding gain of around 0.6 dB compared to Reed-Solomon codes with similar code rate used for long-haul optical communication systems. For our considered parallel decoder architecture, a novel group matching scheme is proposed to reduce the overall hardware complexity of both Chien search and syndrome generator units by 46% for BCH(2047, 1926, 23) code as opposed to only 22% if directly applying the iterative matching algorithm. The proposed scheme exploits the substructure sharing within a finite field multiplier (FFM) and among groups of FFMs.

39 citations


Proceedings Article•DOI•
20 Jun 2004
TL;DR: A novel scheme based on look-ahead computation and retiming is proposed to eliminate the effect of large fanout in parallel long BCH encoders and can achieve a speedup of 132%.
Abstract: Long BCH codes can achieve about 0.6 dB additional coding gain over Reed-Solomon codes with a similar code rate in long-haul optical communication systems. BCH encoders are conventionally implemented by a linear feedback shift register architecture. Encoders of long BCH codes may suffer from the effect of large fanout, which may reduce the achievable clock speed. The data rate requirement of optical applications require parallel implementations of the BCH encoders. In this paper, a novel scheme based on look-ahead computation and retiming is proposed to eliminate the effect of large fanout in parallel long BCH encoders. For a (2047, 1926) code, compared to the original parallel BCH encoder architecture, the modified architecture can achieve a speedup of 132%.

Journal Article•DOI•
TL;DR: With the proposed retimed structure, it is possible to decrease the critical path of the ACS unit by 12% to 15% compared with the conventional MSB-first structures, which can reduce the level of parallelism required for a very high-speed Viterbi decoder.
Abstract: Convolutional codes are widely used in many communication systems due to their excellent error-control performance. High-speed Viterbi decoders for convolutional codes are of great interest for high-data-rate applications. In this paper, an improved most-significant-bit (MSB) -first bit-level pipelined add-compare select (ACS) unit structure is proposed. The ACS unit is the main bottleneck on the decoding speed of a Viterbi decoder. By balancing the settling time of different paths in the ACS unit, the length of the critical path is reduced as close as possible to the iteration bound in the ACS unit. With the proposed retimed structure, it is possible to decrease the critical path of the ACS unit by 12% to 15% compared with the conventional MSB-first structures. This reduction in critical path can reduce the level of parallelism (and area) required for a very high-speed Viterbi decoder.

Journal Article•DOI•
TL;DR: A novel K-nested layered look-ahead method and its corresponding architecture, which combine K-trellis steps into one trellis step (where K is the encoder constraint length), are proposed for implementing low-latency high-throughput rate Viterbi decoders.
Abstract: In this paper, a novel K-nested layered look-ahead method and its corresponding architecture, which combine K-trellis steps into one trellis step (where K is the encoder constraint length), are proposed for implementing low-latency high-throughput rate Viterbi decoders. The proposed method guarantees parallel paths between any two-trellis states in the look-ahead trellises and distributes the add-compare-select (ACS) computations to all trellis layers. It leads to regular and simple architecture for the Viterbi decoding algorithm. The look-ahead ACS computation latency of the proposed method increases logarithmically with respect to the look-ahead step (M) divided by the encoder constraint length (K) as opposed to linearly as in prior work. For a 4-state (i.e., K=3) convolutional code, the decoding latency of the Viterbi decoder using proposed method is reduced by 84%, at the expense of about 22% increase in hardware complexity, compared with conventional M-step look-ahead method with M=48 (where M is also the level of parallelism). The main advantage of our proposed design is that it has the least latency among all known look-ahead Viterbi decoders for a given level of parallelism.

Proceedings Article•DOI•
17 May 2004
TL;DR: This paper exploits the similarity between the two stages of belief propagation decoding algorithm for low density parity check codes to derive an area efficient design that re-maps the check nodes functional units and variable node functional units into the same hardware.
Abstract: This paper exploits the similarity between the two stages of belief propagation decoding algorithm for low density parity check codes to derive an area efficient design that re-maps the check node functional units and variable node functional units into the same hardware. Consequently, the novel approach could reduce the logic core size by approximately 21% without any performance degradation. In addition, the proposed approach improves the hardware utilization efficiency as well.

Proceedings Article•DOI•
06 Dec 2004
TL;DR: A novel architecture based on root-order prediction is proposed to speed up the factorization step of Reed-Solomon codes, and the time-consuming exhaustive-search-based root computation in each iteration of thefactorization step is circumvented with more than 99% probability.
Abstract: Reed-Solomon (RS) codes are among the most widely utilized block error-correcting codes in modern communication and computer systems. Compared to hard-decision decoding, soft-decision decoding offers considerably higher error-correcting capability. Among the soft-decision decoding algorithms, the polynomial time complexity Koetter-Vardy (KV) algorithm can achieve substantial coding gain for high-rate RS codes. In the KV algorithm, the factorization step can consume a major part of the decoding latency. A novel architecture based on root-order prediction is proposed to speed up the factorization step. As a result, the time-consuming exhaustive-search-based root computation in each iteration of the factorization step is circumvented with more than 99% probability. Using the proposed architecture, a speedup of 141% can be achieved over prior efforts for a (255, 239) RS code, while the area consumption is reduced to 31.9%.

Proceedings Article•DOI•
20 Jun 2004
TL;DR: It is shown in this paper that the pulsed-OFDM system has better performance than the non-pulsed system in indoor multipath channels and considerably lower complexity and power consumption.
Abstract: We study the theory and implementation of pulsed orthogonal frequency division multiplexing (pulsed-OFDM) modulation. Pulsed-OFDM is an enhancement to the leading proposal to the IEEE 802.15.3a wireless personal area networks standardization effort, known as multi-band OFDM. In particular, we show in this paper that the pulsed-OFDM system has better performance than the non-pulsed system in indoor multipath channels and considerably lower complexity and power consumption. We begin by studying the, spectral characteristics of pulsed OFDM and the added degrees of diversity that it provides. Next, we discuss the design of receivers for such a system. We show that the diversity branches can be captured and demodulated by one or more fast Fourier transform (FFT). We then focus on a system for the IEEE 802.15.3a standard and derive a particularly low complexity implementation for that system. The implementation is based on carefully designed punctured convolutional codes. It also exploits the normal inefficiencies in an FFT architecture to implement the parallel FFT operations required to demodulate a full diversity pulsed OFDM with lower complexity and smaller area than the single FFT used by the non-pulsed system. We conclude by presenting realistic simulation results for the measured indoor propagation channels provided by the IEEE 802.15.3a standard.

Journal Article•DOI•
TL;DR: This letter addresses low-complexity design strategies on choosing the scaling factor of the log extrinsic information and on reducing the number of hard-decision decodings during a Chase search.
Abstract: In this letter, tradeoffs between very large scale integration implementation complexity and performance of block turbo decoders are explored. We address low-complexity design strategies on choosing the scaling factor of the log extrinsic information and on reducing the number of hard-decision decodings during a Chase search.

Proceedings Article•DOI•
06 Dec 2004
TL;DR: The paper demonstrates that overlapped decoding can be exploited as long as the LDPC matrix is composed of identity matrices and their cyclic-shifted matrices, i.e., the parity-check matrix, H, belongs to a class of quasi-cyclic LDPC codes.
Abstract: In low-density parity-check (LDPC) code decoding with the iterative sum-product algorithm (SPA), due to the randomness of the parity-check matrix, H, the overlapping of the check node processing unit (CNU) and variable node processing unit (VNU) in the same clock cycle is difficult. The paper demonstrates that overlapped decoding can be exploited as long as the LDPC matrix is composed of identity matrices and their cyclic-shifted matrices, i.e., the parity-check matrix, H, belongs to a class of quasi-cyclic LDPC codes. It is shown that the number of clock cycles required for decoding can be reduced by 50% when overlapped decoding is applied to a (3,6)-regular LDPC code decoder.

Proceedings Article•DOI•
20 Jun 2004
TL;DR: The proposed SD algorithm, called SD-KB algorithm, can provide pseudo-MLD solutions, which have significant performance gain over the baseline method, especially when the signal-to-interference ratio (SIR) is low.
Abstract: The sphere decoding (SD) algorithm has been widely recognized as an important algorithm to solve the maximum likelihood detection (MLD) problem, given that symbols can only be selected from a set with a finite alphabet. The complexity of the sphere decoding algorithm is much lower than the directly implemented MLD method, which needs to search through all possible candidates before making a decision. However, in high-dimensional and low signal-to-noise ratio (SNR) cases, the complexity of sphere decoding is still prohibitively high for practical applications. In this paper, a simplified SD algorithm, which combines the K-best algorithm and SD algorithm, is proposed. With carefully selected parameters, the new SD algorithm, called SD-KB algorithm, can achieve very low complexity with acceptable performance degradation compared with the traditional SD algorithm. The low complexity of the new SD-KB algorithm makes it applicable to the simultaneously operating piconets (SOP) problem of the multi-band orthogonal frequency division multiplex (MB-OFDM) scheme for the high- speed wireless personal area network (WPAN). We show in particular that the proposed algorithm provides over 4 dB gain in bit error rate (BER) performance over the baseline MB-OFDM scheme when several piconets interfere with each other. The SD-KB algorithm can provide pseudo-MLD solutions, which have significant performance gain over the baseline method, especially when the signal-to-interference ratio (SIR) is low. The cost of performance improvement is higher complexity. However, the new SD algorithm has predictable computation complexity even in the worst scenario.

Proceedings Article•DOI•
23 May 2004
TL;DR: A new orthogonal frequency division multiplexing (OFDM) scheme, named pulsed-OFDM, for ultra wideband (UWB) communications, which leads to an enhanced performance in multipath fading environments and a low complexity and low power consumption transceiver structure.
Abstract: We describe a new orthogonal frequency division multiplexing (OFDM) scheme, named pulsed-OFDM, for ultra wideband (UWB) communications. Pulsed-OFDM modulation uses pulsed sinusoids instead of continuous sinusoids to send information in parallel over different sub-carriers. Pulsating OFDM symbols spreads the spectrum of the modulated signals in the frequency domain leading to a spreading gain equal to the inverse of the duty cycle of the pulsed sub-carriers. The spreading gain provided by this system leads to an enhanced performance in multipath fading environments. We also show that at the receiver part of the multipath diversity can be exploited by pulsed-OFDM system leading to a low complexity and low power consumption transceiver structure. Easy implementation and frequency spreading of pulsed-OFDM modulation make it a proper modulation for UWB communications where the ratio of bandwidth to the data rate is large but power spectral density is limited. We design a low complexity and low power consumption pulsed-OFDM system for the IEEE 802.15.3a wireless personal area networks and compare it with the normal OFDM system has been proposed for this standard. We also provide realistic simulation results for the measured indoor propagation channels provided by the IEEE 802.15.3a standard to demonstrate the advantages of the pulsed OFDM modulation for UWB communications.

Proceedings Article•DOI•
17 May 2004
TL;DR: The proposed technique is demonstrated and applied to design multiplexer loop based DFEs with throughput in the range of 3.125-10 Gbps.
Abstract: The high speed implementation of a DFE (decision feedback equalizer) requires reformulation of the DFE into an array of comparators and a multiplexer loop. The throughput of the DFE is limited by the speed of the multiplexer loop. This paper proposes a novel look-ahead computation approach to pipeline multiplexer loops. The proposed technique is demonstrated and applied to design multiplexer loop based DFEs with throughput in the range of 3.125-10 Gbps.

Proceedings Article•DOI•
01 Dec 2004
TL;DR: It is shown that reconfigurability with the reduction polynomial significantly benefits from the addition of a low latency divider unit and scalar point multiplication in affine coordinates.
Abstract: This paper focuses on designing elliptic curve crypto-accelerators in GF(2/sup m/) that are cryptographically scalable and hold some degree of reconfigurability. Previous work in elliptic curve crypto-accelerators focused on implementations using projective coordinate systems for specific field sizes. Their performance, scalar point multiplication per second (kP/s) was determined primarily by the underlying multiplier implementation. In addition, a multiplier only implementation and a multiplier plus divider implementation are compared in terms of critical path, area and area time (AT) product. Our multiplier only design, designed for high performance, can achieve 6314 kP/s for GF(2/sup 571/) and requires 47876 LUTs. Meanwhile our multiplier and divider design, with a greater degree of reconfigurability, can achieve 44 kP/s for GF(2/sup 571/). However, this design requires 27355 LUTs, and has a significantly higher AT product. It is shown that reconfigurability with the reduction polynomial significantly benefits from the addition of a low latency divider unit and scalar point multiplication in affine coordinates. In both cases the performance is limited by a critical path in the control logic.

Proceedings Article•DOI•
01 Dec 2004
TL;DR: Using the proposed architecture, a fully subpipelined encryptor with 7 substages in each round unit can achieve a throughput of 21.56 Gbps on a Xilinx XCV 1000e-8bg560 device in nonfeedback modes, which is faster and 79% more efficient than the fastest previous FPGA implementation known to date.
Abstract: This paper presents novel high-speed architectures for the hardware implementation of the advanced encryption standard (AES) algorithm. Unlike previous works, which rely on look-up tables to implement the subbytes and invsubbytes transformations of the AES algorithm, the proposed design employs combinational logic only. As a direct consequence, the unbreakable delay incurred by look-up tables in the conventional approaches is eliminated and the advantage of subpipelining can be further explored. Furthermore, composite field arithmetic is employed to reduce the area requirements and different implementations for the inversion in subfield GF(2/sup 4/) are compared. In addition, efficient key expansion architecture suitable for the subpipelined round units is also presented. Using the proposed architecture, a fully subpipelined encryptor with 7 substages in each round unit can achieve a throughput of 21.56 Gbps on a Xilinx XCV 1000e-8bg560 device in nonfeedback modes, which is faster and is 79% more efficient than the fastest previous FPGA implementation known to date.

Proceedings Article•DOI•
23 May 2004
TL;DR: This paper presents an iterated short convolution (ISC) algorithm, based on the mixed radix algorithm and fast convolution algorithm, transposed to obtain a new hardware efficient fast parallel FIR filter structure, which save a lot of amount of hardware cost.
Abstract: This paper presents an iterated short convolution (ISC) algorithm, based on the mixed radix algorithm and fast convolution algorithm. This ISC based linear convolution structure is transposed to obtain a new hardware efficient fast parallel FIR filter structure, which save a lot of amount of hardware cost, especially when the length of the FIR filter is large. For example, for a 576-tap filter, the proposed structure saves 16.7% to 43.6% of the delay elements and 2.9% to 27% of the additions, which prior fast parallel structures use, when the level of parallelism varies from 6 to 72. These proposed structures exhibit regular structure.

Proceedings Article•DOI•
01 Dec 2004
TL;DR: To further reduce the critical path delay, new look-up-tables (LUT) are developed to replace both conventional LUTs and data format transformation blocks and the adder trees are also reorganized for speed.
Abstract: This paper studies load imbalance problem in the two stages of belief propagation decoding algorithm for LDPC codes and redistributes computational load between two stages. To further reduce the critical path delay, new look-up-tables (LUT) are developed to replace both conventional LUTs and data format transformation blocks. The adder trees are also reorganized for speed. This novel approach can reduce the critical path delay by 41.0% with negligible increase in the logic core size. This paper also exploits the similarity between these two stages and derives an area efficient design that remaps the functional units for these two stages onto the same hardware, which can reduce the logic core size by 10.2% and reduce the critical path delay by 16.2%.

Journal Article•DOI•
TL;DR: A state-space approach-based novel algorithm for designing fine-grain pipelined true orthogonal recursive digital filters is proposed using the matrix look-ahead technique and it is shown both theoretically and numerically how the proposed Orthogonal filter realizations achieve low sensitivity due to finite word-length truncations.
Abstract: Orthogonal recursive or infinite impulse response (IIR) digital filters can achieve a sharp transition band, have good finite word-length behavior, and are used in many modern digital signal processing (DSP) applications such as mobile communications. However, Givens rotation or coordinate rotation digital computer (CORDIC)-based fine-grain pipelined true orthogonal recursive digital filters have yet to be developed. In this paper, a state-space approach-based novel algorithm for designing fine-grain pipelined true orthogonal recursive digital filters is proposed using the matrix look-ahead technique. The algorithm is developed for designing general multi-input/multi-output (MIMO) digital filters, whereas the single-input/single-output (SISO) filters are treated as special cases. The filter synthesis procedure contains five major steps and only involves applying orthogonal transformations that are known to be numerically very reliable and, therefore, is ideal for very large scale integration (VLSI) implementations. The proposed filter architectures are pipelined at fine-grain level and, thus, can be operated at arbitrarily high sample rates. The total complexity is MN(m+p)+(m+N)p+p(p-1)/2 Givens rotations for MIMO filters, where N, m, p, and M are the filter order, number of inputs, number of outputs, and pipelining level, respectively. For the SISO case, the complexity reduces to (2M+1)N+1 Givens rotations, which is linear with respect to the filter order and pipelining level. Furthermore, the filter realizations consist of only Givens rotations, which can be mapped onto CORDIC arithmetic-based processors. Different filter design and realization approaches are explored, and the resulting topologies are compared. As an application, a pipelined intermediate frequency filter for an American mobile telephone system is designed using the proposed approach. Finally, finite word-length simulations are carried out for various orthogonal topologies. It is shown both theoretically and numerically how the proposed orthogonal filter realizations achieve low sensitivity due to finite word-length truncations.

Proceedings Article•DOI•
23 May 2004
TL;DR: An improved most-significant-bit (MSB)-first bit-level pipelined add-compare select (ACS) unit structure is proposed and it is possible to decrease the critical path of the ACS unit by 12 to 15% compared with the conventional MSB-first structures.
Abstract: The convolutional codes are widely used in many communication systems due to their excellent error control performance High speed Viterbi decoders for convolutional codes are of great interest for high data rate applications In this paper, an improved most-significant-bit (MSB)-first bit-level pipelined add-compare select (ACS) unit structure is proposed The ACS unit is the main bottleneck on the decoding speed of a Viterbi decoder By balancing the settling time of different paths in the ACS unit, the length of the critical path is reduced as close as possible to the iteration bound in the ACS unit With the proposed retimed structure, it is possible to decrease the critical path of the ACS unit by 12 to 15% compared with the conventional MSB-first structures This reduction in critical path can reduce the level of parallelism (and area) required for a very highspeed (such as 10 Gbps) Viterbi decoder by about 25%

Proceedings Article•DOI•
17 May 2004
TL;DR: Simulation results show that the error-rate performances of the two schemes are quite close to that of the conventional scheme, and the folded decoders for the two proposed schemes can achieve speedups of 4 and 2, respectively.
Abstract: It is highly likely that 10 Gigabit Ethernet over copper (10GBASE-T) transceivers will use a 10-level pulse amplitude modulation (PAM 10) as well as a 4D trellis code as in 1000BASE-T. The traditional trellis coded modulation scheme, as in 1000BASE-T, leads to a design where the corresponding decoder with a long critical path needs to operate at 833 MHz. It is difficult to meet the critical path requirements of such a decoder. To solve the problem, two interleaved trellis coded modulation schemes are proposed. The inherent decoding speed requirements are relaxed by factors of 4 and 2, respectively. Parallel decoding of the interleaved codes requires multiple decoders. To reduce the hardware overhead, time-multiplexed or folded decoder structures are proposed where only one decoder is needed and each delay in the decoder is replaced with four delays for scheme 1 and two delays for scheme 2, respectively. These delays can be used to reduce the critical path. Compared with the conventional decoder, the folded decoders for the two proposed schemes can achieve speedups of 4 and 2, respectively. Simulation results show that the error-rate performances of the two schemes are quite close to that of the conventional scheme.

Patent•
21 Jul 2004
TL;DR: In this article, the N/M trellis coded modulation and joint equalization and decoding operations can also be implemented by using fewer hardware transceivers and decoders using folding technique where multiple operations are time-multiplexed to the same hardware modulator or decoder which are operated by higher clock speed.
Abstract: Digital communications systems employ trellis coded modulation schemes. A K-dimensional trellis coded modulated symbol is transmitted over M channels in K/M cycles (where M divides K). K/M consecutive data units are transmitted serially in a time multiplexed manner in K/M consecutive cycles over one channel. If M=1, then each symbol is transmitted over one channel in K cycles in a time-multiplexed manner. At the receiver, a symbol is formed by grouping the data received from M channels in K/M cycles and this symbol is then decoded by a joint equalizer and decoder. If the number of parallel channels is N, then N/M trellis coded modulators and N/M decoders can be used in parallel. The advantage of this approach is an increase in speed by factor N/M. The N/M trellis coded modulation and joint equalization and decoding operations can also be implemented by using fewer hardware trellis coded modulators and decoders using folding technique where multiple operations are time-multiplexed to the same hardware modulator or decoder which are operated by higher clock speed.