scispace - formally typeset
Search or ask a question

Showing papers by "Keshab K. Parhi published in 2008"


Journal ArticleDOI
TL;DR: This paper presents a systematic high-speed VLSI implementation of the discrete wavelet transform (DWT) based on hardware-efficient parallel FIR filter structures that can be easily achieved for an NtimesN image with controlled increase of hardware cost.
Abstract: This paper presents a systematic high-speed VLSI implementation of the discrete wavelet transform (DWT) based on hardware-efficient parallel FIR filter structures. High-speed 2-D DWT with computation time as low as N 2/12 can be easily achieved for an NtimesN image with controlled increase of hardware cost. Compared with recently published 2-D DWT architectures with computation time of N 2/3 and 2N 2/3, the proposed designs can also save a large amount of multipliers and/or storage elements. It can also be used to implement those 2-D DWT traditionally suitable for lifting or flipping-based designs, such as (9,7) and (6,10) DWT. The throughput rate can be improved by a factor of 4 by the proposed approach, but the hardware cost increases by a factor of around 3. Furthermore, the proposed designs have very simple control signals, regular structures and 100% hardware utilization for continuous images.

107 citations


Proceedings ArticleDOI
18 May 2008
TL;DR: An efficient controller design of barrel shifter for reconfigurable low-density parity-check (LDPC) decoders, which leads to significant reduction in hardware complexity, and a novel simplified algorithm capable of generating all the control signals.
Abstract: In this paper, we propose an efficient controller design of barrel shifter for reconfigurable low-density parity-check (LDPC) decoders, which leads to significant reduction in hardware complexity. Since the structured LDPC codes for the most modern wireless communication systems include multiple code rates, various block lengths, and different sizes of submatrices, a reconfigurable LDPC decoder is desirable and the barrel shifter needs to be programmable. Even though the Benes network can be optimized for the barrel shifting networks of reconfigurable LDPC decoder, it is not trivial to generate all the control signals for numerous 2times2 switches on-the-fly. A novel simplified algorithm capable of generating all the control signals is proposed using the properties that both the full-size Benes network can be broken into two half-size Benes networks and the barrel shifters needed in the structured LDPC decoders require only cyclic shifts. The proposed algorithm can be easily implemented with a small numbers of gates. Compared with the direct implementation using a dedicated look-up table, the proposed algorithm achieves a significant hardware reduction in implementing a reconfigurable LDPC decoder.

22 citations


Journal ArticleDOI
TL;DR: The proposed architecture also relaxes the constraint on the look-ahead level M to be a multiple of K as was needed in the previous work, and reduces the latency of conventional M-step look ahead Viterbi architecture at the expense of 148.20% extra hardware complexity.
Abstract: By optimizing the number of look-ahead steps of the first layer of the previous low-latency architectures for M-step look-ahead high-throughput rate Viterbi decoders, this paper improves the hardware efficiency by large percentage with slight increase or even further decrease of the latency for the add-compare-select (ACS) computation. This is true especially when the encoder constraint length (K) is large. For example, when K = 7 and M varies from 21 to 84, 20.83% to 41.27% of the hardware cost in previous low latency Viterbi method can be saved with only up to 12% increase or 4% decrease of the latency of the conventional M-step look-ahead Viterbi decoder. The proposed architecture also relaxes the constraint on the look-ahead level M to be a multiple of K as was needed in the previous work. For example, when K = 7 and M (indivisible by K) varies from 40 to 80, 60.27% to 69.3% latency of conventional M-step look ahead Viterbi architecture can be reduced at the expense of 148.62% to 320.20% extra hardware complexity.

20 citations


Proceedings ArticleDOI
04 May 2008
TL;DR: A novel fast composite field S-Box architecture that can reduce the pipelining latency by 40%-60% compared with the conventional design while keeping the same throughput rate and an approach to further reduce critical path delay is presented.
Abstract: Byte substitution (S-Box), which is essentially a combination of inversion and affine operations over a finite field GF(28), limits the throughput of the Advanced Encryption Standard (AES) algorithm Among existing S-Box architectures, the composite field S-Box algorithm is very attractive for its extremely low area cost, which is only 12%-20% of other implementation approaches [1] However, the composite field S-Box suffers from extremely low throughput rate In this paper, we propose a novel fast composite field S-Box architecture By applying pre-computation techniques, some computation on the critical data path can be eliminated so as to reduce the critical path delay The complexity of the precomputation units is minimized via sharing common structures The proposed design is implemented using a 018-um CMOS technology library The results show that the throughput rate is increased by 2822% at the expense of a fairly modest increase in area Based on the proposed design, we then present an approach to further reduce critical path delay The gate-level analysis shows that the second proposed approach can increase the throughput rate by 5625% In addition, the proposed designs can reduce the pipelining latency by 40%-60% compared with the conventional design while keeping the same throughput rate

20 citations


Proceedings ArticleDOI
01 Oct 2008
TL;DR: An optimally quantized offset min-sum decoding algorithm for a flexible low-density parity-check (LDPC) decoder that uses the received data directly instead of log-likelihood ratio (LLR) data as the intrinsic information to achieve better performance.
Abstract: In this paper, we analyze the performance of quantized offset min-sum (MS) decoding algorithm and propose an optimally quantized offset MS algorithm for a flexible low-density parity-check (LDPC) decoder. It is known that the offset MS decoding algorithm is implemented with simplified hardware complexity and achieves good decoding performance. However, the finite precision effects in decoding LDPC codes result in performance different from floating point. The performance degradation is caused by different dynamic ranges of input data at high signal-to-noise ratio (SNR). The proposed offset MS algorithm uses the received data directly instead of log-likelihood ratio (LLR) data as the intrinsic information. It can achieve better performance than the conventional one since its offset factor is more effective at a wide range of SNR and the intrinsic information is quantized more robustly since it is independent of channel information. Also, it is possible for the proposed scheme to use a same quantization scheme for a flexible LDPC decoder, which can decode several kinds of LDPC codes. Simulation results show that our optimally quantized offset MS algorithms with 5-bits for (1728, 864) and (1728, 1296) irregular LDPC codes achieve better performance compared with the conventional offset MS algorithms with 6-bits quantization scheme.

12 citations


Proceedings ArticleDOI
04 May 2008
TL;DR: A novel min-sum (MS) decoder architecture using nonuniform quantization schemes for low-density parity-check (LDPC) codes can reduce the finite word-length while achieving similar performances compared to a conventional quantization scheme.
Abstract: In this paper, we propose a novel min-sum (MS) decoder architecture using nonuniform quantization schemes for low-density parity-check (LDPC) codes. The finite word-length analysis in implementing an LDPC decoder is a very important factor since it directly impacts the size of memory to store the intrinsic and extrinsic messages and the overall hardware area in the partially parallel LDPC decoder. The proposed nonuniform quantization scheme can reduce the finite word-length while achieving similar performances compared to a conventional quantization scheme. From simulation results, it is shown that the proposed 4-bits nonuniform quantization scheme achieves an acceptable decoding performance unlike a conventional 4-bits uniform quantization scheme. In addition, the hardware implementation for the proposed nonuniform quantization scheme requires smaller area.

8 citations


Journal ArticleDOI
TL;DR: A novel method based on word-length reduction technique is proposed and it is shown that, by applying the proposed method to a 10 GBASE-T Ethernet system, the hardware complexity of echo and NEXT cancellers can be reduced by about 10.82% without performance loss.
Abstract: Gigabit and multigigabit transceivers require very long adaptive filters for echo and near-end crosstalk (NEXT) cancellation. Implementation of these filters not only occupies large silicon area but also consumes significant power. These problems become even worse when the Tomlinson-Harashima precoding (THP) technique is used in applications such as 10-Gigabit Ethernet over Copper (10 GBASE-T) as the input to the echo and NEXT cancellers is no longer a simple PAM-M signal. To reduce the complexity of these cancellers, in this paper, a novel method based on word-length reduction technique is proposed. The proposed design is derived by replacing the original input to the echo and NEXT cancellers with a finite-level signal, which is the sum of the input to the TH precoder and a finite-level compensation signal. Then, this modified input signal is recoded to have shorter word-length compared with the original input. Hence, the overall complexity can be reduced by using the proposed method. To further reduce the complexity of these cancellers, an improved design is proposed by exploiting the property of the compensation signal. Compared with the traditional design, the proposed echo and NEXT cancellers have exact input and do not suffer from the quantization problem, and thus they are more suitable for VLSI implementation. The proposed method can also be applied to design adaptive echo and NEXT cancellers with little modification. The performance evaluation is performed by simulations to verify the proposed design. It is shown that, by applying the proposed method to a 10 GBASE-T Ethernet system, the hardware complexity of echo and NEXT cancellers can be reduced by about 10.82% without performance loss, compared with the traditional design.

7 citations


Proceedings ArticleDOI
01 Dec 2008
TL;DR: An analytical approach is developed to estimate the statistics of computer arithmetic computation errors due to supply voltages overscaling, which can be several orders of magnitude faster and meanwhile achieve a reasonable accuracy.
Abstract: This work concerns the design of low power signal processing systems at overscaled supply voltage, in which the behavior of computer arithmetic units in response to overscaled voltage plays an importance role. We show that different hardware implementations of the same computer arithmetic function may respond to overscaled voltage very differently and result in different energy saving potential. Therefore, we develop an analytical approach to estimate the statistics of computer arithmetic computation errors due to supply voltages overscaling. Compared with computation intensive circuit simulations, this analytical approach can be several orders of magnitude faster and meanwhile achieve a reasonable accuracy.

5 citations


01 Jan 2008
TL;DR: A systematic high-speed VLSI implementation of the discrete wavelet transform (DWT) based on hardware-efficient parallel FIR filter structures that can be used to implement those 2-D DWT tradi- tionally suitable for lifting or flipping-based designs.
Abstract: This paper presents a systematic high-speed VLSI implementation of the discrete wavelet transform (DWT) based on hardware-efficient parallel FIR filter structures. High-speed 2-D DWT with computation time as low as can be easily achieved for an image with controlled increase of hardware cost. Compared with recently published 2-D DWT architectures with computation time of and , the proposed de- signs can also save a large amount of multipliers and/or storage elements. It can also be used to implement those 2-D DWT tradi- tionally suitable for lifting or flipping-based designs, such as (9,7) and (6,10) DWT. The throughput rate can be improved by a factor of 4 by the proposed approach, but the hardware cost increases by a factor of around 3. Furthermore, the proposed designs have very simple control signals, regular structures and 100% hardware utilization for continuous images.

5 citations


Proceedings ArticleDOI
17 Nov 2008
TL;DR: The computational redundancy in existing branch metric computation approaches is first recognized, and a general mathematical model for describing the approach space is built, based on which a new approach with minimal complexity and latency is proposed and the proof of its optimality is given.
Abstract: For Viterbi decoders, high throughput rate is achieved by applying look-ahead techniques in the add-compare-select unit, which is the system speed bottleneck. Look-ahead techniques combine multiple binary trellis steps into one equivalent complex trellis step in time sequence, which is referred to as the branch metrics precomputation (BMP) unit. The complexity and latency of BMP increase exponentially and linearly with respect to the look-ahead levels, respectively. For a Viterbi decoder with constraint length K and M-step look-ahead, 2M+K-1 branch metrics need to be computed and compared. In this paper, the computational redundancy in existing branch metric computation approaches is first recognized, and a general mathematical model for describing the approach space is built, based on which a new approach with minimal complexity and latency is proposed. The proof of its optimality is also given. This highly efficient approach leads to a novel overall optimal architecture for M that is any multiple of K. The results show that the proposed approaches can reduce the complexity by up to 45.65% and the latency by up to 72.50%. In addition, the proposed architecture can also be applied when M is any value while achieving the minimal complexity.

4 citations


Journal ArticleDOI
TL;DR: A novel parallel architecture is proposed to speed up Tomlinson-Harashima precoders, which can be used in many high-speed applications, such as 10-Gb Ethernet over copper.
Abstract: Like decision feedback equalizers (DFEs), Tomlinson-Harashima precoders (TH precoders) contain nonlinear feedback loops, which limit their use for high-speed applications. Unlike in DFEs, where the output levels of the nonlinear devices are finite, in TH precoders the output levels of the modulo devices are either infinite or finite but very large. Thus, it is difficult to apply look-ahead and pre-computation techniques to speed up TH precoders, which were successfully applied to design parallel and pipelined infinite impulse response (IIR) filters and DFEs in the past. However, a TH precoder can be viewed as an IIR filter with an input equal to the sum of the original input to the TH precoder and a finite-level compensation signal. Based on this point of view, a novel parallel architecture is proposed to speed up TH precoders. This architecture can be used in many high-speed applications, such as 10-Gb Ethernet over copper.

Proceedings ArticleDOI
01 Oct 2008
TL;DR: In this article, the authors presented a new efficient method for stable pole-zero modeling of long finite impulse response (FIR) filters based on the formulation of a generalized system identification problem using the minimum mean-square error (MMSE) criterion.
Abstract: This paper presents a new efficient method for stable pole-zero modeling of long finite impulse response (FIR) filters. The proposed method is based on the formulation of a generalized system identification problem using the minimum mean-square error (MMSE) criterion. A computationally efficient MMSE solution is then derived by exploiting the structured matrices. The proposed method is general and applicable to cases with unequal numbers of poles and zeros. Compared with the generalized ARMA-Levinson algorithm, the proposed method has lower computational complexity without loss of accuracy. Numerical results are provided to demonstrate the effectiveness of the proposed method.

Proceedings ArticleDOI
17 Nov 2008
TL;DR: It is shown that, by applying the proposed scheme to the 10 Gigabit Ethernet over copper (10 GBASE-T) system, the hardware cost of adaptive echo and NEXT cancellers can be reduced by about 42.02% only with 1.5 dB performance loss, compared with the traditional design.
Abstract: Efficient implementation of adaptive echo and near end crosstalk (NEXT) cancellers in high-speed Ethernet transceivers continues to be a challenging problem. In our previous work, we proposed a new method based on word-length reduction technique to reduce hardware cost of the filter part in adaptive echo and NEXT cancellers. However, the high hardware cost of the weight update part in these adaptive cancellers was not reduced by the above method. This paper presents a new complexity reduction scheme for the weight update part in adaptive echo and NEXT cancellers. By reducing hardware cost of the weight update part, the overall hardware cost of these cancellers can be further reduced. It is shown that, by applying the proposed scheme to the 10 Gigabit Ethernet over copper (10 GBASE-T) system, the hardware cost of adaptive echo and NEXT cancellers can be reduced by about 42.02% only with 1.5 dB performance loss, compared with the traditional design.

Proceedings ArticleDOI
01 Dec 2008
TL;DR: A novel pipelined algorithm is applied in the hardware implementation of Smith-Waterman algorithm to cut down the computation time from O(m+n) to O( m+n/J), where J is the pipeline level, m and n are the lengths of the query sequence and subject sequence respectively.
Abstract: In this paper, a novel pipelined algorithm is applied in the hardware implementation of Smith-Waterman algorithm. The proposed algorithm can cut down the computation time from O(m+n) to O(m+n/J), where J is the pipeline level, m and n are the lengths of the query sequence and subject sequence respectively. It's obvious that if the length of subject sequence is much larger than the query sequence, i.e., n>>m, the computation of scanning protein sequences will be speeded up by a factor of J.