scispace - formally typeset
Search or ask a question

Showing papers on "Gate count published in 2010"


Book ChapterDOI
20 May 2010
TL;DR: The result is, as far as the authors know, the circuit with the smallest gate count yet constructed for this function, and it is experimentally verified that the second step of the technique yields significant improvements over conventional methods when applied to randomly chosen linear transformations.
Abstract: A new technique for combinational logic optimization is described. The technique is a two-step process. In the first step, the non-linearity of a circuit – as measured by the number of non-linear gates it contains – is reduced. The second step reduces the number of gates in the linear components of the already reduced circuit. The technique can be applied to arbitrary combinational logic problems, and often yields improvements even after optimization by standard methods has been performed. In this paper we show the results of our technique when applied to the S-box of the Advanced Encryption Standard (AES [6]). This is an experimental proof of concept, as opposed to a full-fledged circuit optimization effort. Nevertheless the result is, as far as we know, the circuit with the smallest gate count yet constructed for this function. We have also used the technique to improve the performance (in software) of several candidates to the Cryptographic Hash Algorithm Competition. Finally, we have experimentally verified that the second step of our technique yields significant improvements over conventional methods when applied to randomly chosen linear transformations.

148 citations


Journal ArticleDOI
TL;DR: It is proved that the proposed method provides a minimal majority expression and an optimal QCA layout for any given three-variable Boolean function and removes all the redundancies that are produced in the process of converting a decomposed network into a majority network.
Abstract: Quantum-dot cellular automata (QCA) has been widely considered as a replacement candidate for complementary metal-oxide semiconductor (CMOS). The fundamental logic device in QCA is the majority gate. In this paper, we propose an efficient methodology for majority logic synthesis of arbitrary Boolean functions. We prove that our method provides a minimal majority expression and an optimal QCA layout for any given three-variable Boolean function. In order to obtain high-quality decomposed Boolean networks, we introduce a new decomposition scheme that can decompose all Boolean networks efficiently. Furthermore, our method removes all the redundancies that are produced in the process of converting a decomposed network into a majority network. In existing methods, however, these redundancies are not considered. We have built a majority logic synthesis tool based on our method and several existing logic synthesis tools. Experiments with 40 multiple-output benchmarks indicate that, compared to existing methods, 37 benchmarks are optimized by our method, up to 31.6%, 78.2%, 75.5%, and 83.3% reduction in level count, gate count, gate input count, and inverter count, respectively, is possible with the average being 4.7%, 14.5%, 13.3%, and 26.4%, respectively. We have also implemented the QCA layouts of 10 benchmarks by using our method. Results indicate that, compared to existing methods, up to 33.3%, 76.7%, and 75.5% reduction in delay, cell count, and area, respectively, is possible with the average being 8.1%, 28.9%, and 29.0%, respectively.

106 citations


Posted Content
01 Jan 2010
TL;DR: In this article, a two-step technique for combinational logic optimization is described, where the first step reduces the nonlinearity of a circuit, as measured by the number of non-linear gates it contains, and the second step reduces a circuit's number of gates in the linear components.
Abstract: A new technique for combinational logic optimization is described. The technique is a two-step process. In the first step, the nonlinearity of a circuit – as measured by the number of non-linear gates it contains – is reduced. The second step reduces the number of gates in the linear components of the already reduced circuit. The technique can be applied to arbitrary combinational logic problems, and often yields improvements even after optimization by standard methods has been performed. In this paper we show the results of our technique when applied to the S-box of the Advanced Encryption Standard (AES [5]). This is an experimental proof of concept, as opposed to a full-fledged circuit optimization effort. Nevertheless the result is, as far as we know, the circuit with the smallest gate count yet constructed for this function. We have also used the technique to improve the performance (in software) of several candidates to the Cryptographic Hash Algorithm Competition. Finally, we have experimentally verified that the second step of our technique yields significant improvements over conventional methods when applied to randomly chosen linear transformations.

48 citations


Journal ArticleDOI
TL;DR: Experiments show that the proposed bandwidth adaptive hardware architecture of K-Means clustering can be used in applications such as image segmentation, and it has the maximum clock speed 400-MHz and 440-K gate count with TSMC 90-nm technology.
Abstract: K-Means is a clustering algorithm that is widely applied in many fields, including pattern classification and multimedia analysis. Due to real-time requirements and computational-cost constraints in embedded systems, it is necessary to accelerate K-Means algorithm by hardware implementations in SoC environments, where the bandwidth of the system bus is strictly limited. In this paper, a bandwidth adaptive hardware architecture of K-Means clustering is proposed. Experiments show that the proposed hardware can be used in applications such as image segmentation, and it has the maximum clock speed 400-MHz and 440-K gate count with TSMC 90-nm technology. Moreover, the throughput of the proposed hardware reaches 16 dimension/cycle, and it can deal with feature vectors with different dimensions using five parallel modes to utilize the input bandwidth efficiently.

45 citations


Journal ArticleDOI
TL;DR: This paper presents a new 4*4 parity preserving reversible logic gate, IG, which can be used to synthesize any arbitrary Boolean function and allows any fault that affects no more than a single signal readily detectable at the circuit's primary outputs.
Abstract: Reversible logic is emerging as an important research area having its application in diverse fields such as low power CMOS design, digital signal processing, cryptography, quantum computing and optical information processing. This paper presents a new 4*4 parity preserving reversible logic gate, IG. The proposed parity preserving reversible gate can be used to synthesize any arbitrary Boolean function. It allows any fault that affects no more than a single signal readily detectable at the circuit's primary outputs. It is shown that a fault tolerant reversible full adder circuit can be realized using only two IGs. The proposed fault tolerant full adder (FTFA) is used to design other arithmetic logic circuits for which it is used as the fundamental building block. It has also been demonstrated that the proposed design offers less hardware complexity and is efficient in terms of gate count, garbage outputs and constant inputs than the existing counterparts.

39 citations


Patent
24 May 2010
TL;DR: In this article, a pipelined CORDIC architecture is proposed to improve throughput and resource utilization, while reducing the gate count, by integrating the benefits of multi-dimensional annihilation capability of Householder reflections plus the low-complexity nature of the conventional 2D Givens rotations.
Abstract: A QRD processor for computing input signals in a receiver for wireless communication relies upon a combination of multi-dimensional Givens Rotations, Householder Reflections and conventional two-dimensional (2D) Givens Rotations, for computing the QRD of matrices. The proposed technique integrates the benefits of multi-dimensional annihilation capability of Householder reflections plus the low-complexity nature of the conventional 2D Givens rotations. Such integration increases throughput and reduces the hardware complexity, by first decreasing the number of rotation operations required and then by enabling their parallel execution. A pipelined architecture is presented (290) that uses un-rolled pipelined CORDIC processors (245a to 245d) iteratively to improve throughput and resource utilization, while reducing the gate count.

25 citations


Proceedings ArticleDOI
26 May 2010
TL;DR: A new algorithm MP(multiple pass) to synthesize large reversible binary circuits without ancilla bits is presented, which allows for synthesis of large scale reversible circuits (30-bits), which is not possible with existing algorithms.
Abstract: This paper presents a new algorithm MP(multiple pass) to synthesize large reversible binary circuits without ancilla bits. The MMD algorithm requires to store a truth table (or a Reed-Muller -RM transform) as a 2^n vector for a reversible function of n variables. This representation prohibits synthesis of large functions. However, in MP we do not store such an exponentially growing data structure. The values of minterms are calculated in MP dynamically, one-by-one, from a set of logic equations that specify the reversible circuit to be designed. This allows for synthesis of large scale reversible circuits (30-bits), which is not possible with existing algorithms. In addition, our unique multipass approach where the circuit is synthesized with various, yet specific, minterm orders yields optimal solution. The algorithm returns a description of the optimal circuit with respect to gate count or quantum cost. Although the synthesis process is relatively slower, the solution is found in real-time for smaller circuits of 8 bits or less

20 citations


Journal ArticleDOI
TL;DR: This paper demonstrates the first fully integrated built-in self-test (BIST) Σ-Δ analog-to-digital converter (ADC) chip to the best of the authors' knowledge.
Abstract: This paper demonstrates the first fully integrated built-in self-test (BIST) Σ-Δ analog-to-digital converter (ADC) chip to the best of our knowledge. The ADC under test (AUT) comprises a second-order design-for-digital-testability Σ-Δ modulator and a decimation filter. The purely digital BIST circuitry conducts single-tone tests for the signal-to-noise-and-distortion ratio (SNDR), the dynamic range, the offset, and the gain error of the AUT. The BIST design is based on the proposed modified controlled sine-wave fitting procedure to address the component overload issues, reduce the setup parameter numbers, and eliminate the need for parallel multipliers. The total gate count of the whole BIST circuitry is only 13 300. The hardware overhead is much less than the BIST design using the traditional fast Fourier transform (FFT) analysis. Measurement results show that the peak SNDR results of the proposed BIST design and the conventional FFT analysis are 75.5 and 75.3 dB, respectively. The subtle SNDR difference is already within analog test uncertainty. The BIST Σ-Δ ADC achieves a digital test bandwidth higher than 17 kHz, very close to the rated 20-kHz bandwidth of the AUT.

19 citations


Proceedings ArticleDOI
02 Apr 2010
TL;DR: According to the approach, one can realize a variety of reversible full-adders flexibly with only two reversible gates and two garbage outputs, which have improvements in the gate count and garbage count and can reduce the cost of network.
Abstract: The reversible gates, attracting people’s attention increasingly, have been widely used in low-power CMOS design, optical computing and quantum computing. In many existing literatures, only the methods of constructing certain specific reversible full-adders were presented, while we proposed a general approach to construct the reversible full-adder. According to the approach, we can realize a variety of reversible full-adders flexibly with only two reversible gates and two garbage outputs, which have improvements in the gate count and garbage count and can reduce the cost of network.

18 citations


Proceedings ArticleDOI
20 Jun 2010
TL;DR: It is reasonable to consider using pipelined S-boxes in AES hardware implementations targeted at applications requiring low area and moderate speed, as well as a new compact AES encryption hardware core with 128-bit keys.
Abstract: Pipelined S-boxes are usually used in high speed hardware implementations of the Advanced Encryption Standard (AES), and not typically found in compact implementations because of the extra complexity added by the pipeline registers. In this paper, the area and speed performance of applying a pipelined S-box to compact AES hardware implementations is examined. A new compact AES encryption hardware core with 128-bit keys is proposed. The proposed design employs a single 4-stage pipelined S-box that is shared by t he data path operation and the key expansion operation. Compared with the previous smallest encryption-only ASIC implementation of AES, it achieves an increase in throughput of 2.1 times while maintaining a similar gate count. This result indicates that it is reasonable to consider using pipelined S-boxes in AES hardware implementations targeted at applications requiring low area and moderate speed.

17 citations


Proceedings ArticleDOI
Xiao Peng1, Zhixiang Chen1, Xiongxin Zhao1, Fumiaki Maehara1, Satoshi Goto1 
07 Jul 2010
TL;DR: Through introducing the bypass network, the variation Banyan network (VBN) based permutation network architecture for the reconfigurable QC-LDPC decoders and give the control signal generating algorithm for cyclic shift is put forward.
Abstract: Permutation network plays an important role in the reconfigurable QC-LDPC decoder for most modern wireless communication systems with multiple code rates and various code lengths. In this paper, we propose the variation Banyan network (VBN) based permutation network architecture for the reconfigurable QC-LDPC decoders and give the control signal generating algorithm for cyclic shift. Through introducing the bypass network, we put forward the nonblocking scheme for any input number and shift number. In addition, the optimized VBN is proposed for WiMAX and WiFi standard, which can shift at most 4 groups of input data, and greatly reduce the hardware complexity. The synthesis results using the 90nm technology demonstrate that the proposed permutation network can be implemented with the gate count of 18.3k and the frequency of 600 MHz.

Proceedings ArticleDOI
01 Dec 2010
TL;DR: Novel designs for reversible ALU of a cryptoprocessor which have been implemented in standard gate library are proposed and the quantum cost reported here are better than the lower bounds reported in literature.
Abstract: In order to defend the power analysis attack reversible logic is a good candidate as it ideally does not dissipate any heat and today reversible logic is an emerging research area. In literature different designs for reversible hardware cryptography have been proposed but they have been implemented using complex gate libraries and further theorems have been proposed defining lower limit of implementation cost which is quantum cost. We have proposed novel designs for reversible ALU of a cryptoprocessor which have been implemented in standard gate library and the quantum cost reported here are better than the lower bounds reported in literature. Further we have calculated delay of the proposed designs. We have verified that our proposed designs are minimal with respect to gate count which is circuit cost by simulating it in RevKit. This is for the first time that the optimization algorithms to optimize quantum cost and delay have been applied to improvise on the cost metric in reversible ALU design.

Proceedings ArticleDOI
01 Aug 2010
TL;DR: Comparisons indicate that, by applying the method presented, the hardware requirements (i.e., complexity and area) for QCA n-bit counter can be greatly reduced.
Abstract: In this paper, we present a new method for the design of an n-bit synchronous binary up counter in quantum-dot cellular automata (QCA). This method is based on the JK flip-flop which almost always produces the simplest combinational logic in traditional sequential circuits. We implement a new QCA architecture for the JK flip-flop. Compared to the existing QCA JK flip-flop, the majority gate count, cell count, clock cycle count, and area of our QCA JK flip-flop are reduced by 57.1%, 88.1%, 55.6%, and 92.0%, respectively. Based on our QCA JK flip-flop, a method of extending state cells is proposed to design the QCA layout of the n-bit counter, such that all the clock cycle counts between any two state cells become 1. This feature can ensure that one count just takes one clock cycle, in existing methods, however, one count needs to take n−1 clock cycles. Comparisons indicate that, by applying our method, the hardware requirements (i.e., complexity and area) for QCA n-bit counter can be greatly reduced.

Proceedings ArticleDOI
19 Nov 2010
TL;DR: It is shown that the HNG, TSG and MKG gates proposed for designing of a component of multiplier circuit (full adder) is neither unique nor special and many such gates may be proposed which can also perform all boolean operations.
Abstract: Multiplier circuits play an important role in reversible computation, which is helpful in diverse areas such as low power CMOS design, optical computing, DNA computing and bioinformatics. We have proposed a reversible multiplier circuit design in NCT gate library which is based on generating all partial products in one step and then summing their partial products using binary tree network. The proposed reversible multiplier design has two components which are reversible partial product generation circuit and reversible parallel adder circuit. Our design has minimum number of garbage bits, gate count, and quantum cost. We have shown that the HNG, TSG and MKG gates proposed for designing of a component of multiplier circuit (full adder) is neither unique nor special and many such gates may be proposed which can also perform all boolean operations. As an example three such new gates have been presented here.

Journal ArticleDOI
TL;DR: A classification scheme is presented which allows the hardware implementation of the fractal coder based on binary classification of domain and range blocks which increases the processing speed and reduces the power consumption while the qualities of the reconstructed images are comparable with those of the available software techniques.

Proceedings ArticleDOI
01 Sep 2010
TL;DR: Generic silicon area, iteration period, and energy cost models of high-throughput LDPC decoders are derived and can be used for a fair benchmarking of the implemented decoder.
Abstract: System specification of SoCs needs to be supported by quantitative cost models to avoid wrong decisions in this early design phase. For less complex logic structures like for example FIR filters such generic cost models can be derived easily because they base on a simple gate count. For LDPC decoders the influence of the global interconnect between the two basic components of such a decoder complicates the derivation of general cost models. This might be the reason why no accurate cost models are known from literature yet. In this paper generic silicon area, iteration period, and energy cost models of high-throughput LDPC decoders are derived. Those models do not only allow for a decoding-performance vs. hardware-cost trade-off analysis during system specification but can also be used later on to choose a suitable architecture for a certain specification. Finally these models can be used for a fair benchmarking of the implemented decoder.

Journal ArticleDOI
TL;DR: Results of benchmark experiments and comparison with similar studies demonstrate the efficiency of the proposed evolutionary approach, which reduces the gate count of a built-in self-test structure by concurrent optimization of multiple parameters that influence the final solution.
Abstract: The paper describes an approach for the generation of a deterministic test pattern generator logic, which is composed of D-type and T-type flip-flops. This approach employs a genetic algorithm that searches for an acceptable practical solution in a large space of possible implementations. In contrast to conventional approaches the proposed one reduces the gate count of a built-in self-test structure by concurrent optimization of multiple parameters that influence the final solution. The optimization includes the search for: the optimal combination of register cells type; the presence of inverters at inputs and outputs; the test patterns order in the generated test sequence; and the bit order of test patterns. Results of benchmark experiments and comparison with similar studies demonstrate the efficiency of the proposed evolutionary approach.

Posted Content
TL;DR: In this paper, a generalized k*k reversible gate family is proposed and a 3*3 gate of the family is discussed, which can be realized by Inverter, AND, OR, NAND, NOR, and EXOR gates.
Abstract: Reversible circuits have applications in digital signal processing, computer graphics, quantum computation and cryptography. In this paper, a generalized k*k reversible gate family is proposed and a 3*3 gate of the family is discussed. Inverter, AND, OR, NAND, NOR, and EXOR gates can be realized by this gate. Implementation of a full-adder circuit using two such 3*3 gates is given. This full-adder circuit contains only two reversible gates and produces no extra garbage outputs. The proposed full-adder circuit is efficient in terms of gate count, garbage outputs and quantum cost. A 4-bit carry skip adder is designed using this full-adder circuit and a variable block carry skip adder is discussed. Necessary equations required to evaluate these adder are presented.

Proceedings ArticleDOI
Junbao Liu1, Shuai Wang1, Yi Li1, Jun Han1, Xiaoyang Zeng1 
13 Dec 2010
TL;DR: A novel Gabor filter hardware scheme for the fingerprint image enhancement is presented that uses accurate local frequency and orientation to generate the corresponding convolution kernel and thus achieve a better enhancement effect.
Abstract: In this paper a novel Gabor filter hardware scheme for the fingerprint image enhancement is presented. For each pixel of the image, we use accurate local frequency and orientation to generate the corresponding convolution kernel and thus achieve a better enhancement effect. And compared to the previous works, our design yields a higher throughput which is due to the pipeline techniques. Moreover the proposed design can be reconfigured to fulfill the different requirements. Evaluation results demonstrate that, when convolution kernel size is 11×11, our design can achieve 2MPixels/s @ 250MHz, and equivalent gate count is 63.8k at SMIC 0.13um worst process corner. Indeed, it's very suitable for the embedded fingerprint recognition system.

Proceedings ArticleDOI
05 May 2010
TL;DR: An area-efficient FFT processor is proposed for MIMO-OFDM based SDR systems by reducing the required number of nontrivial multipliers with mixed-radix (MR) and multi-path delay commutator (MDC) architecture.
Abstract: In this paper, an area-efficient FFT processor is proposed for MIMO-OFDM based SDR systems. The proposed scalable FFT processor can support the variable length of 64, 128, 512, 1024 and 2048. By reducing the required number of nontrivial multipliers with mixed-radix (MR) and multi-path delay commutator (MDC) architecture, the complexity of the proposed FFT processor is dramatically decreased. The proposed FFT processor was designed in hardware description language (HDL) and synthesized to gate-level circuits using 0.13um CMOS standard cell library. With the proposed architecture, the gate count for the processor is 46K and the size of memory is 90Kbits, which are reduced by 59% and 39%, respectively, compared with those of the 4-channel radix-2 single-path delay feedback (R2SDF) FFT processor. Also, compared with 4-channel radix-2 MDC (R2MDC) FFT processor, it is confirmed that the gate count and memory size are reduced by 16.4% and 26.8%, respectively.

Journal ArticleDOI
TL;DR: In this paper, a new remotely reconfigurable gate array architecture with four configuration contexts that enable remote reconfiguration using optical fiber networks is presented, and discussion of the availability of this architecture and plans based on the experimental results.

Proceedings ArticleDOI
23 Dec 2010
TL;DR: An area-efficient FFT processor is proposed for IEEE 802.16m mobile WiMAX systems and can support the variable length of 512, 1024, 2048 and 4096 by reducing the required number of non-trivial multipliers with mixed-radix (MR) and multi-path delay commutator (MDC) architecture.
Abstract: In this paper, an area-efficient FFT processor is proposed for IEEE 802.16m mobile WiMAX systems. The proposed scalable FFT processor can support the variable length of 512, 1024, 2048 and 4096. By reducing the required number of non-trivial multipliers with mixed-radix (MR) and multi-path delay commutator (MDC) architecture, the complexity of the proposed FFT processor is dramatically decreased without sacrificing system throughput. The proposed FFT processor was designed in hardware description language (HDL) and synthesized to gate-level circuits using 0.18um CMOS standard cell library. With the proposed architecture, the gate count for the processor is 49K and the size of memory is 96Kbits, which are reduced by 12% and 26%, respectively, compared with those of the 4-channel radix-2 MDC (R2MDC) FFT processor.

Proceedings ArticleDOI
17 Sep 2010
TL;DR: The design and VLSI implement of high efficient Context-based Adaptive Variable Length Coding (CAVLC) encoder which adopted a modified VLC look up table technique and parallel processing and is suitable for real-time video applications.
Abstract: This paper proposes the design and VLSI implement of high efficient Context-based Adaptive Variable Length Coding (CAVLC) encoder which adopted a modified Variable Length Coding (VLC) look up table technique and parallel processing. The proposed CAVLC encoder used upper and under buffer as input buffer to perform zigzag scanning with both way ordering. Because of this, the proposed CAVLC encoder can be read and write concurrently. Moreover, we design the CAVLC encoder procedure with parallel processing which uses two generators for information signals and control signals to operate CAVLC modules such as a coeff_token (TotalCeff and TrailingOnes) module, a level module, a total_zeros module, and a run_before module. The proposed CAVLC is prototyped in Verilog-HDL, implemented and synthesized with megnachip 0.18 µm CMOS tech. The synthesis result shows that the gate count is about 12K with the clock constraint of 140Mhz. The proposed CAVLC encoder is suitable for real-time video applications.

Proceedings ArticleDOI
01 Feb 2010
TL;DR: A comparative analysis of nine hardware architecture alternatives for SAD calculation processing unit, varying the parallelism level (4, 8 and 16 samples in parallel) and the number of pipeline stages shows that fewer stage pipeline versions achieved fewer energy consumption and higher throughput when compared with deeper pipeline versions.
Abstract: Sum of Absolute Difference (SAD) is a low complexity distortion metric widely employed in the mode decision stage of real-time video encoders. In H.264/AVC encoding, the state-of-the-art video coding standard, motion estimation responds for the most computational complexity, most of it coming from the SAD calculation for all the candidate blocks. Considering an H.264/AVC motion estimation hardware architecture [5], SAD calculation stands for 79% of total gate count. Therefore, focusing on SAD hardware design space exploration can result in important area, power and performance improvement of H.264/AVC ASIC video encoders, but such exploration were not investigated in previous works. Concerning this question, this work firstly presents a comparative analysis of nine hardware architecture alternatives for SAD calculation processing unit, varying the parallelism level (4, 8 and 16 samples in parallel) and the number of pipeline stages. The comparison is presented in terms of total gate count, processing cycles, throughput, power and energy consumption. Results shown that fewer stage pipeline versions achieved fewer energy consumption and higher throughput (operating at a restricted clock frequency) when compared with deeper pipeline versions. These analyses are useful to select the best SAD architectural alternative to fit each application requirement, from low-power mobile to high resolution H.264/AVC on-chip video encoder.

Proceedings ArticleDOI
04 Nov 2010
TL;DR: This tutorial outlines the development of an ASIC for a CP-QPSK transponder and outlines the pitfalls and challenges.
Abstract: Optical communications systems at 10, 40 and WOG now use digital signal processing to enhance the signal tolerance against channel impairments such as chromatic dispersion and PMD The power of digital signal processing is driven by developments in CMOS integration, which push the limits of gate count, feature sizes and development budgets But what steps are involved in developing a full custom ASIC ? What are the pitfalls and challenges ? This tutorial outlines the development of an ASIC for a CP-QPSK transponder

Journal ArticleDOI
TL;DR: Analytical results show that the proposed architecture has the smallest critical path delay, latency, and area-time complexity in comparison with similar studies.
Abstract: This paper presents a high-speed, low-complexity VLSI architecture based on the modified Euclidean (ME) algorithm for Reed-Solomon decoders. The low-complexity feature of the proposed architecture is obtained by reformulating the error locator and error evaluator polynomials to remove redundant information in the ME algorithm proposed by Truong. This increases the hardware utilization of the processing elements used to solve the key equation and reduces hardware by 30.4%. The proposed architecture retains the high-speed feature of Truong's ME algorithm with a reduced latency, achieved by changing the initial settings of the design. Analytical results show that the proposed architecture has the smallest critical path delay, latency, and area-time complexity in comparison with similar studies. An example RS(255, 239) decoder design, implemented using the TSMC 0.18µm process, can reach a throughput rate of 3Gbps at an operating frequency of 375MHz and with a total gate count of 27, 271.

Proceedings ArticleDOI
01 Nov 2010
TL;DR: An area efficient, low energy, high speed pipelined architecture for a Reed-Solomon decoder based on Decomposed Inversionless Berlekamp-Massey Algorithm, where the error locator and evaluator polynomial can be computed serially.
Abstract: This paper proposes an area efficient, low energy, high speed pipelined architecture for a Reed-Solomon decoder based on Decomposed Inversionless Berlekamp-Massey Algorithm, where the error locator and evaluator polynomial can be computed serially. In the proposed architecture, a new scheduling of t Finite Field Multipliers (FFMs) is used to calculate the error locator and evaluator polynomials to achieve a good balance between area, latency, and throughput. This architecture is tested in two different decoders. The first one is a pipelined two parallel decoder, as two parallel syndrome and two parallel Chien search are used. The second one is a conventional pipelined decoder, as conventional syndrome and Chien search are used. Both decoders have been implemented by 0.13µm CMOS IBM standard cells. The two parallel RS(255, 239) decoder has gate count of 37.6K and area of 1.18mm2, simulation results show this approach can work successfully at the data rate 7.4Gbps and the power dissipation is 50mW. The conventional RS(255, 239) decoder has gate count of 30.7K and area of 0.99mm2. Simulation results show this approach can work successfully at the data rate 4.85Gbps and the power dissipation is 29.28mW.

Proceedings ArticleDOI
TL;DR: A modified sub-pipelined structure is proposed targeting high speed and low power-delay product of the compact AES design with on-the-fly key expansion unit, by adding 25.8% in hardware complexity to the existing ASIC designs.
Abstract: In this paper, efficient hardware of one of the most popular encryption algorithms, the Advanced Encryption Standard (AES), is presented. A modified sub-pipelined structure is proposed targeting high speed and low power-delay product of the compact AES design with on-the-fly key expansion unit. By adding 25.8% in hardware complexity to the existing ASIC designs, the throughput is increased more than 158% with better overall power-delay product. Compared to other compact AES implementation the proposed structure can go up to 6Gbit/sec with about 13k gate count.

Proceedings ArticleDOI
03 Aug 2010
TL;DR: A high throughput context-based adaptive binary arithmetic coding (CABAC) decoding design with hybrid memory architecture for H.264/AVC is presented and an efficient mathematical transform method is proposed to further decrease the critical path of two-symbol binary arithmetic decoding procedure.
Abstract: A high throughput context-based adaptive binary arithmetic coding (CABAC) decoding design with hybrid memory architecture for H.264/AVC is presented in this paper. To accelerate the decoding speed with hardware cost consideration, a new hybrid memory two-symbol parallel decoding technique is proposed. In addition, an efficient mathematical transform method is also proposed to further decrease the critical path of two-symbol binary arithmetic decoding procedure. The proposed architecture is implemented by UMC 90nm technology and experimental results show that our proposal can operate at 264 MHz with 42.37k gate count, and the throughput is 483.1 Mbins/sec, which surpasses previous design with 48.6% hardware cost saving.

Journal ArticleDOI
Xiao Peng1, Xiongxin Zhao1, Zhixiang Chen1, Fumiaki Maehara1, Satoshi Goto1 
TL;DR: The generic permutation network (GPN) for the reconfigurable QC-LDPC decoder could break through the input number restriction, such as power of 2 and other limited number, and optimize the network for any application in demand.
Abstract: Permutation network plays an important role in the reconfigurable QC-LDPC decoder for most modem wireless communication systems with multiple code rates and various code lengths. This paper presents the generic permutation network (GPN) for the reconfigurable QC-LDPC decoder. Compared with conventional permutation networks, this proposal could break through the input number restriction, such as power of 2 and other limited number, and optimize the network for any application in demand. Moreover, the proposed scheme could greatly reduce the latency because of less stages and efficient control signal generating algorithm. In addition, the proposed network processes the nature of high parallelism which could enable several groups of data to be cyclically shifted simultaneously. The synthesis results using the 90 nm technology demonstrate that this architecture can be implemented with the gate count of 18.3k for WiMAX standard at the frequency of 600 MHz and 10.9k for WiFi standard at the frequency of 800 MHz.