scispace - formally typeset
Search or ask a question

Showing papers on "Gate count published in 2013"


Journal ArticleDOI
Taesang Cho1, Hanho Lee1
TL;DR: A novel modified radix-25 FFT algorithm that reduces the hardware complexity is proposed, which can reduce the number of complex multiplications and the size of the twiddle factor memory.
Abstract: This paper presents a high-speed low-complexity modified radix-25 512-point fast Fourier transform (FFT) processor using an eight data-path pipelined approach for high rate wireless personal area network applications. A novel modified radix-25 FFT algorithm that reduces the hardware complexity is proposed. This method can reduce the number of complex multiplications and the size of the twiddle factor memory. It also uses a complex constant multiplier instead of a complex Booth multiplier. The proposed FFT processor achieves a signal-to-quantization noise ratio of 35 dB at 12 bit internal word length. The proposed processor has been designed and implemented using 90-nm CMOS technology with a supply voltage of 1.2 V. The results demonstrate that the total gate count of the proposed FFT processor is 290 K. Furthermore, the highest throughput rate is up to 2.5 GS/s at 310 MHz while requiring much less hardware complexity.

63 citations


Patent
06 Feb 2013
TL;DR: In this article, a low-density One-Time Programmable (OTP) memory is disclosed to achieve low gate count and low overhead in the peripheral circuits to save the cost.
Abstract: A low density One-Time Programmable (OTP) memory is disclosed to achieve low gate count and low overhead in the peripheral circuits to save the cost. A maximum-length Linear Feedback Shift Register (LFSR) can be used to generate 2 n −1 address spaces from an n-bit address. The registers used in the address generator can have two latches. Each latch has two cross-coupled inverters with two outputs coupled to the drains of two MOS input devices, respectively. The inputs of the latch are coupled to the gates of the MOS input devices, respectively. The sources of the MOS input devices are coupled to the drains of at least one MOS device(s), whose gate(s) are coupled to a clock signal and whose source(s) are coupled to a supply voltage. The two latches can be constructed in serial with the outputs of the first latch coupled to the inputs of the second latch.

41 citations


Journal ArticleDOI
TL;DR: A semi-pipelined semi-iterative architecture is presented for the QRD core, that uses innovative design ideas to develop 2-D, Householder 3-D and 4-D/2-D configurable CORDIC processors, such that they can perform the maximum possible number of vectoring and rotation operations within the given number of cycles, while minimizing gate count and maximizing the resource utilization.
Abstract: This paper presents a hybrid QR decomposition (QRD) design that reduces the number of computations and increases their execution parallelism by using a unique combination of Multi-dimensional Givens rotations, Householder transformations and conventional 2-D Givens rotations. A semi-pipelined semi-iterative architecture is presented for the QRD core, that uses innovative design ideas to develop 2-D, Householder 3-D and 4-D/2-D configurable CORDIC processors, such that they can perform the maximum possible number of vectoring and rotation operations within the given number of cycles, while minimizing gate count and maximizing the resource utilization. Test results for the 0.3 mm2 QRD chip, fabricated in 0.13 μm 1P8M CMOS technology, demonstrate that the proposed design for 4×4 complex matrices attains the lowest reported processing latency of 40 clock cycles (144 ns) at 278 MHz and dissipates 48.2 mW at 1.3 V supply and 25°C. It outperforms all of the previously published QRD designs by offering the highest QR processing efficiency.

39 citations


Proceedings ArticleDOI
19 May 2013
TL;DR: A reconfigurable hardware design which can support the inverse transform size from 4×4 to 32×32 in HEVC (High Efficiency Video Coding) and only needs about 133.8K gate count is presented.
Abstract: In this paper, we present a reconfigurable hardware design which can support the inverse transform size from 4×4 to 32×32 in HEVC (High Efficiency Video Coding). We explore the coefficient properties of various inverse transforms such that a base inverse transform unit can be reconfigured or refined to generate other size of inverse transform. The implementation in 90nm technology can support 3840×2160@30fps processing and only needs about 133.8K gate count, which can save 53% of gate count when compared with previous work.

30 citations


Journal ArticleDOI
TL;DR: The comparative results show that the proposed design of the FPGA is much better in terms of gate count, garbage outputs, quantum cost, delay, and hardware complexity than the existing approaches.

15 citations


Journal ArticleDOI
TL;DR: The improved Direct Digital Synthesizer (DDS) using the Hybrid Wave Pipelining (HWP) technique and COordinate Rotation DIgital Computer (CORDIC) algorithm for Software Defined Radio (SDR) is presented in this paper.
Abstract: The improved Direct Digital Synthesizer (DDS) using the Hybrid Wave Pipelining (HWP) technique and COordinate Rotation DIgital Computer (CORDIC) algorithm for Software Defined Radio (SDR) is presented in this paper. In order to achieve high throughput, the hybrid wave pipelining technique is adopted. The HWP can be used to speed up the circuits without insertion of storage elements. The CORDIC algorithm is used for phase-to-amplitude conversion and utilized for dynamic transformation rather than Read Only Memory (ROM) static addressing. The frequency resolution and phase resolution are achieved as 0.023 Hz and 0.088 degree, respectively, at the maximum operating frequency of 199.288 MHz for the proposed DDS architecture. The spectral purity of the proposed design has been improved to 114 dBc with a throughput of 94 %. This paper is focused on the design and implementation of DDS using hybrid wave pipelining with CORDIC approach to target on Xilinx Spartan 3 (XC3S400-5PQ208) Field Programmable Gate Array (FPGA) with a speed grade of −5. The proposed DDS design reduces the gate count from 49.4 % to 18.2 % as compared to the conventional pipelined Read Only Memory Look Up Table (ROMLUT) DDS method. The throughput of the proposed method has been improved from 78 % to 94 % and 55 % of total power reduction as compared with conventional DDS. The performance of the improved DDS architecture is compared with several existing DDS architectures and it is found that the present design is outperforming and can be used for software defined radios.

14 citations


Journal ArticleDOI
TL;DR: The report shows that the proposed decoder can achieve 4.0 Gbps with 7.1 M gate count to decode a 2048-length R=1/3 turbo code, when the bit error rate (BER) is 10-5 @ Eb/N0=1.25 dB.
Abstract: In this letter, we propose a high throughput stochastic Low Bits Computation (LBC) turbo decoder. We represent the signal by a 3-bits width stochastic stream, which improves the accuracy of stochastic computation significantly. We have designed and synthesized our design based on CMOS 90 nm technology. The report shows that the proposed decoder can achieve 4.0 Gbps with 7.1 M gate count to decode a 2048-length R=1/3 turbo code, when the bit error rate (BER) is 10-5 @ Eb/N0=1.25 dB.

13 citations


Journal ArticleDOI
TL;DR: A new scheme is proposed for implementing gate operations between remote qubits in linear nearest neighbor (LNN) architectures, one that does not require qubits to be adjacent to each other in order to perform a gate operation between them, using a new two-control, one-target controlled-unitary gate operation, which is referred to as the C2(−I) gate.
Abstract: We propose a new scheme for implementing gate operations between remote qubits in linear nearest neighbor (LNN) architectures, one that does not require qubits to be adjacent to each other in order to perform a gate operation between them. The key feature of our scheme is a new two-control, one-target controlled-unitary gate operation, which we refer to as the C2(?I) gate. The gate operation can be implemented easily in a single step, requiring only a single control parameter of the system Hamiltonian. Using the C2(?I) gate, we show how to implement CNOT gate operations between remote qubits that do not have any direct coupling between them, along an LNN array. Since this is achieved without requiring swap operations or additional ancilla qubits in the circuit, the quantum cost of our circuit can be more than 50 % lower than those using conventional swap methods. All CNOT gate operations between remote qubits can be achieved with fidelity greater than 99.5 %.

12 citations


Journal ArticleDOI
TL;DR: The high-throughput parallel divider in the moving-average engine is a new solution to reduce the computational time of one division operation to a single clock cycle and to calculate cumulative moving averages with no precision loss.
Abstract: A dual-stage hardware architecture that supports two kinds of moving averages for the on-line clustering algorithm is proposed. The architectural design of this work is different from the one of previous works that focus on the iterative clustering algorithm. The system includes a set of memories that operates in ping-pong mode, so that the Manhattan distances can be computed when the centroids are updated. The high-throughput parallel divider in the moving-average engine is a new solution to reduce the computational time of one division operation to a single clock cycle and to calculate cumulative moving averages with no precision loss. Two hardware examples show the robustness of the proposed architecture, and the architectural analysis is performed with the 90 nm CMOS technology. In the first example, the gate count is the smallest and the normalized power consumption of this work is the lowest among previous works. In the second example, the architecture is compared with related works, which implement the Self-Organizing Map (SOM) algorithm. The proposed work has high flexibility for parameter combinations and can achieve high performance for color quantization in a single iteration. The functionalities of the proposed system are also verified with the background subtraction application.

9 citations


Proceedings ArticleDOI
01 Aug 2013
TL;DR: It is shown how to extrapolate 3- and 4-variable reversible functions implemented by gate count minimal circuits having regular structure to construct sequences of reversible functions of an arbitrary number of variables.
Abstract: This paper reports on a method of the construction of new difficult benchmarks for reversible logic synthesis. It is shown how to extrapolate 3- and 4-variable reversible functions implemented by gate count minimal circuits having regular structure. In this way sequences of reversible functions of an arbitrary number of variables have been constructed for which we have built minimal circuits implementing them. For two example sequences of functions we applied the synthesis tool Revkit trying different synthesis algorithms. The outcome shows a large gap between the circuits synthesized by the tool and the ones proved minimal by construction.

8 citations


Proceedings ArticleDOI
07 Mar 2013
TL;DR: To evaluate the goodness of a synthesized netlist, various metrics such as gate count, quantum cost, and equivalent transistor cost have been considered by various researchers.
Abstract: Summary form only given, as follows. With the increasing emphasis on low-power design and quantum computation, research activities in the area of reversible logic synthesis and testing have gained momentum over the last couple of decades. It is expected that reversible logic will provide us with a viable alternative to building ultra-low power circuits and systems in not too distant future. In the classical works for synthesis of reversible circuits, gate libraries comprising of standard reversible gates like NOT, CNOT, TOFFOLI, FREDKIN, etc. are considered. To evaluate the goodness of a synthesized netlist, various metrics such as gate count, quantum cost, and equivalent transistor cost have been considered by various researchers. The synthesis approaches that have been reported can be broadly categorized into three groups: (a) exact synthesis approaches which try to obtain optimal reversible gate netlists, but can be used for small circuits only, (b) heuristic based approaches which try to utilize some domain knowledge intelligently to reduce the complexity of search, and can be used for somewhat larger circuits, and (c) synthesis approaches that rely on higher level functional representations like binary decision diagram (BDD) or exclusive sum-ofproducts (ESOP). The last approach is scalable to larger circuits (with 200 inputs or more), however, the synthesized netlist is not optimal and various rule-based heuristic approaches have been proposed to minimize the cost. There have been works also that report techniques for implementing sequential circuits with reversible properties, which will be useful for building complex systems containing finite-state machines. There are various transformations that are carried out as part of cryptographic algorithms that are inherently reversible in nature. For instance, any block cipher that uses a key K to transform a plaintext P into a ciphertext C during encryption must be reversible, because decryption will be doing just the reverse (C to P). Also, in standard symmetric block ciphers like DES or AES, there is a combinational block called substitution box or S-box which is also reversible in nature. In AES, the S-box has 8 inputs and 8 outputs, and implements a one-to-one onto mapping. The same reversibility requirements hold for stream ciphers and public-key ciphers like RSA. Although not much work has been carried out in the area of reversible implementations of cryptographic algorithms, this can be a very good area for future research. Similar considerations hold for various coding and decoding techniques used in communication, which are also inherently reversible in nature. Some examples of such coding/decoding are Manchester, Differential Manchester, Bipolar AMI, 4B/5B, 8B/10B, Hamming error correcting code, etc. All these techniques can potentially be implemented using reversible logic circuits. Specific case studies of some of the areas as mentioned will be reported, with synthesis results.

Proceedings ArticleDOI
26 May 2013
TL;DR: This paper presents a programmable application specific instruction processor for the Adaptive Loop Filter, and to the authors' best knowledge this is the first programmable solution for ALF on embedded devices.
Abstract: The Adaptive Loop Filter (ALF) is a subjective and objective image quality improving filter in the High Efficiency Video Coding standard (HEVC). The ALF has shown to be computationally complex and its complexity has been reduced during the HEVC development process. In the HEVC TestModel HM-7.0 ALF is a 9×7 cross + 3×3 square shaped filter. This paper presents a programmable application specific instruction processor for the ALF. The proposed processor processes 1920×1080p luminance frames at 30 frames per second, when operated at a clock frequency of 311MHz. Low power consumption and a low gate count make the proposed processor suitable for embedded devices. The processor program code is written in pure C-language, which allows versatile use of the circuit and updates to the filter functionality without modifying the processor design. To the authors' best knowledge this is the first programmable solution for ALF on embedded devices.

Journal ArticleDOI
30 Jun 2013
TL;DR: In this paper, the core transform architecture is implemented with only adders and shifters instead of area-consuming multipliers, which can process from to blocks with common hardware by reusing processing elements.
Abstract: This paper proposes and implements an core transform architecture, which is one of the major processes in HEVC video compression standard. The proposed core transform architecture is implemented with only adders and shifters instead of area-consuming multipliers. Shifters in the proposed core transform architecture are implemented in wires and multiplexers, which significantly reduces chip area. Also, it can process from to blocks with common hardware by reusing processing elements. Designed core transform architecture in 0.13um technology can process a block with 2-D transform in 130 cycles, and its gate count is 101,015 gates.

Journal ArticleDOI
TL;DR: This paper presents a version of minimized S-box with two separate proposals and improvements in the overall gate count, and presents a selective encryption architecture (SEA) which incorporates the CISA as a part of the encryption core, accompanied by the set partitioning in hierarchical trees (SPIHT) algorithm as a complete selective encryption system.
Abstract: The “S-box” algorithm is a key component in the Advanced Encryption Standard (AES) due to its nonlinear property. Various implementation approaches have been researched and discussed meeting stringent application goals (such as low power, high throughput, low area), but the ultimate goal for many researchers is to find a compact and small hardware footprint for the S-box circuit. In this paper, we present our version of minimized S-box with two separate proposals and improvements in the overall gate count. The compact S-box is adopted with a compact and optimum processor architecture specifically tailored for the AES, namely, the compact instruction set architecture (CISA). To further justify and strengthen the purpose of the compact crypto-processor’s application, we have also presented a selective encryption architecture (SEA) which incorporates the CISA as a part of the encryption core, accompanied by the set partitioning in hierarchical trees (SPIHT) algorithm as a complete selective encryption system.

Journal ArticleDOI
Muchen Li1, Jinjia Zhou, Dajiang Zhou1, Xiao Peng1, Satoshi Goto1 
TL;DR: A novel dual-mode deblocking filter architecture which could support both of the HEVC and H.264/AVC standards and save 30% gate counts than the dedicated ones in filter part is introduced.
Abstract: SUMMARY As the successive video compression standard of H.264/AVC, High E fficiency Video Codec (HEVC) will play an important role in video coding area. In the deblocking filter part, HEVC inherits the basic property of H.264/AVC and gives some new features. Based on this variation, this paper introduces a novel dual-mode deblocking filter architecture which could support both of the HEVC and H.264/AVC standards. For HEVC standard, the proposed symmetric unified-cross unit (SUCU) based filtering scheme greatly reduces the design complexity. As a result, processing a 16×16 block needs 24 clock cycles. For H.264/AVC standard, it takes 48 clock cycles for a 16 × 16 macro-block (MB). In synthesis result, the proposed architecture occupies 41.6k equivalent gate count at frequency of 200 MHz in SMIC 65 nm library, which could satisfy the throughput requirement of super hi-vision (SHV) on 60 fps. With filter reusing scheme, the universal design for the two standards saves 30% gate counts than the dedicated ones in filter part. In addition, the total power consumption could be reduced by 57.2% with skipping mode when the edges need not be fil

Journal ArticleDOI
TL;DR: An intra prediction hardware architecture is proposed to reduce computational complexity of intra prediction in HEVC decoder and adopts a fast smoothing decision algorithm and a fast algorithm to generate coefficients of a filter.
Abstract: In this paper, an intra prediction hardware architecture is proposed to reduce computational complexity of intra prediction in HEVC decoder. The architecture uses shared operation units and common operation units and adopts a fast smoothing decision algorithm and a fast algorithm to generate coefficients of a filter. The shared operation unit shares adders processing common equations to remove the computational redundancy. The unit computes an average value in DC mode for reducing the number of execution cycles in DC mode. In order to reduce operation units, the common operation unit uses one operation unit generating predicted pixels and filtered pixels in all prediction modes. In order to reduce processing time and operators, the decision algorithm uses only bit-comparators and the fast algorithm uses LUT instead of multiplication operators. The proposed architecture using four shared operation units and eight common operation units which can reduce execution cycles of intra prediction. The architecture is synthesized using TSMC 0.13um CMOS technology. The gate count and the maximum operating frequency are 40.5k and 164MHz, respectively. As the result of measuring the performance of the proposed architecture using the extracted data from HM 7.1, the execution cycle of the architecture is about 93.7% less than the previous design.

Proceedings ArticleDOI
03 Jun 2013
TL;DR: A low power multi-Lane Mobile Industry Processor Interface (MIPI) Camera Serial Interface 2 (CSI-2) receiver architecture which adopts an 8-Byte parallel CSI protocol layer for hardware implementations which reduces more than 37%~43% logic power consumption measured in chip.
Abstract: This paper proposes a low power multi-Lane Mobile Industry Processor Interface (MIPI) Camera Serial Interface 2 (CSI-2) receiver architecture which adopts an 8-Byte parallel CSI protocol layer for hardware implementations. The proposed scheme can work in environment with 4 data Lanes and 1 Gb/s per data Lane, i.e. with maximum data rate 4 Gb/s, at 62.5 MHz which increases logic operations from 8 ns (125 MHz) to 16 ns (62.5 MHz) without throughput degradation. Therefore, the supply voltage (1.2 V) can be reduced and the power consumption can also be reduced. The proposed architecture is implemented by 0.13 μm CMOS technology and the total gate count is 32.7 K. It not only reduces the operating clock rate but also reduces more than 37%~43% logic power consumption measured in chip.

Journal ArticleDOI
TL;DR: A new recursive recoding algorithm is proposed that shortens the critical path of the multiplier and reduces the hardware complexity of partial-product-generators as well and provides an optimal space/time partitioning of themultiplier architecture for any size N of the operands.
Abstract: This paper addresses the problem of multiplication with large operand sizes (N≥32). We propose a new recursive recoding algorithm that shortens the critical path of the multiplier and reduces the hardware complexity of partial-product-generators as well. The new recoding algorithm provides an optimal space/time partitioning of the multiplier architecture for any size N of the operands. As a result, the critical path is drastically reduced to 33 N / 2 - 3 with no area overhead in comparison to modified Booth algorithm that shows a critical path of N/2 in adder stages. For instance, only 7 adder stages are needed for a 64-bit two's complement multiplier. Confronted to reference algorithms for N=64, important gain ratios of 1.62, 1.71, 2.64 are obtained in terms of multiply-time, energy consumption per multiply- operation, and total gate count, respectively.

Journal ArticleDOI
TL;DR: A rotation-based synthesis framework for reversible logic that constructs intermediate quantum states that may be in superposition and combines techniques from reversible Boolean logic and quantum computation is proposed.
Abstract: A rotation-based synthesis framework for reversible logic is proposed. We develop a canonical representation based on binary decision diagrams and introduce operators to manipulate the developed representation model. Furthermore, a recursive functional bidecomposition approach is proposed to automatically synthesize a given function. While Boolean reversible logic is particularly addressed, our framework constructs intermediate quantum states that may be in superposition, hence we combine techniques from reversible Boolean logic and quantum computation. The proposed approach results in quadratic gate count for multiple-control Toffoli gates without ancillae, linear depth for quantum carry-ripple adder, and quasilinear size for quantum multiplexer.

Book ChapterDOI
18 Jan 2013
TL;DR: In this article, the color image enhancement is achieved by first convolving an original image with a Gaussian kernel since Gaussian distribution is a point spread function which smoothes the image and then logarithm domain processing and gain/offset corrections are employed in order to enhance and translate pixels into the display range of 0 to 255.
Abstract: This paper presents the development of a new algorithm for Gaussian based color image enhancement system. The algorithm has been designed into architecture suitable for FPGA/ASIC implementation. The color image enhancement is achieved by first convolving an original image with a Gaussian kernel since Gaussian distribution is a point spread function which smoothes the image. Further, logarithm-domain processing and gain/offset corrections are employed in order to enhance and translate pixels into the display range of 0 to 255. The proposed algorithm not only provides better dynamic range compression and color rendition effect but also achieves color constancy in an image. The design exploits high degrees of pipelining and parallel processing to achieve real time performance. The design has been realized by RTL compliant Verilog coding and fits into a single FPGA with a gate count utilization of 321,804. The proposed method is implemented using Xilinx Virtex-II Pro XC2VP40-7FF1148 FPGA device and is capable of processing high resolution color motion pictures of sizes of up to 1600×1200 pixels at the real time video rate of 116 frames per second. This shows that the proposed design would work for not only still images but also for high resolution video sequences.

Journal ArticleDOI
TL;DR: This paper presents the FPGA implementation of an LFSR based pseudorandom pattern generator that has the characteristics of high speed, low power consumption and it is especially suited in processing environment where uniform distribution random numbers are required.
Abstract: strides in programmable logic density, speed and hardware description language (HDL) have empowered the engineer with the ability to implement high-performance digital functionality within field programmable gate array (FPGA). Linear feedback shift resister (LFSR) has become one of the central elements used in testing and self testing of contemporary complex electronic systems like processors, controllers and integrated circuits (ICs). This paper presents the FPGA implementation of an LFSR based pseudorandom pattern generator. This LFSR has the characteristics of high speed, low power consumption and it is especially suited in processing environment where uniform distribution random numbers are required. A typical application of the pattern generator considered in this work is the testing of micro- electro-mechanical-system (MEMS), where low power consumption is required. Very high speed integrated circuit HDL (VHDL) was used to implement the LFSR on FPGA. A testbench in VHDL was used to verify the correctness of the design. The compiled VHDL code was been synthesized into gate level. Area and timing optimization were done to achieve a very low gate count of 436 and increase the design speed to 178MHz Mentor Graphics and Xilinx ISE 6, electronic design automation (EDA) tool suite and DIGILENT D2SB PROTO BOARD were used for the overall FPGA implementation process.

Proceedings ArticleDOI
04 Jul 2013
TL;DR: An 8-bit AES direct FPGA hardware implementation of CFB/OFB operations without using the Block RAM (BRAM) is presented, which is the smallest gate count for the 8- bit ASIC implementation ever proposed.
Abstract: This paper presents an 8-bit AES direct FPGA hardware implementation of CFB/OFB operations without using the Block RAM (BRAM). The 8-bit AES core is then embedded through a microcontroller to interface with Bluetooth wireless for performing encryption or decryption. Two sets of the embedded systems are configured together to experiment the AES operation of the image encryption and decryption through wireless communication achieved the baud rate of 0.23 Megabits per second (Mbps). CFB/OFB operations have two advantages over ECB operation; one is the low area circuit design, and the other is the complete hiding of input patterns in plain image with identical colors. Though CFB/OFB implementation without BRAM has a little larger slice area then the implementation with RAM, yet the non-BRAM in ASIC implementation achieved only 2.2K gates, synthesized using 0.18μm technology, which is the smallest gate count for the 8-bit ASIC implementation ever proposed.

Proceedings ArticleDOI
01 Sep 2013
TL;DR: A new BCH decoding architecture is presented that combines different parallelization degrees for the Berlekamp-Massey algorithm and the Chien search, which significantly reduces the number of required multipliers.
Abstract: Error correction coding (ECC) has become one of the most important tasks of flash memory controllers. The gate count of ECC hardware is taking up a significant share of the overall SOC logic. Scaling ECC strength to growing error correction requirements has become increasingly difficult when considering cost and area limitations. In this work, a new BCH decoding architecture is presented that combines different parallelization degrees for the Berlekamp-Massey algorithm and the Chien search. This approach significantly reduces the number of required multipliers. Nevertheless, the average decoding speed is equal to that of a fully parallel implementation.

Journal ArticleDOI
01 May 2013
TL;DR: The new proposed SE algorithm, using new initial conditions and polynomials, can significantly reduce the computation complexity compared with the existing ME and reformulated inversionless Berlekamp-Massey (RiBM) algorithms, since it has the least number of coefficients in the newInitial conditions.
Abstract: This paper proposes a cost-effective simplified Euclid's (SE) algorithm for Reed-Solomon decoders, which can replace the existing modified Euclid's (ME) algorithm. The new proposed SE algorithm, using new initial conditions and polynomials, can significantly reduce the computation complexity compared with the existing ME and reformulated inversionless Berlekamp-Massey (RiBM) algorithms, since it has the least number of coefficients in the new initial conditions. Thus, the proposed SE architecture, consisting of only 3t basic cells, has the smallest area among the existing key solver blocks, where t means the error correction capability. In addition, the SE architecture requires only the latency of 2t clock cycles to solve the key equation without initial latency. The proposed RS decoder has been synthesized using the 0.18 μm Samsung cell library, and the gate count of the RS decoder, excluding FIFO memory, is only 40,136 for the (255, 239, 8) RS code.

Proceedings ArticleDOI
29 Apr 2013
TL;DR: An LDPC decoder in 65nm CMOS targeting WPAN (IEEE 802.15.3c) is presented with measurement results and a modified-PCM based message permutation strategy with compatible data flow is proposed to solve the network problem raised by high parallelism LDPC decoding.
Abstract: An LDPC decoder in 65nm CMOS targeting WPAN (IEEE 802.15.3c) is presented with measurement results. A modified-PCM based message permutation strategy with compatible data flow is proposed to solve the network problem raised by high parallelism LDPC decoding. Compared to the state-of-art, decoder chip achieves 17.7%, 33.5% and 49% improvements in chip density, gate count and energy efficiency, respectively.

Book ChapterDOI
01 Jan 2013
TL;DR: Hardware architecture with shared operation unit, common operation unit and fast smoothing decision algorithm is proposed to reduce computational complexity of intra prediction in HEVC decoder and uses only bit-comparators instead of arithmetic operators.
Abstract: In this paper, hardware architecture with shared operation unit, common operation unit and fast smoothing decision algorithm is proposed to reduce computational complexity of intra prediction in HEVC decoder. The shared operation unit shares adders computing common operations in smoothing equations to remove the computational redundancy and pre-computes the mean value of reference pixels for removing an idle cycle in DC mode. The common operation unit uses one operation unit to generate predicted pixels and filters predicted pixels in all prediction modes to reduce the number of operation units for each mode. The decision algorithm uses only bit-comparators instead of arithmetic operators. The architecture is synthesized using TSMC 0.13um CMOS technology. The gate count and the maximum operating frequency of the architecture are 40.5 k and 164 MHz, respectively. The number of processing cycles of the architecture for one 4 × 4 PU is one cycle and about 93.7 % less than the previous one.

01 Jan 2013
TL;DR: A combination table which can store the flip-flops that can be merged is introduced in the counter measure circuit to reduce the power as well as area in the clock power circuit.
Abstract: The clock power is the major dynamic power source VLSI circuits. The multi bit flip-flop technique is one of the techniques used to reduce the clock power. The power reduction is achieved through the merging of flip-flops based on certain timing constraints. A combination table which can store the flip-flops that can be merged is introduced in the proposed work. The Differential Power Analysis (DPA) is a big threat to crypto chips since it can efficiently disclose the secret key. Using the self generated true random sequence based counter measure circuit the differential power attack can be reduced. The multi-bit flip-flop technique is introduced in the counter measure circuit to reduce the power as well as area. According to the experimental results it is found that the flip- flops after merging reduces the dynamic power about 27.27% and the total power about 12.59%. It is also found that the total gate count is reduced from 7709 to 7389. Keywords- merging,combination table; multi-bit flip-flop; differential power analysis;LFSR;counter measure circuit

Proceedings ArticleDOI
30 Sep 2013
TL;DR: The randomness analysis of Grain-128 stream cipher algorithm by using NIST Statistical Test Suite is introduced and it is obtained that this algorithm is not random at the 1% significance level.
Abstract: In this work, the randomness analysis of Grain-128 stream cipher algorithm by using NIST Statistical Test Suite is introduced. The NIST Statistical Test Suite is applied to determine the randomness of this algorithm. The Grain - 128 is based on LFSR, NLFSR and Boolean function with suitable for limited resources like gate count, power consumption and area chip. It uses 128-bit key and 96-bit initial value (IV). Based on our result of conducting the analysis, we obtained that this algorithm is not random at the 1% significance level.

Journal ArticleDOI
TL;DR: A flexible VLSI architecture for full-search VBSME (FSVBSME), allowing the partitioning of the source frames into sixteen 4x4 sub-blocks and using a MVP scheme, which can offer higher processing speed, lower power consumption, lower latency and lower gate count complexity.

Journal ArticleDOI
TL;DR: The proposed low-overhead space-time block code (STBC) interference canceller combined with the two-stage channel estimator can be applied to wireless metropolitan area network (WMAN), like IEEE 802.16e system.
Abstract: This paper proposes a low-overhead space-time block code (STBC) interference canceller for high-mobility STBC-orthogonal frequency division multiplexing (STBC-OFDM) systems. The proposed STBC interference canceller combined with the two-stage channel estimator can be applied to wireless metropolitan area network (WMAN), like IEEE 802.16e system. At the vehicle speeds of 240 km/hr for 16 quadrature amplitude modulation (16 QAM), the bit error rate (BER)can be improved about 10 times of that just using the two-stage channel estimator. The proposed design is implemented in 90 nm CMOS technology. The gate count is 109.3 K, and the power dissipation is 1.45 mW at 83.3 MHz operation frequency with 1 V power supply. However, up to 61% hardware can be reused from the existed two-stage channel estimator design. After reusing, the proposed STBC interference canceller requires only 42.2 K gates, which is 4.9% overhead of the two-stage channel estimator.