scispace - formally typeset
Search or ask a question

Showing papers on "Gate count published in 2016"


Journal ArticleDOI
TL;DR: A novel system-on-chip (SOC) solution for a portable ultrasound imaging system (PUS) for point-of-care applications that includes all of the signal processing modules and an efficient architecture for hardware-based imaging methods.
Abstract: In this paper, we present a novel system-on-chip (SOC) solution for a portable ultrasound imaging system (PUS) for point-of-care applications. The PUS-SOC includes all of the signal processing modules (i.e., the transmit and dynamic receive beamformer modules, mid- and back-end processors, and color Doppler processors) as well as an efficient architecture for hardware-based imaging methods (e.g., dynamic delay calculation, multi-beamforming, and coded excitation and compression). The PUS-SOC was fabricated using a UMC 130-nm NAND process and has 16.8 GFLOPS of computing power with a total equivalent gate count of 12.1 million, which is comparable to a Pentium-4 CPU. The size and power consumption of the PUS-SOC are $27\times 27~{\rm mm}^{2}$ and 1.2 W, respectively. Based on the PUS-SOC, a prototype hand-held US imaging system was implemented. Phantom experiments demonstrated that the PUS-SOC can provide appropriate image quality for point-of-care applications with a compact PDA size ( $200\times 120\times 45~{\rm mm}^{3}$ ) and 3 hours of battery life.

47 citations


Journal ArticleDOI
TL;DR: A top-down design methodology, which not only goes through code construction and optimization, but also hardware implementation to meet all the critical requirements, is presented, leading to a proposed hybrid storage architecture which has the advantages of better area efficiency and large enough data bandwidth for high decoding throughput.
Abstract: Although Latin square is a well-known algorithm to construct low-density parity-check (LDPC) codes for satisfying long code length, high code-rate, good correcting capability, and low error floor, it has a drawback of large submatrix that the hardware implementation will be suffered from large barrel shifter and worse routing congestion in fitting NAND flash applications. In this paper, a top-down design methodology, which not only goes through code construction and optimization, but also hardware implementation to meet all the critical requirements, is presented. A two-step array dispersion algorithm is proposed to construct long LDPC codes with a small submatrix size. Then, the constructed LDPC code is optimized by masking matrix to obtain better bit-error rate (BER) performance and lower error-floor. In addition, our LDPC codes have a diagonal-like structure in the parity-check matrix leading to a proposed hybrid storage architecture, which has the advantages of better area efficiency and large enough data bandwidth for high decoding throughput. To be adopted for NAND flash applications, an (18900, 17010) LDPC code with a code-rate of 0.9 and submatrix size of 63 is constructed and the field-programmable gate array simulations show that the error floor is successfully suppressed down to BER of $10^{-12}$ . An LDPC decoder using normalized min-sum variable-node-centric sequential scheduling decoding algorithm is implemented in UMC 90-nm CMOS process. The postlayout result shows that the proposed LDPC decoder can achieve a throughput of 1.58 Gb/s at six iterations with a gate count of 520k under a clock frequency of 166.6 MHz. It meets the throughput requirement of both NAND flash memories with Toggle double data rate 1.0 and open NAND flash interface 2.3 NAND interfaces.

33 citations


Proceedings ArticleDOI
18 May 2016
TL;DR: This paper proposes, for the first time, the inclusion of nearest neighbourhood criteria in a widely used ancilla free reversible logic synthesis method, and shows that this method easily outperforms the earlier two step techniques in terms of gate count without any runtime overhead.
Abstract: The rapid advances of quantum technologiesare opening up new challenges, of which, protectingquantum states from errors is a major one. Amongquantum error correction schemes, the surface code isemerging as a natural choice with high-fidelity quantumgates reported for experimental platforms. Surfacecodes also necessitate the quantum gates to be formedwith strict nearest neighbour coupling. State-of-the-artreversible logic synthesis techniques for quantum circuitimplementation do not ensure the logic gates to be formedin a nearest neighbor fashion, and this is handled as a post processingoptimization by the insertion of swap gates. Inthis paper, we propose, for the first time, the inclusionof nearest neighbourhood criteria in a widely used ancilla freereversible logic synthesis method. Experimental resultsshow that this method easily outperforms the earlier two steptechniques in terms of gate count without any runtime overhead.

27 citations


Journal ArticleDOI
TL;DR: The proposed detector employs the signal-vector-based list detection method, but the original method is modified to realize a low-complexity implementation to reduce the hardware complexity while achieving a near-optimal error-rate performance.
Abstract: This brief presents a hardware implementation of a detector for spatial modulation multiple-input multiple-output (SM-MIMO) communication systems. The proposed detector employs the signal-vector-based list detection method, but the original method is modified to realize a low-complexity implementation. In addition, the proposed detector is designed based on the dual-data-path architecture, in which antenna selection and symbol detection are separately performed with different precision levels to reduce the hardware complexity while achieving a near-optimal error-rate performance. The proposed detector is implemented with 87.4-K logic gates in a 0.18- $\mu\text{m}$ CMOS technology, and its throughput is 858 Mb/s for 8 $\times$ 4 64 quadrature amplitude modulation SM-MIMO systems, where the operating frequency is 286 MHz, and the power consumption is 121.3 mW. This manifests that the proposed detector is very efficient with respect to the gate count as well as the energy consumption.

18 citations


Journal ArticleDOI
TL;DR: A high-throughput and multi-parallel VLSI hardware architecture for the deblocking filter in the HEVC video coding standard and an implementation-friendly and fast boundary judgment method are presented to avoid using the original recursion loop approach.
Abstract: This paper presents a high-throughput and multi-parallel VLSI hardware architecture for the deblocking filter in the HEVC video coding standard. First, an implementation-friendly and fast boundary judgment method is proposed to avoid using the original recursion loop approach. Then a dedicated parallel VLSI architecture composed of four parallel filtering cores is presented based on the proposed boundary judgment method. With the parallel luma/chroma filtering and parallel vertical/horizontal edges filtering order, the proposed VLSI architecture can process filtering operations for one largest coding unit (LCU) with less filtering cycles than other conventional approaches. Furthermore, filtering efficiency is improved due to a novel ping-pang buffer architecture and the on-chip single-port SRAM with dedicated data arrangement in the memory modules. Experimental results demonstrate that the proposed deblocking filter architecture improves the performance by 28–89% at the expense of the slightly increased gate count compared to the previously known architecture in HEVC. The proposed architecture can reach a high operating clock frequency of 278 MHz with TSMC 90 nm library and meet the real time requirement of the deblocking filter for 8 K × 4 K video format at 123 frame/s.

17 citations


Journal ArticleDOI
TL;DR: This work is the most compact ECDSA engine with capability for a wide range of curves and different applications and allows it to be implemented on any application specific integrated circuit (ASIC) or FPGA platform with dual-port memory support.
Abstract: Security problems introduced with rapid increase in deployment of Internet-of-Things devices can be overcome only with lightweight cryptographic schemes and modules. A compact prime field (GF(p)) elliptic curve digital signature algorithm (ECDSA) engine suitable for use in such applications is presented. Generic architecture of the engine makes it suitable for other elliptic curve (EC) based schemes (EC Diffie–Hellman key exchange, EC integrated encryption, EC factoring etc.) with slight modifications. The presented engine is composed of a simple microcoded controller and application-specific processing units. It can work with ECs of up to 256 bits, while 160-bit ECDSA signature generation takes 490 K cycles. The engine is implemented as an intellectual property (IP) in a 180 nm process. However, its architecture allows it to be implemented on any application specific integrated circuit (ASIC) or FPGA platform with dual-port memory support. In view of its gate count of 11,366 gate equivalents, the presented work is the most compact ECDSA engine with capability for a wide range of curves and different applications.

14 citations


Journal ArticleDOI
Xiaofeng Huang1, Huizhu Jia1, Binbin Cai1, Chuang Zhu1, Jie Liu1, Mingyuan Yang1, Don Xie1, Wen Gao1 
TL;DR: Several fast algorithms are proposed to remove the data dependency and to reduce the computational complexity, which include source signal based Rough Mode Decision, coarse to fine rough mode search, Prediction Mode Interlaced RDO mode decision, parallelized context adaption and Chroma-free Coding Unit (CU)/Prediction Unit (PU) decision.
Abstract: The emerging intra-coding tools of High Efficiency Video Coding (HEVC) standard can achieve up to 36?% bit-rate reduction compared to H.264/AVC, but with significant complexity increase. The design challenges, such as data dependency and computational complexity, make it difficult to implement a hardware encoder for real-time applications. In this paper, firstly, the data dependency in HEVC intra-mode decision is fully analyzed, which is cost by the reconstruction loop, the Most Probable Mode, the context adaption during Context-based Adaptive Binary Arithmetic Coding based rate estimation, and the Chroma derived mode. Then, several fast algorithms are proposed to remove the data dependency and to reduce the computational complexity, which include source signal based Rough Mode Decision, coarse to fine rough mode search, Prediction Mode Interlaced RDO mode decision, parallelized context adaption and Chroma-free Coding Unit (CU)/Prediction Unit (PU) decision. Finally, the parallelized VLSI architecture with CU reordering and Chroma reordering scheduling is proposed to improve the throughput. The experimental results demonstrate that the proposed intra-mode decision achieves 41.6?% complexity reduction with 4.3?% Bjontegaard Delta Rate (BDR) increase on average compared to the reference software, HM-13.0. The intra-mode decision scheme is implemented with 1571.7K gate count in 55?nm CMOS technology. The implementation results show that our design can achieve [email protected] real time processing at 294?MHz operation frequency.

13 citations


Proceedings ArticleDOI
01 May 2016
TL;DR: In this paper, a low complexity image scaling algorithm is proposed which shows significant reduction in hardware cost and energy over existing architectures without significant degradation in quality.
Abstract: Image scaling is one of the widely used techniques in various portable devices to fit the image in their respective displays. Traditional image scaling architectures consume more power and hardware, making them inefficient for use in portable devices. In this paper, a low complexity image scaling algorithm is proposed. In the proposed algorithm, the target pixel is computed either by bilinear interpolation or by replication. The edge catching module in the architecture determines the method of computation which makes the design energy efficient. Further, algebraic manipulation is done and the resulting pipelined architecture shows significant reduction in hardware cost. In order to evaluate the efficacy, the proposed and existing algorithms are implemented in MATLAB and simulated using standard benchmark images. The proposed design is synthesized in Synopsys Design Compiler using 90-nm CMOS process which shows 43.3% reduced gate count and 25.9% reduction in energy over existing architectures without significant degradation in quality.

13 citations


Journal ArticleDOI
Hokyoon Lee1, Yoonah Paik1, Jaeyung Jun1, Youngsun Han, Seon Wook Kim1 
TL;DR: This paper proposes a new design of SubBytes and MixColumns in AES using constant binary matrix-vector multiplications reduced to AND and XOR operations, and proposes a four-stage pipelined AES architecture to achieve higher throughput.

12 citations


Proceedings ArticleDOI
01 May 2016
TL;DR: A novel design for a Reversible 8-bit ALU is proposed, which has reduced gate count, and transistor count and the propagation delay was found to be significantly lesser than existing designs.
Abstract: Conventional Complementary metal oxide semiconductor circuits (CMOS) dissipate energy in the form of bits of information. This dissipation of energy is in the form of power dissipation and plays a very important role as far as low power design is considered. Today, most digital circuits are being designed using Reversible Logic. Design based on Reversible Logic helps in reducing heat dissipation, allowing nearly energy free computation, allowing higher circuit densities and enabling better testing of faults. In this paper, a novel design for a Reversible 8-bit ALU is proposed. The 8-bit ALU is designed by cascading 1-bit ALUs. The two major units of a 1-bit ALU are the control unit and the adder unit. For the control unit, the Control Output Gate (COG) has been used and for the adder unit the Haghparast and Navi Gate (HNG) has been used. The most significant aspect of this paper is that as compared to other papers, this ALU design has reduced gate count, and transistor count. The propagation delay was found to be significantly lesser at a value of 5.52ns when compared with the value of 8.29ns for an existing design. Simulation and verification of the proposed design was performed using Cadence 180nm technology software tool.

12 citations


Journal ArticleDOI
TL;DR: Categorizing various ways of implementation in VHDL using Xilinx ISE design suit 14.2 tool and comparative analysis of existing 1 bit ALU designs in terms of optimization metrics like power consumption, number of gates,Number of constant inputs, numberof garbage outputs and quantum cost is aimed at.
Abstract: There is a tremendous growth in fabrication from small scale integration (SSI) to giant scale integration (GSI). It however raises a question of sustainability of Moore's law due to almost intolerable levels of power consumption. Researchers have invented a lot of methods to reduce power consumption and recent technologies are switching to reversible logic. Reversible logic has various applications in fields of computer graphics, optical information processing, quantum computing, DNA computing, ultra low power CMOS design and communication. Arithmetic Logic Unit (ALU) is considered to be the basic building block of a CPU in the computing environment and portability in computing system highly demands reversible logic based ALU. Modern processors usually have a word length of 32 or 64 bits. Divide and conquer approach principle cascades n number of 1 bit ALU to implement n bit ALU. Several researchers have proposed 1-bit ALU design using various reversible logic gates. This paper aims at categorizing various ways of implementation in VHDL using Xilinx ISE design suit 14.2 tool and comparative analysis of existing 1 bit ALU designs in terms of optimization metrics like power consumption, number of gates, number of constant inputs, number of garbage outputs and quantum cost. ALU realized using carry save adder block is found to be most optimum design in terms of gate count and quantum cost.

Journal ArticleDOI
13 Apr 2016
TL;DR: It has been introduced that new reversible gate, namely, universal parity preserving gate (UPPG), to optimise the ALU circuits, and circuit design focuses on optimising the gate count and quantum cost.
Abstract: In the digital circuit design, the primary factors are low power and a high packing density. The reversible logic circuit in quantum-dot cellular automata (QCA) framework is hoped to be effective in addressing the factor of power consumption at nanoscale regime. Fault tolerant circuits are suited of interruption of errors at the outputs. This manuscript focuses the design of ALU in QCA-based and propose new parity preserving gate. It has been introduced that new reversible gate, namely, universal parity preserving gate (UPPG), to optimise the ALU circuits. An algorithm and lemmas are shown in designing ALU. The ALU generates a number of arithmetic and logical function with using only less architectural complexity. Most importantly circuit design focuses on optimising the gate count and quantum cost. In addition to optimisation, the workability of UPPG gate is tested by QCA and the simulation result obtained ensures the correctness of the design.

Posted Content
TL;DR: This paper has proposed a procedure to trace error probability due to noisy gates and decoherence in quantum circuit and place an error correcting block only when the error probability exceeds a certain threshold, which shows a drastic reduction in the required number of error correcting blocks.
Abstract: Descriptions of quantum algorithms, communication etc. protocols assume the existence of closed quantum system. However, real life quantum systems are open and are highly sensitive to errors. Hence error correction is of utmost importance if quantum computation is to be carried out in reality. Ideally, an error correction block should be placed after every gate operation in a quantum circuit. This increases the overhead and reduced the speedup of the quantum circuit. Moreover, the error correction blocks themselves may induce errors as the gates used for error correction may be noisy. In this paper, we have proposed a procedure to trace error probability due to noisy gates and decoherence in quantum circuit and place an error correcting block only when the error probability exceeds a certain threshold. This procedure shows a drastic reduction in the required number of error correcting blocks. Furthermore, we have considered concatenated codes with tile structure layout lattice architecture[25][21],[24] and SWAP gate based qubit transport mechanism. Tracing errors in higher levels of concatenation shows that, in most cases, after 1 or 2 levels of concatenation, the number of QECC blocks required become static. However, since the gate count increases with increasing concatenation, the percentage saving in gate count is considerably high.

Proceedings ArticleDOI
01 Jul 2016
TL;DR: The proposed architecture is a fully pipelined architecture with optimized adders bit-widths and it was prototyped on TSMC 65 nm CMOS technology and the prototyping results show the high performance of the proposed architecture.
Abstract: HEVC (H.265) standard was proposed as a means to increase the compression rate with no loss in video quality. Large integer DCT, with sizes 16x16 and 32x32, is one of the key new features of the H.265 standard. In this paper, we propose a new scalable architecture for integer DCT in HEVC encoder. The proposed architecture is a fully pipelined architecture with optimized adders bit-widths. It was prototyped on TSMC 65 nm CMOS technology. The prototyping results show the high performance of theproposed architecture. Its gate count is 130K and it can achieve throughput of 9.26 Gsps. The proposed architecture can encode 8K @ 120 fps video sequence with working frequency of 373.25 MHz in real time.

Proceedings ArticleDOI
01 Nov 2016
TL;DR: It is found that the Arch-3 presents the best PFSCL full adder design by incorporating the advantageous features of the other two proposed architectures.
Abstract: In this paper, implementation of full adders in positive feedback source-coupled logic style (PFSCL) is proposed. Three new architectures for PFSCL full adders are put forward. The first architecture is implemented by using conventional NOR based method. The second architecture is based on the use of configurable cell while the last architecture optimizes the structure by using both the conventional NOR and configurable cell based approaches. The functionality of the proposed architectures is verified through simulations by using TSMC 180 nm CMOS technology parameter on Tanner EDA. Their performance is compared in terms of transistor count, gate count, power, delay and power-delay product. It is found that the Arch-3 presents the best PFSCL full adder design by incorporating the advantageous features of the other two proposed architectures.

Proceedings ArticleDOI
01 Dec 2016
TL;DR: The design and performance of the Kogge Stone Parallel Prefix Adders are described and implemented using different design technique and CMOS and GDI are the different design techniques used.
Abstract: Adders forms a major part in various arithmetic logical operations. Parallel Prefix Adder have been built up as the most essential and efficient circuit for binary addition. Their Particular structure and execution performance are very attractive for VLSI implementation. In these papers, we describe the design and performance of the Kogge Stone Parallel Prefix Adders and implemented using different design technique. CMOS (Complementary Metal Oxide Semiconductor) and GDI (Gate Diffusion Input) are the different design technique used. The design and simulation of logic gates is performed on CADENCE Design Suit 6.1.6 using virtuoso and ADE Environment at GPDK 180nm technology. The execution measurement considered for the performance of the KSA is delay, number of gate count/Transistor Count (area) and power. Simulation studies are done for 4-bit, 8-bit and 16-bit input data.

Proceedings ArticleDOI
21 Mar 2016
TL;DR: A unified hardware architecture for 4×4, 8×8, 16×16 and 32×32 inverse 2D core transform IDCT in HEVC standard used only one block 1D transform and a transpose buffer based on FIFO memory blocs instead of the traditional register array in order to further reduce the memory resources.
Abstract: Most video coding standards use transform algorithms to reduce the size of data characterizing a video signal. The traditional transform matrices as in H.264 are limited to 4×4 and 8×8 sizes. However, the flexibility of coding structure presented in the next generation of video coding standard High Efficiency Video Coding standard HEVC allows the definition of various sizes of transform matrices which can vary from 4×4 to 32×32. This paper describes a unified hardware architecture for 4×4, 8×8, 16×16 and 32×32 inverse 2D core transform IDCT in HEVC standard. It used only one block 1D transform and a transpose buffer based on FIFO memory blocs instead of the traditional register array in order to further reduce the memory resources. The synthesis results under TSMC 180 nm CMOS technology show that the total gate count of the design is more than 30% improved compared to previous works. However, the operating frequency of the hardware design is about 130 MHz. This last can perform the decoding of 25 frames per second of Quad HD (3840×2160) resolution.

Book ChapterDOI
10 Aug 2016
TL;DR: The results show that heuristics should be considered as a viable choice for the generation of S-boxes with good implementation properties, and proposed methods to obtain \(4 \times 4\) and \(5\times 5\) S- boxes that are either power or area efficient.
Abstract: With the emergence of the Internet of Things and lightweight cryptography, one can observe a gradual shift of interest in the design of block ciphers. Naturally, security is still of paramount importance, but one is willing to trade a part of that security in order to obtain higher speed and/or smaller implementation area. Accordingly, a common metric in many cipher proposals has been the gate count for realizing the cipher in hardware. On the other side, it is also important, especially for battery powered devices, to have a small energy consumption. That is why we can observe the following shift of research focus: from the analysis of the energy consumption of existing ciphers and their building blocks to the design of new ciphers and building blocks, specifically for low energy. Existing research results focusing on the energy consumption of symmetric ciphers, suggest that the S-box is the most expensive part in the majority of lightweight implementations. If we only consider purely combinatorial S-boxes, we can focus on reducing the power consumption of the S-box in order to minimize the energy consumption of the overall cipher. In this paper, we propose several methods to obtain \(4 \times 4\) and \(5\times 5\) S-boxes that are either power or area efficient. Our results show that heuristics should be considered as a viable choice for the generation of S-boxes with good implementation properties.

Proceedings ArticleDOI
22 May 2016
TL;DR: The results show that the propos d TPC architecture is able to reduce approximately to a half the gate count and the memory size, which makes the new TPC an excellent candidate for practical VLSI implementation in commercial transceivers.
Abstract: We propose a low-complexity implementation architecture for turbo product code (TPC) suitable for next-generation fiber optic networks (e.g. ≳ 100 Gb/s). The proposed code makes use of expurgated Bose-Chaudhuri-Hocquenghem (BCH) codes to improve the performance and reduce implementation complexity. In comparison with existing solutions, our results show that the propos d TPC architecture is able to reduce approximately to a half the gate count and the memory size. This feature makes the new TPC an excellent candidate for practical VLSI implementation in commercial transceivers.

Proceedings ArticleDOI
03 Mar 2016
TL;DR: Results are presented to confirm that the PSO based algorithm is superior to Human Design Method in terms of time, effort and especially the gate count required to design the digital combinational circuits.
Abstract: With increasing complexity of electronic circuits, the design and optimization of electronic circuit needs to be automated with high degree of reliability and accuracy. In order to optimize hardware requirement of digital combinational circuits, evolutionary and innovative techniques need to be enforced at various levels such as gate level and device level. This paper presents the use of one of the evolutionary techniques, i.e., Particle Swarm Optimization (PSO) algorithm. It is motivated by the social behaviour of organisms for the optimal design of combinational logic circuit with a reduced gate count in MATLAB platform. Results are presented to confirm that the PSO based algorithm is superior to Human Design Method in terms of time, effort and especially the gate count required to design the digital combinational circuits. The paper shows that PSO based algorithm converges faster than other algorithms such as genetic algorithm and also reduces the computational complexity.

Proceedings ArticleDOI
01 Dec 2016
TL;DR: Primary objective of this article is design efficient latches in terms of garbage outputs, quantum cost, and delay.
Abstract: In this article, sequential circuits have been designed which are based on reversible quantum dot cellular automata. Thus, the proposed designs have property of quantum dot cellular automata with essence of reversible logic. Gate count cannot be decent cost metric for optimization because each reversible logic gate is of computational complexity and distinctive type, so it will have a dissimilar delay and quantum cost. Computational complexity of reversible gate can be represented by its delay, quantum cost. Thus, primary objective of this article is design efficient latches in terms of garbage outputs, quantum cost, and delay. The various efficient r eversible latches including the positive level triggered D Latch, negative level triggered D Latch, T latch have been designs.

Journal ArticleDOI
TL;DR: Results indicate that this simple architecture based on a reconfigurable scheme for integrating several commonly used mathematical operations of speech signal processing can be effectively used in most speech processing applications.
Abstract: This brief presents a fixed-point architecture based on a reconfigurable scheme for integrating several commonly used mathematical operations of speech signal processing. The proposed design can perform two transcendental mathematical operations called logarithm and powering, and three commonly used computations with similar operations named polynomial calculation, filtering, and windowing. By analyzing the adopted algorithms of the above five operations, a simplified computing unit is designed. This unit can combine six types of operations by reconfiguring the data paths, and the same multiply–add architecture can be reused for reducing the redundant usage of logic gates. The experimental results reveal that the proposed design can work at a 200-MHz clock rate, and its gate count only has 11.9k. Compared with the results of the floating-point function, the median errors of the proposed design for computing the powering and logarithmic functions are 0.57% and 0.11%, respectively. Such results indicate that this simple architecture can be effectively used in most speech processing applications.

Journal ArticleDOI
Rui Jia1, Rui Chen1, Colin Yu Lin1, Guo Zhenhong1, Haigang Yang1 
TL;DR: The proposed architecture can be generally used to compute 8×8 DCT of AVS, H.264, VC-1 and HEVC in a low cost way, and can be used to decode Full-HD and WQXGA formate video sequences in real time.
Abstract: The expandability of high demands for multimedia applications brings out more and more video standards for improving the coding and compression efficiency. As the most commonly used transform, Discrete cosine transform (DCT) achieves excellent energy compaction property and good compression efficiency. Hardware sharing is the mostly used efficient strategy to reduce the cost for video codec. Based on traditional matrix factorization, this paper makes three observations to direct the design of proposed hardware sharing architecture. The proposed architecture can be generally used to compute 8×8 DCT of AVS, H.264, VC-1 and HEVC in a low cost way, and can be used to decode Full-HD and WQXGA formate video sequences in real time. The design has been synthesized in 0.13μm technology. The synthesis results show that the proposed architecture achieves 76.9% reduction in gate count, 85.6% decrease in power consumption and 35% improvement in operational speed in comparison with other existing designs.

Proceedings ArticleDOI
01 Apr 2016
TL;DR: Results are presented to support that the PSO based algorithm is better than Human Design Method in respect of time, labour and specially the gate count required to design digital combinational circuit.
Abstract: With the increasing complexity of electronic circuits and to meet the demand of high performance, the design and optimization of electronic circuits need to be automated with high degree of reliability and accuracy. In order to optimize hardware requirements of digital combinational circuits, evolutionary and innovative techniques need to be enforced at various levels such as at gate level and device level. One of the evolutionary technique Particle Swarm Optimization (PSO) algorithm motivated by the social behaviour of organism is used for the optimal design of combinational logic circuits with reduced gate count on MATLAB platform. PSO technique has been applied to optimize Full Adder circuit. Results are presented to support that the PSO based algorithm is better than Human Design Method in respect of time, labour and specially the gate count required to design digital combinational circuit. Later on that optimized circuit has been analysed by Microwind3.1 VLSI CAD Tool. Using the tool the parameters like Area, Power, Delay and Maximum and Average drain current are determined with 90nm, 65nm and 45nm technologies using BSIM4 Model. The results shown in this paper reflects that with technology scaling decreases the area, delay, power consumption and leakage current which are some of the major requirements of today's VLSI design.

Posted Content
TL;DR: This paper connects the problem of upper bound of the gate count with the multiplicative complexity analysis of classical Boolean functions and explores the possibility of relaxing the ancilla and if that approach makes the upper bound tighter.
Abstract: Reversible computation is gaining increasing relevance in the context of several post-CMOS technologies, the most prominent of those being Quantum computing. One of the key theoretical problem pertaining to reversible logic synthesis is the upper bound of the gate count. Compared to the known bounds, the results obtained by optimal synthesis methods are significantly less. In this paper, we connect this problem with the multiplicative complexity analysis of classical Boolean functions. We explore the possibility of relaxing the ancilla and if that approach makes the upper bound tighter. Our results are negative. The ancilla-free synthesis methods by using transformations and by starting from an Exclusive Sum-of-Product (ESOP) formulation remain, theoretically, the synthesis methods for achieving least gate count for the cases where the number of variables $n$ is $< 8$ and otherwise, respectively.

Proceedings ArticleDOI
L. Martirosyan1
01 Oct 2016
TL;DR: This paper presents a quality characteristics estimation methodology for STAR Memory System (SMS) network based on linear and polynomial approximation that enables to perform area and power-aware SMS network design at early stages of SoC design.
Abstract: For System-on-Chips (SoCs) one of the most critical design constraints are gate count and power consumption. This paper presents a quality characteristics estimation methodology for STAR Memory System (SMS) network. Our proposed methodology is based on linear and polynomial approximation. The obtained approximate functions are embedded in scripts that were developed for automated estimation of gate count and power consumption. The mentioned methodology enables to perform area and power-aware SMS network design at early stages of SoC design.

Book ChapterDOI
01 Jan 2016
TL;DR: The effectiveness of the negative control Toffoli and Peres gates in reducing quantum cost, delay and gate count is explored and the adder performance increases along with area optimization which will make these designs useful in future low power Reversible computing.
Abstract: Reversible logic in recent times has attracted a lot of research attention in the field of Quantum computation and nanotechnology due to its low power dissipation capability. Adders are one of the basic components in most of digital systems. Optimization of these adders can improve the performance of the entire system. In this work we have proposed designs of reversible Binary and BCD adders. Ripple carry adder, conditional adders for binary addition and regular and flagged adders for BCD addition. The proposed adder designs are optimized for quantum cost, Gate count and delay. The effectiveness of the negative control Toffoli and Peres gates in reducing quantum cost, delay and gate count is explored. Due to this the adder performance increases along with area optimization which will make these designs useful in future low power Reversible computing.

Journal ArticleDOI
TL;DR: The proposed dual-clock pipelined architecture can be used for real-time H.264/MPEG-4 AVC processing and achieves a throughput of 7 G and 18.7 G pixels/sec for each block of 4 × 4 and 8 × 8 forward integer transforms, respectively.

Journal ArticleDOI
TL;DR: A novel design of binary coded decimal (BCD) adder/subtractor in reversible logic has been proposed and carry skip (CSK) logic is used for reversible ripple carry adder stages to reduce delay but at the expense of little hardware.
Abstract: In the present era, reversible logic designs play a very critical role in nanotechnology, low power complementary metal-oxide semiconductor (CMOS) designs, optical computing and, especially, in quantum computing. High power dissipation and leakage current in deep submicron technologies is a severe threat in applications created today. As a consequence, design of datapath elements in reversible logic has gained much importance. In this study, a novel design of binary coded decimal (BCD) adder/subtractor in reversible logic has been proposed. As a further optimization of the proposed reversible decimal design, carry skip (CSK) logic is used for reversible ripple carry adder stages. This reduces delay but at the expense of little hardware. The proposed BCD adder/subtractor and its optimized version are designed using structural VHDL and simulated using ModelSim 6.3f. Performance analysis reveals that the proposed BCD design demonstrates reductions in gate count, garbage outputs and constant inputs of 30.5%, 46% and 28%, respectively, and its optimized version exhibits 19.4%, 32.4% and 16% reductions in gate count, garbage outputs and constant inputs compared to the design in Ref. 14 [V. Rajmohan, V. Renganathan and M. Rajmohan, A novel reversible design of unified single digit BCD adder–subtractor, Int. J. Comput. Theor. Eng. 3 (2011) 697–700].

Journal ArticleDOI
TL;DR: A highly parallel architecture and a pipelined hardware implementation achieving 8×8 Prediction Unit (PU) interpolation in only 30 clock cycles for inter-prediction which is useful for motion compensation (MC) module in the HEVC decoder.
Abstract: The fractional sample interpolation process is one of the most computationally intensive parts of video decoder based on High Efficiency Video Coding (HEVC) standard. Therefore, in this paper, we propose high performance hardware interpolation architecture for inter-prediction which is useful for motion compensation (MC) module in the HEVC decoder. For this component, we propose a highly parallel architecture and a pipelined hardware implementation achieving 8×8 Prediction Unit (PU) interpolation in only 30 clock cycles. Experimental results show that our architecture can achieve up to 3.2 pixels/cycle at 125 MHz on field-programmable gate array technology (FPGA) and the corresponding performance can support the processing of Quad Full High Definition (QFHD, 3840×2160)@30 fps. The gate count of the resulting Application-Specific Integrated Circuit (ASIC) implementation in 65 nm technology is 36.7 k.