Showing papers on "Gate count published in 2016"

PDF

Open Access

Journal Article•DOI•

A System-on-Chip Solution for Point-of-Care Ultrasound Imaging Systems: Architecture and ASIC Implementation

[...]

Jeeun Kang¹, Changhan Yoon¹, Jaejin Lee¹, Sang-Bum Kye¹, Yongbae Lee¹, Jin Ho Chang¹, Gi-Duck Kim¹, Yangmo Yoo¹, Tai-Kyong Song¹ - Show less +5 more•Institutions (1)

Sogang University¹

01 Apr 2016-IEEE Transactions on Biomedical Circuits and Systems

TL;DR: A novel system-on-chip (SOC) solution for a portable ultrasound imaging system (PUS) for point-of-care applications that includes all of the signal processing modules and an efficient architecture for hardware-based imaging methods.

...read moreread less

Abstract: In this paper, we present a novel system-on-chip (SOC) solution for a portable ultrasound imaging system (PUS) for point-of-care applications. The PUS-SOC includes all of the signal processing modules (i.e., the transmit and dynamic receive beamformer modules, mid- and back-end processors, and color Doppler processors) as well as an efficient architecture for hardware-based imaging methods (e.g., dynamic delay calculation, multi-beamforming, and coded excitation and compression). The PUS-SOC was fabricated using a UMC 130-nm NAND process and has 16.8 GFLOPS of computing power with a total equivalent gate count of 12.1 million, which is comparable to a Pentium-4 CPU. The size and power consumption of the PUS-SOC are $27\times 27~{\rm mm}^{2}$ and 1.2 W, respectively. Based on the PUS-SOC, a prototype hand-held US imaging system was implemented. Phantom experiments demonstrated that the PUS-SOC can provide appropriate image quality for point-of-care applications with a compact PDA size ( $200\times 120\times 45~{\rm mm}^{3}$ ) and 3 hours of battery life.

...read moreread less

47 citations

Journal Article•DOI•

A 520k (18900, 17010) Array Dispersion LDPC Decoder Architectures for NAND Flash Memory

[...]

Kin-Chu Ho¹, Chih-Lung Chen¹, Hsie-Chia Chang¹•Institutions (1)

National Chiao Tung University¹

01 Apr 2016-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: A top-down design methodology, which not only goes through code construction and optimization, but also hardware implementation to meet all the critical requirements, is presented, leading to a proposed hybrid storage architecture which has the advantages of better area efficiency and large enough data bandwidth for high decoding throughput.

...read moreread less

Abstract: Although Latin square is a well-known algorithm to construct low-density parity-check (LDPC) codes for satisfying long code length, high code-rate, good correcting capability, and low error floor, it has a drawback of large submatrix that the hardware implementation will be suffered from large barrel shifter and worse routing congestion in fitting NAND flash applications. In this paper, a top-down design methodology, which not only goes through code construction and optimization, but also hardware implementation to meet all the critical requirements, is presented. A two-step array dispersion algorithm is proposed to construct long LDPC codes with a small submatrix size. Then, the constructed LDPC code is optimized by masking matrix to obtain better bit-error rate (BER) performance and lower error-floor. In addition, our LDPC codes have a diagonal-like structure in the parity-check matrix leading to a proposed hybrid storage architecture, which has the advantages of better area efficiency and large enough data bandwidth for high decoding throughput. To be adopted for NAND flash applications, an (18900, 17010) LDPC code with a code-rate of 0.9 and submatrix size of 63 is constructed and the field-programmable gate array simulations show that the error floor is successfully suppressed down to BER of $10^{-12}$ . An LDPC decoder using normalized min-sum variable-node-centric sequential scheduling decoding algorithm is implemented in UMC 90-nm CMOS process. The postlayout result shows that the proposed LDPC decoder can achieve a throughput of 1.58 Gb/s at six iterations with a gate count of 520k under a clock frequency of 166.6 MHz. It meets the throughput requirement of both NAND flash memories with Toggle double data rate 1.0 and open NAND flash interface 2.3 NAND interfaces.

...read moreread less

33 citations

Proceedings Article•DOI•

Integrated Synthesis of Linear Nearest Neighbor Ancilla-Free MCT Circuits

[...]

Md. Mazder Rahman¹, Gerhard W. Dueck¹, Anupam Chattopadhyay², Robert Wille³•Institutions (3)

University of New Brunswick¹, Nanyang Technological University², Johannes Kepler University of Linz³

18 May 2016

TL;DR: This paper proposes, for the first time, the inclusion of nearest neighbourhood criteria in a widely used ancilla free reversible logic synthesis method, and shows that this method easily outperforms the earlier two step techniques in terms of gate count without any runtime overhead.

...read moreread less

Abstract: The rapid advances of quantum technologiesare opening up new challenges, of which, protectingquantum states from errors is a major one. Amongquantum error correction schemes, the surface code isemerging as a natural choice with high-fidelity quantumgates reported for experimental platforms. Surfacecodes also necessitate the quantum gates to be formedwith strict nearest neighbour coupling. State-of-the-artreversible logic synthesis techniques for quantum circuitimplementation do not ensure the logic gates to be formedin a nearest neighbor fashion, and this is handled as a post processingoptimization by the insertion of swap gates. Inthis paper, we propose, for the first time, the inclusionof nearest neighbourhood criteria in a widely used ancilla freereversible logic synthesis method. Experimental resultsshow that this method easily outperforms the earlier two steptechniques in terms of gate count without any runtime overhead.

...read moreread less

27 citations

Journal Article•DOI•

Implementation of a Near-Optimal Detector for Spatial Modulation MIMO Systems

[...]

Gwang-Ho Lee¹, Tae-Hwan Kim¹•Institutions (1)

Korea Aerospace University¹

29 Feb 2016-IEEE Transactions on Circuits and Systems Ii-express Briefs

TL;DR: The proposed detector employs the signal-vector-based list detection method, but the original method is modified to realize a low-complexity implementation to reduce the hardware complexity while achieving a near-optimal error-rate performance.

...read moreread less

Abstract: This brief presents a hardware implementation of a detector for spatial modulation multiple-input multiple-output (SM-MIMO) communication systems. The proposed detector employs the signal-vector-based list detection method, but the original method is modified to realize a low-complexity implementation. In addition, the proposed detector is designed based on the dual-data-path architecture, in which antenna selection and symbol detection are separately performed with different precision levels to reduce the hardware complexity while achieving a near-optimal error-rate performance. The proposed detector is implemented with 87.4-K logic gates in a 0.18- $\mu\text{m}$ CMOS technology, and its throughput is 858 Mb/s for 8 $\times$ 4 64 quadrature amplitude modulation SM-MIMO systems, where the operating frequency is 286 MHz, and the power consumption is 121.3 mW. This manifests that the proposed detector is very efficient with respect to the gate count as well as the energy consumption.

...read moreread less

18 citations

Journal Article•DOI•

A High-Throughput and Multi-Parallel VLSI Architecture for HEVC Deblocking Filter

[...]

Wei Zhou¹, Jingzhi Zhang¹, Xin Zhou¹, Zhenyu Liu², Xiaoxiang Liu - Show less +1 more•Institutions (2)

Northwestern Polytechnical University¹, Tsinghua University²

02 Mar 2016-IEEE Transactions on Multimedia

TL;DR: A high-throughput and multi-parallel VLSI hardware architecture for the deblocking filter in the HEVC video coding standard and an implementation-friendly and fast boundary judgment method are presented to avoid using the original recursion loop approach.

...read moreread less

Abstract: This paper presents a high-throughput and multi-parallel VLSI hardware architecture for the deblocking filter in the HEVC video coding standard. First, an implementation-friendly and fast boundary judgment method is proposed to avoid using the original recursion loop approach. Then a dedicated parallel VLSI architecture composed of four parallel filtering cores is presented based on the proposed boundary judgment method. With the parallel luma/chroma filtering and parallel vertical/horizontal edges filtering order, the proposed VLSI architecture can process filtering operations for one largest coding unit (LCU) with less filtering cycles than other conventional approaches. Furthermore, filtering efficiency is improved due to a novel ping-pang buffer architecture and the on-chip single-port SRAM with dedicated data arrangement in the memory modules. Experimental results demonstrate that the proposed deblocking filter architecture improves the performance by 28–89% at the expense of the slightly increased gate count compared to the previously known architecture in HEVC. The proposed architecture can reach a high operating clock frequency of 278 MHz with TSMC 90 nm library and meet the real time requirement of the deblocking filter for 8 K × 4 K video format at 123 frame/s.

...read moreread less

17 citations

Journal Article•DOI•

Compact ECDSA engine for IoT applications

[...]

Tolga Yalcin

22 Jun 2016-Electronics Letters

TL;DR: This work is the most compact ECDSA engine with capability for a wide range of curves and different applications and allows it to be implemented on any application specific integrated circuit (ASIC) or FPGA platform with dual-port memory support.

...read moreread less

Abstract: Security problems introduced with rapid increase in deployment of Internet-of-Things devices can be overcome only with lightweight cryptographic schemes and modules. A compact prime field (GF(p)) elliptic curve digital signature algorithm (ECDSA) engine suitable for use in such applications is presented. Generic architecture of the engine makes it suitable for other elliptic curve (EC) based schemes (EC Diffie–Hellman key exchange, EC integrated encryption, EC factoring etc.) with slight modifications. The presented engine is composed of a simple microcoded controller and application-specific processing units. It can work with ECs of up to 256 bits, while 160-bit ECDSA signature generation takes 490 K cycles. The engine is implemented as an intellectual property (IP) in a 180 nm process. However, its architecture allows it to be implemented on any application specific integrated circuit (ASIC) or FPGA platform with dual-port memory support. In view of its gate count of 11,366 gate equivalents, the presented work is the most compact ECDSA engine with capability for a wide range of curves and different applications.

...read moreread less

14 citations

Journal Article•DOI•

Fast algorithms and VLSI architecture design for HEVC intra-mode decision

[...]

Xiaofeng Huang¹, Huizhu Jia¹, Binbin Cai¹, Chuang Zhu¹, Jie Liu¹, Mingyuan Yang¹, Don Xie¹, Wen Gao¹ - Show less +4 more•Institutions (1)

Peking University¹

01 Aug 2016-Journal of Real-time Image Processing

TL;DR: Several fast algorithms are proposed to remove the data dependency and to reduce the computational complexity, which include source signal based Rough Mode Decision, coarse to fine rough mode search, Prediction Mode Interlaced RDO mode decision, parallelized context adaption and Chroma-free Coding Unit (CU)/Prediction Unit (PU) decision.

...read moreread less

Abstract: The emerging intra-coding tools of High Efficiency Video Coding (HEVC) standard can achieve up to 36?% bit-rate reduction compared to H.264/AVC, but with significant complexity increase. The design challenges, such as data dependency and computational complexity, make it difficult to implement a hardware encoder for real-time applications. In this paper, firstly, the data dependency in HEVC intra-mode decision is fully analyzed, which is cost by the reconstruction loop, the Most Probable Mode, the context adaption during Context-based Adaptive Binary Arithmetic Coding based rate estimation, and the Chroma derived mode. Then, several fast algorithms are proposed to remove the data dependency and to reduce the computational complexity, which include source signal based Rough Mode Decision, coarse to fine rough mode search, Prediction Mode Interlaced RDO mode decision, parallelized context adaption and Chroma-free Coding Unit (CU)/Prediction Unit (PU) decision. Finally, the parallelized VLSI architecture with CU reordering and Chroma reordering scheduling is proposed to improve the throughput. The experimental results demonstrate that the proposed intra-mode decision achieves 41.6?% complexity reduction with 4.3?% Bjontegaard Delta Rate (BDR) increase on average compared to the reference software, HM-13.0. The intra-mode decision scheme is implemented with 1571.7K gate count in 55?nm CMOS technology. The implementation results show that our design can achieve [email protected] real time processing at 294?MHz operation frequency.

...read moreread less

13 citations

Proceedings Article•DOI•

A low-cost energy efficient image scaling processor for multimedia applications

[...]

Bharat Garg¹, V N S K Chaitanya Goteti¹, G. K. Sharma¹•Institutions (1)

Indian Institute of Information Technology and Management, Gwalior¹

01 May 2016

TL;DR: In this paper, a low complexity image scaling algorithm is proposed which shows significant reduction in hardware cost and energy over existing architectures without significant degradation in quality.

...read moreread less

Abstract: Image scaling is one of the widely used techniques in various portable devices to fit the image in their respective displays. Traditional image scaling architectures consume more power and hardware, making them inefficient for use in portable devices. In this paper, a low complexity image scaling algorithm is proposed. In the proposed algorithm, the target pixel is computed either by bilinear interpolation or by replication. The edge catching module in the architecture determines the method of computation which makes the design energy efficient. Further, algebraic manipulation is done and the resulting pipelined architecture shows significant reduction in hardware cost. In order to evaluate the efficacy, the proposed and existing algorithms are implemented in MATLAB and simulated using standard benchmark images. The proposed design is synthesized in Synopsys Design Compiler using 90-nm CMOS process which shows 43.3% reduced gate count and 25.9% reduction in energy over existing architectures without significant degradation in quality.

...read moreread less

13 citations

Journal Article•DOI•

High-throughput low-area design of AES using constant binary matrix-vector multiplication

[...]

Hokyoon Lee¹, Yoonah Paik¹, Jaeyung Jun¹, Youngsun Han, Seon Wook Kim¹ - Show less +1 more•Institutions (1)

Korea University¹

01 Nov 2016-Microprocessors and Microsystems

TL;DR: This paper proposes a new design of SubBytes and MixColumns in AES using constant binary matrix-vector multiplications reduced to AND and XOR operations, and proposes a four-stage pipelined AES architecture to achieve higher throughput.

...read moreread less

12 citations

Proceedings Article•DOI•

Design and optimization of 8 bit ALU using reversible logic

[...]

A Deeptha¹, Drishika Muthanna¹, Dhrithi M¹, M Pratiksha¹, B. S. Kariyappa¹ - Show less +1 more•Institutions (1)

R.V. College of Engineering¹

01 May 2016

TL;DR: A novel design for a Reversible 8-bit ALU is proposed, which has reduced gate count, and transistor count and the propagation delay was found to be significantly lesser than existing designs.

...read moreread less

Abstract: Conventional Complementary metal oxide semiconductor circuits (CMOS) dissipate energy in the form of bits of information. This dissipation of energy is in the form of power dissipation and plays a very important role as far as low power design is considered. Today, most digital circuits are being designed using Reversible Logic. Design based on Reversible Logic helps in reducing heat dissipation, allowing nearly energy free computation, allowing higher circuit densities and enabling better testing of faults. In this paper, a novel design for a Reversible 8-bit ALU is proposed. The 8-bit ALU is designed by cascading 1-bit ALUs. The two major units of a 1-bit ALU are the control unit and the adder unit. For the control unit, the Control Output Gate (COG) has been used and for the adder unit the Haghparast and Navi Gate (HNG) has been used. The most significant aspect of this paper is that as compared to other papers, this ALU design has reduced gate count, and transistor count. The propagation delay was found to be significantly lesser at a value of 5.52ns when compared with the value of 8.29ns for an existing design. Simulation and verification of the proposed design was performed using Cadence 180nm technology software tool.

...read moreread less

12 citations

Journal Article•DOI•

Implementation and Analysis of Reversible logic Based Arithmetic Logic Unit

[...]

Shaveta Thakral¹, Dipali Bansal¹, S. K. Chakarvarti¹•Institutions (1)

Manav Rachna International University¹

01 Dec 2016-TELKOMNIKA Telecommunication Computing Electronics and Control

TL;DR: Categorizing various ways of implementation in VHDL using Xilinx ISE design suit 14.2 tool and comparative analysis of existing 1 bit ALU designs in terms of optimization metrics like power consumption, number of gates,Number of constant inputs, numberof garbage outputs and quantum cost is aimed at.

...read moreread less

Abstract: There is a tremendous growth in fabrication from small scale integration (SSI) to giant scale integration (GSI). It however raises a question of sustainability of Moore's law due to almost intolerable levels of power consumption. Researchers have invented a lot of methods to reduce power consumption and recent technologies are switching to reversible logic. Reversible logic has various applications in fields of computer graphics, optical information processing, quantum computing, DNA computing, ultra low power CMOS design and communication. Arithmetic Logic Unit (ALU) is considered to be the basic building block of a CPU in the computing environment and portability in computing system highly demands reversible logic based ALU. Modern processors usually have a word length of 32 or 64 bits. Divide and conquer approach principle cascades n number of 1 bit ALU to implement n bit ALU. Several researchers have proposed 1-bit ALU design using various reversible logic gates. This paper aims at categorizing various ways of implementation in VHDL using Xilinx ISE design suit 14.2 tool and comparative analysis of existing 1 bit ALU designs in terms of optimization metrics like power consumption, number of gates, number of constant inputs, number of garbage outputs and quantum cost. ALU realized using carry save adder block is found to be most optimum design in terms of gate count and quantum cost.

...read moreread less

Journal Article•DOI•

Approach to design a high performance fault-tolerant reversible ALU

[...]

Neeraj Kumar Misra, Subodh Wairya, Vinod Kumar Singh

13 Apr 2016

TL;DR: It has been introduced that new reversible gate, namely, universal parity preserving gate (UPPG), to optimise the ALU circuits, and circuit design focuses on optimising the gate count and quantum cost.

...read moreread less

Abstract: In the digital circuit design, the primary factors are low power and a high packing density. The reversible logic circuit in quantum-dot cellular automata (QCA) framework is hoped to be effective in addressing the factor of power consumption at nanoscale regime. Fault tolerant circuits are suited of interruption of errors at the outputs. This manuscript focuses the design of ALU in QCA-based and propose new parity preserving gate. It has been introduced that new reversible gate, namely, universal parity preserving gate (UPPG), to optimise the ALU circuits. An algorithm and lemmas are shown in designing ALU. The ALU generates a number of arithmetic and logical function with using only less architectural complexity. Most importantly circuit design focuses on optimising the gate count and quantum cost. In addition to optimisation, the workability of UPPG gate is tested by QCA and the simulation result obtained ensures the correctness of the design.

...read moreread less

Posted Content•

Error tracing in linear and concatenated quantum circuits.

[...]

Ritajit Majumdar, Saikat Basu, Priyanka Mukhopadhyay, Susmita Sur-Kolay

23 Dec 2016-arXiv: Quantum Physics

TL;DR: This paper has proposed a procedure to trace error probability due to noisy gates and decoherence in quantum circuit and place an error correcting block only when the error probability exceeds a certain threshold, which shows a drastic reduction in the required number of error correcting blocks.

...read moreread less

Abstract: Descriptions of quantum algorithms, communication etc. protocols assume the existence of closed quantum system. However, real life quantum systems are open and are highly sensitive to errors. Hence error correction is of utmost importance if quantum computation is to be carried out in reality. Ideally, an error correction block should be placed after every gate operation in a quantum circuit. This increases the overhead and reduced the speedup of the quantum circuit. Moreover, the error correction blocks themselves may induce errors as the gates used for error correction may be noisy. In this paper, we have proposed a procedure to trace error probability due to noisy gates and decoherence in quantum circuit and place an error correcting block only when the error probability exceeds a certain threshold. This procedure shows a drastic reduction in the required number of error correcting blocks. Furthermore, we have considered concatenated codes with tile structure layout lattice architecture[25][21],[24] and SWAP gate based qubit transport mechanism. Tracing errors in higher levels of concatenation shows that, in most cases, after 1 or 2 levels of concatenation, the number of QECC blocks required become static. However, since the gate count increases with increasing concatenation, the percentage saving in gate count is considerably high.

...read moreread less

Proceedings Article•DOI•

Scalable Integer DCT Architecture for HEVC Encoder

[...]

Maher Abdelrasoul¹, Mohammed S. Sayed¹, Victor Goulart¹•Institutions (1)

Egypt-Japan University of Science and Technology¹

01 Jul 2016

TL;DR: The proposed architecture is a fully pipelined architecture with optimized adders bit-widths and it was prototyped on TSMC 65 nm CMOS technology and the prototyping results show the high performance of the proposed architecture.

...read moreread less

Abstract: HEVC (H.265) standard was proposed as a means to increase the compression rate with no loss in video quality. Large integer DCT, with sizes 16x16 and 32x32, is one of the key new features of the H.265 standard. In this paper, we propose a new scalable architecture for integer DCT in HEVC encoder. The proposed architecture is a fully pipelined architecture with optimized adders bit-widths. It was prototyped on TSMC 65 nm CMOS technology. The prototyping results show the high performance of theproposed architecture. Its gate count is 130K and it can achieve throughput of 9.26 Gsps. The proposed architecture can encode 8K @ 120 fps video sequence with working frequency of 373.25 MHz in real time.

...read moreread less

Proceedings Article•DOI•

On the implementation of PFSCL adders

[...]

Kirti Gupta¹, Pragati Shukla¹, Neeta Pandey²•Institutions (2)

Bharati Vidyapeeth's College of Engineering¹, Delhi Technological University²

01 Nov 2016

TL;DR: It is found that the Arch-3 presents the best PFSCL full adder design by incorporating the advantageous features of the other two proposed architectures.

...read moreread less

Abstract: In this paper, implementation of full adders in positive feedback source-coupled logic style (PFSCL) is proposed. Three new architectures for PFSCL full adders are put forward. The first architecture is implemented by using conventional NOR based method. The second architecture is based on the use of configurable cell while the last architecture optimizes the structure by using both the conventional NOR and configurable cell based approaches. The functionality of the proposed architectures is verified through simulations by using TSMC 180 nm CMOS technology parameter on Tanner EDA. Their performance is compared in terms of transistor count, gate count, power, delay and power-delay product. It is found that the Arch-3 presents the best PFSCL full adder design by incorporating the advantageous features of the other two proposed architectures.

...read moreread less

Proceedings Article•DOI•

Design, Implementation and Comparative Analysis of Kogge Stone Adder Using CMOS and GDI Design: A VLSI Based Approach

[...]

C. N. Shilpa, Kunjan D. Shinde, H V Nithin

01 Dec 2016

TL;DR: The design and performance of the Kogge Stone Parallel Prefix Adders are described and implemented using different design technique and CMOS and GDI are the different design techniques used.

...read moreread less

Abstract: Adders forms a major part in various arithmetic logical operations. Parallel Prefix Adder have been built up as the most essential and efficient circuit for binary addition. Their Particular structure and execution performance are very attractive for VLSI implementation. In these papers, we describe the design and performance of the Kogge Stone Parallel Prefix Adders and implemented using different design technique. CMOS (Complementary Metal Oxide Semiconductor) and GDI (Gate Diffusion Input) are the different design technique used. The design and simulation of logic gates is performed on CADENCE Design Suit 6.1.6 using virtuoso and ADE Environment at GPDK 180nm technology. The execution measurement considered for the performance of the KSA is delay, number of gate count/Transistor Count (area) and power. Simulation studies are done for 4-bit, 8-bit and 16-bit input data.

...read moreread less

Proceedings Article•DOI•

An optimized hardware architecture of 4×4, 8×8, 16×16 and 32×32 inverse transform for HEVC

[...]

Manel Kammoun¹, Emna Maamouri¹, Ahmed Ben Atitallah¹, Nouri Masmoudi¹•Institutions (1)

University of Sfax¹

21 Mar 2016

TL;DR: A unified hardware architecture for 4×4, 8×8, 16×16 and 32×32 inverse 2D core transform IDCT in HEVC standard used only one block 1D transform and a transpose buffer based on FIFO memory blocs instead of the traditional register array in order to further reduce the memory resources.

...read moreread less

Abstract: Most video coding standards use transform algorithms to reduce the size of data characterizing a video signal. The traditional transform matrices as in H.264 are limited to 4×4 and 8×8 sizes. However, the flexibility of coding structure presented in the next generation of video coding standard High Efficiency Video Coding standard HEVC allows the definition of various sizes of transform matrices which can vary from 4×4 to 32×32. This paper describes a unified hardware architecture for 4×4, 8×8, 16×16 and 32×32 inverse 2D core transform IDCT in HEVC standard. It used only one block 1D transform and a transpose buffer based on FIFO memory blocs instead of the traditional register array in order to further reduce the memory resources. The synthesis results under TSMC 180 nm CMOS technology show that the total gate count of the design is more than 30% improved compared to previous works. However, the operating frequency of the hardware design is about 130 MHz. This last can perform the decoding of 25 frames per second of Quad HD (3840×2160) resolution.

...read moreread less

Book Chapter•DOI•

On the Construction of Hardware-Friendly $4\times 4$ and $5\times 5$ S-Boxes

[...]

Stjepan Picek¹, Bohan Yang¹, Vladimir Rozic¹, Nele Mentens¹•Institutions (1)

Katholieke Universiteit Leuven¹

10 Aug 2016

TL;DR: The results show that heuristics should be considered as a viable choice for the generation of S-boxes with good implementation properties, and proposed methods to obtain $4 \times 4$ and $5\times 5$ S- boxes that are either power or area efficient.

...read moreread less

Abstract: With the emergence of the Internet of Things and lightweight cryptography, one can observe a gradual shift of interest in the design of block ciphers. Naturally, security is still of paramount importance, but one is willing to trade a part of that security in order to obtain higher speed and/or smaller implementation area. Accordingly, a common metric in many cipher proposals has been the gate count for realizing the cipher in hardware. On the other side, it is also important, especially for battery powered devices, to have a small energy consumption. That is why we can observe the following shift of research focus: from the analysis of the energy consumption of existing ciphers and their building blocks to the design of new ciphers and building blocks, specifically for low energy. Existing research results focusing on the energy consumption of symmetric ciphers, suggest that the S-box is the most expensive part in the majority of lightweight implementations. If we only consider purely combinatorial S-boxes, we can focus on reducing the power consumption of the S-box in order to minimize the energy consumption of the overall cipher. In this paper, we propose several methods to obtain $4 \times 4$ and $5\times 5$ S-boxes that are either power or area efficient. Our results show that heuristics should be considered as a viable choice for the generation of S-boxes with good implementation properties.

...read moreread less

Proceedings Article•DOI•

Low-complexity turbo product code for high-speed fiber-optic systems based on expurgated BCH codes

[...]

Franco Paludi, Damian A. Morera, Teodoro A. Goette, Matias S. Schnidrig, Facundo Ramos, Mario R. Hueda¹ - Show less +2 more•Institutions (1)

National University of Cordoba¹

22 May 2016

TL;DR: The results show that the propos d TPC architecture is able to reduce approximately to a half the gate count and the memory size, which makes the new TPC an excellent candidate for practical VLSI implementation in commercial transceivers.

...read moreread less

Abstract: We propose a low-complexity implementation architecture for turbo product code (TPC) suitable for next-generation fiber optic networks (e.g. ≳ 100 Gb/s). The proposed code makes use of expurgated Bose-Chaudhuri-Hocquenghem (BCH) codes to improve the performance and reduce implementation complexity. In comparison with existing solutions, our results show that the propos d TPC architecture is able to reduce approximately to a half the gate count and the memory size. This feature makes the new TPC an excellent candidate for practical VLSI implementation in commercial transceivers.

...read moreread less

Proceedings Article•DOI•

Optimal design of full adder circuit using Particle Swarm Optimization algorithm

[...]

R. Das¹, A. Kumar¹, S. Kumar¹, P. K. Prasad¹, Rajib Kar¹, D. Mondal¹, Sakti Prasad Ghoshal¹ - Show less +3 more•Institutions (1)

National Institute of Technology, Durgapur¹

03 Mar 2016

TL;DR: Results are presented to confirm that the PSO based algorithm is superior to Human Design Method in terms of time, effort and especially the gate count required to design the digital combinational circuits.

...read moreread less

Abstract: With increasing complexity of electronic circuits, the design and optimization of electronic circuit needs to be automated with high degree of reliability and accuracy. In order to optimize hardware requirement of digital combinational circuits, evolutionary and innovative techniques need to be enforced at various levels such as gate level and device level. This paper presents the use of one of the evolutionary techniques, i.e., Particle Swarm Optimization (PSO) algorithm. It is motivated by the social behaviour of organisms for the optimal design of combinational logic circuit with a reduced gate count in MATLAB platform. Results are presented to confirm that the PSO based algorithm is superior to Human Design Method in terms of time, effort and especially the gate count required to design the digital combinational circuits. The paper shows that PSO based algorithm converges faster than other algorithms such as genetic algorithm and also reduces the computational complexity.

...read moreread less

Proceedings Article•DOI•

Design of low cost latches based on reversible quantum dot cellular automata

[...]

Debajyoty Banik¹•Institutions (1)

Indian Institute of Technology Patna¹

01 Dec 2016

TL;DR: Primary objective of this article is design efficient latches in terms of garbage outputs, quantum cost, and delay.

...read moreread less

Abstract: In this article, sequential circuits have been designed which are based on reversible quantum dot cellular automata. Thus, the proposed designs have property of quantum dot cellular automata with essence of reversible logic. Gate count cannot be decent cost metric for optimization because each reversible logic gate is of computational complexity and distinctive type, so it will have a dissimilar delay and quantum cost. Computational complexity of reversible gate can be represented by its delay, quantum cost. Thus, primary objective of this article is design efficient latches in terms of garbage outputs, quantum cost, and delay. The various efficient r eversible latches including the positive level triggered D Latch, negative level triggered D Latch, T latch have been designs.

...read moreread less

Journal Article•DOI•

Fixed-Point Computing Element Design for Transcendental Functions and Primary Operations in Speech Processing

[...]

Chung-Hsien Chang¹, Shi-Huang Chen, Bo-Wei Chen¹, Wen Ji², K. Bharanitharan³, Jhing-Fa Wang¹ - Show less +2 more•Institutions (3)

National Cheng Kung University¹, Chinese Academy of Sciences², Hanyang University³

01 May 2016-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: Results indicate that this simple architecture based on a reconfigurable scheme for integrating several commonly used mathematical operations of speech signal processing can be effectively used in most speech processing applications.

...read moreread less

Abstract: This brief presents a fixed-point architecture based on a reconfigurable scheme for integrating several commonly used mathematical operations of speech signal processing. The proposed design can perform two transcendental mathematical operations called logarithm and powering, and three commonly used computations with similar operations named polynomial calculation, filtering, and windowing. By analyzing the adopted algorithms of the above five operations, a simplified computing unit is designed. This unit can combine six types of operations by reconfiguring the data paths, and the same multiply–add architecture can be reused for reducing the redundant usage of logic gates. The experimental results reveal that the proposed design can work at a 200-MHz clock rate, and its gate count only has 11.9k. Compared with the results of the floating-point function, the median errors of the proposed design for computing the powering and logarithmic functions are 0.57% and 0.11%, respectively. Such results indicate that this simple architecture can be effectively used in most speech processing applications.

...read moreread less

Journal Article•DOI•

Low Cost 1D DCT Core for Multiple Video Codec

[...]

Rui Jia¹, Rui Chen¹, Colin Yu Lin¹, Guo Zhenhong¹, Haigang Yang¹ - Show less +1 more•Institutions (1)

Chinese Academy of Sciences¹

01 Nov 2016-Chinese Journal of Electronics

TL;DR: The proposed architecture can be generally used to compute 8×8 DCT of AVS, H.264, VC-1 and HEVC in a low cost way, and can be used to decode Full-HD and WQXGA formate video sequences in real time.

...read moreread less

Abstract: The expandability of high demands for multimedia applications brings out more and more video standards for improving the coding and compression efficiency. As the most commonly used transform, Discrete cosine transform (DCT) achieves excellent energy compaction property and good compression efficiency. Hardware sharing is the mostly used efficient strategy to reduce the cost for video codec. Based on traditional matrix factorization, this paper makes three observations to direct the design of proposed hardware sharing architecture. The proposed architecture can be generally used to compute 8×8 DCT of AVS, H.264, VC-1 and HEVC in a low cost way, and can be used to decode Full-HD and WQXGA formate video sequences in real time. The design has been synthesized in 0.13μm technology. The synthesis results show that the proposed architecture achieves 76.9% reduction in gate count, 85.6% decrease in power consumption and 35% improvement in operational speed in comparison with other existing designs.

...read moreread less

Proceedings Article•DOI•

Area-delay-power efficient PSO based full adder in different technologies

[...]

Anuj Kumar¹, S. Kumar¹, P. K. Prasad¹, R. Das¹, Rajib Kar¹, D. Mondal¹, Sakti Prasad Ghoshal¹ - Show less +3 more•Institutions (1)

National Institute of Technology, Durgapur¹

01 Apr 2016

TL;DR: Results are presented to support that the PSO based algorithm is better than Human Design Method in respect of time, labour and specially the gate count required to design digital combinational circuit.

...read moreread less

Abstract: With the increasing complexity of electronic circuits and to meet the demand of high performance, the design and optimization of electronic circuits need to be automated with high degree of reliability and accuracy. In order to optimize hardware requirements of digital combinational circuits, evolutionary and innovative techniques need to be enforced at various levels such as at gate level and device level. One of the evolutionary technique Particle Swarm Optimization (PSO) algorithm motivated by the social behaviour of organism is used for the optimal design of combinational logic circuits with reduced gate count on MATLAB platform. PSO technique has been applied to optimize Full Adder circuit. Results are presented to support that the PSO based algorithm is better than Human Design Method in respect of time, labour and specially the gate count required to design digital combinational circuit. Later on that optimized circuit has been analysed by Microwind3.1 VLSI CAD Tool. Using the tool the parameters like Area, Power, Delay and Maximum and Average drain current are determined with 90nm, 65nm and 45nm technologies using BSIM4 Model. The results shown in this paper reflects that with technology scaling decreases the area, delay, power consumption and leakage current which are some of the major requirements of today's VLSI design.

...read moreread less

Posted Content•

Reversible Logic Circuit Complexity Analysis via Functional Decomposition

[...]

Anupam Chattopadhyay, Anubhab Baksi

30 Jan 2016-arXiv: Emerging Technologies

TL;DR: This paper connects the problem of upper bound of the gate count with the multiplicative complexity analysis of classical Boolean functions and explores the possibility of relaxing the ancilla and if that approach makes the upper bound tighter.

...read moreread less

Abstract: Reversible computation is gaining increasing relevance in the context of several post-CMOS technologies, the most prominent of those being Quantum computing. One of the key theoretical problem pertaining to reversible logic synthesis is the upper bound of the gate count. Compared to the known bounds, the results obtained by optimal synthesis methods are significantly less. In this paper, we connect this problem with the multiplicative complexity analysis of classical Boolean functions. We explore the possibility of relaxing the ancilla and if that approach makes the upper bound tighter. Our results are negative. The ancilla-free synthesis methods by using transformations and by starting from an Exclusive Sum-of-Product (ESOP) formulation remain, theoretically, the synthesis methods for achieving least gate count for the cases where the number of variables $n$ is $< 8$ and otherwise, respectively.

...read moreread less

Proceedings Article•DOI•

A quality characteristics estimation methodology for the hierarchy of RTL compilers

[...]

L. Martirosyan¹•Institutions (1)

Synopsys¹

01 Oct 2016

TL;DR: This paper presents a quality characteristics estimation methodology for STAR Memory System (SMS) network based on linear and polynomial approximation that enables to perform area and power-aware SMS network design at early stages of SoC design.

...read moreread less

Abstract: For System-on-Chips (SoCs) one of the most critical design constraints are gate count and power consumption. This paper presents a quality characteristics estimation methodology for STAR Memory System (SMS) network. Our proposed methodology is based on linear and polynomial approximation. The obtained approximate functions are embedded in scripts that were developed for automated estimation of gate count and power consumption. The mentioned methodology enables to perform area and power-aware SMS network design at early stages of SoC design.

...read moreread less

Book Chapter•DOI•

Design and Analysis of Reversible Binary and BCD Adders

[...]

A. N. Nagamani¹, Nikhil J. Reddy¹, Vinod Kumar Agrawal¹•Institutions (1)

PES University¹

01 Jan 2016

TL;DR: The effectiveness of the negative control Toffoli and Peres gates in reducing quantum cost, delay and gate count is explored and the adder performance increases along with area optimization which will make these designs useful in future low power Reversible computing.

...read moreread less

Abstract: Reversible logic in recent times has attracted a lot of research attention in the field of Quantum computation and nanotechnology due to its low power dissipation capability. Adders are one of the basic components in most of digital systems. Optimization of these adders can improve the performance of the entire system. In this work we have proposed designs of reversible Binary and BCD adders. Ripple carry adder, conditional adders for binary addition and regular and flagged adders for BCD addition. The proposed adder designs are optimized for quantum cost, Gate count and delay. The effectiveness of the negative control Toffoli and Peres gates in reducing quantum cost, delay and gate count is explored. Due to this the adder performance increases along with area optimization which will make these designs useful in future low power Reversible computing.

...read moreread less

Journal Article•DOI•

High throughput resource shared 2D integer transform computation for H.264/MPEG-4 AVC

[...]

Honey Durga Tiwari, Meeturani Sharma, Harsh Durga Tiwari

01 Jul 2016-Digital Signal Processing

TL;DR: The proposed dual-clock pipelined architecture can be used for real-time H.264/MPEG-4 AVC processing and achieves a throughput of 7 G and 18.7 G pixels/sec for each block of 4 × 4 and 8 × 8 forward integer transforms, respectively.

...read moreread less

Journal Article•DOI•

Design of Efficient Reversible BCD Adder–Subtractor Architecture and Its Optimization Using Carry Skip Logic

[...]

Praveena Murugesan¹, Thanushkodi Keppanagounder², Vijeyakumar Krishnasamy Natarajan³•Institutions (3)

Anna University¹, Akshaya College of Engineering and Technology², Dr. Mahalingam College of Engineering and Technology³

22 Apr 2016-Journal of Circuits, Systems, and Computers

TL;DR: A novel design of binary coded decimal (BCD) adder/subtractor in reversible logic has been proposed and carry skip (CSK) logic is used for reversible ripple carry adder stages to reduce delay but at the expense of little hardware.

...read moreread less

Abstract: In the present era, reversible logic designs play a very critical role in nanotechnology, low power complementary metal-oxide semiconductor (CMOS) designs, optical computing and, especially, in quantum computing. High power dissipation and leakage current in deep submicron technologies is a severe threat in applications created today. As a consequence, design of datapath elements in reversible logic has gained much importance. In this study, a novel design of binary coded decimal (BCD) adder/subtractor in reversible logic has been proposed. As a further optimization of the proposed reversible decimal design, carry skip (CSK) logic is used for reversible ripple carry adder stages. This reduces delay but at the expense of little hardware. The proposed BCD adder/subtractor and its optimized version are designed using structural VHDL and simulated using ModelSim 6.3f. Performance analysis reveals that the proposed BCD design demonstrates reductions in gate count, garbage outputs and constant inputs of 30.5%, 46% and 28%, respectively, and its optimized version exhibits 19.4%, 32.4% and 16% reductions in gate count, garbage outputs and constant inputs compared to the design in Ref. 14 [V. Rajmohan, V. Renganathan and M. Rajmohan, A novel reversible design of unified single digit BCD adder–subtractor, Int. J. Comput. Theor. Eng. 3 (2011) 697–700].

...read moreread less

Journal Article•DOI•

High-Performance Hardware Interpolation Architecture for High Efficiency Video Coding Decoder

[...]

Lella Aicha Ayadi¹, Nihel Neji¹, Hassen Loukil¹, Mouhamed Ali Ben Ayed¹, Nouri Masmoudi¹ - Show less +1 more•Institutions (1)

University of Sfax¹

30 Sep 2016-International Review on Computers and Software

TL;DR: A highly parallel architecture and a pipelined hardware implementation achieving 8×8 Prediction Unit (PU) interpolation in only 30 clock cycles for inter-prediction which is useful for motion compensation (MC) module in the HEVC decoder.

...read moreread less

Abstract: The fractional sample interpolation process is one of the most computationally intensive parts of video decoder based on High Efficiency Video Coding (HEVC) standard. Therefore, in this paper, we propose high performance hardware interpolation architecture for inter-prediction which is useful for motion compensation (MC) module in the HEVC decoder. For this component, we propose a highly parallel architecture and a pipelined hardware implementation achieving 8×8 Prediction Unit (PU) interpolation in only 30 clock cycles. Experimental results show that our architecture can achieve up to 3.2 pixels/cycle at 125 MHz on field-programmable gate array technology (FPGA) and the corresponding performance can support the processing of Quad Full High Definition (QFHD, 3840×2160)@30 fps. The gate count of the resulting Application-Specific Integrated Circuit (ASIC) implementation in 65 nm technology is 36.7 k.

...read moreread less