Showing papers on "Gate count published in 2021"

PDF

Open Access

Proceedings Article•DOI•

Orchestrated trios: compiling for efficient communication in Quantum programs with 3-Qubit gates

[...]

Casey Duckering¹, Jonathan M. Baker¹, Andrew Litteken¹, Frederic T. Chong¹•Institutions (1)

19 Apr 2021

TL;DR: In this article, the authors propose a new compiler structure, Orchestrated Trios, that first decomposes to the three-qubit Toffoli, routes the inputs of the higher-level operations to groups of nearby qubits, then finishes decomposition to hardware-supported gates.

...read moreread less

Abstract: Current quantum computers are especially error prone and require high levels of optimization to reduce operation counts and maximize the probability the compiled program will succeed. These computers only support operations decomposed into one- and two-qubit gates and only two-qubit gates between physically connected pairs of qubits. Typical compilers first decompose operations, then route data to connected qubits. We propose a new compiler structure, Orchestrated Trios, that first decomposes to the three-qubit Toffoli, routes the inputs of the higher-level Toffoli operations to groups of nearby qubits, then finishes decomposition to hardware-supported gates. This significantly reduces communication overhead by giving the routing pass access to the higher-level structure of the circuit instead of discarding it. A second benefit is the ability to now select an architecture-tuned Toffoli decomposition such as the 8-CNOT Toffoli for the specific hardware qubits now known after the routing pass. We perform real experiments on IBM Johannesburg showing an average 35% decrease in two-qubit gate count and 23% increase in success rate of a single Toffoli over Qiskit. We additionally compile many near-term benchmark algorithms showing an average 344% increase in (or 4.44x) simulated success rate on the Johannesburg architecture and compare with other architecture types.

...read moreread less

14 citations

Proceedings Article•DOI•

Orchestrated Trios: Compiling for Efficient Communication in Quantum Programs with 3-Qubit Gates

[...]

Casey Duckering¹, Jonathan M. Baker¹, Andrew Litteken¹, Frederic T. Chong²•Institutions (2)

University of Chicago¹, Association for Computing Machinery²

16 Feb 2021-arXiv: Quantum Physics

TL;DR: The Orchestrated Trios compiler as mentioned in this paper decomposes to the three-qubit Toffoli operation and routes the inputs of the higher-level operations to groups of nearby qubits, then finishes decomposition to hardware-supported gates.

...read moreread less

13 citations

Journal Article•DOI•

Memristor based high speed and low power consumption memory design using deep search method

[...]

M. Prithivi Raj¹, G. Kavithaa¹•Institutions (1)

Government College of Engineering, Salem¹

01 Mar 2021-Journal of Ambient Intelligence and Humanized Computing

TL;DR: In this work, low-power complementary metal oxide semiconductor (CMOS) flip-flops have been proposed with deep search pattern method and some key parameters such as delay, power, gate count and other memristor calculations are carried out.

...read moreread less

Abstract: The demand for low-power devices in today’s world is increasing, and the reason behind this is scaling CMOS technology. Due to scaling, the size of the chip decreases and the number of transistors in System-On-Chip increases. However, transistor miniaturization also introduces many new challenges in circuit design for very large scale integrated circuits. Therefore in this work is introduced a memristor based memory design, the memristor breaks the scaling limitations of CMOS technology and prevails over emerging semiconductor devices. The memristor is forced to the Nano scale design of the invention, and successful fabrication begins to take into account the range of conventional metal oxide semiconductor field effect transistors or more specifically the use of transistors as a whole. The memristor has a history mechanism that allows the memory operation to be combined with the inherent bipolar resistance switching characteristics. Different types of existing mathematical models have been derived from shapes that can be further implemented and tested in a prototype with some crucial parameters thus determined and how they differ from conventional transistor-based designs for a suitable circuit memristor. In this work, low-power complementary metal oxide semiconductor (CMOS) flip-flops have been proposed with deep search pattern method and some key parameters such as delay, power, gate count and other memristor calculations are carried out. The simulations are carried out using Verilog-analog mixed signal. The proposed system is considered as potential devices for building memories because they are very dense, non-volatile scalable devices with faster switching times and low power dissipation and are also compatible with the existing CMOS-technology.

...read moreread less

12 citations

Journal Article•DOI•

Efficient Designs of Reversible Synchronous Counters in Nanoscale

[...]

Mojtaba Noorallahzadeh¹, Mohammad Mosleh¹, Seyed-Sajad Ahmadpour¹•Institutions (1)

Islamic Azad University¹

06 May 2021-Circuits Systems and Signal Processing

TL;DR: In this article, an effective 5'×'5 reversible block, called NB, is first proposed, and then, using the proposed reversible block a novel reversible T flip-flop is designed.

...read moreread less

Abstract: In recent years, reversible circuits have attracted the attention of many researchers. Their applications include the design of low-power digital circuits, the design of computational circuits in quantum computers and DNA-based calculations. In this paper, an effective 5 × 5 reversible block, called new block (NB), is first proposed, and then, using the proposed reversible block, a novel reversible T flip-flop is designed. Moreover, we have used Miller synthesis method for calculating the optimal quantum cost of the proposed block. Finally, using the proposed T flip-flop, Feynman gate (FG), Fredkin gate (FRG), Reversible Multiplexer1 (RMUX1) and Modified Toffoli gate (MTG), two reversible synchronous counters including up/down and BCD are suggested. The comparison results show that the proposed up/down counters are superior to the previous designs in terms of parameters such as gate count, constant input, garbage output, quantum cost, and delay.

...read moreread less

11 citations

Journal Article•DOI•

Real time FPGA Implementation of a High Speed and Area Optimized Harris Corner Detection Algorithm

[...]

Prateek Sikka¹, Abhijit Asati¹, Chandra Shekhar¹•Institutions (1)

Birla Institute of Technology and Science¹

01 Feb 2021-Microprocessors and Microsystems

TL;DR: A high speed and area optimized implementation of a Harris corner detection algorithm is proposed using a novel high-level synthesis (HLS) design method based on application-specific bit widths for intermediate data nodes.

...read moreread less

11 citations

Posted Content•

Quantum circuit optimization with deep reinforcement learning

[...]

Thomas Fösel¹, Thomas Fösel², Thomas Fösel³, Murphy Yuezhen Niu³, Florian Marquardt¹, Florian Marquardt², Li Li³ - Show less +3 more•Institutions (3)

Max Planck Society¹, University of Erlangen-Nuremberg², Google³

13 Mar 2021-arXiv: Quantum Physics

TL;DR: In this article, an approach to quantum circuit optimization based on reinforcement learning is presented, where an agent can autonomously learn generic strategies to optimize arbitrary circuits on a specific architecture, where the optimization target can be chosen freely by the user.

...read moreread less

Abstract: A central aspect for operating future quantum computers is quantum circuit optimization, i.e., the search for efficient realizations of quantum algorithms given the device capabilities. In recent years, powerful approaches have been developed which focus on optimizing the high-level circuit structure. However, these approaches do not consider and thus cannot optimize for the hardware details of the quantum architecture, which is especially important for near-term devices. To address this point, we present an approach to quantum circuit optimization based on reinforcement learning. We demonstrate how an agent, realized by a deep convolutional neural network, can autonomously learn generic strategies to optimize arbitrary circuits on a specific architecture, where the optimization target can be chosen freely by the user. We demonstrate the feasibility of this approach by training agents on 12-qubit random circuits, where we find on average a depth reduction by 27% and a gate count reduction by 15%. We examine the extrapolation to larger circuits than used for training, and envision how this approach can be utilized for near-term quantum devices.

...read moreread less

10 citations

Proceedings Article•DOI•

Exploiting long-distance interactions and tolerating atom loss in neutral atom quantum architectures

[...]

Jonathan M. Baker¹, Andrew Litteken¹, Casey Duckering¹, Henry Hoffmann¹, Hannes Bernien¹, Frederic T. Chong¹ - Show less +2 more•Institutions (1)

University of Chicago¹

14 Jun 2021

TL;DR: In this paper, the authors evaluate the advantages and disadvantages of neutral atom (NA) architectures and propose hardware and compiler methods to increase system resilience to atom loss dramatically reducing total computation time by circumventing complete reloads or full recompilation every cycle.

...read moreread less

Abstract: Quantum technologies currently struggle to scale beyond moderate scale prototypes and are unable to execute even reasonably sized programs due to prohibitive gate error rates or coherence times. Many software approaches rely on heavy compiler optimization to squeeze extra value from noisy machines but are fundamentally limited by hardware. Alone, these software approaches help to maximize the use of available hardware but cannot overcome the inherent limitations posed by the underlying technology. An alternative approach is to explore the use of new, though potentially less developed, technology as a path towards scalability. In this work we evaluate the advantages and disadvantages of a Neutral Atom (NA) architecture. NA systems offer several promising advantages such as long range interactions and native multiqubit gates which reduce communication overhead, overall gate count, and depth for compiled programs. Long range interactions, however, impede parallelism with restriction zones surrounding interacting qubit pairs. We extend current compiler methods to maximize the benefit of these advantages and minimize the cost. Furthermore, atoms in an NA device have the possibility to randomly be lost over the course of program execution which is extremely detrimental to total program execution time as atom arrays are slow to load. When the compiled program is no longer compatible with the underlying topology, we need a fast and efficient coping mechanism. We propose hardware and compiler methods to increase system resilience to atom loss dramatically reducing total computation time by circumventing complete reloads or full recompilation every cycle.

...read moreread less

8 citations

Journal Article•DOI•

Quantum Gate Pattern Recognition and Circuit Optimization for Scientific Applications

[...]

Wonho Jang¹, Koji Terashi¹, Masahiko Saito¹, Christian W. Bauer², Benjamin Nachman², Yutaro Iiyama¹, Tomoe Kishimoto¹, Ryunosuke Okubo¹, Ryu Sawada¹, Junichi Tanaka¹ - Show less +6 more•Institutions (2)

University of Tokyo¹, Lawrence Berkeley National Laboratory²

01 May 2021-Epj Web of Conferences

TL;DR: AQCEL as discussed by the authors is a multi-tiered quantum circuit optimization protocol that combines two separate ideas for circuit optimization and combines them in a multilevel quantum circuit optimizer.

...read moreread less

Abstract: There is no unique way to encode a quantum algorithm into a quantum circuit. With limited qubit counts, connectivities, and coherence times, circuit optimization is essential to make the best use of quantum devices produced over a next decade. We introduce two separate ideas for circuit optimization and combine them in a multi-tiered quantum circuit optimization protocol called AQCEL. The first ingredient is a technique to recognize repeated patterns of quantum gates, opening up the possibility of future hardware optimization. The second ingredient is an approach to reduce circuit complexity by identifying zero- or low-amplitude computational basis states and redundant gates. As a demonstration, AQCEL is deployed on an iterative and effcient quantum algorithm designed to model final state radiation in high energy physics. For this algorithm, our optimization scheme brings a significant reduction in the gate count without losing any accuracy compared to the original circuit. Additionally, we have investigated whether this can be demonstrated on a quantum computer using polynomial resources. Our technique is generic and can be useful for a wide variety of quantum algorithms.

...read moreread less

7 citations

Journal Article•DOI•

High-Speed Architecture for Successive Cancellation Decoder With Split-g Node Block

[...]

J. Sujanth Roy¹, G. Lakshminarayanan¹, Seok-Bum Ko²•Institutions (2)

National Institute of Technology, Tiruchirappalli¹, University of Saskatchewan²

01 Sep 2021-IEEE Embedded Systems Letters

TL;DR: A novel restructuring of the 2bit-SC (2b-SC) precomputation decoder architecture is carried out to reduce the latency by 20% while reducing the hardware complexity.

...read moreread less

Abstract: Polar codes are one of the recently developed error correcting codes, and they are popular due to their capacity achieving nature. The architecture of the successive cancellation (SC) decoder algorithm is composed of a recursive processing element (PE). The PE comprises various blocks that include signed adder, subtractor, comparator, multiplexers, and few logic gates. Therefore, the latency of the PE is a primary concern. Hence, a high-speed architecture for implementing the SC decoding algorithm for polar codes is proposed. In the proposed work, a novel restructuring of the 2bit-SC (2b-SC) precomputation decoder architecture is carried out to reduce the latency by 20% while reducing the hardware complexity. Compared to the 2b-SC precomputation decoder, the proposed architecture also has 19% increased throughput for (1024, 512) polar codes with 45% reduction in the gate count.

...read moreread less

7 citations

Journal Article•DOI•

A hybrid hardware oriented motion estimation algorithm for HEVC/H.265

[...]

Sushanta Gogoi¹, Rangababu Peesapati¹•Institutions (1)

National Institute of Technology, Meghalaya¹

01 Jun 2021-Journal of Real-time Image Processing

TL;DR: This paper proposes a fast hybrid search pattern algorithm and its hardware architecture for encoding UHD videos that requires an average of 11.19% less encoding time than the default Test Zone Search (TZS) algorithm in HM reference software.

...read moreread less

Abstract: High Efficiency Video Coding (HEVC) is the latest video coding standard that supports high resolution videos by providing approximately twice the compression efficiency as compared to its previous standard H.264. Motion Estimation (ME) in HEVC is the most computation-intensive block as a result it becomes a bottleneck in the design of the encoder while implementing video applications on various computing platforms such as general purpose and embedded processors. So developing computational efficient architectures on Field Programmable Gate Array (FPGA) and Application Specific Integrated Circuit (ASIC) platforms is inevitable. This paper proposes a fast hybrid search pattern algorithm and its hardware architecture for encoding UHD videos. The proposed Integer ME (IME) algorithm requires an average of 11.19% less encoding time than the default Test Zone Search (TZS) algorithm in HM reference software with compromising decrement in PSNR and increment in bit rate. The proposed architecture is implemented in both FPGA and ASIC platform with TSMC 90 nm technology library. It consumed 32-33% of resources in Virtex-7 FPGA and 2784.4 K equivalent gate count (in terms of NAND ) and 18 kB of memory, respectively. The results show that maximum frequency of the proposed architecture is 162 MHz and a total power consumption is 463.4 mW. The architecture provides a maximum throughput of 2.78 Gpixels/sec due to it process $$32\times 32$$ CU comparatively much less clock cycles (59.5) as compared to the state-of-the-art literature . Further, it supports 8K UHD $$(8192\times 4320)$$ @ 78 fps.

...read moreread less

6 citations

Posted Content•

Approaching the theoretical limit in quantum gate decomposition

[...]

Péter Rakyta, Zoltán Zimborás

14 Sep 2021-arXiv: Quantum Physics

TL;DR: In this paper, the authors proposed a numerical approach to decompose general quantum programs in terms of single and two-qubit quantum gates with a $CNOT$ gate count very close to the current theoretical lower bounds.

...read moreread less

Abstract: In this work we propose a novel numerical approach to decompose general quantum programs in terms of single- and two-qubit quantum gates with a $CNOT$ gate count very close to the current theoretical lower bounds. In particular, it turns out that $15$ and $63$ $CNOT$ gates are sufficient to decompose a general $3$- and $4$-qubit unitary, respectively. This is currently the lowest achieved gate count compared to other algorithms. Our approach is based on a sequential optimization of parameters related to the single-qubit rotation gates involved in a pre-designed quantum circuit used for the decomposition. In addition, the algorithm can be adopted to sparse inter-qubit connectivity architectures provided by current mid-scale quantum computers, needing only a few additional $CNOT$ gates to be implemented in the resulting quantum circuits.

...read moreread less

Proceedings Article•DOI•

A Microarchitecture Implementation Framework for Online Learning with Temporal Neural Networks

[...]

Harideep Nair¹, John Paul Shen¹, James E. Smith¹•Institutions (1)

Carnegie Mellon University¹

01 Jul 2021

TL;DR: In this paper, the authors propose a micro-architecture framework for implementing TNNs using standard CMOS, which is embodied in a set of characteristic scaling equations for assessing the gate count, area, delay and power for any TNN design.

...read moreread less

Abstract: Temporal Neural Networks (TNNs) are spiking neural networks that use time as a resource to represent and process information, similar to the mammalian neocortex. In contrast to compute-intensive deep neural networks that employ separate training and inference phases, TNNs are capable of extremely efficient online incremental/continual learning and are excellent candidates for building edge-native sensory processing units. This work proposes a microarchitecture framework for implementing TNNs using standard CMOS. Gate-level implementations of three key building blocks are presented: 1) multi-synapse neurons, 2) multi-neuron columns, and 3) unsupervised and supervised online learning algorithms based on Spike Timing Dependent Plasticity (STDP). The proposed microarchitecture is embodied in a set of characteristic scaling equations for assessing the gate count, area, delay and power for any TNN design. Post-synthesis results (in 45nm CMOS) for the proposed designs are presented, and their online incremental learning capability is demonstrated.

...read moreread less

Journal Article•DOI•

Implementation and validation of quadral-duty digital PWM to develop a cost-optimized ASIC for BLDC motor drive

[...]

Pratikanta Mishra¹, Atanu Banerjee¹, Mousam Ghosh², Sushanta Gogoi¹, Pramod Kumar Meher - Show less +1 more•Institutions (2)

National Institute of Technology, Meghalaya¹, Government Engineering College, Sreekrishnapuram²

01 Apr 2021-Control Engineering Practice

TL;DR: A quadral-duty digital pulse width modulation technique-based low-cost hardware architecture for brushless DC (BLDC) motor drive is proposed by incorporating an efficient speed calculation and commutation circuitry to achieve the compactness of the total architecture.

...read moreread less

Posted Content•

Clifford Circuit Optimization with Templates and Symbolic Pauli Gates

[...]

Sergey Bravyi¹, Ruslan Shaydulin², Shaohan Hu³, Dmitri Maslov¹•Institutions (3)

IBM¹, Argonne National Laboratory², JPMorgan Chase³

05 May 2021-arXiv: Quantum Physics

TL;DR: In this paper, the problem of finding a short quantum circuit implementing a given Clifford group element is considered, and two methods aim to minimize the entangling gate count assuming all-to-all qubit connectivity.

...read moreread less

Abstract: The Clifford group is a finite subgroup of the unitary group generated by the Hadamard, the CNOT, and the Phase gates. This group plays a prominent role in quantum error correction, randomized benchmarking protocols, and the study of entanglement. Here we consider the problem of finding a short quantum circuit implementing a given Clifford group element. Our methods aim to minimize the entangling gate count assuming all-to-all qubit connectivity. First, we consider circuit optimization based on template matching and design Clifford-specific templates that leverage the ability to factor out Pauli and SWAP gates. Second, we introduce a symbolic peephole optimization method. It works by projecting the full circuit onto a small subset of qubits and optimally recompiling the projected subcircuit via dynamic programming. CNOT gates coupling the chosen subset of qubits with the remaining qubits are expressed using symbolic Pauli gates. Software implementation of these methods finds circuits that are only 0.2% away from optimal for 6 qubits and reduces the two-qubit gate count in circuits with up to 64 qubits by 64.7% on average, compared with the Aaronson-Gottesman canonical form.

...read moreread less

Journal Article•DOI•

VLSI Implementation of Multi-Channel ECG Lossless Compression System

[...]

Tsung-Han Tsai¹, Nai-Chieh Tung¹, Ding-Bang Lin¹•Institutions (1)

National Central University¹

08 Apr 2021-IEEE Transactions on Circuits and Systems Ii-express Briefs

TL;DR: A hardware architecture of multi-Channel lossless ECG compression system based on the algorithm including multi-channel linear prediction and adaptive linear prediction, designed with low hardware complexity usage while using optimum hardware resources.

...read moreread less

Abstract: Electrocardiogram (ECG) is used to record the electrical activity of heart. If the instrument monitoring the signal for a long time, it will produce large amount of data. So, the effective lossless ECG compression system can help to reduce the storage space. This brief presents a hardware architecture of multi-channel lossless ECG compression system. The system is based on the algorithm including multi-channel linear prediction and adaptive linear prediction. Also, Golomb rice code (GRC) is used for entropy coding. The hardware implementation has been designed with low hardware complexity usage while using optimum hardware resources. And the architecture can process multiple channels in parallel that can obtain the high throughput. PTB database has been used for verification and testing purposes. This design was implemented in TSMC 180nm. The implementation results show the gate count is 476K and power consumption is $69.18\mu \text{W}$ while working frequency is 1 KHz.

...read moreread less

Journal Article•DOI•

A Low gate count reconfigurable architecture for biomedical signal processing applications

[...]

Nupur Jain¹, Biswajit Mishra¹, Peter R. Wilson²•Institutions (2)

Indian Institute of Chemical Technology¹, University of Bath²

30 Apr 2021

TL;DR: A new reconfigurable architecture for biomedical applications that targets frequently encountered functions in biomedical signal processing algorithms thereby replacing multiple dedicated accelerators and reports low gate count is presented.

...read moreread less

Abstract: A new reconfigurable architecture for biomedical applications is presented in this paper. The architecture targets frequently encountered functions in biomedical signal processing algorithms thereby replacing multiple dedicated accelerators and reports low gate count. An optimized implementation is achieved by mapping methodologies to functions and limiting the required memory leading directly to an overall minimization of gate count. The proposed architecture has a simple configuration scheme with special provision for handling feedback. The effectiveness of the architecture is demonstrated on an FPGA to show implementation schemes for multiple DSP functions. The architecture has gate count of $$\approx$$ 25k and an operating frequency of 46.9 MHz.

...read moreread less

Journal Article•DOI•

Area and Power-Efficient Variable-Sized DCT Architecture for HEVC Using Muxed-MCM Problem

[...]

Ahmad Shabani¹, Mohammad Sabri², Bahareh Khabbazan³, Somayeh Timarchi¹•Institutions (3)

Shahid Beheshti University¹, University of Tehran², Iran University of Science and Technology³

01 Mar 2021-IEEE Transactions on Circuits and Systems I-regular Papers

TL;DR: An area and power-efficient variable-size DCT architecture for HEVC application and a reconfigurable and scalable shift-and-add unit embedded in the authors' 1D-DCT architecture by leveraging Muxed-MCM problem with the aim of increasing the hardware reusability in the arithmetic units, while reducing the hardware cost.

...read moreread less

Abstract: This paper presents an area and power-efficient variable-size DCT architecture for HEVC application. We develop a reconfigurable and scalable shift-and-add unit (SAU) embedded in our 1D-DCT architecture by leveraging Muxed-MCM problem with the aim of increasing the hardware reusability in the arithmetic units, while reducing the hardware cost. The key idea behind the proposed architecture is the fact that in most of the times (≈90%) the lower point DCTs are performed when the higher point SAUs remain unused. Accordingly, we focus on merging the SAUs of lower point DCTs into the higher point DCTs to compute multiple lower point DCTs in parallel as well as processing any combination of transform sizes. The experimental results show that the proposed folded and fully-parallel 2D-DCT architectures achieve the best hardware cost by 45% and 30% reduction in gate count, respectively, amongst the existing architectures. Moreover, power saving of 55% and 32% can be achieved for the proposed folded and fully-parallel architectures, respectively, where they can process 60 fps of 4K and 30 fps of 8K UHD video sequences in 300 MHz operating frequency.

...read moreread less

Proceedings Article•DOI•

Property-driven Automatic Generation of Reduced-ISA Hardware

[...]

Nathan Bleier¹, John Sartori², Rakesh Kumar¹•Institutions (2)

University of Illinois at Urbana–Champaign¹, University of Minnesota²

05 Dec 2021

TL;DR: In this article, a property-based framework for automatically generating reduced-ISA hardware is presented, which directly operates on a given arbitrary RTL or gate-level netlist, uses property checking to identify gates that are guaranteed to not toggle if only a reduced ISA needs to be supported, and automatically eliminates these untoggleable gates to generate a new design.

...read moreread less

Abstract: As the diversity of computing workloads and customers continues to increase, so does the need to customize hardware at low cost for different computing needs. This work focuses on automatic customization of a given hardware, available as a soft or firm IP, through eliminating unneeded or undesired instruction set architecture (ISA) instructions. We present a property-based framework for automatically generating reduced-ISA hardware. Our framework directly operates on a given arbitrary RTL or gate-level netlist, uses property checking to identify gates that are guaranteed to not toggle if only a reduced ISA needs to be supported, and automatically eliminates these untoggleable gates to generate a new design. We show a 14% gate count reduction when the Ibex [19] core is optimized using our framework for the instructions required by a set of embedded (MiBench) workloads. Reduced-ISA versions generated by our framework that support a limited set of ISA extensions and which cannot be generated using Ibex’s parameterization options provide 10%47% gate count reduction. For an obfuscated Cortex M0 netlist optimized to support the instructions in the MiBench benchmarks, we observe a 20% area reduction and 18% gate count reduction compared to the baseline core, demonstrating applicability of our framework to obfuscated designs. We demonstrate the scalability of our approach by applying our framework to a 100,000-gate RIDECORE [21] design, showing a 14%17% gate count reduction.

...read moreread less

Book Chapter•DOI•

An Efficient Hardware Architecture for Deblocking Filter in HEVC

[...]

P. Kopperundevi¹, M. Surya Prakash¹•Institutions (1)

National Institute of Technology Calicut¹

01 Jan 2021

TL;DR: Experimental results show that the proposed deblocking filter architecture achieves similar or up to two times higher throughput compared with the existing architectures while occupying a moderate chip area and consuming relatively low logic power.

...read moreread less

Abstract: This paper proposes a new hardware architecture for deblocking filter in a high efficiency video coding (HEVC) system. The proposed hardware is designed by using mixed pipelined and parallel processing architectures. The pixels are processed in the stream of two blocks of $4 \times 32$ samples in which edge filters are applied vertically in a parallel fashion for the processing of luma and chroma samples. These pixels are transposed and reprocessed through the vertical filter for horizontal filtering in a pipelined fashion. Finally, the filtered block will be transposed back to the original direction. The proposed filter is implemented using Verilog HDL, and the design is synthesized using the GPDK 90 nm technology library. Experimental results show that the proposed deblocking filter architecture achieves similar or up to two times higher throughput compared with the existing architectures while occupying a moderate chip area and consuming relatively low logic power. The proposed architecture supports the real-time deblocking filter operation of $4\text {k} \times 2\text {k}$ @60 fps under the clock frequency of 125 MHz with a gate count of 110K.

...read moreread less

Proceedings Article•

Virtual Secure Platform: A Five-Stage Pipeline Processor over TFHE.

[...]

Kotaro Matsuoka¹, Ryotaro Banno¹, Naoki Matsumoto¹, Takashi Sato¹, Song Bian¹ - Show less +1 more•Institutions (1)

Kyoto University¹

01 Jan 2021

TL;DR: Virtual Secure Platform (VSP) as mentioned in this paper implements a multi-opcode general-purpose sequential processor over Fully Homomorphic Encryption (FHE) for Secure Multi-Party Computation (SMPC).

...read moreread less

Abstract: We present Virtual Secure Platform (VSP), the first comprehensive platform that implements a multi-opcode general-purpose sequential processor over Fully Homomorphic Encryption (FHE) for Secure Multi-Party Computation (SMPC). VSP protects both the data and functions on which the data are evaluated from the adversary in a secure computation offloading situation like cloud computing. We proposed a complete processor architecture with a five-stage pipeline, which improves the performance of the VSP by providing more parallelism in circuit evaluation. In addition, we also designed a custom Instruction Set Architecture (ISA) to reduce the gate count of our processor, along with an entire set of toolchains to ensure that arbitrary C programs can be compiled into our custom ISA. In order to speed up instruction evaluation over VSP, CMUX Memory based ROM and RAM constructions over FHE are also proposed. Our experiments show that both the pipelined architecture and the CMUX Memory technique are effective in improving the performance of the proposed processor. We provide an open-source implementation of VSP which achieves a per-instruction latency of less than 1 second. We demonstrate that compared to the best existing processor over FHE, our implementation runs nearly 1,600$\times$ faster.

...read moreread less

Proceedings Article•DOI•

Design of Parity Preserving Arithmetic and Logic Unit using Reversible Logic Gates

[...]

Sumit Pahuja¹, Gurjit Kaur¹•Institutions (1)

Delhi Technological University¹

25 Jun 2021

TL;DR: In this article, the authors proposed a new design of one-bit Parity Preserving Reversible ALU circuit with low power dissipation using the concept of Reversible Logic computation.

...read moreread less

Abstract: Arithmetic and Logical Unit (ALU) is one of the key aspects of the digital world. In generally, it requires a colossal amount of power. This paper outlines a new design of one-bit Parity Preserving Reversible ALU circuit. To design the ALU with low power dissipation, we have used the concept of Reversible Logic computation. Conventional Circuits dissipate an enormous amount of power due to the loss of information bits in computation, but Reversible Circuits has no data loss as one on one mapping between outputs and inputs which leads to the minimization of power dissipation. In our design, we have used Parity Preserving Reversible Gates which are having Fault tolerance property as fault occurring at internal nodes results in an error at the output. So, Parity Preserving Reversible Gates is the one in which Output Parity remains same as of the Inputs. The aimed ALU has been carried out using Xilinx ISE 14.7 version software in Verilog HDL. To demonstrate the efficiency of proposed Parity Preserving Reversible Arithmetic and Logical Unit, each subpart is shown in terms of various parameters such as Quantum cost, Ancilla Inputs, Gate Count and Garbage Outputs and also a comparison with existing work is shown. The intended design is far better than the existing one because of its fault-tolerant capabilities. The proposed ALU extends its application over DNA mapping, Optical computation, cryptography, nanotechnology, quantum computing and digital signal processing.

...read moreread less

Journal Article•DOI•

VLSI Implementation of a Cost-Efficient Loeffler DCT Algorithm with Recursive CORDIC for DCT-Based Encoder

[...]

Rih-Lung Chung, Chen-Wei Chen, Chiung-An Chen, Patricia Angela R. Abu, Shih-Lun Chen - Show less +1 more

05 Apr 2021-Electronics

TL;DR: Experimental results show that the proposed 2-D DCT spectral analyzer not only achieved a superior average peak signal–noise ratio (PSNR) compared to the previous CORDIC-DCT algorithms but also designed cost-efficient architecture for very large scale integration (VLSI) implementation.

...read moreread less

Abstract: This paper presents a low-cost and high-quality, hardware-oriented, two-dimensional discrete cosine transform (2-D DCT) signal analyzer for image and video encoders In order to reduce memory requirement and improve image quality, a novel Loeffler DCT based on a coordinate rotation digital computer (CORDIC) technique is proposed In addition, the proposed algorithm is realized by a recursive CORDIC architecture instead of an unfolded CORDIC architecture with approximated scale factors In the proposed design, a fully pipelined architecture is developed to efficiently increase operating frequency and throughput, and scale factors are implemented by using four hardware-sharing machines for complexity reduction Thus, the computational complexity can be decreased significantly with only 001 dB loss deviated from the optimal image quality of the Loeffler DCT Experimental results show that the proposed 2-D DCT spectral analyzer not only achieved a superior average peak signal–noise ratio (PSNR) compared to the previous CORDIC-DCT algorithms but also designed cost-efficient architecture for very large scale integration (VLSI) implementation The proposed design was realized using a UMC 018-μm CMOS process with a synthesized gate count of 804 k and core area of 75,100 μm2 Its operating frequency was 100 MHz and power consumption was 417 mW Moreover, this work had at least a 641% gate count reduction and saved at least 225% in power consumption compared to previous designs

...read moreread less

Proceedings Article•DOI•

Synthesis and Implementation of Reconfigurable Reversible Generalized Fredkin Gate

[...]

Oleksii Dovhaniuk¹, V. G. Deibuk¹•Institutions (1)

Chernivtsi University¹

19 May 2021

TL;DR: In this paper, a reconfigurable reversible fault-tolerant gate in the basis of generalized Fredkin gates is presented, and a genetic algorithm is used to optimize the characteristics of the circuit such as number of gates, quantum cost, delay, number of auxiliary inputs (garbage outputs).

...read moreread less

Abstract: The paper represents synthesis of the reconfigurable reversible fault-tolerant gate in the basis of generalized Fredkin gates. The gate is designed for an FPGA (Field Programmable Gate Array). Additionally, a genetic algorithm was used to optimize the characteristics of the circuit such as number of gates, quantum cost, delay, number of auxiliary inputs (garbage outputs). The model of the gate was created and verified in Active-HDL environment. Consequently, comparative analysis showed that the proposed design is fault-tolerant and it has more efficient quantum cost, gate count, and garbage outputs lines in contrast to the results of other authors.

...read moreread less

Posted Content•

Quantum mean value approximator for hard integer value problems

[...]

David Joseph, Antonio Martinez, Cong Ling, Florian Mintert

27 May 2021-arXiv: Quantum Physics

TL;DR: In this article, the quantum mean value problem (QMV) was used to optimize the quantum approximate optimization algorithm and other variational quantum eigensolvers, and it was shown that such an optimization can be improved substantially by using an approximation rather than the exact expectation.

...read moreread less

Abstract: Evaluating the expectation of a quantum circuit is a classically difficult problem known as the quantum mean value problem (QMV). It is used to optimize the quantum approximate optimization algorithm and other variational quantum eigensolvers. We show that such an optimization can be improved substantially by using an approximation rather than the exact expectation. Together with efficient classical sampling algorithms, a quantum algorithm with minimal gate count can thus improve the efficiency of general integer-value problems, such as the shortest vector problem (SVP) investigated in this work.

...read moreread less

DOI•

Clifford Circuit Optimization with Templates and Symbolic Pauli Gates

[...]

Sergey Bravyi¹, Ruslan Shaydulin², Shaohan Hu³, Dmitri Maslov¹•Institutions (3)

IBM¹, Argonne National Laboratory², JPMorgan Chase³

16 Nov 2021

TL;DR: In this article, the problem of finding a short quantum circuit implementing a given Clifford group element is considered, and two methods aim to minimize the entangling gate count assuming all-to-all qubit connectivity.

...read moreread less

Journal Article•DOI•

An energy efficient PVT aware novel CML-TG based Mux-Latch circuit Serializes high rate data

[...]

Alak Majumder, Monalisa Das, Suraj Kumar Saw, Bidyut K. Bhattacharyya¹•Institutions (1)

National Institute of Technology Agartala¹

01 Feb 2021-Microsystem Technologies-micro-and Nanosystems-information Storage and Processing Systems

TL;DR: This work explores a novel configuration of multiplexer embedded with cross-coupled NMOS latch after integrating the Transmission Gate principle with the MOS Current Mode Logic (MCML) to prove the robustness of the proposed Mux-Latch, which is employed to tender a new low gate count and energy efficient variation aware Serializer circuit capable of offering a data rate of as high as 50 Gbit/s.

...read moreread less

Abstract: The high speed wireline communication suffers from a lot of signal quality issues such as jitter and swing, which eventually leads to higher probability of data loss. As the current mode multiplexer, being the integral cell of any transceiver circuit guides to Serialize data in high rate, its arrangement is of utmost importance. This work explores a novel configuration of multiplexer embedded with cross-coupled NMOS latch after integrating the Transmission Gate (TG) principle with the MOS Current Mode Logic (MCML). The proposed configuration reads an average power, delay and power-delay product (PDP) of as tiny as 135.7 μW, 20.16 ps and 2.736 fJ, respectively when simulated for 90 nm CMOS using Cadence Virtuoso at 10 GHz switching frequency and 1 V power supply. The process variation is performed at different corners through Monte-Carlo runs with ‘no skew’ and ‘5% process skew’ variation at both pre-layout and post-layout to prove the robustness of the proposed Mux-Latch, which is employed to tender a new low gate count and energy efficient variation aware Serializer circuit capable of offering a data rate of as high as 50 Gbit/s. The entire circuit is also validated at lower technology nodes like 28 nm UMC.

...read moreread less

Journal Article•DOI•

Hierarchical Methodology Approach to SOC Design: A Comprehensive Look

[...]

Vivek Bhardwaj¹•Institutions (1)

Intel¹

12 Feb 2021-Social Science Research Network

TL;DR: This paper comprehensively describes hierarchical implementation flows and their artifacts like, models, post assembly ECO, context views, feasibility flows among others and looks at advances in the tool technology of timing graph reduction and physical data reduction that enables new flow implementations and can be applied to both hierarchical and flat methodologies.

...read moreread less

Abstract: With the process node (CMOS transistor size) becoming increasingly smaller, there is an ever-increasing need for the software to handle the large gate count with faster turnaround time and better accuracy. Reaching timing closure on multi-million gate VLSI chips using flat flow is infeasible due to hardware capacity limit, and excessive run time overheads. New flows and design implementations need to be introduced to manage the scalability of design sizes by scaling down the memory usage and run time. Traditionally, the flat implementation of designs was used for smaller designs with maximum accuracy. Then the hierarchical approach was introduced to better the run time and handle the millions of gates by partitioning and assembling. Through this paper, we are going to look at the various implementation flows and its artifacts viz. flat, traditional hierarchical, models, post assembly ECO, context views, feasibility flows . We are also going to look at a advances in the technology of timing graph reduction and physical data reduction that enables new flow implementations and can be applied to both hierarchical and flat methodologies.

...read moreread less

Posted Content•

2QAN: A quantum compiler for 2-local qubit Hamiltonian simulation algorithms.

[...]

Lingling Lao, Dan E. Browne

04 Aug 2021-arXiv: Quantum Physics

TL;DR: 2QAN as discussed by the authors is a compiler for 2-local qubit Hamiltonian simulation problems, which includes permutation-aware qubit mapping, qubit routing, gate optimization and scheduling techniques to minimize the compilation overhead.

...read moreread less

Abstract: Simulating quantum systems is one of the most important potential applications of quantum computers to demonstrate its advantages over classical algorithms. The high-level circuit defining the simulation needs to be transformed into one that compiles with hardware limitations such as qubit connectivity and hardware gate set. Many techniques have been developed to efficiently compile quantum circuits while minimizing compilation overhead. However, general-purpose quantum compilers work at the gate level and have little knowledge of the mathematical properties of quantum applications, missing further optimization opportunities. In this work, we exploit one application-level property in Hamiltonian simulation, which is, the flexibility of permuting different operators in the Hamiltonian (no matter whether they commute). We develop a compiler, named 2QAN, to optimize quantum circuits for 2-local qubit Hamiltonian simulation problems, a framework which includes the important quantum approximate optimization algorithm (QAOA). In particular, we propose permutation-aware qubit mapping, qubit routing, gate optimization and scheduling techniques to minimize the compilation overhead. We evaluate 2QAN by compiling three applications (up to 50 qubits) onto three quantum computers that have different qubit topologies and hardware two-qubit gates, namely, Google Sycamore, IBMQ Montreal and Rigetti Aspen. Compared to state-of-the-art quantum compilers, 2QAN can reduce the number of inserted SWAP gates by up to 11.5X, reduce overhead in hardware gate count by up to 30.7X, and reduce overhead in circuit depth by up to 21X. This significant overhead reduction will help improve application performance. Experimental results on the Montreal device demonstrate that benchmarks compiled by 2QAN achieve highest fidelity.

...read moreread less

Journal Article•DOI•

VLSI based Lossless ECG Compression Algorithm Implementation for Low Power Devices

[...]

P G Kuppusamy, R Sureshkumar, S. A. Yuvaraj, E Dilliraj¹•Institutions (1)

Prathyusha Institute of Technology and Management¹

01 Jul 2021

TL;DR: In this paper, the authors presented a VLSI design of an effective electrocardiogram data encoding lossless data compression scheme to conserve disk system to minimize channel capacity, which can save disc space and reduce transfer time.

...read moreread less

Abstract: The research study presents a VLSI design of an effective electrocardiogram data encoding lossless data compression scheme to conserve disk system to minimize channel capacity. As the data compression can save disc space, reduce transfer time, and seized this ability by introducing a memory-less architecture when operating in VLSI at a high data rate. There are two components of the ECG classification technique: an adaptive frequency-domain methodology and bandwidth. An accurate and reduced VLSI compressed algorithm design has been introduced. The current VLSI architecture uses a few more procedures to substitute for the various mathematical functions to enhance performance and implemented the VLSI's architecture to the MIT-BIH atrial fibrillation repository capable of achieving a 2.62 lossless bit compression rate. Also, the VLSI structure uses a gate count of 5.1 K.

...read moreread less

Journal Article•DOI•

Anelegance of novel digital filter using majority logic on pipelined architecturefor SNR improvement in signal processing

[...]

S. Aathilakshmi¹, R. Vimala², K. R. Aravind Britto²•Institutions (2)

M. Kumarasamy College of Engineering¹, PSNA College of Engineering and Technology²

22 Apr 2021-Journal of Ambient Intelligence and Humanized Computing

TL;DR: The proposed innovative carry save adder is constructed by using majority logic and implemented into a digital FIR filter, this new technique consumes low power, delay-free carry circuit and less number of gate counts and is implemented into Biomedical Application for reducing noise and improving signal to noise ratio.

...read moreread less

Abstract: VLSI is an enduring technology which is used to change the entire digital element into autonomy, some real-time opportunities are characterized under Very Large Scale Integration such as low power application, testing, MOS technology etc. This research focused on signal processing in low power VLSI design, in existing system the backend IC fabrication process illustrates system-level design by using digital logic elecment. An existing digital element consist of different functions of adders such as carry select, ripple carry adder, carry skip adder, carry look ahead adder, which has consume more area, delay, and power. To improve the efficiency of digital design a novel majority carry save adder is proposed and incorporate with a structured tree multiplier, this research produce an optimized carry save adder design in digital filter for improve the signal to noise ratio. The proposed innovative carry save adder is constructed by using majority logic and implemented into a digital FIR filter, this new technique consumes low power, delay-free carry circuit and less number of gate counts. The proposed adder has achieved 96 % efficiency in terms of gate count, delay, and power compared with existing analysis. Digital system design produces 83.5 % efficiencyin an existing system and it required the maximum number of gate count and an increasing number of delays when compared with recent research. The design summary is analyzed by using XILINX 14.7 ISE synthesis and the implementation process is highly reached with the help of MATLAB 2018a. The proposed design is implemented into Biomedical Application for reducing noise and improving signal to noise ratio.

...read moreread less