scispace - formally typeset
Search or ask a question

Showing papers on "Gate count published in 2018"


Journal ArticleDOI
TL;DR: In this article, a single-shot spin readout using radio-frequency reflectometry on a single gate electrode defining the quantum dot itself has been demonstrated, with an average readout fidelity of 73%.
Abstract: Electron spins in silicon quantum dots provide a promising route towards realising the large number of coupled qubits required for a useful quantum processor. At present, the requisite single-shot spin qubit measurements are performed using on-chip charge sensors, capacitively coupled to the quantum dots. However, as the number of qubits is increased, this approach becomes impractical due to the footprint and complexity of the charge sensors, combined with the required proximity to the quantum dots. Alternatively, the spin state can be measured directly by detecting the complex impedance of spin-dependent electron tunnelling between quantum dots. This can be achieved using radio-frequency reflectometry on a single gate electrode defining the quantum dot itself, significantly reducing gate count and architectural complexity, but thus far it has not been possible to achieve single-shot spin readout using this technique. Here, we detect single electron tunnelling in a double quantum dot and demonstrate that gate-based sensing can be used to read out the electron spin state in a single shot, with an average readout fidelity of 73%. The result demonstrates a key step towards the readout of many spin qubits in parallel, using a compact gate design that will be needed for a large-scale semiconductor quantum processor.

109 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present two techniques that can greatly reduce the number of gates required to realize an energy measurement, with application to ground state preparation in quantum simulations, with both tailored to lattice models and targeted at reducing the use of generic single-qubit rotations.
Abstract: We present two techniques that can greatly reduce the number of gates required to realize an energy measurement, with application to ground state preparation in quantum simulations. The first technique realizes that to prepare the ground state of some Hamiltonian, it is not necessary to implement the time-evolution operator: any unitary operator which is a function of the Hamiltonian will do. We propose one such unitary operator which can be implemented exactly, circumventing any Taylor or Trotter approximation errors. The second technique is tailored to lattice models, and is targeted at reducing the use of generic single-qubit rotations, which are very expensive to produce by standard fault tolerant techniques. In particular, the number of generic single-qubit rotations used by our method scales with the number of parameters in the Hamiltonian, which contrasts with a growth proportional to the lattice size required by other techniques.

102 citations


Proceedings ArticleDOI
11 Jan 2018
TL;DR: This paper studies the problem of simulating the time evolution of a lattice Hamiltonian, and proves a matching lower bound on the gate count of such a simulation, showing that any quantum algorithm that can simulate a piecewise constant bounded local Hamiltonian in one dimension to constant error requires (nT) gates in the worst case.
Abstract: We study the problem of simulating the time evolution of a lattice Hamiltonian, where the qubits are laid out on a lattice and the Hamiltonian only includes geometrically local interactions (i.e., a qubit may only interact with qubits in its vicinity). This class of Hamiltonians is very general and encompasses all physically reasonable Hamiltonians. Our algorithm simulates the time evolution of such a Hamiltonian on n qubits for time T up to error e using O(T polylog(nT/e)) gates with depth O(T polylog(nT/e)). Our algorithm is the first simulation algorithm that achieves gate cost quasilinear in nT and polylogarithmic in 1/e. Our algorithm also readily generalizes to time-dependent Hamiltonians and yields an algorithm with similar gate count for any piecewise slowly varying time-dependent bounded local Hamiltonian. We also prove a matching lower bound on the gate count of such a simulation, showing that any quantum algorithm that can simulate a piecewise constant bounded local Hamiltonian in one dimension to constant error requires (nT) gates in the worst case. The lower bound holds even if we only require the output state to be correct on local measurements. To our best knowledge, this is the first nontrivial lower bound on the gate complexity of the simulation problem. Our algorithm is based on a decomposition of the time-evolution unitary into a product of small unitaries using Lieb-Robinson bounds. In the appendix, we prove a Lieb-Robinson bound tailored to Hamiltonians with small commutators between local terms, giving zero Lieb-Robinson velocity in the limit of commuting Hamiltonians. This improves the performance of our algorithm when the Hamiltonian is close to commuting.

65 citations


Journal ArticleDOI
TL;DR: An end-to-end CNN accelerator that maximizes hardware utilization with run-time configurations of different kernel sizes and minimizes data bandwidth with the output first strategy to improve the data reuse of the convolutional layers by up to up to $300\times \sim 600\times $ compared with the non-reused case is presented.
Abstract: Hardware design of deep convolutional neural networks (CNNs) faces challenges of high computational complexity and data bandwidth as well as huge divergence in different CNN network layers, in which the throughput of the convolutional layer would be bounded by available hardware resource, and throughput of the fully connected layer would be bounded by available data bandwidth. Thus, a highly flexible and efficient design is desired to meet these needs. This paper presents an end-to-end CNN accelerator that maximizes hardware utilization with run-time configurations of different kernel sizes. It also minimizes data bandwidth with the output first strategy to improve the data reuse of the convolutional layers by up to $300\times \sim 600\times $ compared with the non-reused case. The whole CNN implementation of the target network is generated optimally for both hardware and data efficiency under design resource constraints, which can be run-time reconfigured by the layer optimized parameters to achieve real-time and end-to-end CNN acceleration. An implementation example for AlexNet consumes a 1.783 M gate count for 216 MACs and a 142.64 kb internal buffer with TSMC 40 nm process, and achieves 99.7 and 61.6 f/s under 454 MHz clock frequency for the convolutional layers and the whole AlexNet, respectively.

54 citations


Journal ArticleDOI
TL;DR: This paper focuses on the circuits composed with global Ising entangling gates and arbitrary addressable single-qubit gates and shows that under certain circumstances the use of global operations can substantially improve the entangling gate count.
Abstract: The disclosure describes various aspects of techniques for using global interactions in efficient quantum circuit constructions. More specifically, this disclosure describes ways to use a global entangling operator to efficiently implement circuitry common to a selection of important quantum algorithms. The circuits may be constructed with global Ising entangling gates (e.g., global Molmer-Sorenson gates or GMS gates) and arbitrary addressable single-qubit gates. Examples of the types of circuits that can be implemented include stabilizer circuits, Toffoli-4 gates, Toffoli-n gates, quantum Fourier transformation (QTF) circuits, and quantum Fourier adder (QFA) circuits. In certain instances, the use of global operations can substantially improve the entangling gate count.

52 citations


Posted Content
TL;DR: A method for automatically recompiling a quantum circuit A into a target circuit B, with the goal that both circuits have the same action on a specific input, using a recently introduced imaginary-time technique derived from McLachlan's variational principle.
Abstract: We describe a method for automatically recompiling a quantum circuit A into a target circuit B, with the goal that both circuits have the same action on a specific input i.e. A|in> = B|in>. This is of particular relevance to hybrid, NISQ-era algorithms for dynamical simulation or eigensolving. The user initially specifies B as a blank template: a layout of parameterised unitary gates configured to the identity. The compilation then proceeds using quantum hardware to perform an isomorphic energy-minimisation task, and optionally a gate elimination phase to compress the circuit. We use a recently introduced imaginary-time technique derived from McLachlan's variational principle. If the template for B is too shallow for perfect recompilation then the method will result in an approximate solution. As a demonstration we successfully recompile a 7-qubit circuit involving 186 gates of multiple types into an alternative form with a different topology, a far lower two-qubit gate count, and a smaller family of gate types. We test the scaling of our algorithm on up to 20 qubits, recompiling into circuits with up to 400 parameterized gates, and incorporate a novel adaptive timestep technique. We note that a classical simulation of the process can be useful to optimise circuits for today's prototypes, and more generally the method may enable `blind' compilation i.e. harnessing a device whose response to control parameters is deterministic but unknown.

51 citations


Posted Content
TL;DR: A novel set of reversible modular multipliers applicable to quantum computing, derived from three classical techniques: 1) traditional integer division, 2) Montgomery residue arithmetic, and 3) Barrett reduction are presented.
Abstract: We present a novel set of reversible modular multipliers applicable to quantum computing, derived from three classical techniques: 1) traditional integer division, 2) Montgomery residue arithmetic, and 3) Barrett reduction. Each multiplier computes an exact result for all binary input values, while maintaining the asymptotic resource complexity of a single (non-modular) integer multiplier. We additionally conduct an empirical resource analysis of our designs in order to determine the total gate count and circuit depth of each fully constructed circuit, with inputs as large as 2048 bits. Our comparative analysis considers both circuit implementations which allow for arbitrary (controlled) rotation gates, as well as those restricted to a typical fault-tolerant gate set.

27 citations


Proceedings ArticleDOI
29 Sep 2018
TL;DR: It is shown how Clifford+T circuits can efficiently be mapped into the two IBM quantum computers with 5 qubits and shown that the optimized circuits can considerably reduce the gate count and number of levels and thus produce results with better fidelity.
Abstract: IBM has made several quantum computers available to researchers around the world via cloud services. Two architectures with five qubits, one with 16, and one with 20 qubits are available to run experiments. The IBM architectures implement gates from the Clifford+T gate library. However, each architecture only implements a subset of the possible CNOT gates. In this paper, we show how Clifford+T circuits can efficiently be mapped into the two IBM quantum computers with 5 qubits. We further present an algorithm and a set of circuit identities that may be used to optimize the Clifford+T circuits in terms of gate count and number of levels. It is further shown that the optimized circuits can considerably reduce the gate count and number of levels and thus produce results with better fidelity.

26 citations


Proceedings ArticleDOI
08 May 2018
TL;DR: The first design is the smallest AES S-box to date, breaking the 13 years implementation record of Canright and the new logicminimization heuristics that outperform the previous algorithms of Boyar-Peralta are proposed.
Abstract: Canright S-box has been known as the most compact S-box design since its introduction back in CHES’05. Boyar-Peralta proposed logic-minimization heuristics that could reduce the gate count of Canright S-box from 120 gates to 113 gates, however synthesis results did not reflect much improvement. In CHES’15, Ueno et al. proposed an S-box that has a slightly higher area, but significantly faster than the previous designs, hence it was the most efficient (measured by area×delay) S-box implementation to date. In this paper, we propose two new designs for the AES S-box. One design has a smaller implementation area than both Canright and the 113-gate S-boxes. Hence, our first design is the smallest AES S-box to date, breaking the 13 years implementation record of Canright. The second design is faster and smaller than the Ueno S-box. Hence, our second design is both the fastest and the most efficient S-box design to date. While doing so, we also propose new logicminimization heuristics that outperform the previous algorithms of Boyar-Peralta. Finally, we conduct an exhaustive evaluation of each and every block in the S-box circuit, using both structural and behavioral HDL modeling, to reach the optimum synergy between theoretical algorithms and technology-supported optimization tools. We show that involving the technology-supported CAD tools in the analysis results in several counter-intuitive results.

24 citations


Journal ArticleDOI
TL;DR: This article implements PASM in a weight-shared CNN convolution hardware accelerator and analyzes its effectiveness, showing that for a clock speed 1GHz implemented on a 45nm ASIC process, this approach results in fewer gates, smaller logic, and reduced power with only a slight increase in latency.
Abstract: Convolutional neural networks (CNNs) are one of the most successful machine-learning techniques for image, voice, and video processing. CNNs require large amounts of processing capacity and memory bandwidth. Hardware accelerators have been proposed for CNNs that typically contain large numbers of multiply-accumulate (MAC) units, the multipliers of which are large in integrated circuit (IC) gate count and power consumption. “Weight-sharing” accelerators have been proposed where the full range of weight values in a trained CNN are compressed and put into bins, and the bin index is used to access the weight-shared value. We reduce power and area of the CNN by implementing parallel accumulate shared MAC (PASM) in a weight-shared CNN. PASM re-architects the MAC to instead count the frequency of each weight and place it in a bin. The accumulated value is computed in a subsequent multiply phase, significantly reducing gate count and power consumption of the CNN. In this article, we implement PASM in a weight-shared CNN convolution hardware accelerator and analyze its effectiveness. Experiments show that for a clock speed 1GHz implemented on a 45nm ASIC process our approach results in fewer gates, smaller logic, and reduced power with only a slight increase in latency. We also show that the same weight-shared-with-PASM CNN accelerator can be implemented in resource-constrained FPGAs, where the FPGA has limited numbers of digital signal processor (DSP) units to accelerate the MAC operations.

22 citations


Journal ArticleDOI
TL;DR: The adder-based architecture is explored to reduce the hardware consumption of performing scalar multiplication (SM) and the Interleaved Modular Multiplication Algorithm and Binary Modular Inversion Algorithm are improved and implemented with two full-word adder units.
Abstract: In this paper, a low hardware consumption design of elliptic curve cryptography (ECC) over GF(p) in embedded applications is proposed. The adder-based architecture is explored to reduce the hardware consumption of performing scalar multiplication (SM). The Interleaved Modular Multiplication Algorithm and Binary Modular Inversion Algorithm are improved and implemented with two full-word adder units. The full-word register units for data storage are also optimized. The design is based on two full-word adder units and twelve full-word register units of pipeline structure and was implemented on Xilinx Virtex-4 platform. Design Compiler is used to synthesized the proposed architecture with 0.13 μm CMOS standard cell library. For 160, 192, 224, 256 field order, the proposed architecture consumes 5595, 7080, 8423, 9370 slices, respectively, and saves 17.58∼54.93% slice resources on FPGA platform when compared with other design architectures. The synthesized result uses 35.43 k, 43.37 k, 50.38 k, 57.05 k gate area and saves 52.56∼91.34% in terms of gate count in comparison. The design takes 2.56∼4.07 ms to perform SM operation over different field order under 150 MHz frequency. The proposed architecture is safe from simple power analysis (SPA). Thus, it is a good choice for embedded applications.

Journal ArticleDOI
TL;DR: A novel reversible gate design of 128-bit Advanced Encryption Standard (AES) cryptographic algorithm is presented and shows considerable improvements in the performance metrics when compared to existing designs.
Abstract: The quantum of power consumption in wireless sensor nodes plays a vital role in power management since more number of functional elements are integrated in a smaller space and operated at very high frequencies. In addition, the variations in the power consumption pave the way for power analysis attacks in which the attacker gains control of the secret parameters involved in the cryptographic implementation embedded in the wireless sensor nodes. Hence, a strong countermeasure is required to provide adequate security in these systems. Traditional digital logic gates are used to build the circuits in wireless sensor nodes and the primary reason for its power consumption is the absence of reversibility property in those gates. These irreversible logic gates consume power as heat due to the loss of per bit information. In order to minimize the power consumption and in turn to circumvent the issues related to power analysis attacks, reversible logic gates can be used in wireless sensor nodes. This shifts the focus from power-hungry irreversible gates to potentially powerful circuits based on controllable quantum systems. Reversible logic gates theoretically consume zero power and have accurate quantum circuit model for practical realization such as quantum computers and implementations based on quantum dot cellular automata. One of the key components in wireless sensor nodes is the cryptographic algorithm implementation which is used to secure the information collected by the sensor nodes. In this work, a novel reversible gate design of 128-bit Advanced Encryption Standard (AES) cryptographic algorithm is presented. The complete structure of AES algorithm is designed by using combinational logic circuits and further they are mapped to reversible logic circuits. The proposed architectures make use of Toffoli family of reversible gates. The performance metrics such as gate count and quantum cost of the proposed designs are rigorously analyzed with respect to the existing designs and are properly tabulated. Our proposed reversible design of AES algorithm shows considerable improvements in the performance metrics when compared to existing designs.

Posted Content
TL;DR: This work discusses the complexity of compilation with a particular focus on the search space structure and decomposes the compilation problem into three combinatorial subproblems for which heuristics can be determined.
Abstract: Noisy, intermediate-scale quantum (NISQ) computers are expected to execute quantum circuits of up to a few hundred qubits. The circuits have to satisfy certain constraints concerning the placement and interactions of the involved qubits. Hence, a compiler takes an input circuit not conforming to a NISQ architecture and transforms it to a conforming output circuit. NISQ hardware is faulty and insufficient to implement computational fault-tolerance, such that computation results will be faulty, too. Accordingly, compilers need to optimise the depth and the gate count of the compiled circuits, because these influence the aggregated computation result error. This work discusses the complexity of compilation with a particular focus on the search space structure. The presented analysis decomposes the compilation problem into three combinatorial subproblems for which heuristics can be determined. The search space structure is the result of analysing jointly the gate sequence of the input circuit and its influence on how qubits have to be mapped to a NISQ architecture. These findings support the development of future NISQ compilers.

Journal ArticleDOI
TL;DR: Experimental results illustrate that the proposed DBF and SAO architecture decreases the processing cycles required for processing each large coding unit compared with the state-of-the-art literature with the increase of gate count including memory.
Abstract: This paper aims to design an efficient mixed serial five-stage pipeline processing hardware architecture of deblocking filter (DBF) and sample adaptive offset (SAO) filter for high efficiency video coding decoder. The proposed hardware is designed to increase the throughput and reduce the number of clock cycles by processing the pixels in a stream of ${4 \times 36}$ samples in which edge filters are applied vertically in a parallel fashion for processing of luma/chroma samples. Subsequently these filtered pixels are transposed and reprocessed through vertical filter for horizontal filtering in a pipeline fashion. Finally, the filtered block transposed back to the original orientation and forwarded to a three-stage pipeline SAO filter. The proposed architecture is implemented in field programmable gate array and application specific integrated circuit platform using 90-nm library. Experimental results illustrate that the proposed DBF and SAO architecture decreases the processing cycles (172) required for processing each ${64 \times 64}$ or large coding unit compared with the state-of-the-art literature with the increase of gate count (593.32K) including memory. The results show that the throughput of the proposed filter can successfully decode ultrahigh definition video sequences at 200 frames/s at 341 MHz.

Proceedings ArticleDOI
TL;DR: In this paper, the authors present an algorithm and a set of circuit identities that may be used to optimize the Clifford+T circuits in terms of gate count and number of levels.
Abstract: IBM has made several quantum computers available to researchers around the world via cloud services. Two architectures with five qubits, one with 16, and one with 20 qubits are available to run experiments. The IBM architectures implement gates from the Clifford+T gate library. However, each architecture only implements a subset of the possible CNOT gates. In this paper, we show how Clifford+T circuits can efficiently be mapped into the two IBM quantum computers with 5 qubits. We further present an algorithm and a set of circuit identities that may be used to optimize the Clifford+T circuits in terms of gate count and number of levels. It is further shown that the optimized circuits can considerably reduce the gate count and number of levels and thus produce results with better fidelity.

Journal ArticleDOI
TL;DR: A novel LLR representation scheme is proposed so that the kernel processing can be realized in low-complexity and high-speed circuitry and the decoder is 1.34 times superior to the previous state-of-the-art decoder.
Abstract: This brief presents an efficient architecture of the polar decoder that employs the successive-cancellation (SC) decoding algorithm. In the SC decoding algorithm, each bit is decoded successively by recursively calculating the log likelihood ratio (LLR) based on two kernels. This brief proposes a novel LLR representation scheme so that the kernel processing can be realized in low-complexity and high-speed circuitry. A 1024-bit polar decoder was designed and implemented based on the proposed scheme using a ${0.18~\mu }\text{m}$ CMOS process. Its throughput is ${252R}$ Mb/s for the rate- ${R}$ code, and the gate count is 256K. By the proposed LLR representation scheme, the decoding speed is increased by 18% while the gate count is not increased when compared to the same decoder designed with the signed-magnitude scheme. In terms of the throughput efficiency, the proposed decoder is 1.34 times superior to the previous state-of-the-art decoder.

Posted Content
TL;DR: A systematic procedure is used to obtain optimized circuits (circuits having reduced gate count and number of levels) for a large number of Clifford+T circuits which have already been implemented in the IBM quantum computers.
Abstract: Recently, various quantum computing and communication tasks have been implemented using IBM's superconductivity-based quantum computers which are available on the cloud. Here, we show that the circuits used in most of those works were not optimized and the use of the optimized circuits can considerably improve the possibility of observing unique features of quantum mechanics. Specifically, a systematic procedure is used here to obtain optimized circuits (circuits having reduced gate count and number of levels) for a large number of Clifford+T circuits which have already been implemented in the IBM quantum computers. Optimized circuits implementable in IBM quantum computers are also obtained for a set of reversible benchmark circuits. With a clear example, it is shown that the reduction in circuit costs enhances the fidelity of the output state (with respect to the theoretically expected state in the absence of noise) as lesser number of gates and levels introduce lesser amount of errors during evolution of the state. Further, considering Mermin inequality as an example, it's shown that the violation of classical limit is enhanced when we use an optimized circuit. Thus, the approach adopted here can be used to identify relatively weaker signature of quantumness and also to establish quantum supremacy in a stronger manner.

Journal ArticleDOI
TL;DR: The evaluation results show that the approximated LBP values generated by the proposed LBP circuit can achieve comparable classification accuracy with the primitive LBP method.
Abstract: In the field of computer vision, local binary pattern (LBP) is one of the most popular feature extraction method and has been used in many object detection frameworks. To efficiently extract LBP features in high-resolution images, hardware architecture is needed to disperse CPU burden and to improve the entire object detection performance. In this paper, a hardware implementation of an approximated LBP method with adjustable parameters is introduced. For simulation, Taiwan Semiconductor Manufacturing Company $0.18~\mu \text{m}$ technology is used to implement the LBP hardware, and the hardware can achieve 500 MHz with lower gate count than previous study. The proposed LBP circuit is applied to the pedestrian classification application and the evaluation results show that the approximated LBP values generated by our circuit can achieve comparable classification accuracy with the primitive LBP method. Additionally, the proposed LBP hardware provides adjustable parameters to fit different applications while requires fewer hardware costs as compared with the existing work.

Journal ArticleDOI
TL;DR: Experiments on some benchmark MCT circuits indicate that the direct mapping algorithm results in fewer additional SWAP gates in about 50%, while the average improvement rate in quantum cost is 16.95% compared to the decomposition-based method.
Abstract: In recent years, quantum computing research has been attracting more and more attention, but few studies on the limited interaction distance between quantum bits (qubit) are deeply carried out. This paper presents a mapping method for transforming multiple-control Toffoli (MCT) circuits into linear nearest neighbor (LNN) quantum circuits instead of traditional decomposition-based methods. In order to reduce the number of inserted SWAP gates, a novel type of gate with the optimal LNN quantum realization was constructed, namely NNTS gate. The MCT gate with multiple control bits could be better cascaded by the NNTS gates, in which the arrangement of the input lines was LNN arrangement of the MCT gate. Then, the communication overhead measurement model on inserted SWAP gate count from the original arrangement to the new arrangement was put forward, and we selected one of the LNN arrangements with the minimum SWAP gate count. Moreover, the LNN arrangement-based mapping algorithm was given, and it dealt with the MCT gates in turn and mapped each MCT gate into its LNN form by inserting the minimum number of SWAP gates. Finally, some simplification rules were used, which can further reduce the final quantum cost of the LNN quantum circuit. Experiments on some benchmark MCT circuits indicate that the direct mapping algorithm results in fewer additional SWAP gates in about 50%, while the average improvement rate in quantum cost is 16.95% compared to the decomposition-based method. In addition, it has been verified that the proposed method has greater superiority for reversible circuits cascaded by MCT gates with more control bits.

Journal ArticleDOI
TL;DR: The results on FPGA shows that compressor based converters and multipliers produced less amount of propagation delay with a slight increase of hardware resources, and in case of ASIC implementation, a compressor based converter delay is equivalent to conventional converter with a slightly increase of gate count.

Proceedings ArticleDOI
01 Jan 2018
TL;DR: This paper shows how the same effect can be achieved with fewer gates, where a reduction of 44% in the gate count and a 26% reduction in the number of levels for IBM's QX5 computer is achieved.
Abstract: IBM’s quantum computers implement gates from Clifford +T gate library. All single qubit gates are implemented, but only a subset of the possible CNOT are provided. It is well known that the functionally of the missing gates can be achieved by a sequence of gates. The sequence of gates is based on SWAP gates. Up to seven elementary gates are required to implement a SWAP gate. In this paper we show how the same effect can be achieved with fewer gates. To show the potential of the proposed transformations, an example is presented where a reduction of 44% in the gate count and a 26% reduction in the number of levels for IBM’s QX5 computer is achieved. An algorithm that is considered state of the art, is used for the comparison.

Dissertation
24 Feb 2018
TL;DR: Two low-complexity, low-area cryptographic processors based on the ultimate reduced instruction set computer (URISC) are created to provide security features for wireless visual sensor networks (WVSN) by using field-programmable gate array (FPGA) based visual processors typically used in RCEs.
Abstract: RCE (Resource Constrained Environment) is known for its stringent hardware design requirements. With the rise of Internet of Things (IoT), low-complexity and low-area designs are becoming prominent in the face of complex security threats. Two low-complexity, low-area cryptographic processors based on the ultimate reduced instruction set computer (URISC) are created to provide security features for wireless visual sensor networks (WVSN) by using field-programmable gate array (FPGA) based visual processors typically used in RCEs. The first processor is the Two Instruction Set Computer (TISC) running the Skipjack cipher. To improve security, a Compact Instruction Set Architecture (CISA) processor running the full AES with modified S-Box was created. The modified S-Box achieved a gate count reduction of 23% with no functional compromise compared to Boyar’s. Using the Spartan-3L XC3S1500L-4-FG320 FPGA, the implementation of the TISC occupies 71 slices and 1 block RAM. The TISC achieved a throughput of 46.38 kbps at a stable 24MHz clock. The CISA which occupies 157 slices and 1 block RAM, achieved a throughput of 119.3 kbps at a stable 24MHz clock. The CISA processor is demonstrated in two main applications, the first in a multilevel, multi cipher architecture (MMA) with two modes of operation, (1) by selecting cipher programs (primitives) and sharing crypto-blocks, (2) by using simple authentication, key renewal schemes, and showing perceptual improvements over direct AES on images. The second application demonstrates the use of the CISA processor as part of a selective encryption architecture (SEA) in combination with the millions instructions per second set partitioning in hierarchical trees (MIPS SPIHT) visual processor. The SEA is implemented on a Celoxica RC203 Vertex XC2V3000 FPGA occupying 6251 slices and a visual sensor is used to capture real world images. Four images frames were captured from a camera sensor, compressed, selectively encrypted, and sent over to a PC environment for decryption. The final design emulates a working visual sensor, from on node processing and encryption to back-end data processing on a server computer.

Journal ArticleDOI
TL;DR: It is observed that the proposed design achieves better performance in terms of hardware complexity and normalised energy for the given specifications.
Abstract: This study presents a variable length multi-path delay commutator fast Fourier transform (FFT)/inverse FFT (IFFT) architecture for a multiple input multiple output orthogonal frequency division multiplexing system. It supports the FFT/ IFFT lengths of 512/256/128/64 samples to process each symbol carried by eight spatial streams and achieves a speed of 160 MHz to meet the IEEE 802.11ac timing requirements. A resource scheduling methodology to minimise the hardware complexity of the design is proposed and adopted in the architecture presented. A novel stagger word length strategy is also proposed and applied to achieve the better accuracy with lesser hardware. Here, the signal to quantisation noise ratio of 57.23 dB is obtained. The twiddle coefficient storage space is significantly compressed to achieve the coefficient generation with reduced hardware. The design is implemented using the TSMC-65 nm complementary metal oxide semiconductor technology with a supply voltage of 1 V at 160 MHz. The implementation results show that the architecture has a gate count of 3,48,013 with power consumption of 105.1 mW and area of 0.492 mm2. The hardware complexity and performance of the design are compared with earlier reported architectures. It is observed that the proposed design achieves better performance in terms of hardware complexity and normalised energy for the given specifications.

Proceedings ArticleDOI
01 Mar 2018
TL;DR: In proposed circuit Dual Rail Signal System (DRPTL) is implemented with the load condition and the clock signal to manage the power flow in the circuit and the process is performed in an efficient way in terms of its gate count and thereby on power and speed.
Abstract: With the revolution in integrated circuits, great emphasis was given on performance and miniaturization. Speed, area and power became the main criterion upon which a VLSI system is measured in terms of its efficiency. In any VLSI system, a full adder is widely component, which also decides the performance criteria of the system. Basically the adder circuit is designed to achieve low power and less delay and by logic gate of the circuit improves the performances. For speed process high logic circuit is implemented and also to have less propagation. In hybrid CMOS design style various adder cells and transistor is used, but in proposed circuit Dual Rail Signal System (DRPTL) is implemented with the load condition and the clock signal to manage the power flow in the circuit and the process is performed in an efficient way in terms of its gate count and thereby on power and speed.

Proceedings ArticleDOI
01 Mar 2018
TL;DR: This paper explores FPGA designs for two of the most important field primitives namely multiplication and inverse, and proposes a novel finite field multiplier based on the recursive Karatsuba algorithm that obtains the best area time product.
Abstract: The current era has an explosive growth in communications. Many applications like internet banking, personal digital assistants, mobile communication, smart carts need for security in resource-constrained environments. Elliptic curve cryptography (ECC) used as an excellent tool for cryptographic, because of the security and smaller key sizes when compared with other public key algorithms. The efficiency is largely affected by the underlying arithmetic primitives. This paper explores FPGA designs for two of the most important field primitives namely multiplication and inverse. The smallest programmable entity in an FPGA is the lookup table. A novel finite field multiplier based on the recursive Karatsuba algorithm is proposed. This proposed multiplier combines two variants of Karatsuba. The general Karatsuba multiplier has a large gate count but for small sized multiplications is compact because it utilizes LUT resources efficiently. For largely sized multiplications, the simple Karatsuba is efficient as it requires lesser gates. This proposed hybrid multiplier uses a simple algorithm for initial recursion and small-sized final multiplication has been performed using the general algorithm. While comparing with reported literature this multiplier obtains the best area time product.

Journal ArticleDOI
TL;DR: The compressed beamforming weights (CBWs) feedback is used in the IEEE 802.11n/ac WLAN, an example of the practical beamforming multiple input multiple output-orthogonal frequency division multiplexing system, and this architecture outperforms one earlier architectural design to compute the CBWs.
Abstract: The compressed beamforming weights (CBWs) feedback is used in the IEEE 802.11n/ac WLAN, an example of the practical beamforming multiple input multiple output-orthogonal frequency division multiplexing system, to reduce the amount of feedback information so that the beamformee can respond rapidly to the beamformer. The CBW associated with each sub-carrier includes the quantized angles obtained from QR-decomposition (QRD) of the right singular vectors of each corresponding channel matrix. Efficient matrix QRD and singular value decomposition (SVD) together are therefore desirable for computing the CBWs associated with all sub-carriers. Considering the exemplary antenna configuration of 4 beamformer and 2 beamformee antennas, we propose to apply the same matrix triangulation to compute the SVD of a 2-by-4 matrix and to compute the QRD of a 4-by-2 matrix. We can achieve gate count reduction by exploiting only one matrix triangulation module in our architecture. The VLSI implementation results under the TSMC 90-ns CMOS technology reveal that our architecture requires 194.25K gates while operating at frequency 200.75 MHz. Additionally, with better normalized matrix throughput and gate efficiency, our architecture outperforms one earlier architectural design to compute the CBWs.

Proceedings ArticleDOI
01 Sep 2018
TL;DR: A highperformance and power-efficient VLSI architecture for the PRESENT block cipher and its integration in a system-on-chip (SoC) environment and an ASIC implementation of the architecture is done in SCL 180 nm technology for its usage as an intellectual-property (IP) core for SoCs.
Abstract: The essence of internet-of-things (IoT) and cyber-physical systems (CPS) infrastructures is primarily based on privacy and security of communicated data. In these resource-constrained applications, lightweight cryptography plays a vital role for data security. In this paper, we propose a highperformance and power-efficient VLSI architecture for the PRESENT block cipher and its integration in a system-on-chip (SoC) environment. The architecture is based on 8-bit datapath and requires 48 clock cycles for processing of 64-bit plaintext and 128-bit key. When implemented on Xilinx Virtex-5 xc5vlx50-1ff324 FPGA device, it consumes 84 slices, provides 379.78 MHz maximum frequency, and 506.37 Mbps of throughput. Dynamic power consumption is 36.57 mW, energy 57.95 nJ, and energy/bit is 0.91 nJ/bit. In comparison to an exiting architecture, the proposed architecture provides improved performance. Further, an ASIC implementation of the architecture is done in SCL 180 nm technology for its usage as an intellectual-property (IP) core for SoCs. Gate count of the ASIC implementation is 1785 GE, area 1.55 mm2, and it can be operated up to 448 MHz clock frequency.

Journal ArticleDOI
TL;DR: A deterministic approach based on the currently known lower and upper bounds of multiplicative complexity for logic minimization problems with not more than five inputs is proposed and the quality of results produce is comparable, and in some cases, better than the results reported in previous works using the same heuristic.

Proceedings ArticleDOI
01 Nov 2018
TL;DR: A comparative study for BDD reordering algorithms in terms of the cost of the generated reversible circuit and a proposal for a new framework for reversible logic synthesis are presented.
Abstract: As billions of transistors are being placed on a few square millimeters of silicon, power dissipation is becoming a more crucial factor to be tackled for high performance computing. Reversible circuit synthesis has been considered as a promising direction for low power design due to its information lossless behavior. In addition, it forms the basis for quantum computing. However, synthesis of reversible circuits cannot be achieved with the classical approaches for irreversible logic due to the additional imposed constraints, consequently, neither fan-out nor feedback are allowed. Binary Decision Diagrams (BDDs) have been proposed as a compact data structure to represent a boolean function. They have been exploited to synthesize reversible circuits through proper mapping of each BDD’s node into a cascade of reversible Toffoli gates. Nevertheless, reordering of BDD’s nodes before circuit synthesis significantly impacts the overall cost of the synthesized circuit. In this paper, we present a comparative study for BDD reordering algorithms in terms of the cost of the generated reversible circuit. The studied algorithms include greedy, dynamic programming, and heuristic based approaches. The cost metric includes the number of lines, gate count, and quantum cost. Experimental results show that meta heuristic-based BDD reordering algorithms outperform other algorithms in terms of the overall synthesized circuit cost with slightly additional runtime. Thereafter, we conclude with a proposal for a new framework for reversible logic synthesis.

Book ChapterDOI
01 Jan 2018
TL;DR: P parity preserving reversible binary-to-BCD code converter is designed, and effect of reversible metrics is analyzed such as gate count, ancilla input, garbage output, and quantum cost, and qubit transition analysis of the quantum circuit in the regime of quantum computing has been presented.
Abstract: The reversible logic circuit is popular due to its quantum gates involved where quantum gates are reversible and noted down feature of no information loss. In this paper, parity preserving reversible binary-to-BCD code converter is designed, and effect of reversible metrics is analyzed such as gate count, ancilla input, garbage output, and quantum cost. This design can build blocks of basic existing parity preserving reversible gates. The building blocks of the code converter reversible circuit constructed on Toffoli gate based as well as elemental gate based such as CNOT, C-V, and C-V+ gates. In addition, qubit transition analysis of the quantum circuit in the regime of quantum computing has been presented. The heuristic approach has been developed in quantum circuit construction and the optimized quantum cost for the circuit of binary-to-BCD code converter. Logic functions validate the development of quantum circuit. Moving the testability aim are figured in the quantum logic circuit testing such as single missing gate and single missing control point fault.