scispace - formally typeset
Search or ask a question

Showing papers in "Iet Computers and Digital Techniques in 2018"


Journal ArticleDOI
TL;DR: The exemplary real world application, video object plane decoder is mapped on a 2D mesh NoC using different mapping algorithms under NOCMAP and NoCTweak simulators for comparative analysis of the NoC simulators and their embedded mapping algorithms.
Abstract: Network-on-chip (NoC) is a reliable and scalable communication paradigm deemed as an alternative to classic bus systems in modern systems-on-chip designs. Consequently, one can observe extensive multidimensional research related to the design and implementation of NoC-based systems. A basic requirement for most of these activities is the availability of NoC simulators that enable the study and comparison of different technologies. This study targets the analysis of different NoC simulators and highlights its contributions towards NoC research. Various NoC tools such as NoCTweak, Noxim, Nirgam, Nostrum, BookSim, WormSim, NOCMAP and ORION are evaluated and their strengths and weaknesses are highlighted. The comparative analysis includes methods for estimation of latency, throughput and energy consumption. Further, the exemplary real world application, video object plane decoder is mapped on a 2D mesh NoC using different mapping algorithms under NOCMAP and NoCTweak simulators for comparative analysis of the NoC simulators and their embedded mapping algorithms.

24 citations


Journal ArticleDOI
TL;DR: A novel genetic-based hyper-heuristic algorithm which can select suitable operators automatically during the mapping process, it noticeably improves convergence speed and demonstrates excellent stability and compared to state-of-the-art mapping algorithms, GHA produces improved mapping results with less time.
Abstract: In this study, a flexible energy- and delay-aware mapping approach is proposed for the co-optimisation of energy consumption and communication latency for network-on-chips (NoCs). A novel genetic-based hyper-heuristic algorithm (GHA) is proposed as the core algorithm. This algorithm consists of bottom-level optimisation which includes a variety of operators and top-level optimisation which selects suitable operators through a ‘reward’ mechanism. As this algorithm can select suitable operators automatically during the mapping process, it noticeably improves convergence speed and demonstrates excellent stability. Compared to the random algorithm, GHA can achieve on average 23.28% delay reduction and 11.81% power reduction. Compared to state-of-the-art mapping algorithms, GHA produces improved mapping results with less time, especially when the size of NoC is large.

17 citations


Journal ArticleDOI
TL;DR: The procedure has the ability to identify tests in the pool that are effective for test compaction even when they do not increase the fault coverage, and is designed for the case where multicycle functional broadside tests are extracted from functional test sequences.
Abstract: This study describes a static test compaction procedure that is applicable in the scenario where (i) a large pool of tests can be generated efficiently, but (ii) test compaction that modifies tests, and covering procedures, are not applicable, and (iii) reverse order fault simulation procedures are not sufficient for test compaction. The procedure has the ability to identify tests in the pool that are effective for test compaction even when they do not increase the fault coverage. This ability is achieved using only fault simulation with fault dropping. The procedure is designed for the case where multicycle functional broadside tests are extracted from functional test sequences. The use of multicycle tests results in higher levels of test compaction than possible with two-cycle functional broadside tests. It adds another dimension to the procedure that also needs to select a number of clock cycles for every test.

16 citations


Journal ArticleDOI
TL;DR: This study explores the method of using various physically unclonable functions (PUFs) as a potential seed for a pseudorandom number generators (PRNGs) element, increasing the difficulty of attackers’ ability to model security systems, and presenting a lightweight and efficient solution to the growing security concerns.
Abstract: Continued growth and development in the consumer electronic market have greatly increased in the realm of home automation With this swelling in smart, Internet-connected consumer electronics, there is a need to ensure the safe and secure use of these products So how does one authenticate each product in a large connected environment? How can the authors minimise counterfeiting, cloning, and the presence of Trojans in customer electronics? In this study, they explore their method of using various physically unclonable functions (PUFs) as a potential seed for a pseudorandom number generators (PRNGs) element These can then be used to authenticate consumer electronic devices or protect communication over a large interconnected network The advantage of this work is that their method increases the difficulty of attackers to learn patterns of the seed of each PRNG while optimising PUF-based constraints in different consumer electronic domains Through this work they enhance the function of PRNGs, increasing the difficulty of attackers’ ability to model security systems, as well as present a lightweight and efficient solution to the growing security concerns By making the PRNG more difficult to model, malicious actors are less able to overcome their proposed security enhancement leading to a safe and secure environment

15 citations


Journal ArticleDOI
TL;DR: This work proposes a partitioned enclave architecture targeting IPSec, TLS and SSL where the partitioned area ensures that the processor data-path is completely isolated from the secret-key memory.
Abstract: Internet protocol security (IPSec), secure sockets layer (SSL)/transport layer security (TLS) and other security protocols necessitate high throughput hardware implementation of cryptographic functions. In recent literature, cryptographic functions implemented in software, application specific integrated circuit (ASIC) and field programmable gate array (FPGA). They are not necessarily optimized for throughput. Due to the various side-channel based attacks on cache and memory, and various malware based exfiltration of security keys and other sensitive information, cryptographic enclave processors are implemented which isolates the cryptographically sensitive information like keys. We propose a partitioned enclave architecture targeting IPSec, TLS and SSL where the partitioned area ensures that the processor data-path is completely isolated from the secret-key memory. The security processor consists of a Trivium random number generator, Rivest–Shamir–Adleman (RSA), advanced encryption standard (AES) and KECCAK cryptos. We implement three different optimized KECCAK architectures. The processing element (PE) handles all communication interfaces, data paths, and control hazards of network security processor. The memory of KECCAK and AES communication is done via a direct memory access controller to reduce the PE overhead. The whole system is demonstrated by FPGA implementation using Vivado 2015.2 on Artix-7 (XC7A100T, CSG324). The performances of the implemented KECCAKs are better in terms of security, throughput and resource than the existing literature.

14 citations


Journal ArticleDOI
TL;DR: A new implementation schema for hierarchically-connected IoTD for indoor applications is proposed, which brings about a new low CCL RSA with two-folded power-aware implementation and is more secure than the conventional implementation due to the inherent countermeasure against the side-channel attacks.
Abstract: The hardware security issues are emerging in crypto-algorithms of embedded portable Internet-of-Things-Devices (IoTD). The communication protocols/standards including MQTT (Message Queuing Telemetry Transport) are enforcing additional cares in device-to-system design perspectives. Due to computation-capacity limitations (CCLs) in battery-operated IoTD, heavy-duty crypto-algorithms are prohibited. This results in compromised hardware using lightweight algorithms. In this study, a new implementation schema for hierarchically-connected IoTD for indoor applications is proposed. This schema allows the IoT network to utilise strong-crypto-algorithms (i.e. RSA) instead of lightweight algorithms (i.e. attribute-based encryption (ABE)). Therefore, without increasing the consumption power or complexity, the security in the IoT network increases. This method brings about a new low CCL RSA with two-folded power-aware implementation. Furthermore, without complexity overhead, the proposed method is more secure than the conventional implementation due to the inherent countermeasure against the side-channel attacks. The presented schema is implemented on a target IoT network, utilising in XC7A100T-FPGA as IoT nodes. Furthermore, both the conventional and the proposed RSA-2048 have been implemented in Spartan6-LX75 on a SAKURA-GW board. The results show that the proposed method has reduced the RSA execution time and consumption power of IoTD at about 50 and 60%, respectively. The most noticeable drawback of the current implementation is an overhead in the range of 30-53% on block-random access memory (RAM) usage.

12 citations


Journal ArticleDOI
TL;DR: The authors propose efficient net susceptibility metrics to significantly speedup functional-HT detection in gate-level digital designs and confirm a 100% HT trigger detection with a low false positive as compared with previous metrics.
Abstract: A hardware Trojan (HT) is an extra circuitry inserted into a chip design with the malicious aim of functionality alteration, reliability degradation or secret information leakage. It is normally very hard to find HT activation signals since such signals are intended to activate upon occurring very rare conditions on specific nets of the infected circuit. A security engineer would have to search among thousands of gates and modules to make sure about the non-existence of design-time HTs in the circuit. The authors propose efficient net susceptibility metrics to significantly speedup functional-HT detection in gate-level digital designs. The proposed metrics perform a computationally low overhead analysis on the controllability and observability parameters of each net of the under HT-test circuit. Then, using a proposed net classifier method, a very low percentage of circuit nets is determined as HT trigger suspicious nets. To show practicality and detection accuracy of the proposed metrics, gate-level circuits of Trust-HUB benchmark suite are examined by the proposed metrics. Results confirm a 100% HT trigger detection with a low false positive as compared with previous metrics. More importantly, unlike previously proposed methods, the authors detection accuracy is totally independent of the switching probability of circuit inputs.

12 citations


Journal ArticleDOI
TL;DR: This study demonstrates how to push the limits of the evolutionary design by choosing a more suitable representation on the one hand and a more efficient fitness function on the other hand and shows that employing full adders as building blocks leads to more efficient approximate circuits.
Abstract: Circuit approximation has been introduced in recent years as a viable method for constructing energy-efficient electronic systems. An open problem is how to effectively obtain approximate circuits showing good compromises between key circuit parameters - the error, power consumption, area and delay. The use of evolutionary algorithms in the task of circuit approximation has led to promising results. Unfortunately, only relatively small circuit instances have been tackled because of the scalability problems of the evolutionary design method. This study demonstrates how to push the limits of the evolutionary design by choosing a more suitable representation on the one hand and a more efficient fitness function on the other hand. In particular, the authors show that employing full adders as building blocks leads to more efficient approximate circuits. The authors focused on the approximation of key arithmetic circuits such as adders and multipliers. While the evolutionary design of adders represents a rather easy benchmark problem, the design of multipliers is known to be one of the hardest problems. The authors evolved a comprehensive library of energy-efficient 12-bit multipliers with a guaranteed worst-case error. The library consists of 65 Pareto dominant solutions considering power, delay, area and error as design objectives.

12 citations


Journal ArticleDOI
TL;DR: The authors report on the realisation of an encryption process in real-time analogue circuitry using on-the-shelf components and minimal processing power and demonstrate a fabricated printed circuit board implementation of the system.
Abstract: The authors report on the realisation of an encryption process in real-time analogue circuitry using on-the-shelf components and minimal processing power. Self-synchronisation of two similar systems through a single shared state is a unique property of the chaotic Lorenz attractor system. In this process, the single parameters of the system are modulated to mask a message before transmitting securely through a single-shared state. However, these techniques are vulnerable to the return map attack. They show that time-scaling can further obfuscate the modulation process and improve return map attack immunity and demonstrate a fabricated printed circuit board implementation of the system.

10 citations


Journal ArticleDOI
TL;DR: A high performance and energy efficient single- Precision and double-precision merged floating-point adder based on the two-path FP addition algorithm designed and implemented on field programmable gate array (FPGA) is presented.
Abstract: A high performance and energy efficient single-precision and double-precision merged floating-point adder based on the two-path FP addition algorithm designed and implemented on field programmable gate array (FPGA) is presented. With a fully pipelined architecture, the proposed adder can accomplish one double-precision addition or two parallel single-precision additions in six clock cycles. The proposed architecture is designed based on the double-precision adder and each major component is segmented to support dual single-precision operations. In addition, all the components of the proposed adder are optimised for mapping on FPGA. The proposed architecture is implemented on both Altera Stratix-III and Xilinx Virtex-5 devices and it has a faster clock frequency when compared with the double-precision intellectual property (IP) core adder provided by the FPGA vendors. Since the dual single-precision operations support, the proposed adder has higher throughput compared with the single-precision IP core adder. In addition, the proposed adder has better energy efficiency compared with both single-precision and double-precision IP core adder. The implementation results of the proposed adder on the latest Altera Arria-10 and Xilinx Virtex-7 devices are provided. A direct implementation of the proposed architecture on STM-90 nm technology ASIC platform is also performed.

10 citations


Journal ArticleDOI
TL;DR: In this paper, an arithmetic sign detector for the extended four-moduli set is proposed, where n and k are positive integers such that 0 ≤ k ≤ n. The proposed arithmetic unit is built using carry-save adders and carry-generation circuits.
Abstract: This work is an additional effort to improve the performance of a four-moduli set residue-based sign detector. The study proposes an arithmetic sign detector for the extended four-moduli set { 2 n − 1 , 2 n + 1 , 2 2 n + 1 , 2 n + k } , where n and k are positive integers such that 0 ≤ k ≤ n . The proposed arithmetic unit is built using carry-save adders and carry-generation circuits. When compared with the only sign detector available in the literature for a similar moduli set, the proposed one showed very slight reductions in area and power. However, it showed a huge reduction in time delay. Using very-large-scale integration tools, the presented sign detector achieved a reduction of (48.8–59.2)% in time delay.

Journal ArticleDOI
TL;DR: This study analyses the quantisation noise in FFT computation and proposes the mixed use of multiple scaling approaches to compensate the noise, and a statistics-based optimisation scheme is proposed to configure the scaling operations of the cascaded arithmetic blocks at each stage for yielding the most optimised accuracy for a given FFT length.
Abstract: Fast Fourier transform (FFT) plays an important role in digital signal processing systems. In this study, the authors explore the very large-scale integration (VLSI) design of high-precision fixed-point reconfigurable FFT processor. To achieve high accuracy under the limited wordlength, this study analyses the quantisation noise in FFT computation and proposes the mixed use of multiple scaling approaches to compensate the noise. In addition, a statistics-based optimisation scheme is proposed to configure the scaling operations of the cascaded arithmetic blocks at each stage for yielding the most optimised accuracy for a given FFT length. On the basis of this approach, they further present a VLSI implementation of area-efficient and high-precision FFT processor, which can perform power-of-two FFT from 32 to 8192 points. By using the SMIC 0.13 μ m process, the area of the proposed FFT processor is 27 m m 2 with a maximum operating frequency of 400 MHz. When the FFT processor is configured to perform 8192-point FFT at 40 MHz, the signal-to-quantisation-noise ratio is up to 53.28 dB and the power consumption measured by post-layout simulation is 35.7 mW.

Journal ArticleDOI
TL;DR: This work leverages a novel probabilistic spintronic switching element device that provides thermally-driven and current-controlled tunable stochasticity in a compact, low-energy, and high-speed package and demonstrates that an S3N can implement perceptron functionality, such as AND-gate- and OR- gate-based logic processing, and provides future extensions of the work to more advanced stochastically neuromorphic architectures.
Abstract: The spintronic stochastic spiking neuron (S3N) developed herein realises biologically mimetic stochastic spiking characteristics observed within in vivo cortical neurons, while operating several orders of magnitude more rapidly and exhibiting a favourable energy profile. This work leverages a novel probabilistic spintronic switching element device that provides thermally-driven and current-controlled tunable stochasticity in a compact, low-energy, and high-speed package. In order to close the loop, the authors utilise a second-order complementary metal-oxide-semiconductor (CMOS) synapse with variable weight control that accumulates incoming spikes into second-order transient current signals, which resemble the excitatory post-synaptic potentials found in biological neurons, and can be used to drive post-synaptic S3Ns. Simulation program with integrated circuit emphasis (SPICE) simulation results indicate that the equivalent of 1 s of in vivo neuronal spiking characteristics can be generated on the order of nanoseconds, enabling the feasibility of extremely rapid emulation of in vivo neuronal behaviours for future statistical models of cortical information processing. Their results also indicate that the S3N can generate spikes on the order of ten picoseconds while dissipating only 0.6-9.6 μW, depending on the spiking rate. Additionally, they demonstrate that an S3N can implement perceptron functionality, such as AND-gate- and OR-gate-based logic processing, and provide future extensions of the work to more advanced stochastic neuromorphic architectures.

Journal ArticleDOI
TL;DR: This study proposes analytical models to estimate Pass Rate, delay, power and area of ESAs, and results show that modified ESAs provide higher accuracy, better quality-effort curves and more optimal Delay–Power–Area–Accuracy trade-off as compared to original ESAs.
Abstract: Recently, several approximate adders have been proposed based on the design concept of Equal Segment Adder (ESA), i.e. to segment an N -bit adder into several smaller and independent equal size sub-adders. In this study, the authors propose analytical models to estimate Pass Rate (PR), delay, power and area of ESAs, where PR represents the probability of output to be correct. From the proposed analytical models, they observe that there is a scope and need for improvement in quality-effort curves of existing ESAs. Intended to improve the quality-effort curves, they propose modifications in existing ESAs with design objective that modified ESAs provide higher accuracy without imposing any additional delay, power and area overheads. Both the authors’ analytical and simulation results show that modified ESAs provide higher accuracy, better quality-effort curves and more optimal Delay–Power–Area–Accuracy trade-off as compared to original ESAs. In addition to accuracy enhancement, the proposed approach also provides improvements in delay and power when ESAs are used with Error Detection and Correction logic. For evaluating the effectiveness of the proposed approach in real-life applications, they process Lena image using original ESAs and modified ESAs. Their image processing results show that modified ESAs provide more precise images as compared to original ESAs.

Journal ArticleDOI
TL;DR: Experimental results show MPGA achieves a significant improvement over the previous publications both on dynamic power and leakage power reduction in most benchmarks.
Abstract: As the development of deep-submicron and nano-technology, leakage power minimisation becomes as important as dynamic power reduction in IC design. In order to achieve low-power state assignment for finite-state machine (FSM) synthesis, a multi-population genetic algorithm (MPGA)-based state assignment method is proposed. MPGA consists of an outer-loop and a set of inner-GAs. In MPGA, inner-GA is a local search component for finding low-power state assignment. Selection, crossover and mutation are used to perform variations on individuals. Cost function is defined based on power dissipation formulation of complementary metal oxide semiconductor (CMOS) gate for dynamic power and leakage power estimation. The outer-loop is used to optimise the parameters of inner-genetic algorithm (GA) through population variation schema, intra-specific competition and newborn. Twenty-three FSMs that were commonly used as benchmarks are employed to test the effectiveness of MPGA and compare different state assignment methods. Experimental results show MPGA achieves a significant improvement over the previous publications both on dynamic power and leakage power reduction in most benchmarks.

Journal ArticleDOI
TL;DR: It is observed that the proposed design achieves better performance in terms of hardware complexity and normalised energy for the given specifications.
Abstract: This study presents a variable length multi-path delay commutator fast Fourier transform (FFT)/inverse FFT (IFFT) architecture for a multiple input multiple output orthogonal frequency division multiplexing system. It supports the FFT/ IFFT lengths of 512/256/128/64 samples to process each symbol carried by eight spatial streams and achieves a speed of 160 MHz to meet the IEEE 802.11ac timing requirements. A resource scheduling methodology to minimise the hardware complexity of the design is proposed and adopted in the architecture presented. A novel stagger word length strategy is also proposed and applied to achieve the better accuracy with lesser hardware. Here, the signal to quantisation noise ratio of 57.23 dB is obtained. The twiddle coefficient storage space is significantly compressed to achieve the coefficient generation with reduced hardware. The design is implemented using the TSMC-65 nm complementary metal oxide semiconductor technology with a supply voltage of 1 V at 160 MHz. The implementation results show that the architecture has a gate count of 3,48,013 with power consumption of 105.1 mW and area of 0.492 mm2. The hardware complexity and performance of the design are compared with earlier reported architectures. It is observed that the proposed design achieves better performance in terms of hardware complexity and normalised energy for the given specifications.

Journal ArticleDOI
TL;DR: The algorithm for the proposed architecture is derived from the Chinese remainder theorem and performs MM completely within a residue number system (RNS) and enables the construction of low-voltage and energy-efficient ECCs.
Abstract: Modular multiplication (MM) is the main operation in cryptography algorithms such as elliptic-curve cryptography (ECC) and Rivest-Shamir-Adleman, where repeated MM is used to perform elliptic curve point multiplication and modular exponentiation, respectively. The algorithm for the proposed architecture is derived from the Chinese remainder theorem and performs MM completely within a residue number system (RNS). Moreover, a 40-channel RNS moduli-set is proposed for this architecture to benefit from the short-channel width of the RNS moduli-set. The throughput of the architecture is enhanced by pipelining and pre-computations. The proposed architecture is fabricated as an ASIC using 65-nm CMOS technology. The measurement results are obtained for energy dissipation at different voltage levels from 0.43 to 1.25 V. The maximum throughput of the proposed design is 1037 Mbps while operating at a frequency of 162 MHz with an energy dissipation of 48 nJ. The proposed architecture enables the construction of low-voltage and energy-efficient ECCs.

Journal ArticleDOI
TL;DR: An arbitration mechanism for NoC is proposed that leads to a reduction in congestion delay in routers as well as the network latency, and is compatible with the bypass and baseline pipeline in routers.
Abstract: In the movement from a multi-core to a many-core era, cores count on the chip increases quickly thus interconnect plays a large role in achieving the desired performance. Network-on-chip (NoC) is the most widely used interconnect as a scalable alternative for traditional shared bus in many-core chips. As the dimensions of mesh-based NoC increase, routers and links serve as a major part to achieve the desired performance and low-latency communication between cores. In this study, the authors propose an arbitration mechanism for NoC that leads to a reduction in congestion delay in routers as well as the network latency. The proposed mechanism is compatible with the bypass and baseline pipeline in routers. System simulations with Noxim demonstrate reduction in latencies and power consumption using different routing algorithms for 4×4,8×8 and 16×16 mesh topologies, as compared with a baseline router.

Journal ArticleDOI
TL;DR: An energy-efficient routing technique that can tolerate permanent faults in NoC links by introducing a simple logic unit placed next to the output port allocation stage of the deflection router pipeline, which incurs minimum wiring overheads and promises a stable network throughput for high fault rates.
Abstract: New generation multi-processor system-on-chips integrate hundreds of processing elements in a single chip which communicate with each other through on-chip communication networks, commonly known as network-on-chip (NoC). Routers are the most critical NoC components and deflection routing is a technique used in buffer-less routers for better energy efficiency. Massive integration of devices along with fabrication at deep sub-micron level feature sizes increases the possibility of wear out and damage to various components resulting in unreliable operation of the chip. Hence NoC fabric in general and routers, in particular, should be equipped with built-in fault tolerance mechanisms to ensure the reliability of the chip in the presence of faults. The authors propose an energy-efficient routing technique that can tolerate permanent faults in NoC links by introducing a simple logic unit placed next to the output port allocation stage of the deflection router pipeline. This technique incurs minimum wiring overheads and promises a stable network throughput for high fault rates. Evaluation of the proposed method on 8 × 8 mesh NoC for various fault rates reports reduced flit deflection rate and hop power which brings about a significant reduction in dynamic power consumption at the inter-router links compared to state-of-the-art fault tolerance techniques.

Journal ArticleDOI
TL;DR: A hierarchical block-merging-based technique for test data compression, which appropriately encodes the test pattern blocks of fixed sizes at inter- and intra-block levels using lesser number of bits, is presented.
Abstract: Manufacturing of semiconductor devices at the sub-micron level has led to the introduction of huge number of faults. To ensure the quality of integrated circuits (ICs), enormous amount of test data is needed which, in turn, increases the overall test cost of the ICs. This study presents a hierarchical block-merging-based technique (HBMT) for test data compression, which appropriately encodes the test pattern blocks of fixed sizes at inter- and intra-block levels using lesser number of bits. The proposed technique works in four steps: segmentation of the entire length of test data into equal length blocks; categorisation of test blocks as compatible blocks and unique blocks; merging of compatible blocks to form representative pattern block, which is further merged at sub-block level; and compression of the non-compatible (unique) blocks using different encoding cases. Experimental results performed on various international symposium for circuits and systems (ISCAS)' 89 benchmark circuits demonstrate the effectiveness of the proposed test data compression technique. It is found that application of HBMT can improve the compression efficiency by an average of 73% along with a reduction in the test application time. This study also presents the decoder architecture.

Journal ArticleDOI
TL;DR: The proposed systolic Dickson basis multiplier can concurrently compute a great number of multiplications with a high-throughput rate, thereby substantially increasing the speed of computation for digital signatures.
Abstract: In this study, the authors propose a high-throughput systolic Dickson basis multiplier over GF(2 m ). Use of the Dickson basis seems promising when no Gaussian normal basis exists for the field, and it can easily carry out both squaring and multiplication operations. Many squaring operations and multiplications are needed when computing the digital signatures of elliptic curve digital signature algorithm. The proposed systolic Dickson basis multiplier can concurrently compute a great number of multiplications with a high-throughput rate, thereby substantially increasing the speed of computation for digital signatures.

Journal ArticleDOI
TL;DR: An efficient reversible circuit synthesis scheme that constructs improved circuits by minimising the quantum cost is developed that substantially reduces the cost of the circuits to a great extent.
Abstract: In this computing paradigms, the quantum computing has evolved as a promising platform for designing very fast computable circuits. In view of the improved design of such circuits, an efficient synthesis approach for circuits needs to be developed. As a direct synthesis of the quantum circuit is a bit complicated, the concept of reversible circuit appears which internally implements the quantum functionality, and to design better quantum circuit the corresponding reversible circuit have to be optimised. Considering this need, in this work, the authors develop an efficient reversible circuit synthesis scheme that constructs improved circuits by minimising the quantum cost. The entire work is completed in two phases. In the first phase, a circuit design scheme based on the best neighbour is implemented, where a function shares a portion of its own data with a chosen neighbour termed as the best neighbour and builds the shared structure. In the second phase, the designed circuit passes through an optimisation process which further reduces the cost metrics of the circuit. The experiment shows that the optimisation process substantially reduces the cost of the circuits to a great extent. At the end of the work, a comparative study with related works has also been presented.

Journal ArticleDOI
TL;DR: A novel approach for efficiently activating Trojans hidden in digital signal processing (DSP) circuits by increasing the transition activity of rare bits, which effectively activate internal rare nodes and trigger HTs is proposed.
Abstract: Hardware Trojan (HT), which usually is activated under rare conditions associated with low transition bits in a circuit, can lead to circuit functional failure or information leakage. Effectively activating hidden HTs is a major challenge during the HT detection process. In this study, the authors propose a novel approach for efficiently activating Trojans hidden in digital signal processing (DSP) circuits by increasing the transition activity of rare bits. In particular, the bit-level transition activity can be increased by controlling signal word-level statistical properties, such as standard deviation and autocorrelation, and their propagation through various operators involved in DSP circuit design. As a result, the proposed approach can generate appropriate test vectors, which effectively activate internal rare nodes and trigger HTs. The experimental results show that using the proposed approach the transition activity of rare bits is significantly increased and various HTs inserted into DSP circuits are activated with reduced time. By comparing to an existing activation approach working at the bit level, the proposed approach is superior in test vectors generation time up to 9 times reduction and HT activation time up to 66 times reduction.

Journal ArticleDOI
TL;DR: This study proposes a novel reliability and threat analysis of negative bias temperature instability (NBTI) stress on digital signal processing (DSP) cores and identifies input vectors that cause maximum degradation of DSP cores due to NBTI stress.
Abstract: Device aging is a critical failure mechanism in nanoscale designs. Prolonged device degradation may result in failure. Delay degradation of a design depends on various factors such as threshold voltage, temperature, input vector pattern and so on. An attacker who is aware of this phenomenon may exploit by accelerating the performance degradation mechanism. This study proposes a novel reliability and threat analysis of negative bias temperature instability (NBTI) stress on digital signal processing (DSP) cores. The main contributions of this study are as follows: (a) identifying input vectors that cause maximum degradation of DSP cores due to NBTI stress, (b) analysing impact of NBTI stress for varying stress time on DSP core in terms of delay degradation and (c) analysing performance comparison of stress versus no-stress condition for various input vector samples.

Journal ArticleDOI
TL;DR: This study addresses the key challenge of providing a scalable communication interconnect for global astrocyte network requirements and how it integrates with existing local communication mechanism.
Abstract: Hardware has become more prone to faults as a result of geometric scaling, wear-out and faults caused during the manufacturing process, therefore, the reliability of hardware is reliant on the need to continually adapt to faults. A computational model of biological self-repair in the brain, derived from observing the role of astrocytes (a glial cell found in the mammalian brain), has captured self-repair within models of neural networks known as neuro-glia networks. This astrocyte-driven repair process can address the issues of faulty synapse connections between neurons. These astrocyte cells are distributed throughout a neuro-glia network and regulate synaptic activity, and it has been observed in computational models that this can result in a fine-grained self-repair process. Therefore, mapping neuro-glia networks to hardware provides a strategy for achieving self-repair in hardware. The internal interconnecting of these networks in hardware is a challenge. Previous work has focused on addressing neuron to astrocyte communication (local), however, the global self-repair process is dependent on the communication infrastructure between astrocyte-to-astrocyte; e.g. astrocyte network. This study addresses the key challenge of providing a scalable communication interconnect for global astrocyte network requirements and how it integrates with existing local communication mechanism. Area/power results demonstrate scalable implementations with the ring topology while meeting timing requirements.


Journal ArticleDOI
TL;DR: The analysis shows the benefits of including a scratchpad memory inside the reconfiguration controller in order to improve the efficiency of the Reconfiguration process.
Abstract: In this paper we have evaluated the overhead and the tradeoffs of a set of components usually included in a system with run-time partial reconfiguration implemented on a Xilinx Virtex-5. Our analysis shows the benefits of including a scratchpad memory inside the reconfiguration controller in order to improve the efficiency of the reconfiguration process. We have designed a simple controller for this scratchpad that includes support for prefetching and caching in order to further reduce both the energy and latency overhead.

Journal ArticleDOI
TL;DR: A locality-protected method based on instruction programme counter (LPC) to make use of data locality in L1 data cache with very low hardware overhead and a hardware-efficient prioritised cache allocation unit is proposed to coordinate data reuse information with time-stamps to predict the reuse possibility of each cache line.
Abstract: Graphics processing units (GPUs) are playing more important roles in parallel computing. Using their multi-threaded execution model, GPUs can accelerate many parallel programmes and save energy. In contrast to their strong computing power, GPUs have limited on-chip memory space which is easy to be inadequate. The throughput-oriented execution model in GPU introduces thousands of hardware threads, which may access the small cache simultaneously. This will cause cache thrashing and contention problems and limit GPU performance. Motivated by these issues, the authors put forward a locality-protected method based on instruction programme counter (LPC) to make use of data locality in L1 data cache with very low hardware overhead. First, they use a simple Program Counter (PC)-based locality detector to collect reuse information of each cache line. Then, a hardware-efficient prioritised cache allocation unit is proposed to coordinate data reuse information with time-stamp information to predict the reuse possibility of each cache line, and to evict the line with the least reuse possibility. Their experiment on the simulator shows that LPC provides an up to 17.8% speedup and an average of 5.0% improvement over the baseline method with very low overhead.

Journal ArticleDOI
TL;DR: A novel method is proposed which transforms the characteristic space into performance space and achieves 13.3× speed-up over the program-centric scheme with an average correlation coefficient of 0.92.
Abstract: Predictive modelling has gained much attention in the last decade, aiming fast evaluation of different design points in design space exploration (DSE) process. However, predictive model construction still requires costly simulations for every unseen program. To reduce the number of simulations, several cross-program prediction schemes have been developed. This study proposes a cross-program predictive scheme for micro-architectural DSE. The scheme measures a set of representative inherent characteristics for the unseen program and compares them against the same characteristics of training programs. Then, based on similarity information, the performance trend of the unseen program is predicted using the predictive models of training programs. As the raw data of the characteristics do not characterise programs in performance space, the authors propose a novel method which transforms the characteristic space into performance space. The proposed method achieves 13.3× speed-up over the program-centric scheme with an average correlation coefficient of 0.92.

Journal ArticleDOI
TL;DR: The authors propose an NSP in field programmable gate array (FPGA) platform where according to a strict power, throughput, resource, and security priorities, a proposed preferential algorithm chooses a cipher suite to program the hardware.
Abstract: Efficient and cost-effective hardware design of network security processor (NSP) is of vital importance in the present era due to the increasing need of security infrastructure in a wide range of computing applications. Here, the authors propose an NSP in field programmable gate array (FPGA) platform where according to a strict power, throughput, resource, and security priorities, a proposed preferential algorithm chooses a cipher suite to program the hardware. The choice is based on a rank list of available cipher suites depending on efficient system index evaluated in terms of power, throughput, resource, and security data and their given weights by the user. Encryption, hash, and key exchange algorithm along with their architectural variants serve excellent hardware flexibility whose bit files are stored in secure digital memory. The proposed design used an isolated key memory where secret keys are stored in encrypted form along with the hash value. The design is implemented using ISE14.4 suite with ZYNQ7z020-clg484 FPGA platform. The performances of the variants architecture of the crypto algorithms are considerably better in terms of power, throughput, and resource than the existing works reported in the literature.