scispace - formally typeset
Search or ask a question

Showing papers in "Iet Computers and Digital Techniques in 2019"


Journal ArticleDOI
TL;DR: A pipelined architecture is proposed in this work to speed up the point multiplication in elliptic curve cryptography (ECC) by reducing the number of clock cycles (latency), which is achieved through careful scheduling of computations involved in point addition and point doubling.
Abstract: A pipelined architecture is proposed in this work to speed up the point multiplication in elliptic curve cryptography (ECC). This is achieved, at first; by pipelining the arithmetic unit to reduce the critical path delay. Second, by reducing the number of clock cycles (latency), which is achieved through careful scheduling of computations involved in point addition and point doubling. These two factors thus, help in reducing the time for one point multiplication computation. On the other hand, the small area overhead for this design gives a higher throughput/area ratio. Consequently, the proposed architecture is synthesised on different FPGAs to compare with the state-of-the-art. The synthesis results over GF(2 m ) show that the proposed design can work up to a frequency of 369, 357 and 337 MHz when implemented for m = 163, 233 and 283 bit key lengths, respectively, on Virtex-7 FPGA. The corresponding throughput/slice figures are 42.22, 12.37 and 9.45, which outperform existing implementations.

33 citations


Journal ArticleDOI
TL;DR: Experimental results show that the proposed scheme is not only able to achieve appreciable energy savings with respect to state-of-the-art but also enables a significant improvement in resource utilisation.
Abstract: Devising energy-efficient scheduling strategies for real-time periodic tasks on heterogeneous platforms is a challenging as well as a computationally demanding problem. This study proposes a low-overhead heuristic strategy called, HEALERS, for dynamic voltage and frequency scaling (DVFS)-cum-dynamic power management (DPM) enabled energy-aware scheduling of a set of periodic tasks executing on a heterogeneous multi-core system. The presented strategy first applies deadline-partitioning to acquire a set of distinct time-slices. At any time-slice boundary, the following three-phase operations are applied to obtain a schedule for the next time-slice: first, it computes the fragments of the execution demands of all tasks onto each of the different processing cores in the platform. Next, it generates a schedule for each task on one or more processing cores such that the total execution demand of all tasks is satisfied. Finally, HEALERS applies DVFS and DPM on all processing cores so that energy consumption within the time-slice may be minimized while not jeopardising execution requirements of the scheduled tasks. Experimental results show that the proposed scheme is not only able to achieve appreciable energy savings with respect to state-of-the-art (5–42% on average) but also enables a significant improvement in resource utilisation (as high as 58%).

16 citations


Journal ArticleDOI
TL;DR: The authors propose the concepts of kernel vulnerability factor (KVF) and layerulnerability factor (LVF), which indicate the probability of faults in a kernel or layer to affect the computation.
Abstract: Video recognition applications running on Graphics Processing Unit are composed of heterogeneous software portions, such as kernels or layers for neural networks The authors propose the concepts of kernel vulnerability factor (KVF) and layer vulnerability factor (LVF), which indicate the probability of faults in a kernel or layer to affect the computation KVF and LVF indicate the high-level portions of code that are more likely, if corrupted, to impact the application's output KVF and LVF restrict the architecture/program vulnerability factor analysis to specific portions of the algorithm, easing the criticality analysis and the implementation of selective hardening We apply the proposed metrics to two Histogram of Oriented Gradients (HOG), and You Only Look Once (YOLO) benchmarks We measure the KVF for HOG by using fault-injection at both the architectural level and high level We propose for HOG an efficient selective hardening technique able to detect 85% of critical errors with an overhead in performance as low as 118% For YOLO, we study the LVF with architectural-level fault-injection We qualify the observed corrupted outputs, distinguishing between tolerable and critical errors Then, we proposed a smart layer duplication that detects more than 90% of errors, with an overhead lower than 60%

16 citations


Journal ArticleDOI
TL;DR: A knowledge-based memetic algorithm (KBMA) is proposed for 3D NoC for successful mapping with standard network topologies and adopts power, area and delay as a cost function for an effective mapping.
Abstract: Due to increased demands for communication at low power, an efficient application mapping has become vital in the area of network on chip (NoC). Optimisation of architectural structure in on-chip design is essential to maximise the performance of the network and minimise the cost functions. To address this issue, a knowledge-based memetic algorithm (KBMA) is proposed for 3D NoC for successful mapping with standard network topologies. The proposed KBMA adopts power, area and delay as a cost function for an effective mapping. The competence of the proposed method is verified through comparison with other natural inspired algorithms like particle swarm optimisation and genetic algorithm. The presented work is validated through four case studies which include real application benchmarks of NoC and random generated benchmarks using test graph for free.

12 citations


Journal ArticleDOI
TL;DR: P-EdgeCoolingMode is capable of pro-actively monitoring performance and based on the user's demand the agent takes necessary action, making the proposed methodology highly suitable for implementation on existing as well as conceptual Edge devices utilising heterogeneous MPSoCs with dynamic voltage and frequency scaling (DVFS) capabilities.
Abstract: Thermal cycling, as well as spatial and thermal gradient, affects the lifetime reliability and performance of heterogeneous Multi-Processor Systems-on-Chips (MPSoCs). Conventional temperature management techniques are not intelligent enough to cater for performance, energy efficiency as well as the operating temperature of the system. In this study, the authors propose a light-weight novel thermal management mechanism (P-EdgeCoolingMode) in the form of intelligent software agent, which monitors and regulates the operating temperature of the CPU cores to improve the reliability of the system while catering for performance requirements. P-EdgeCoolingMode is capable of pro-actively monitoring performance and based on the user's demand the agent takes necessary action, making the proposed methodology highly suitable for implementation on existing as well as conceptual Edge devices utilising heterogeneous MPSoCs with dynamic voltage and frequency scaling (DVFS) capabilities. They validated the authors’ methodology on the Odroid-XU4 MPSoC and Huawei P20 Lite (HiSilicon Kirin 659 MPSoC). P-EdgeCoolingMode has been successful in reducing the operating temperature while improving performance and reducing power consumption for chosen test cases than the state-of-the-art. For applications with demanding performance requirement P-EdgeCoolingMode has been found to improve the power consumption by 30.62% at the most in comparison to existing state-of-the-art power management methodologies.

11 citations


Journal ArticleDOI
TL;DR: Both EDA experiment results and field programmable gate array (FPGA) experiment results show that the proposed co-training based hardware Trojan detection method by exploiting inaccurate simulation models and unlabeled fabricated ICs can detect unknown Trojans with high accuracy and recall.
Abstract: Most of prior hardware Trojan detection approaches require golden chips for references. A classification-based golden chips-free hardware Trojan detection technique has been proposed in the authors' previous work. However, the algorithm in that work is trained by simulated ICs without considering a shift between the simulation and silicon fabrication. In this study, a co-training based hardware Trojan detection method by exploiting inaccurate simulation models and unlabeled fabricated ICs is proposed to provide reliable detection capability when facing fabricated ICs, which eliminates the need of golden chips. Two classification algorithms are trained using simulated ICs. These two algorithms can identify different patterns in the unlabelled ICs during test-time, and thus can label some of these ICs for the further training of the other algorithm. Moreover, a statistical examination is used to choose ICs labelling for the other algorithm. A statistical confidence interval based technique is also used to combine the hypotheses of the two classification algorithms. Furthermore, the partial least squares method is used to preprocess the raw data of ICs for feature selection. Both EDA experiment results and field programmable gate array (FPGA) experiment results show that the proposed technique can detect unknown Trojans with high accuracy and recall.

10 citations


Journal ArticleDOI
TL;DR: Results show that the proposed routing algorithm improves temperature variance by 9-39% and reduces number of throttled routers by 16-86%, which is achieved at the cost of one extra virtual channel per each physical channel in the XY -plane.
Abstract: Dynamic thermal management (DTM) techniques of three-dimensional (3D) Network-on-Chips (NoCs) are employed to rescue the chip from thermal difficulties. Reactive routing algorithms, which utilise router throttling technique as a popular DTM, disregard distribution of heat generation of routers resulting in more throttled routers as well as long packet delays in throttled processing elements. This study proposes a reactive routing algorithm for 3D NoCs to (i) dynamically detour packets from hot zones containing throttled routers and (ii) minimise the number of required router throttling in the network. The proposed routing algorithm defines two virtual networks to enhance the path diversity for packets in each layer of 3D NoCs. The selection of diverse paths distributes heat generation to alleviate the thermal variance. The proposed routing algorithm is analysed by turn model to achieve deadlock freedom. Access Noxim simulator is also used to evaluate the performance and the thermal behaviour of the proposed routing algorithm in the variety of conditions. Results show that the proposed routing algorithm improves temperature variance by 9-39% and reduces number of throttled routers by 16-86%, which is achieved at the cost of one extra virtual channel per each physical channel in the XY -plane.

10 citations


Journal ArticleDOI
TL;DR: A new output selection strategy called destination intensity and congestion aware (DICA) that uses both local and regional congestion information from adjacent and two hops away neighbours on the path to destination based on the channel and switch information to distribute traffic more equally over the network.
Abstract: Selection strategy is an essential part of an adaptive routing algorithm that influences the performance of the networks-on-chip (NoC). A selection strategy is used for selecting the best output channel from the available channels according to the network status. This study presents a new output selection strategy called destination intensity and congestion aware (DICA) that uses both local and regional congestion information from adjacent and two hops away neighbours on the path to destination based on the channel and switch information. Also, the proposed output selection strategy uses a new global congestion-aware scheme based on destination node called destination congestion awareness method to distribute traffic more equally over the network. The simulation results show that DICA strategy consistently improves the performance in both throughput and average latency with minimal overhead in terms of area consumption for various synthetic and real application traffic patterns. In addition, the microarchitecture of NoC routers is also presented in this study and it shows that the proposed output selection strategy can be combined with any adaptive routing algorithms. The experimental results show the average delay improvements of DICA to the Bufferlevel, neighbours-on-path, and regional congestion awareness are 87, 57, and 24%, respectively.

8 citations


Journal ArticleDOI
TL;DR: The authors discuss the parallelisation and memory optimisation strategies of a computer vision application for motion estimation using the NVIDIA compute unified device architecture (CUDA) and addresses optimisation techniques for algorithms that surpass the GPU resources in either computation or memory resources for CUDA architecture.
Abstract: As video processing technologies continue to rise quicker than central processing unit (CPU) performance in complexity and image resolution, data-parallel computing methods will be even more important. In fact, the high-performance, data-parallel architecture of modern graphics processing unit (GPUs) can minimise execution times by orders of magnitude or more. However, creating an optimal GPU implementation not only needs converting sequential implementation of algorithms into parallel ones but, more importantly, needs cautious balancing of the GPU resources. It requires also an understanding of the bottlenecks and defect caused by memory latency and code computing. The defiance is even greater when an implementation exceeds the GPU resources. In this study, the authors discuss the parallelisation and memory optimisation strategies of a computer vision application for motion estimation using the NVIDIA compute unified device architecture (CUDA). It addresses optimisation techniques for algorithms that surpass the GPU resources in either computation or memory resources for CUDA architecture. The proposed implementation reveals a substantial improvement in both speed up (SU) and peak signal-to-noise ratio (PSNR). Indeed, the implementation is up to 50 times faster than the CPU counterpart. It also provides an increase in PSNR of the coded test sequence up to 8 dB.

8 citations


Journal ArticleDOI
TL;DR: This work investigates an alternative approach that exploits on-chip data locality to a large extent, through distributed shared memory systems that permit efficient reuse of on- chip mapped data in clusterised many-core architectures.
Abstract: Power-efficient architectures have become the most important feature required for future embedded systems. Modern designs, like those released on mobile devices, reveal that clusterisation is the way to improve energy efficiency. However, such architectures are still limited by the memory subsystem (i.e. memory latency problems). This work investigates an alternative approach that exploits on-chip data locality to a large extent, through distributed shared memory systems that permit efficient reuse of on-chip mapped data in clusterised many-core architectures. First, this work reviews the current literature on memory allocations and explores the limitations of cluster-based many-core architectures. Then, several memory allocations are introduced and benchmarked scalability, performance and energy-wise against the conventional centralised shared memory solution in order to reveal which memory allocation is the most appropriate for future mobile architectures. The results show that distributed shared memory allocations bring performance gains and opportunities to reduce energy consumption.

7 citations


Journal ArticleDOI
TL;DR: The authors outline a rigorous mathematical method to compute the wirelength of any embedding from the guest graph into the host graph and show that the computation of the optimal wirelength depends on finding optimal solutions for another graph partition problem such as edge isoperimetric problem in that guest graph.
Abstract: In this study, the authors discuss the vertex congestion of any embedding from the guest graph into the host graph and outline a rigorous mathematical method to compute the wirelength of that embedding. Further, they show that the computation of the optimal wirelength depends on finding optimal solutions for another graph partition problem such as edge isoperimetric problem in that guest graph. On the other side, they consider an important variant of the popular hypercube network, the enhanced hypercube, and obtain the nested optimal solutions for the edge isoperimetric problem. As a combined output, they illustrate the authors' technique by embedding enhanced hypercube into a caterpillar and from that reducing the linear layout of the enhanced hypercube. As another application of their technique, they embed the hypercube as well as the enhanced hypercube on the two rows extended grid structure with optimal wirelength for the first time and showing that the existing edge congestion technique cannot be used to solve this problem.

Journal ArticleDOI
TL;DR: A complete fluid-level synthesis considering all the essential goals together instead of dealing with them in isolation is proposed effectively handles the trade-off scenarios and provides flexibility to the designer to decide the threshold of the individual optimisation objective leading to the construction of a good-quality solution as a whole.
Abstract: Production of correct bioassay outcome is the foremost objective in digital microfluidic biochips (or DMFBs). In high-frequency DMFBs, continuous actuation of electrodes leads to malfunctioning or even breakdown of the system. The improper functioning of a biochip tends to produce erroneous results. On the other hand, while transporting droplets, the residues may get stuck to electrode walls and cause contamination to other droplets. To ensure proper assay outcome, washing becomes mandatory, whose incorporation may delay the bioassay completion time significantly. Furthermore, each wash droplet possesses a capacity constraint within which the residues can be washed off successfully. Evidently, the design objectives possess a large degree of trade-offs among themselves and must be attacked to prepare an efficient platform. Here, the authors propose a complete fluid-level synthesis considering all the essential goals together instead of dealing with them in isolation. The presented approach effectively handles the trade-off scenarios and provides flexibility to the designer to decide the threshold of the individual optimisation objective leading to the construction of a good-quality solution as a whole. The performance is evaluated over several benchmark bioassays.

Journal ArticleDOI
TL;DR: This study presents a problem specific parallel pipelined field programmable gate array-based accelerator to reduce execution time when solving complex optimisation problems and shows a promising average speedup over software and GPU implementations.
Abstract: Cuckoo search (CS) is a recent swarm intelligence-based meta-heuristic optimisation algorithm that has shown excellent results for a broad class of optimisation problems in diverse fields. However, CS is generally compute intensive and slow when implemented in software requiring large number of fitness function evaluations to obtain acceptable solutions. In this study, the authors present a problem specific parallel pipelined field programmable gate array-based accelerator to reduce execution time when solving complex optimisation problems. Experiments conducted on a large number of well-known benchmark functions revealed that the hardware approach offers a promising average speedup of 75× and 53× than software and GPU implementations, respectively.

Journal ArticleDOI
TL;DR: The radix-43 architecture is a memory optimized parallel architecture which computes 64-point FFT, with least execution time, and is implemented in UMC 40nm CMOS technology with clock frequency of 500 MHz and area of 0.841 mm2.
Abstract: Multi-dimensional Discrete Fourier Transforms (DFTs) play an important role in signal and image processing applications. Image reconstruction is a key component in signal processing applications like medical imaging, computer vision, face recognition etc. Two dimensional fast Fourier Transform (2D FFT) and Inverse FFT plays vital role in reconstruction. In this paper we present a fast 64 × 64 point 2D FFT architecture based on radix-43 algorithm using a parallel unrolled radix-4 3 FFT as the basic block. Our radix-4 3 architecture is a memory optimized parallel architecture which computes 64-point FFT, with least execution time. Proposed architecture produces reordered output of both 64-point one dimensional (1D) FFT and 64 × 64 point 2D FFT, without using any additional hardware for reordering. The proposed architecture has been implemented in UMC 40nm CMOS technology with clock frequency of 500 MHz and area of 0.841 mm 2 . The power consumption of proposed architecture is 358 mW at 500 MHz. Energy efficiency (FFTs computed per unit of energy) is 341 points/Joule. Computation time of 64 × 64 point FFT is 8.19 μs. ASIC implementation results shows better performance of proposed work in terms of computation time when compared with state-of-art implementation. Proposed architecture has also been implemented in Virtex-7 FPGA which gives comparable area.

Journal ArticleDOI
TL;DR: This study presents a novel subthreshold Darlington pair-based NBTI degradation sensor that is less affected by the process variation and has the maximum deviation at standby leakage current of 30 nA.
Abstract: Aggressive technology scaling has inevitably led to reliability becomes a major concern for modern high-speed and high-performance integrated circuits. The major reliability concerns in nanoscale very-large-scale integration design are the time-dependent negative bias temperature instability (NBTI) degradation. Owing to increasing vertical oxide field and higher operating temperature, the threshold voltage of P-channel MOS transistors increases with time under NBTI. This study presents a novel subthreshold Darlington pair-based NBTI degradation sensor under the stress conditions. The proposed sensor provides the high degree of linearity and sensitivity under subthreshold conditions. The Darlington pair used in the circuit provides the stability and the high-input impedance of the circuit makes it less affected by the process variations. Owing to high sensitivity, the proposed sensor is best suited for sensing of temperature variation, process variation, and temporal degradation during measurement. The sensitivity of the proposed sensor at room temperature is 0.239 mV/nA under subthreshold conditions. The proposed sensor is less affected by the process variation and has the maximum deviation of 0.0011 mV at standby leakage current of 30 nA.

Journal ArticleDOI
TL;DR: In experimental evaluation, it is being observed that for small functions BDD gives more compact circuits than the other two IRs but when the input size increases, then MIG as IR makes substantial improvements in cost parameters as compared with BDD by reducing quantum cost by 39% on an average.
Abstract: Reversible logic synthesis is one of the best suited ways which act as the intermediate step for synthesising Boolean functions on quantum technologies. For a given Boolean function, there are multiple possible intermediate representations (IRs), based on functional abstraction, e.g. truth table, decision diagrams or circuit abstraction, e.g. binary decision diagram (BDD), and-inverter graph (AIG) and majority inverter graph (MIG). These IRs play an important role in building circuits as the choice of an IR directly impacts on cost parameters of the design. In the authors' work, they are analysing the effects of different graph-based IRs (BDD, AIG and MIG) and their usability in making efficient circuit realisations. Although applications of BDDs as an IR to represent large functions has already been studied, here they are demonstrating a synthesis scheme by taking AIG and MIG as IRs and making a comprehensive comparative analysis over all these three graph-based IRs. In experimental evaluation, it is being observed that for small functions BDD gives more compact circuits than the other two IRs but when the input size increases, then MIG as IR makes substantial improvements in cost parameters as compared with BDD by reducing quantum cost by 39% on an average. Along with the experimental results, a detailed analysis over the different IRs is also included to find their easiness in designing circuits.

Journal ArticleDOI
TL;DR: The authors have shown that HFA achieves high error-detection rate while keeping overheads reasonable, and they call it as high throughput fault-resilient AES (HFA).
Abstract: As more and more confidential information is being transmitted securely, the use of cryptographic algorithms is expanded. However, existing cryptographic algorithms are subject to various malicious attacks. Fault injection attack is one of the most effective attacks that are able to extract private information with the inexpensive requirement and short amount of time. AES is a block cipher that is used in many critical applications. Here, a lightweight error-detection architecture for AES has been proposed; the authors call it as high throughput fault-resilient AES (HFA). In the proposed architecture, the authors use parallel AES architecture, which contains four equivalent blocks and split each block into two pipeline stages. The authors have shown that HFA achieves high error-detection rate while keeping overheads reasonable.

Journal ArticleDOI
TL;DR: With reduced precision numerical formats, memory footprint, computing speed, and resource utilisation are improved and the energy efficiency of SNN implementation is also improved.
Abstract: In this study, reduced precision operations are investigated in order to improve the speed and energy efficiency of SNN implementation. Instead of using the 32-bit single-precision floating-point format, small floating-point format and fixed-point format are used to represent SNN parameters and to perform SNN operations. The analyses are performed on the training and inference of a leaky integrate-and-fire model-based SNN that is trained and used to classify the handwritten digits in MNIST database. The analysis results show that for SNN inference, the floating-point format with 4-bit exponent and 3-bit mantissa or the fixed-point format with 6-bit integer and 7-bit fraction can be used without any accuracy degradation. For training, a floating-point format with 5-bit exponent and 3-bit mantissa or a fixed-point format with 6-bit integer and 10-bit fraction can be used to obtain full accuracy. The proposed reduced precision formats can be used in SNN hardware accelerator design and the selection between floating-point and fixed-point can be determined by design requirements. A case study of SNN implementation on field-programmable gate array device is performed. With reduced precision numerical formats, memory footprint, computing speed, and resource utilisation are improved. As a result, the energy efficiency of SNN implementation is also improved.

Journal ArticleDOI
TL;DR: Overall results show that a power-constrained S*FSM consumes about 5% more power than insecure FSMs with binary encodings, though with a penalty of a 95% increase in layout area.
Abstract: Security-centric components and systems, such as System-on-Chip early-boot communication protocols and ultra-specific lightweight devices, require a departure from minimalist design constructs. The need for built-in protection mechanisms, at all levels of design, is paramount to providing cost-effective, efficient, secure systems. In this work, Securely derived Finite State Machines (S*FSM) and power-aware S*FSM are proposed and studied. Overall results show that to provide an S*FSM, the typical FSM requires a 50% increase in the number of states and a 57% increase in the number of product terms needed to define the state transitions. These increases translate to a minimum encoding space increase of 70%, raising the average encoding length from 4.8 bits to 7.9 bits. When factoring in relaxed structural constraints for power and space mitigation, the respective increases of 53 and 67% raise the average number of bits needed to 7.3 and 7.9. Regarding power savings, current minimisation is possible for both FSMs and S*FSMs through the addition of encoding constraints with average current reductions of 30 and 70%, respectively. Overall, a power-constrained S*FSM consumes about 5% more power than insecure FSMs with binary encodings, though with a penalty of a 95% increase in layout area.

Journal ArticleDOI
TL;DR: An energy efficient in-memory computing (IMC) kernel for linear classification and an initial prototype is devised that achieves a power savings of over 6.4 times than a conventional discrete system while improving reliability by 54.67%.
Abstract: Large-scale machine-learning (ML) algorithms require extensive memory interactions. Managing or reducing data movement can significantly increase the speed and efficiency of many ML tasks. Towards this end, the authors devise an energy efficient in-memory computing (IMC) kernel for linear classification and design an initial prototype. The authors achieve a power savings of over 6.4 times than a conventional discrete system while improving reliability by 54.67%. The authors employ a split-data-aware technique to manage process, voltage, and temperature variations and to achieve fair trade-offs between energy efficiency, area requirements, and accuracy. The authors utilise a trimodal architecture with a hierarchical tree structure to further decrease power consumption. The authors also explore alternatives to the hierarchical tree structure with a significantly reduced number of linear regression blocks, while maintaining a competitive classification accuracy. Overall, the scheme provides a fast, energy efficient, and competitively accurate binary classification kernel.

Journal ArticleDOI
TL;DR: The intention of this work is to present an optimised framework that can be used as reliably as one implemented with precise operations, standard training algorithms and the same network structures and hyper-parameters.
Abstract: As Machine Learning applications increase the demand for optimised implementations in both embedded and high-end processing platforms, the industry and research community have been responding with different approaches to implement these solutions. This work presents approximations to arithmetic operations and mathematical functions that, associated with a customised adaptive artificial neural networks training method, based on RMSProp, provide reliable and efficient implementations of classifiers. The proposed solution does not rely on mixed operations with higher precision or complex rounding methods that are commonly applied. The intention of this work is not to find the optimal simplifications for specific deep learning problems but to present an optimised framework that can be used as reliably as one implemented with precise operations, standard training algorithms and the same network structures and hyper-parameters. By simplifying the ‘half-precision’ floating point format and approximating exponentiation and square root operations, the authors’ work drastically reduces the field programmable gate array implementation complexity (e.g. −43 and −57% in two of the component resources). The reciprocal square root approximation is so simple it could be implemented only with combination logic. In a full software implementation for a mixed-precision platform, only two of the approximations compensate the processing overhead of precision conversions.

Journal ArticleDOI
TL;DR: The authors’ results show that their test snippet generation approach not only leads to the production of test snippets which are properly fitted the proposed test architecture but also its final fault coverage is comparable and even a little better than the fault coverage of the best existing methods.
Abstract: In the past decades, software-based self-testing (SBST) which is testing of a processing core using its native instructions has attracted much attention. However, efficient SBST of a processing core which is deeply embedded in a multicore architecture is still an open issue. In this study, inspiring from built-in self-test methods, the authors place several number of hardware test components next to the processing cores in order to overcome existing SBST challenges. These test components facilitate quick testing of embedded cores by providing several mechanisms such as virtual fetch, virtual jump, fake load & store, and segmented test application. In order to enable segmented test application, they propose the concept of test snippet and a test snippet generation approach. The result is the capability of testing embedded cores in short idle times leading to efficient online testing of the cores with zero performance overhead. The authors’ results show that their test snippet generation approach not only leads to the production of test snippets which are properly fitted the proposed test architecture but also its final fault coverage is comparable and even a little better than the fault coverage of the best existing methods.

Journal ArticleDOI
TL;DR: The authors address the issue of 3D IC testing using genetic algorithm-based approach to decrease test time by considering, variable partitions with or without certain power limits.
Abstract: The interconnect between the cores of System-on-Chip (SOC) degrades the circuit performance by contributing to circuit delay and power consumption. To reduce this problem, SOC-based three-dimensional (3D) integrated circuit (IC) technology as a promising solution where multiple layers are stacked together decreasing the length of interconnect. However, 3D IC invites some new problems including more complexity in test generation. Testing of 3D IC requires test access architecture called Test Access Mechanism (TAM) for the purpose of transport of test stimuli to the cores placed in different layers. During testing due to increasing switching activity, any circuit demands higher power consumption and it becomes more acute for 3D IC. Moreover, testing of 3D ICs has other constraints. In this study, the authors address the issue of 3D IC testing using genetic algorithm-based approach to decrease test time. At first, available TAM width is partitioned into some fixed groups and they have to find partitioning of TAM and distribution of cores among layers with a goal to decrease test time. Next, they do the same considering, variable partitions with or without certain power limits. Experimental results establish the efficacy of the authors' method.

Journal ArticleDOI
TL;DR: Here, the authors propose an energy-efficient codec design using a rate-0.91 systematic quasi-cyclic-low-density parity-check code and a cost-effective early termination (ET) scheme is presented for efficiently terminating the decoding iterations and maintaining desirable correcting performance.
Abstract: Here, the authors propose an energy-efficient codec design using a rate-0.91 systematic quasi-cyclic-low-density parity-check (QC-LDPC) code. A cost-effective early termination (ET) scheme is presented for efficiently terminating the decoding iterations and maintaining desirable correcting performance. Compared with no ET scheme, the cost-effective ET scheme achieves 54.6% energy reduction with 1.7% area overhead. Finally, the proposed QC-LDPC codec employing the cost-effective ET scheme is implemented in a prototyping chip of 9.86 mm 2 core area using the TSMC 90 nm CMOS technology. Compared with the other decoder chips, the prototyping codec operating at 278 MHz achieves the best decoding energy efficiency of 156 pJ/bit with a high decoding throughput of 4.3 Gbps. The prototyping codec also achieves a high encoding throughput of 4.4 Gbps.

Journal ArticleDOI
TL;DR: The authors propose an energy-efficient caching strategy for prefetch blocks, ECAP, which uses the less used cache set of nearby tiles running light applications as virtual cache memories for the tiles running high applications to place thePrefetch blocks.
Abstract: With the increase in processing cores performance have increased, but energy consumption and memory access latency have become a crucial factor in determining system performance. In tiled chip multiprocessor, tiles are interconnected using a network and different application runs in different tiles. Non-uniform load distribution of applications results in varying L1 cache usage pattern. Application with larger memory footprint uses most of its L1 cache. Prefetching on top of such application may cause cache pollution by evicting useful demand blocks from the cache. This generates further cache misses which increases the network traffic. Therefore, an inefficient prefetch block placement strategy may result in generating more traffic that may increase congestion and power consumption in the network. This also dampens the packet movement rate which increases miss penalty at the cores thereby affecting Average Memory Access Time (AMAT). The authors propose an energy-efficient caching strategy for prefetch blocks, ECAP. It uses the less used cache set of nearby tiles running light applications as virtual cache memories for the tiles running high applications to place the prefetch blocks. ECAP reduces AMAT, router and link power in NoC by 23.54%, 14.42%, and 27%, respectively as compared to the conventional prefetch placement technique.

Journal ArticleDOI
TL;DR: An optimum energy efficient real-time scheduling to adjust voltage dynamically to achieve optimum throughput and significantly reduced the total energy consumption of the system with respect to some popular and relatively new scheduling schemes are presented.
Abstract: One of the critical design issues in real-time systems is energy consumption, especially in battery-operated systems. Generally higher processor voltage generates higher throughput of the system while decreasing voltage can perform energy minimisation. Instead of lowering processor voltage, this paper presents an optimum energy efficient real-time scheduling to adjust voltage dynamically to achieve optimum throughput. Earlier research works have considered random new tasks, which have been divided into jobs using pfair scheduling to fit into idle times of different cores of the system. In this paper we consider each job has different power levels and execution time at each power level can be found using normalised execution time. Based on the power levels and their corresponding execution time, we find different combinations of energy signature of the system and derive the optimum state of the system using a weighted average of the energy of the system and corresponding throughput. We verify the proposed model using generated task sets and the results show that the model performs excellently in all the cases and significantly reduced the total energy consumption of the system with respect to some popular and relatively new scheduling schemes.

Journal ArticleDOI
TL;DR: An extended two-dimensional mesh Network-on-Chip architecture for region-based fault tolerant routing methods that has an additional track of links and switches at the four sides of a mesh network so that it can partially reconfigure the network around faulty regions to provide new detour paths.
Abstract: This paper proposes an extended two-dimensional mesh Network-on-Chip architecture for region-based fault tolerant routing methods. The proposed architecture has an additional track of links and switches at the four sides of a mesh network so that it can partially reconfigure the network around faulty regions to provide new detour paths. This allows to simplify the complex routing rules of the existing fault-tolerant routing methods and avoid long detour routing paths. Modified routing method is also proposed for the new architecture and the deadlock freeness is proved. Simulation results show that the proposed architecture with the modified routing method reduces the average communication latency by about 39% compared to the existing state-of-the-art method at the expense of low hardware overhead.

Journal ArticleDOI
TL;DR: It is demonstrated that system reliability is improved when the more vulnerable components are checked more frequently than when they are checked in round-robin order, and a genetic algorithm is proposed for finding a voter checking schedule that maximises the reliability of TMR–MER systems.
Abstract: Field-programmable gate arrays are susceptible to radiation-induced single event upsets. These are commonly dealt with using triple modular redundancy (TMR) and module-based configuration memory error recovery (MER). By triplicating components and voting on their outputs, TMR helps localise configuration memory errors, and by reconfiguring faulty components, MER swiftly corrects them. However, the order in which TMR voters are checked inevitably impacts the overall system reliability. In this study, the authors outline an approach for computing the reliability of TMR–MER systems that consist of finitely many components. They demonstrate that system reliability is improved when the more vulnerable components are checked more frequently than when they are checked in round-robin order. They propose a genetic algorithm for finding a voter checking schedule that maximises the reliability of TMR–MER systems. Results indicate that the mean time to failure (MTTF) of these systems can be increased by up to 400% when variable-rate voter checking (VRVC) is used instead of round robin. They show that VRVC achieves 15–23% increase in MTTF with a 10× reduction in checking frequency to reduce system power. They also found that VRVC detects errors 44% faster on average than round robin.

Journal ArticleDOI
TL;DR: A low power single lead electrocardiogram front-end acquisition system in 0.18 μm CMOS operating at 0.5 V and using a moving average voltage to time converter to get amplification and anti-aliasing in the time domain is presented.
Abstract: A low power single lead electrocardiogram front-end acquisition system in 0.18 μm CMOS operating at 0.5 V is presented here. The analogue blocks in low noise amplifier (LNA), filters and passive elements that perform amplification and DC offset cancellation are replaced by a moving average voltage to time converter (MA-VTC) to get amplification and anti-aliasing in the time domain. A digital feedback algorithm is used to cancel out the DC offset. The front-end structure is designed in the sub-threshold region of MOS to reduce the power consumption in the circuit. The proposed architecture consumes 50 nW of power with a gain of 670 μs/V. The output of the front-end is fed to an all digital time-to-digital converter (TDC) that operates in the near threshold region with a resolution of 586.4 ps and 32.5 μW power consumption.

Journal ArticleDOI
TL;DR: Proposed policies to reduce the leakage power consumption of NoC buffers by the use of non-volatile spin transfer torque random access memory (STT-RAM)-based buffers and improve lifetime by 3.2 times and 1093 times, respectively are presented.
Abstract: With the advancement in CMOS technology and multiple processors on the chip, communication across these cores is managed by a network-on-chip (NoC). Power and performance of these NoC interconnects have become a significant factor.The authors aim to reduce the leakage power consumption of NoC buffers by the use of non-volatile spin transfer torque random access memory (STT-RAM)-based buffers. STT-RAM technology has the advantages of high density and low leakage but suffers from low endurance. This low endurance has an impact on the lifetime of the router on the whole due to unwanted write-variations governed by virtual channel (VC) allocation policies. Here various VC allocation policies that help the uniform distribution of the writes across the buffers are proposed. Iso-capacity and iso-area-based alternatives to replace SRAM buffers with STT-RAM buffers are also presented. Pure STT-RAM buffers, however, impact the network latency. To mitigate this, a hybrid variant of the proposed policies which uses alternative VCs made of SRAM technology in the case of heavy network traffic is proposed. Experimental evaluation of full system simulation shows that proposed policies reduce the write variation by 99% and improve lifetime by 3.2 times and 1093 times, respectively. Also a 55.5% gain in the energy delay product is obtained.