scispace - formally typeset
Search or ask a question

Showing papers in "Iet Computers and Digital Techniques in 2020"


Journal ArticleDOI
TL;DR: A new 11T static random access memory (SRAM) cell that uses power gating transistors and transmission gate for low leakage and reliable write operation and solves the row half select disturbance and utilises a row-based virtual ground signal to eliminate unnecessary bit-line discharge in the un-selected row, thus decreasing energy consumption.
Abstract: This study aims for a new 11T static random access memory (SRAM) cell that uses power gating transistors and transmission gate for low leakage and reliable write operation. The proposed cell has a separate read and write path which successfully improves read and write abilities. Furthermore, it solves the row half select disturbance and utilises a row-based virtual ground signal to eliminate unnecessary bit-line discharge in the un-selected row, thus decreasing energy consumption. The cell also achieves low power due to the stack effect. To show the effectiveness of the cell, its design metrics are compared with other published SRAM cells, namely, conventional 6T, 10T, 9T, and power-gated 9T (PG9T). In standby mode, from 6.71 to 7.37% leakage power reduction is observed for this cell at an operating voltage of 1.2 V and 29.21 to 58.68% & 32.74 to 71.11% improvement for write & read power over other cells. The proposed cell exhibits higher write and reads static noise margins with an improvement of 13.54 and 63.28%, respectively, compared to conventional 6T SRAM cell. The cell provides write delay improvement from 29.77 to 49.40% and read delay improvement from 7 to 12% compared to 9T, 10T, and PG9T, respectively.

29 citations


Journal ArticleDOI
TL;DR: The authors attempt to make a review of the hardware Trojan design and implementations in the last decade and also provide an outlook, focusing on the attacker's methods, capabilities, and challenges when the attacker designs and implements a hardware Trojan.
Abstract: Hardware Trojan detection techniques have been studied extensively However, to develop reliable and effective defenses, it is important to figure out how hardware Trojans are implemented in practical scenarios The authors attempt to make a review of the hardware Trojan design and implementations in the last decade and also provide an outlook Unlike all previous surveys that discuss Trojans from the defender's perspective, for the first time, the authors study the Trojans from the attacker's perspective, focusing on the attacker's methods, capabilities, and challenges when the attacker designs and implements a hardware Trojan First, the authors present adversarial models in terms of the adversary's methods, adversary's capabilities, and adversary's challenges in seven practical hardware Trojan implementation scenarios: in-house design team attacks, third-party intellectual property vendor attacks, computer-aided design tools attacks, fabrication stage attacks, testing stage attacks, distribution stage attacks, and field-programmable gate array Trojan attacks Second, the authors analyse the hardware Trojan implementation methods under each adversarial model in terms of seven aspects/metrics: hardware Trojan attack scenarios, the attacker's motivation, feasibility, detectability (anti-detection capability), protection and prevention suggestions for the designer, overhead analysis, and case studies of Trojan implementations Finally, future directions on hardware Trojan attacks and defenses are also discussed

27 citations


Journal ArticleDOI
TL;DR: This study presents a high throughput field-programmable gate array (FPGA) implementation of advanced encryption standard-128 (AES-128), a well-known symmetric key encryption algorithm with high security against different attacks that is widely used in different applications.
Abstract: This study presents a high throughput field-programmable gate array (FPGA) implementation of advanced encryption standard-128 (AES-128). AES is a well-known symmetric key encryption algorithm with high security against different attacks that are widely used in different applications. The main goal of this study is to design a high throughput and FPGA efficiency (FPGA-Eff) cryptosystem for high-traffic applications. To achieve high throughput, loop-unrolling, inner and outer pipelining techniques are employed. In AES, substitution bytes (Sub-Bytes) is one of the costly functions that occupy a large number of resources and has a large delay. To reduce the area of Sub-Bytes, new-affine-transformation, which is the combination of inverse isomorphic and affine transformation, is proposed and employed. Besides that, AES has been modified according to the proposed architecture. For the first nine rounds, Shift-Rows and Sub-Bytes have been exchanged, and Shift-Rows is merged with Add-Round-Key. To make an equal latency between stages, Mix-Columns is divided into two different stages. AES is implemented in counter mode on Xilinx Virtex-5 using VHDL. The proposed implementation achieves a throughput of 79.7 Gbps, FPGA-Eff of 13.3 Mbps/slice, and frequency of 622.4 MHz. Compared to the state-of-the-art work, the proposed design has improved data throughput by 8.02% and FPGA-Eff by 22.63%.

19 citations


Journal ArticleDOI
TL;DR: The proposed design can be considered as an efficient alternative of traditional adjacent error correcting decoders in resource constraint applications in FPGA and ASIC platforms.
Abstract: Multiple cell upsets (MCUs) caused by radiation is an important issue related to the reliability of embedded static random access memories (SRAMs). Multiple random and adjacent error correcting codes have been extensively employed for several years to protect stored data in SRAMs against MCUs. A compact and fast error correcting codec is desirable in most of these applications. In this study, simplified expressions for error location detection (ELD) block for single error correction-double error detection-double adjacent error correction (SEC-DED-DAEC) and single error correction-double error detection-triple adjacent error correction (SEC-DED-TAEC) decoders have been obtained by employing Karnaugh map. The conventional SEC-DED-DAEC and SEC-DED-TAEC decoders have been designed and implemented in both field-programmable gate array and ASIC platforms by considering these simplified ELD expressions. In FPGA platform, the proposed design for SEC-DED-DAEC and SEC-DED-TAEC decoders require 1.37–28.40% improvement in area and maximum 14.74% improvement in delay compared to existing designs. Whereas ASIC-based designs provide 2.20–26.81% reduction in area and 0.30–28.96% reduction in delay compared to existing related works. So the proposed design can be considered as an efficient alternative of traditional adjacent error correcting decoders in resource constraint applications.

13 citations


Journal ArticleDOI
TL;DR: This study aims to provide a comprehensive review of the purviews and insights provided by the extensive body of work related to Amdahl's law to date, focusing on computation speedup.
Abstract: For over 50 years, Amdahl's Law has been the hallmark model for reasoning about performance bounds for homogeneous parallel computing resources. As heterogeneous, many-core parallel resources continue to permeate into the modern server and embedded domains, there has been growing interest in promulgating realistic extensions and assumptions in keeping with newer use cases. This study aims to provide a comprehensive review of the purviews and insights provided by the extensive body of work related to Amdahl's law to date, focusing on computation speedup. The authors show that a significant portion of these studies has looked into analysing the scalability of the model considering both workload and system heterogeneity in real-world applications. The focus has been to improve the definition and semantic power of the two key parameters in the original model: the parallel fraction (f) and the computation capability improvement index (n). More recently, researchers have shown normal-form and multi-fraction extensions that can account for wider ranges of heterogeneity, validated on many-core systems running realistic workloads. Speedup models from Amdahl's law onwards have seen a wide range of uses, such as the optimisation of system execution, and these uses are even more important with the advent of the heterogeneous many-core era.

9 citations


Journal ArticleDOI
TL;DR: This survey study presents an analysis of various DRAM designs and their performances and focuses on the architecture, functionality, and performance of different hardware accelerators and PIM systems to reduce memory access time.
Abstract: A major issue faced by data scientists today is how to scale up their processing infrastructure to meet the challenge of big data and high-performance computing (HPC) workloads. With today's HPC domain, it is required to connect multiple graphics processing units (GPUs) to accomplish large-scale parallel computing along with CPUs. Data movement between the processor and on-chip or off-chip memory creates a major bottleneck in overall system performance. The CPU/GPU processes all the data on a computer's memory and hence the speed of the data movement to/from memory and the size of the memory affect computer speed. During memory access by any processing element, the memory management unit (MMU) controls the data flow of the computer's main memory and impacts the system performance and power. Change in dynamic random access memory (DRAM) architecture, integration of memory-centric hardware accelerator in the heterogeneous system and Processing-in-Memory (PIM) are the techniques adopted from all the available shared resource management techniques to maximise the system throughput. This survey study presents an analysis of various DRAM designs and their performances. The authors also focus on the architecture, functionality, and performance of different hardware accelerators and PIM systems to reduce memory access time. Some insights and potential directions toward enhancements to existing techniques are also discussed. The requirement of fast, reconfigurable, self-adaptive memory management schemes in the high-speed processing scenario motivates us to track the trend. An effective MMU handles memory protection, cache control and bus arbitration associated with the processors.

8 citations


Journal ArticleDOI
TL;DR: A flexible structure that can perform various configurations of CLEFIA to support variable key sizes: 128, 192 and 256 bit is proposed and results show improvements in terms of execution time, throughput and throughput/area compared with other related works.
Abstract: In this study, high-throughput and flexible hardware implementations of the CLEFIA lightweight block cipher are presented. A unified processing element is designed and shared for implementing of generalised Feistel network that computes round keys and encryption process in the two separate times. The most complex blocks in the CLEFIA algorithm are substitution boxes (S 0 and S 1 ). The S 0 S-box is implemented based on area-optimised combinational logic circuits. In the proposed S-box structure, the number of logic gates and critical path delay are reduced by using the simplification of computation terms. The S-box S 1 consists of three steps: a field inversion over F 2 8 and two affine transformations over F 2 . The inversion operation is implemented over the composite field F(2 4 ) 2 instead of inversion over F 2 8 which is an important factor for the reduction of area consumption. In addition, we proposed a flexible structure that can perform various configurations of CLEFIA to support variable key sizes: 128, 192 and 256 bit. Implementation results of the proposed architectures in 180 nm complementary metal-oxide-semiconductor technology for different key sizes are achieved. The results show improvements in terms of execution time, throughput and throughput/area compared with other related works.

8 citations


Journal ArticleDOI
TL;DR: The hardware implementation reduces the time frame to analyse the DNA sequence of Eukaryotic genes for protein formation, which plays a significant role in detecting individual diseases from genetic reports.
Abstract: In a Eukaryotic gene, identification of exon regions is crucial for protein formation. The periodic-3 property of exon regions has been used for its identification. An anti-notch infinite impulse response (IIR) filter is mostly employed to recognise this periodic-3 property. The lattice structure realisation of anti-notch IIR filter requires less hardware over direct from-II structures. In this study, a hardware implementation of IIR anti-notch filter lattice structure is carried out on Zynq-series (Zybo board) field programmable gate array (FPGA). The performance of hardware design has been improved using techniques like retiming, pipelining and unfolding and finally assessed on various Eukaryotic genes. The hardware implementation reduces the time frame to analyse the DNA sequence of Eukaryotic genes for protein formation, which plays a significant role in detecting individual diseases from genetic reports. Here, the performance evaluation is carried out in MATLAB simulation environment and the results are found similar. Application-specific integrated circuit (ASIC) implementation of the anti-notch filter lattice structure is also carried out on CADENCE-RTL compiler. It is observed that the FPGA implementation is 31 to 34 times faster and ASIC implementation is 58 to 64 times faster compared to the results generated by MATLAB platform with similar prediction accuracy.

7 citations


Journal ArticleDOI
TL;DR: This study combines ternary static DCVSL (SDCVSL) with dynamic logic (DL) to realiseTernary dynamic DCV SL (DDCV SL) by means of a single power source to reduce power consumption and introduce new logic styles.
Abstract: Every logic style has certain advantages for a specific application. Therefore, it is essential to introduce and investigate different logic styles. Differential cascode voltage switch logic (DCVSL) with the inherent redundancy is known to be an ideal logic style for error detection applications. This study combines ternary static DCVSL (SDCVSL) with dynamic logic (DL) to realise ternary dynamic DCVSL (DDCVSL) by means of a single power source. At first, it is shown that why the same static-to-dynamic conversion method in binary logic fails to operate correctly in ternary logic. Then, two solutions are given. Static power dissipation and switching activity are particularly dealt with in the second proposed ternary DDCVSL to reduce power consumption. The new designs are simulated and tested by using HSPICE simulator and 32 nm Stanford carbon nanotube field effect transistor model. Simulation results and comparisons with a vast range of conventional and state-of-the-art competitors show prominence and great potential for the new ternary circuit methodology. For example, the authors second proposed ternary DDCVSL AND/NAND has 19.7, 37.4, and 60.5% higher performance than some famous static ternary logic styles such as CMOS-like, SDCVSL, and pseudo N-type, respectively, in terms of energy consumption.

7 citations


Journal ArticleDOI
TL;DR: This work presents a novel PIM concept, embedding Akers array in QCA to achieve high-speed computing at the nano-scale era and indicates the efficacy of QCA PIM over the conventional Von Neumann architecture.
Abstract: The conventional computing system has been facing enormous pressure to cope with the uprising demand for computing speed in today's world. In search of high-speed computing in the nano-scale era, it becomes the utmost necessity to explore a viable alternative to overcome the challenges of the physical limit of complementary-metal-oxide-semiconductor (CMOS). Towards that direction, the processing-in-memory (PIM) is advancing its importance as it keeps the computation as adjacent as possible to memory. It promises to outperform the latencies of the conventional stored-program concept by embedding storage and data computation in a single unit. On the other hand, the bit storing and processing capability of Akers array provides the foundation of PIM. Again, quantum-dot cellular automata (QCA) emerges as a promising nanoelectronic to put back CMOS to give fast-paced devices at the nanoelectronics era. This work presents a novel PIM concept, embedding Akers array in QCA to achieve high-speed computing at the nano-scale era. QCA implementation of universal logic utilizing Akers array signifies its processing power and puts forth its potentials. A universal function is considered for testing the effectiveness of the proposed PIM cell. The performance evaluation indicates the efficacy of QCA PIM over the conventional Von Neumann architecture.

6 citations


Journal ArticleDOI
TL;DR: The authors' ECCP design offers higher speed without any significant area overhead to recent designs reported in the literature, and can be ported to any field-programmable gate array family or standard ASIC libraries.
Abstract: Recent studies have shown that existing elliptic curve-based cryptographic standards provide backdoors for manipulation and hence compromise the security. In this regard, two new elliptic curves known as Curve448 and Curve25519 are recently recommended by IETF for transport layer security future generations. Hence, cryptosystems built over these elliptic curves are expected to play a vital role in the near future for secure communications. A high-speed elliptic curve cryptographic processor (ECCP) for the Curve448 is proposed in this study. The area of the ECCP is optimised by performing different modular operations required for the elliptic curve Diffie–Hellman protocol through a unified architecture. The critical path delay of the proposed ECCP is optimised by adopting the redundant-signed-digit technique for arithmetic operations. The segmentation approach is introduced to reduce the required number of clock cycles for the ECCP. The proposed ECCP is developed using look-up-tables (LUTs) only, and hence it can be ported to any field-programmable gate array family or standard ASIC libraries. The authors' ECCP design offers higher speed without any significant area overhead to recent designs reported in the literature.

Journal ArticleDOI
TL;DR: The study introduces an effective virtual machine (VM) migration strategy using an optimisation algorithm in such a way to facilitate the user selection of the providers based on their budgetary requirements in running their own platforms.
Abstract: The growing demand for the cloud community market towards attracting and sustaining the incoming and the available cloud users is addressed actively to meet the competitive environment. There is a good scope for improving the provider capabilities in the cloud in order to satisfy the users with attractive benefits. The study introduces an effective virtual machine (VM) migration strategy using an optimisation algorithm in such a way to facilitate the user selection of the providers based on their budgetary requirements in running their own platforms. The constraints associated with the selection of the provider include cost, revenue, and resource, which are altogether confined as an elective factor. The optimisation algorithm employed for the VM migration is referred to as Taylor series-based salp swarm algorithm (Taylor-SSA) that is the integration of the Taylor series with SSA. The evaluation of the method is progressed using three setups by varying the number of providers and users. The cost, the revenue, and the resource of the proposed method are analysed and concluded that the proposed method acquired a minimal cost, maximal resource gain and revenue.

Journal ArticleDOI
TL;DR: This study presents an analysis of the LFSR, using a known automatic test PG (ATPG) test set, and two techniques are undertaken to target difficult-to-detect faults with their respective trade-off analysis.
Abstract: Safety-critical technology rests on optimised and effective testing techniques for every embedded system involved in the equipment. Pattern generator (PG) such as linear feedback shift register (LFSR) is used for fault detection and useful for reliability and online test. This study presents an analysis of the LFSR, using a known automatic test PG (ATPG) test set. Two techniques are undertaken to target difficult-to-detect faults with their respective trade-off analysis. This is achieved using Berlekamp-Massey (BM) algorithm with optimisations to reduce area overhead. The first technique (concatenated) combines all test sets generating a single polynomial that covers complete ATPG set (baseline-C). Improvements are found in Algorithm 1 reducing polynomial size through Xs assignment. The second technique uses non-concatenated test sets and provides a group of LFSRs using BM without including any optimisation (baseline-N). This algorithm is further optimised by selecting full mapping and independent polynomial expressions. Results are generated using 32 benchmarks and 65 nm technology. The concatenated technique provides reductions on area overhead for 90.6% cases with a best case of 57 and 39% means. The remaining 9.4% of cases, non-concatenated technique provides the best reduction of 37 with 1.4% means, whilst achieving 100% test mapping in both cases.

Journal ArticleDOI
TL;DR: In this article, a configurable self-calibrated power efficient five-bit error correction code is proposed to correct both single bit random and burst errors up to five bits; providing 100% error correction probability with crosstalk avoidance.
Abstract: A configurable self-calibrated power efficient five-bit error correction code is proposed to correct both single bit random and burst errors up to five bits; providing 100% error correction probability with crosstalk avoidance. It can also correct higher-order error up to 9 bits with an error correction probability tolerance of 73% for on-chip interconnection links. Single error correction and double error detection with extended Hamming code (22,16) is utilised along with standard triplication error correction methods in the proposed code. Self-calibration algorithm and data stream rerouting block are integrated into the error correction code to achieve power efficiency. Reliability, link power consumption, and link swing voltage are estimated using an analytical model used in a network-on-chip. Area, power, and delay of the codec are obtained using Synopsys tools utilising UMC 90 nm technology. The proposed method provides 32–73% power saving and 22.3–60.6% delay reduction with negligible area overhead compared with the state-of-the-art works. Estimated results prove that it provides a 40.5–50% reduction in link swing voltage and link power consumption compared with the state-of-the-art works. The proposed code is more appropriate for on-chip interconnect links where it provides high reliability and low swing voltage with high error correction capability compared with existing codes.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a global qubit ordering technique that considers fewer permutations for the number of interactions a qubit does with other qubits of its circuit, and performed the local re-ordering of qubits by attempting to reduce the cost as much as possible.
Abstract: Quantum computers that are based on technologies like superconducting and quantum dots impose a physical constraint that requires interacting qubits to be adjacent. The initial placement of qubits and the swap gate insertion techniques affect the circuit cost. The authors proposed a global qubit ordering technique that considers fewer permutations for the number of interactions a qubit does with other qubits of its circuit. They also performed the local re-ordering of qubits by attempting to reduce the cost as much as possible; the cost is estimated by defining a window with weights assigned in such a way that nearby gates to the current gate in question are given higher weightage. Experiments have been conducted on NCV benchmarks, and results have been compared with those of recent state-of-the-art techniques. When compared with the existing works, the proposed method shows improvements of up to 53.3% for smaller benchmarks and up to 51.61% for larger benchmarks.

Journal ArticleDOI
TL;DR: This work provides an analogy of various ML sensing techniques based on their pre-charging, evaluation and performance improvement strategies andimation on the power dissipation and evaluation time are made and in-depth analysis on their power-speed-overhead trade-off are carried on 64-bit CAM macros.
Abstract: Performance of a memory depends on the storage stability, yield and sensing speed. Differential input and the latching time of sense amplifiers are considered as primary performance factors in static random access memory. In a content addressable memory (CAM), the sensing is carried out through the matchline (ML) and the time for evaluation is the key to decide the search speed. The density of CAM is on a rise to accommodate a higher amount of information which increases the power dissipation associated with it. Issues such as the logical threshold variation and lower noise margin between match and mismatch are critical in the operation of a CAM. A good ML sensing technique can reduce the ML power with enhanced evaluation speed. This work provides an analogy of various ML sensing techniques based on their pre-charging, evaluation and performance improvement strategies. Estimation on the power dissipation and evaluation time are made and in-depth analysis on their power-speed-overhead trade-off are carried on 64-bit CAM macros.

Journal ArticleDOI
TL;DR: A novel method for identification of susceptible nets that are prone to Hardware Trojan insertions in a logic circuit and the impact of the number of trigger inputs as well as the distribution of trigger nets on the testability metrics of digital circuits is presented.
Abstract: Insertion of malicious circuits commonly known as Hardware Trojans into an original integrated circuit (IC) design to alter the functionality has been a major concern in recent years. As a result, over the years multiple techniques have been suggested by researchers to combat these malicious threats. Hard to test nets in any logic circuit are the most vulnerable to insertion of Hardware Trojans. Testability analysis is the process of identification of these hard to test nets in a logic circuit. Testability analysis is achieved through the testability metrics namely controllability and observability. Testability metrics can be used as a yardstick in devising efficient Hardware Trojan detection methods. The crux of this study is a novel method for identification of susceptible nets that are prone to Hardware Trojan insertions in a logic circuit. The study also presents a comprehensive analysis of the impact on testability parameters as a result of Hardware Trojans in the identified susceptible nets. The method utilises the testability parameters of nets to define threshold values for isolating susceptible nets in a design. The study details out the impact of the number of trigger inputs as well as the distribution of trigger nets on the testability metrics of digital circuits.

Journal ArticleDOI
TL;DR: This study proposes efficient very large scale integration (VLSI) architectures of lifting based 3D-DWT using (5,3) and (9,7) Daubechies wavelets that achieve significant improvement in throughput than various existing designs.
Abstract: Discrete wavelet transform (DWT) is widely used in the image and video compression due to its high compression ratio and resolution. This study proposes efficient very large scale integration (VLSI) architectures of lifting based 3D-DWT using (5,3) and (9,7) Daubechies wavelets. The advantage of these proposed architectures is the absence of storage buffer in between the row, column, and temporal processes. Also, five and nine numbers of frames of the 3D signal can be processed in parallel using the proposed (5,3) and (9,7) lifting based DWTs, respectively. Due to this parallelism and the elimination of storage buffers, the throughput of the proposed design is greater than other existing techniques. The authors have implemented all the existing and proposed 3D-DWTs using 45 nm CMOS library with Cadence and Artix-7 FPGA with Xilinx Vivado. The synthesis results show that the proposed designs achieve significant improvement in throughput than various existing designs. For example, the proposed (9,7) lifting based 3D-DWT achieves 85.4% of improvement in the throughput than the conventional design.

Journal ArticleDOI
TL;DR: In this paper, a fast mode decision algorithm for the intra prediction module is proposed to reduce the number of intra prediction modes to be tested instead of performing a full intra mode search.
Abstract: High-efficiency video coding (HEVC) is the latest video coding standard aimed to reduce the bitrate by half for the same video quality compared to H.264/AVC. This encoding performance makes HEVC more suitable for high-definition video applications. However, this performance is coupled with a high-computational complexity, which makes it hard to achieve real-time video encoding with a classic embedded processor. Multicore technology of programmable processors could be a very promising solution to overcome this computational complexity. Moreover, software optimisations by proposing fast algorithms for the most complex functions could also be an efficient solution to speed up the encoding process. In this context, this study presents a fast mode decision algorithm for the intra prediction module. This algorithm aims to reduce the number of intra prediction modes to be tested instead of performing a full intra mode search. Experimental results for all-Intra configuration show that the proposed fast intra mode decision allows saving up to 46.79% of the intra prediction time in average. Encoding performance in terms of video quality and bitrate is not significantly affected.

Journal ArticleDOI
TL;DR: A low power-delay-product (PDP) dynamic complementary metal oxide semiconductor (CMOS) circuit design using small swing domino logic with twist-connected transistors is proposed, leading to an improvement in PDP by using a node-discharger circuit in the conventional design.
Abstract: The incessant growth of devices such as mobile phones, digital cameras, and other portable electronic gadgets has led to a higher amount of research being dedicated to the low power digital and analogue circuits. In this study, a low power-delay-product (PDP) dynamic complementary metal oxide semiconductor (CMOS) circuit design using small swing domino logic with twist-connected transistors is proposed. An improvement in PDP can be achieved by using a node-discharger circuit in the conventional design. The conventional benchmark and modified circuits are implemented in 90 nm CMOS technology with different power supplies, i.e. 1.2, 1, and 0.9 V. Furthermore, a decrease in voltage level for logic ‘1’ and an increase in voltage level for logic ‘0’ is achieved while maintaining the logic threshold accordingly at half of the supply voltage. So, the output voltage swing is reduced and the unnecessary nodes of the pull down network get discharged in pre-charge phase, eventually leading to an improvement when compared with conventional design in overall PDP by 43.21 and 46.83% for two inverted two-input and three-input AND gate dynamic benchmarks, respectively, for a power supply of 1 V.

Journal ArticleDOI
TL;DR: This study uses Markov Reward Models (MRMs) to model and evaluate a new core thermal management method, which can reduce hotspots and balance the thermal profile of a multi-core system.
Abstract: With successive scaling of CMOS technology, power density and cooling costs significantly increase. Consequently, the cooling system of processors can no longer be designed for the worst-case situation in each generation of CMOS technology and there is an essential need for run-time techniques to control the operating temperature. Task scheduling and resource management with respect to thermal constraints are run-time methods used to control the thermal profile of a system. In this study, the authors use Markov Reward Models (MRMs) to model and evaluate a new core thermal management method, which can reduce hotspots and balance the thermal profile of a multi-core system. Although the proposed management method degrades the performance of the system, such as other previously presented methods, it controls the temperature of a die to decrease the temperature variation and hotspots. The proposed approach is assessed on a quad-core system and the experimental results are compared to the results obtained from the proposed MRM to demonstrate the accuracy of the proposed analytical model.

Journal ArticleDOI
TL;DR: The authors propose an ILP model with three different objective functions which include minimising access latency, minimising energy and minimisingEnergy-delay product in the hybrid cache, which obtains better results in terms of energy consumption and performance compared to the existing hybrid cache architecture.
Abstract: Spin-transfer torque random access memory (STT-RAM) has emerged as an eminent choice for the larger on-chip caches due to high density, low static power consumption and scalability. However, this technology suffers from long latency and high energy consumption during a write operation. Hybrid caches alleviate these problems by incorporating a write-friendly memory technology such as static random access memory along with STT-RAM technology. The proper allocation of data blocks has a significant effect on both performance and energy consumption in the hybrid cache. In this study, the allocation and migration problem of data blocks in the hybrid cache is examined and then modelled using integer linear programming (ILP) formulations. The authors propose an ILP model with three different objective functions which include minimising access latency, minimising energy and minimising energy-delay product in the hybrid cache. Evaluations confirm that the proposed ILP model obtains better results in terms of energy consumption and performance compared to the existing hybrid cache architecture.

Journal ArticleDOI
TL;DR: This study proposes an efficient method to implement concurrent fault detection for parallel CRC computation that involves using a serial CRC computation circuit that is used to periodically check the results obtained from the main module to detect the faults.
Abstract: As technology scales down, circuits are more prone to incur in faults and fault detection is necessary to ensure the system reliability. However, fault-detection circuits are also vulnerable to stuck-at faults due to, for example, manufacturing defects or ageing; a fault can cause an incorrect output in the fault-detection scheme; so concurrent fault detection is, therefore, needed. Cyclic redundancy checks (CRCs) are widely used to detect errors in many applications, for example, they are used in communication to detect errors on transmitted frames. In this study, an efficient method to implement concurrent fault detection for parallel CRC computation is proposed. The scheme relies on using a serial CRC computation circuit that is used to periodically check the results obtained from the main module to detect the faults. This introduces a lower circuit overhead than existing schemes. All CRC encoders and decoders that implement the CRC computation in parallel can employ the proposed scheme to detect faults.

Journal ArticleDOI
TL;DR: This study proposes a scalable pseudo-exhaustive testing and diagnosis methodology for flow-based microfluidic biochips that employs a divide-and-conquer based technique wherein, large architectures are split into smaller sub-architectures and each is tested and diagnosed independently.
Abstract: Microfluidics is an upcoming field of science that is going to be used widely in many safety-critical applications including healthcare, medical research and defence. Hence, technologies for fault testing and fault diagnosis of these chips are of extreme importance. In this study, the authors propose a scalable pseudo-exhaustive testing and diagnosis methodology for flow-based microfluidic biochips. The proposed approach employs a divide-and-conquer based technique wherein, large architectures are split into smaller sub-architectures and each of these are tested and diagnosed independently.

Journal ArticleDOI
TL;DR: Multi-core hardware realisation of the quasi maximum likelihood algorithm as the state-of-the-art estimator of polynomial phase signals (PPSs) is proposed in this study and developed multiple-clock-cycle realisation is suitable for real-time implementation.
Abstract: Multi-core hardware realisation of the quasi maximum likelihood algorithm as the state-of-the-art estimator of polynomial phase signals (PPSs) is proposed in this study. Developed multiple-clock-cycle realisation is suitable for real-time implementation. To prove this, the proposed design is implemented on a field programmable gate array circuit. The hardware realisation is tested and verified on PPSs corrupted with various amounts of the Gaussian noise. Obtained results are compared with software simulations showing excellent match between the proposed system-based and the software-based outputs.

Journal ArticleDOI
TL;DR: This work presents a novel two-step method that combines the advantages of regular and irregular NoC topologies and shows the superiority of the proposed method over the existing work on several multimedia benchmarks.
Abstract: When designing a Network-on-Chip (NoC) architecture, designers must consider various criteria such as bandwidth, performance, energy consumption, cost, re-usability, and fault tolerance. In most of the design efforts, it is very difficult to meet all these interacting constraints and objectives at the same time. Some of these parameters can be optimised and met easily by regular NoC topologies due to their re-usability and fault-tolerance capabilities. On the other hand, other parameters such as energy consumption, performance, and chip area can be better optimised in irregular NoC topologies. In this work, the authors present a novel two-step method that combines the advantages of regular and irregular NoC topologies. In the first step, the authors' method generates an energy and area optimised irregular topology for the given application by using a genetic algorithm. The generated topology uses the least amount of routers and links to minimise the area and energy; thus, it offers only one routing path between communicating nodes. Therefore, it does not fault tolerant. In the second step, their method maps the generated irregular topology to a reconfigurable mesh topology to make it fault tolerant. The detailed simulation results show the superiority of the proposed method over the existing work on several multimedia benchmarks.

Journal ArticleDOI
TL;DR: The procedure is structured to explore the trade-off between the level of test data compression and the Hamming distances or the proximity to functional operation conditions and the lowest possible Hamming distance between their boundary vectors and functional boundary vectors.
Abstract: This study considers the compression of a type of close-to-functional broadside tests called boundary-functional broadside tests when the on-chip decompression logic consists of a linear-feedback shift register (LFSR). Boundary-functional broadside tests maintain functional operation conditions on a set of lines (called a boundary) in a circuit. This limits the deviations from functional operation conditions by ensuring that they do not propagate across the boundary. Functional vectors for the boundary are obtained from functional broadside tests. Seeds for the LFSR are generated directly from functional boundary vectors without generating tests or test cubes. Considering the tests that the LFSR produces, the seed generation procedure attempts to obtain the lowest possible Hamming distance between their boundary vectors and functional boundary vectors. It considers multiple LFSRs with increasing lengths to achieve test data compression. The procedure is structured to explore the trade-off between the level of test data compression and the Hamming distances or the proximity to functional operation conditions.

Journal ArticleDOI
TL;DR: An approach called density direction transform algorithm is proposed to eliminate the isomorphism of mapping sequence and accelerate the convergence of population and DDGMAP gets better performance than GA in searching the optimal solution.
Abstract: With the development of network-on-chip (NoC) theory, lots of mapping algorithm have been proposed to solve the application mapping problem which is an NP-hard (non-polynomial hard) problem. Most algorithms are based on a heuristic algorithm. They are trapped by iterations limited, not by the distance between iterations, because of the isomorphism of mapping sequence. In this study, the authors define and analyse the isomorphism with the genetic algorithm (GA) which is a heuristic algorithm. Then, they proposed an approach called density direction transform algorithm to eliminate the isomorphism of mapping sequence and accelerate the convergence of population. To verify this approach, they developed a density-direction-based genetic mapping algorithm (DDGMAP) and make a comparison with genetic mapping algorithm (GMA). The experiment demonstrates that compared to the random algorithm, their algorithm (DDGMAP) can achieve on an average 23.48% delay reduction and 7.15% power reduction. And DDGMAP gets better performance than GA in searching the optimal solution.

Journal ArticleDOI
TL;DR: A new bus protocol named as integrated bus (IBUS), and more important, a configurable bus wrapper for connecting AXI3-interfaced IPs into IBUS is further proposed, aiming to finding the optimal balance between bus efficiency and resource cost in terms of field-programming gate array slice count, bus transfer latency, and energy consumption.
Abstract: To integrate third-party intellectual properties (IPs) into a new system-on-chip (SoC) architecture is a big challenge. Therefore, this study first presents a new bus protocol named as integrated bus (IBUS), and more important, a configurable bus wrapper for connecting AXI3-interfaced IPs into IBUS is further proposed, aiming to finding the optimal balance between bus efficiency and resource cost in terms of field-programming gate array slice count, bus transfer latency, and energy consumption. As a case study, the authors implemented three IBUS wrappers for integrating three AXI3-interfaced verification IPs into an IBUS SoC. Experimental results show that their proposed work achieves a higher valid data throughput ( 1.35 × in the block test and 1.52 × in the cipher test) compared with the designs on conventional bridge-based SoC integration, as well as a large reduction in the normalised slice-time-power (18.73% in the block benchmark and 23.45% in the cipher benchmark) when setting the same weights of slice number, data transfer latency, and energy dissipation.

Journal ArticleDOI
TL;DR: This study proposes a methodology which gleans high-level descriptions of the micro-architectural steps and uses them in an artificial Intelligence planning framework to find alternative pathways through which a bug may return.
Abstract: Bug traces serve as references for patching a microprocessor design after a bug has been found. Unless the root cause of a bug has been detected and patched, variants of the bug may return through alternative bug traces, following a different sequence of micro-architectural events. To avoid such a situation, the verification engineer must think of every possible way in which the bug may return, which is a complex problem for a modern microprocessor. This study proposes a methodology which gleans high-level descriptions of the micro-architectural steps and uses them in an artificial Intelligence planning framework to find alternative pathways through which a bug may return. The plans are then translated to simulation test cases which explore these potential bug scenarios. The planning tool essentially automates the task of the verification engineer towards exploring possible alternative sequences of micro-architectural steps that may allow a bug to return. The proposed methodology is demonstrated in three case studies.