scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Computers in 2019"


Journal ArticleDOI
TL;DR: This paper proposes FastBFT, a fast and scalable BFT protocol that combines hardware-based trusted execution environments (TEEs) with lightweight secret sharing and has better scalability and performance than previous BFT protocols.
Abstract: The surging interest in blockchain technology has revitalized the search for effective Byzantine consensus schemes. In particular, the blockchain community has been looking for ways to effectively integrate traditional Byzantine fault-tolerant (BFT) protocols into a blockchain consensus layer allowing various financial institutions to securely agree on the order of transactions. However, existing BFT protocols can only scale to tens of nodes due to their $O(n^2)$ message complexity. In this paper, we propose FastBFT, a fast and scalable BFT protocol. At the heart of FastBFT is a novel message aggregation technique that combines hardware-based trusted execution environments (TEEs) with lightweight secret sharing. Combining this technique with several other optimizations (i.e., optimistic execution, tree topology and failure detection), FastBFT achieves low latency and high throughput even for large scale networks. Via systematic analysis and experiments, we demonstrate that FastBFT has better scalability and performance than previous BFT protocols.

164 citations


Journal ArticleDOI
TL;DR: NeST as discussed by the authors combines gradient-based growth and magnitude-based pruning of neurons and connections to learn both weights and compact DNN architectures during training, achieving significant additional parameter and FLOPs reduction relative to pruning-only methods.
Abstract: Deep neural networks (DNNs) have begun to have a pervasive impact on various applications of machine learning. However, the problem of finding an optimal DNN architecture for large applications is challenging. Common approaches go for deeper and larger DNN architectures but may incur substantial redundancy. To address these problems, we introduce a network growth algorithm that complements network pruning to learn both weights and compact DNN architectures during training. We propose a DNN synthesis tool (NeST) that combines both methods to automate the generation of compact and accurate DNNs. NeST starts with a randomly initialized sparse network called the seed architecture. It iteratively tunes the architecture with gradient-based growth and magnitude-based pruning of neurons and connections. Our experimental results show that NeST yields accurate, yet very compact DNNs, with a wide range of seed architecture selection. For the LeNet-300-100 (LeNet-5) architecture, we reduce network parameters by $70.2\times$70.2× ($74.3\times$74.3×) and floating-point operations (FLOPs) by $79.4\times$79.4× ($43.7\times$43.7×). For the AlexNet, VGG-16, and ResNet-50 architectures, we reduce network parameters (FLOPs) by $15.7\times$15.7× ($4.6\times$4.6×), $33.2\times$33.2× ($8.9\times$8.9×), and $4.1\times$4.1× ($2.1\times$2.1×) respectively. NeST's grow-and-prune paradigm delivers significant additional parameter and FLOPs reduction relative to pruning-only methods.

154 citations


Journal ArticleDOI
TL;DR: RepuCoin this article defines a miner's power by its reputation, as a function of its work integrated over the time of the entire blockchain, rather than through instantaneous computing power, which can be obtained relatively quickly and/or temporarily.
Abstract: Existing proof-of-work cryptocurrencies cannot tolerate attackers controlling more than 50 percent of the network’s computing power at any time, but assume that such a condition happening is “unlikely”. However, recent attack sophistication, e.g., where attackers can rent mining capacity to obtain a majority of computing power temporarily, render this assumption unrealistic. This paper proposes RepuCoin, the first system to provide guarantees even when more than 50 percent of the system’s computing power is temporarily dominated by an attacker. RepuCoin physically limits the rate of voting power growth of the entire system. In particular, RepuCoin defines a miner’s power by its ‘reputation’, as a function of its work integrated over the time of the entire blockchain, rather than through instantaneous computing power, which can be obtained relatively quickly and/or temporarily. As an example, after a single year of operation, RepuCoin can tolerate attacks compromising 51 percent of the network’s computing resources, even if such power stays maliciously seized for almost a whole year. Moreover, RepuCoin provides better resilience to known attacks, compared to existing proof-of-work systems, while achieving a high throughput of 10000 transactions per second (TPS).

99 citations


Journal ArticleDOI
TL;DR: This paper introduces a novel analytical expression for calculating the MTTF due to transient faults and tackles the problem of maximizing availability for multicore real-time systems with consideration of permanent and transient faults.
Abstract: CMOS scaling has greatly increased concerns for both lifetime reliability due to permanent faults and soft-error reliability due to transient faults. Most existing works only focus on one of the two reliability concerns, but often times techniques used to increase one type of reliability may adversely impact the other type. A few efforts do consider both types of reliability together and use two different metrics to quantify the two types of reliability. However, for many systems, the user's concern is to maximize system availability by improving the mean time to failure (MTTF), regardless of whether the failure is caused by permanent or transient faults. Addressing this concern requires a uniform metric to measure the effect due to both types of faults. This paper introduces a novel analytical expression for calculating the MTTF due to transient faults. Using this new formula and an existing method to evaluate system MTTF, we tackle the problem of maximizing availability for multicore real-time systems with consideration of permanent and transient faults. A framework is proposed to solve the system availability maximization problem. Experimental results on a hardware board and simulation results of synthetic tasks show that our scheme significantly improves system MTTF (and hence availability) compared with existing techniques.

94 citations


Journal ArticleDOI
TL;DR: Energy-efficient approximate multipliers based on the Mitchell's log multiplication, optimized for performing inferences on convolutional neural networks (CNN) are proposed, supported by the detailed formal analysis as well as the experimental results on CNNs.
Abstract: This paper proposes energy-efficient approximate multipliers based on the Mitchell’s log multiplication, optimized for performing inferences on convolutional neural networks (CNN). Various design techniques are applied to the log multiplier, including a fully-parallel LOD, efficient shift amount calculation, and exact zero computation. Additionally, the truncation of the operands is studied to create the customizable log multiplier that further reduces energy consumption. The paper also proposes using the one’s complements to handle negative numbers, as an approximation of the two’s complements that had been used in the prior works. The viability of the proposed designs is supported by the detailed formal analysis as well as the experimental results on CNNs. The experiments also provide insights into the effect of approximate multiplication in CNNs, identifying the importance of minimizing the range of error.The proposed customizable design at $w$w = 8 saves up to 88 percent energy compared to the exact fixed-point multiplier at 32 bits with just a performance degradation of 0.2 percent for the ImageNet ILSVRC2012 dataset.

88 citations


Journal ArticleDOI
TL;DR: Significant improvements in energy and execution time for the CRAM-based implementation over a near-memory processing system are demonstrated, and can be attributed to the ability of CRAM to overcome the memory access bottleneck, and to provide high levels of parallelism to the computation.
Abstract: The Computational Random Access Memory (CRAM) is a platform that makes a small modification to a standard spintronics-based memory array to organically enable logic operations within the array. CRAM provides a true in-memory computational platform that can perform computations within the memory array, as against other methods that send computational tasks to a separate processor module or a near-memory module at the periphery of the memory array. This paper describes how the CRAM structure can be built and utilized, accounting for considerations at the device, gate, and functional levels. Techniques for constructing fundamental gates are first overviewed, accounting for electrical and noise margin considerations. Next, these logic operations are composed to schedule operations in the array that implement basic arithmetic operations such as addition and multiplication. These methods are then demonstrated on 2D convolution with multibit data, and a binary neural inference engine. The performance of the CRAM is analyzed on near-term and longer-term spintronic device technologies. Significant improvements in energy and execution time for the CRAM-based implementation over a near-memory processing system are demonstrated, and can be attributed to the ability of CRAM to overcome the memory access bottleneck, and to provide high levels of parallelism to the computation.

83 citations


Journal ArticleDOI
TL;DR: Three Approximate Booth Multipliers are proposed in which approximate computing is applied to the radix-4 modified Booth algorithm and are found to outperform the state-of-the-art existing multipliers in terms of area and power savings while maintaining high accuracy.
Abstract: Approximate computing is an emerging technique in which power-efficient circuits are designed with reduced complexity in exchange for some loss in accuracy. Such circuits are suitable for applications in which high accuracy is not a strict requirement. Radix-4 modified Booth encoding is a popular multiplication algorithm which reduces the size of the partial product array by half. In this paper, three Approximate Booth Multiplier Models (ABM-M1, ABM-M2, and ABM-M3) are proposed in which approximate computing is applied to the radix-4 modified Booth algorithm. Each of the three designs features a unique approximation technique that involves both reducing the logic complexity of the Booth partial product generator and modifying the method of partial product accumulation. The proposed approximate multipliers are demonstrated to have better performance than existing approximate Booth multipliers in terms of accuracy and power. Compared to the exact Booth multiplier, ABM-M1 achieves up to a 23 percent reduction in area and 15 percent reduction in power with a Mean Relative Error Distance (MRED) value of $7.9\times 10^{-4}$7.9×10-4. ABM-M2 has area and power savings of up to 51 and 46 percent respectively with a MRED of $2.7\times 10^{-2}$2.7×10-2. ABM-M3 has area savings of up to 56 percent and power savings of up to 46 percent with a MRED of $3.4\times 10^{-3}$3.4×10-3. The proposed designs are compared with the state-of-the-art existing multipliers and are found to outperform them in terms of area and power savings while maintaining high accuracy. The performance of the proposed designs are demonstrated using image transformation, matrix multiplication, and Finite Impulse Response (FIR) filtering applications.

71 citations


Journal ArticleDOI
TL;DR: The proposed quantum multiplier design reduces the T-count by using a novel quantum conditional adder circuit that is replaced with a Toffoli gate array to further save T gates.
Abstract: Quantum circuits of many qubits are extremely difficult to realize; thus, the number of qubits is an important metric in a quantum circuit design. Further, scalable and reliable quantum circuits are based on fault tolerant implementations of quantum gates such as Clifford+T gates. An efficient quantum circuit saves quantum hardware resources by reducing the number of T gates without substantially increasing the number of qubits. This work presents a T-count optimized quantum circuit for integer multiplication with only $4 \cdot n + 1$4·n+1 qubits and no garbage outputs. The proposed quantum multiplier design reduces the T-count by using a novel quantum conditional adder circuit. Also, where one operand to the conditional adder is zero, the conditional adder is replaced with a Toffoli gate array to further save T gates. Average T-count savings of $46.12$46.12, $47.55$47.55, $62.71$62.71 and 26.30 percent are achieved compared to the recent works by Kotiyal et al., Babu, Lin et al., and Jayashree et al., respectively.

65 citations


Journal ArticleDOI
TL;DR: The proposed approximate RB multiplier designs are compared with previous approximate Booth multipliers; the results show that the approximate RB multipliers are better than approximate NB Boothmultipliers especially when the word size is large.
Abstract: As technology scaling is reaching its limits, new approaches have been proposed for computional efficiency. Approximate computing is a promising technique for high performance and low power circuits as used in error-tolerant applications. Among approximate circuits, approximate arithmetic designs have attracted significant research interest. In this paper, the design of approximate redundant binary (RB) multipliers is studied. Two approximate Booth encoders and two RB 4:2 compressors based on RB (full and half) adders are proposed for the RB multipliers. The approximate design of the RB-Normal Binary (NB) converter in the RB multiplier is also studied by considering the error characteristics of both the approximate Booth encoders and the RB compressors. Both approximate and exact regular partial product arrays are used in the approximate RB multipliers to meet different accuracy requirements. Error analysis and hardware simulation results are provided. The proposed approximate RB multipliers are compared with previous approximate Booth multipliers; the results show that the approximate RB multipliers are better than approximate NB Booth multipliers especially when the word size is large. Case studies of error-resilient applications are also presented to show the validity of the proposed designs.

59 citations


Journal ArticleDOI
TL;DR: NTX as discussed by the authors proposes an efficient near-memory acceleration engine called NTX that can be used to train state-of-the-art deep convolutional neural networks at scale.
Abstract: Most investigations into near-memory hardware accelerators for deep neural networks have primarily focused on inference, while the potential of accelerating training has received relatively little attention so far. Based on an in-depth analysis of the key computational patterns in state-of-the-art gradient-based training methods, we propose an efficient near-memory acceleration engine called NTX that can be used to train state-of-the-art deep convolutional neural networks at scale. Our main contributions are: (i) a loose coupling of RISC-V cores and NTX co-processors reducing offloading overhead by $7\times$ 7 × over previously published results; (ii) an optimized IEEE 754 compliant data path for fast high-precision convolutions and gradient propagation; (iii) evaluation of near-memory computing with NTX embedded into residual area on the Logic Base die of a Hybrid Memory Cube; and (iv) a scaling analysis to meshes of HMCs in a data center scenario. We demonstrate a $2.7\times$ 2 . 7 × energy efficiency improvement of NTX over contemporary GPUs at $4.4\times$ 4 . 4 × less silicon area, and a compute performance of 1.2 Tflop/s for training large state-of-the-art networks with full floating-point precision. At the data center scale, a mesh of NTX achieves above 95 percent parallel and energy efficiency, while providing $2.1\times$ 2 . 1 × energy savings or $3.1\times$ 3 . 1 × performance improvement over a GPU-based system.

58 citations


Journal ArticleDOI
TL;DR: This paper proposes LEAD Learning-enabled Energy-Aware Dynamic voltage/frequency scaling for multicore architectures using both supervised learning and reinforcement learning approaches, and describes a reinforcement learning approach to LEAD that optimizes the DVFS mode selection directly, obviating the need for label and threshold engineering.
Abstract: Network-on-Chips (NoCs) are the de facto choice for designing the interconnect fabric in multicore chips due to their regularity, efficiency, simplicity, and scalability. However, NoC suffers from excessive static power and dynamic energy due to transistor leakage current and data movement between the cores and caches. Power consumption issues are only exacerbated by ever decreasing technology sizes. Dynamic Voltage and Frequency Scaling (DVFS) is one technique that seeks to reduce dynamic energy; however this often occurs at the expense of performance. In this paper, we propose LEAD Learning-enabled Energy-Aware Dynamic voltage/frequency scaling for multicore architectures using both supervised learning and reinforcement learning approaches. LEAD groups the router and its outgoing links into the same V/F domain and implements proactive DVFS mode management strategies that rely on offline trained machine learning models in order to provide optimal V/F mode selection between different voltage/frequency pairs. We present three supervised learning versions of LEAD that are based on buffer utilization, change in buffer utilization and change in energy/throughput, which allow proactive mode selection based on accurate prediction of future network parameters. We then describe a reinforcement learning approach to LEAD that optimizes the DVFS mode selection directly, obviating the need for label and threshold engineering. Simulation results using PARSEC and Splash-2 benchmarks on a 4 × 4 concentrated mesh architecture show that by using supervised learning LEAD can achieve an average dynamic energy savings of 15.4 percent for a loss in throughput of 0.8 percent with no significant impact on latency. When reinforcement learning is used, LEAD increases average dynamic energy savings to 20.3 percent at the cost of a 1.5 percent decrease in throughput and a 1.7 percent increase in latency. Overall, the more flexible reinforcement learning approach enables learning an optimal behavior for a wider range of load environments under any desired energy versus throughput tradeoff.

Journal ArticleDOI
TL;DR: Compared to the state-of-the-art multiple-precision FMA design, the proposed FMA supports more floating-point operations such as half-pre precision FMA operations and mixed-preision operations with only 10.6 percent larger area.
Abstract: In this paper, an efficient multiple-precision floating-point fused multiply-add (FMA) unit is proposed. The proposed FMA supports not only single-precision, double-precision, and quadruple-precision operations, as some previous works do, but also half-precision operations. The proposed FMA architecture can execute one quadruple-precision operation, or two parallel double-precision operations, or four parallel single-precision operations, or eight parallel half-precision operations every clock cycle. In addition to the support of normal FMA operations, the proposed FMA also supports mixed-precision FMA operations and mixed-precision dot-product operations. Specifically, the products of two lower precision multiplications can be accumulated to a higher precision addend. By setting the operands of one multiplication to zeros, the proposed FMA can also perform mixed-precision FMA operations. Support for mixed-precision FMA and mixed-precision dot-product is newly added but it only consumes 6.5 percent more area compared to a normal multiple-precision FMA unit. Compared to the state-of-the-art multiple-precision FMA design, the proposed FMA supports more floating-point operations such as half-precision FMA operations and mixed-precision operations with only 10.6 percent larger area.

Journal ArticleDOI
TL;DR: This paper proposes a novel processing in-memory architecture, called NNPIM, that significantly accelerates neural network's inference phase inside the memory and introduces simple optimization techniques which significantly improves NNs’ performance and reduces the overall energy consumption.
Abstract: Neural networks (NNs) have shown great ability to process emerging applications such as speech recognition, language recognition, image classification, video segmentation, and gaming. It is therefore important to make NNs efficient. Although attempts have been made to improve NNs’ computation cost, the data movement between memory and processing cores is the main bottleneck for NNs’ energy consumption and execution time. This makes the implementation of NNs significantly slower on traditional CPU/GPU cores. In this paper, we propose a novel processing in-memory architecture, called NNPIM, that significantly accelerates neural network's inference phase inside the memory. First, we design a crossbar memory architecture that supports fast addition, multiplication, and search operations inside the memory. Second, we introduce simple optimization techniques which significantly improves NNs’ performance and reduces the overall energy consumption. We also map all NN functionalities using parallel in-memory components. To further improve the efficiency, our design supports weight sharing to reduce the number of computations in memory and consecutively speedup NNPIM computation. We compare the efficiency of our proposed NNPIM with GPU and the state-of-the-art PIM architectures. Our evaluation shows that our design can achieve 131.5× higher energy efficiency and is 48.2× faster as compared to NVIDIA GTX 1,080 GPU architecture. Compared to state-of-the-art neural network accelerators, NNPIM can achieve on an average 3.6× higher energy efficiency and is 4.6× faster, while providing the same classification accuracy.

Journal ArticleDOI
TL;DR: It is measured a 50 to 75 percent reduction in L2 misses for many compiled C-language benchmarks running under a commodity operating system using compressed 128-bit and 64-bit formats, demonstrating both compatibility with and increased performance over the uncompressed, 256-bit format.
Abstract: We present CHERI Concentrate, a new fat-pointer compression scheme applied to CHERI, the most developed capability-pointer system at present. Capability fat pointers are a primary candidate to enforce fine-grained and non-bypassable security properties in future computer systems, although increased pointer size can severely affect performance. Thus, several proposals for capability compression have been suggested elsewhere that do not support legacy instruction sets, ignore features critical to the existing software base, and also introduce design inefficiencies to RISC-style processor pipelines. CHERI Concentrate improves on the state-of-the-art region-encoding efficiency, solves important pipeline problems, and eases semantic restrictions of compressed encoding, allowing it to protect a full legacy software stack. We present the first quantitative analysis of compiled capability code, which we use to guide the design of the encoding format. We analyze and extend logic from the open-source CHERI prototype processor design on FPGA to demonstrate encoding efficiency, minimize delay of pointer arithmetic, and eliminate additional load-to-use delay. To verify correctness of our proposed high-performance logic, we present a HOL4 machine-checked proof of the decode and pointer-modify operations. Finally, we measure a 50 to 75 percent reduction in L2 misses for many compiled C-language benchmarks running under a commodity operating system using compressed 128-bit and 64-bit formats, demonstrating both compatibility with and increased performance over the uncompressed, 256-bit format.

Journal ArticleDOI
TL;DR: In this article, a 3D Network-on-Chip (NoC) for heterogeneous manycore platforms that considers the appropriate design objectives for 3D heterogeneous system and explores various tradeoffs using an efficient machine learning (ML)-based multi-objective optimization (MOO) technique is proposed.
Abstract: The rising use of deep learning and other big-data algorithms has led to an increasing demand for hardware platforms that are computationally powerful, yet energy-efficient. Due to the amount of data parallelism in these algorithms, high-performance three-dimensional (3D) manycore platforms that incorporate both CPUs and GPUs present a promising direction. However, as systems use heterogeneity (e.g., a combination of CPUs, GPUs, and accelerators) to improve performance and efficiency, it becomes more pertinent to address the distinct and likely conflicting communication requirements (e.g., CPU memory access latency or GPU network throughput) that arise from such heterogeneity. Unfortunately, it is difficult to quickly explore the hardware design space and choose appropriate tradeoffs between these heterogeneous requirements. To address these challenges, we propose the design of a 3D Network-on-Chip (NoC) for heterogeneous manycore platforms that considers the appropriate design objectives for a 3D heterogeneous system and explores various tradeoffs using an efficient machine learning (ML)-based multi-objective optimization (MOO) technique. The proposed design space exploration considers the various requirements of its heterogeneous components and generates a set of 3D NoC architectures that efficiently trades off these design objectives. Our findings show that by jointly considering these requirements (latency, throughput, temperature, and energy), we can achieve 9.6 percent better Energy-Delay Product on average at nearly iso-temperature conditions when compared to a thermally-optimized design for 3D heterogeneous NoCs. More importantly, our results suggest that our 3D NoCs optimized for a few applications can be generalized for unknown applications as well. Our results show that these generalized 3D NoCs only incur a 1.8 percent (36-tile system) and 1.1 percent (64-tile system) average performance loss compared to application-specific NoCs.

Journal ArticleDOI
TL;DR: A cache replacement policy is proposed to mitigate heat generation and high temperature of STT-MRAM cache memories using replacement policy and demonstrate that the majority of consecutive write operations are committed to adjacent cache blocks.
Abstract: As technology process node scales down, on-chip SRAM caches lose their efficiency because of their low scalability, high leakage power, and increasing rate of soft errors. Among emerging memory technologies, $Spin$ - $Transfer\; Torque\; Magnetic\; RAM$ (STT-MRAM) is known as the most promising replacement for SRAM-based cache memories. The main advantages of STT-MRAM are its non-volatility, near-zero leakage power, higher density, soft-error immunity, and higher scalability. Despite these advantages, high error rate in STT-MRAM cells due to $retention\; failure$ , $write\; failure$ , and $read\; disturbance$ threatens the reliability of cache memories built upon STT-MRAM technology. The error rate is significantly increased in higher temperature, which further affects the reliability of STT-MRAM-based cache memories. The major source of heat generation and temperature increase in STT-MRAM cache memories is write operations, which are managed by cache $replacement\; policy$ . To the best of our knowledge, none of previous studies have attempted to mitigate heat generation and high temperature of STT-MRAM cache memories using replacement policy. In this paper, we first analyze the cache behavior in conventional $Least$ - $Recently\; Used$ (LRU) replacement policy and demonstrate that the majority of consecutive write operations (more than 66 percent) are committed to adjacent cache blocks. These adjacent write operations cause accumulated heat and increased temperature, which significantly increase the cache error rate. To eliminate heat accumulation and the adjacency of consecutive writes, we propose a cache replacement policy, named $Thermal$ - $Aware\; Least$ - $Recently\; Written$ (TA-LRW), to smoothly distribute the generated heat by conducting consecutive write operations in distant cache blocks. TA-LRW guarantees the distance of at least three blocks for each two consecutive write operations in an 8-way associative cache. This distant write scheme reduces the temperature-induced error rate by 94.8 percent, on average, compared with the conventional LRU policy, which results in 6.9x reduction in cache error rate. The implementation cost and complexity of TA-LRW is as low as $First$ - $In,\; First$ - $Out$ (FIFO) policy while providing a near-LRU performance, having the advantages of both replacement policies. The significantly reduced error rate is achieved by imposing only 2.3 percent performance overhead compared with the LRU policy.

Journal ArticleDOI
TL;DR: The SqueezeFlow accelerator architecture that exploits sparsity of CNN models for increased efficiency is presented, which employs a PT-OS-sparse dataflow that removes the ineffective computations while maintaining the regularity of CNN computations.
Abstract: Convolutional Neural Networks (CNNs) have been widely used in machine learning tasks. While delivering state-of-the-art accuracy, CNNs are known as both compute- and memory-intensive. This paper presents the SqueezeFlow accelerator architecture that exploits sparsity of CNN models for increased efficiency. Unlike prior accelerators that trade complexity for flexibility, SqueezeFlow exploits concise convolution rules to benefit from the reduction of computation and memory accesses as well as the acceleration of existing dense architectures without intrusive PE modifications. Specifically, SqueezeFlow employs a PT-OS-sparse dataflow that removes the ineffective computations while maintaining the regularity of CNN computations. We present a full design down to the layout at 65 $nm$nm, with an area of 4.80 $\mathrm{mm^2}$ mm 2 and power of 536.09 $\mathrm{mW}$ mW . The experiments show that SqueezeFlow achieves a speedup of $2.9\times$2.9× on VGG16 compared to the dense architectures, with an area and power overhead of only 8.8 and 15.3 percent, respectively. On three representative sparse CNNs, SqueezeFlow improves the performance and energy efficiency by $1.8\times$1.8× and $1.5\times$1.5× over the state-of-the-art sparse accelerators.

Journal ArticleDOI
Yi Wu1, You Li1, Xiangxuan Ge1, Yuan Gao1, Weikang Qian1 
TL;DR: This work shows how to calculate the ER, and demonstrates an approach to get the error distribution, which can be used to calculate other metrics, such as MED and MSE, and proposes an efficient method to obtain their error statistics.
Abstract: Adders are key building blocks of many error-tolerant applications. Recently, a number of approximate adders were proposed. Many of them are block-based approximate adders. For approximate circuits, besides normal metrics such as area and delay, the other important design metrics are the various error statistics, such as error rate (ER), mean error distance (MED), and mean square error (MSE). Given the popularity of block-based approximate adders, in this work, we propose an efficient method to obtain their error statistics. We first show how to calculate the ER. Then, we demonstrate an approach to get the error distribution, which can be used to calculate other metrics, such as MED and MSE. Our method is applicable to an arbitrary block-based approximate adder. It is accurate for the uniformly distributed inputs. Experimental results also demonstrated that it produces error metrics close to the accurate ones for various types of non-uniform input distributions. Compared to the state-of-the-art algorithm for obtaining the error distributions of block-based approximate adders, for the uniform input distribution, our method improves the runtime by up to $4.8\times 10^4$ times with the same accuracy; for non-uniform input distributions, it achieves a speed-up of up to 400 times with very similar accuracy.

Journal ArticleDOI
TL;DR: A Value Iteration Architecture based Deep Learning (VIADL) method to conduct routing design to address the limitations of existing deep learning based routing algorithms in dynamic networks and can guarantee more stable network performance when network topology changes.
Abstract: Recently, the rapid advancement of high computing platforms has accelerated the development and applications of artificial intelligence techniques. Deep learning, which has been regarded as the next paradigm to revolutionize users’ experiences, has attracted networking researchers’ interests to relieve the burden due to the exponentially growing traffic and increasing complexities. Various intelligent packet transmission strategies have been proposed to tackle different network problems. However, most of the existing research just focuses on the network related improvements and neglects the analysis about the computation consumptions. In this paper, we propose a Value Iteration Architecture based Deep Learning (VIADL) method to conduct routing design to address the limitations of existing deep learning based routing algorithms in dynamic networks. Besides the network performance analysis, we also study the complexity of our proposal as well as the resource consumptions in different deployment manners. Moreover, we adopt the Heterogeneous Computing Platform (HCP) to conduct the training and running of the proposed VIADL since the theoretical analysis demonstrates the significant reduction of the time complexity with the multiple GPUs in HCPs. Furthermore, simulation results demonstrate that compared with the existing deep learning based method, our proposal can guarantee more stable network performance when network topology changes.

Journal ArticleDOI
TL;DR: This paper proposes a solution for embedded applications using any peripheral device to run despite transient power, following a kernel-oriented approach resulting in minimal impact on the programming model of the application.
Abstract: In a near future, energy harvesting is expected to replace batteries in ultra-low-power embedded systems. Research prototypes of such systems have recently been proposed. As the power harvested in the environment is very low, such systems need to cope with frequent power outages. They are referred to as transiently-powered systems (TPS). In order to execute non-trivial applications, TPS need to retain information between power losses. To achieve this goal, emerging non-volatile memory (NVM) technologies are a key enabler: they provide a lightweight solution to retain, between power outages, the state of an application and of its peripheral devices. These include sensors, serial interface or radio devices for instance. Existing works have described various checkpointing mechanisms to adapt embedded applications to TPS but the use of peripherals was not yet handled. in these works. This paper proposes a solution for embedded applications using any peripheral device to run despite transient power. We follow a kernel-oriented approach resulting in minimal impact on the programming model of the application. We implement the new concepts in our lightweight kernel called Sytare, running on an MSP430FR5739 micro-controller and we analyze the cost of the proposed solution.

Journal ArticleDOI
TL;DR: SyRA is proposed, a system-level cross-layer early reliability analysis framework for radiation induced soft errors in memory arrays of microprocessor-based systems and implements a complete tool-chain that scales efficiently with the complexity of the system.
Abstract: Cross-layer reliability is becoming the preferred solution when reliability is a concern in the design of a microprocessor-based system. Nevertheless, deciding how to distribute the error management across the different layers of the system is a very complex task that requires the support of dedicated frameworks for cross-layer reliability analysis. This paper proposes SyRA, a system-level cross-layer early reliability analysis framework for radiation induced soft errors in memory arrays of microprocessor-based systems. The framework exploits a multi-level hybrid Bayesian model to describe the target system and takes advantage of Bayesian inference to estimate different reliability metrics. SyRA implements several mechanisms and features to deal with the complexity of realistic models and implements a complete tool-chain that scales efficiently with the complexity of the system. The simulation time is significantly lower than micro-architecture level or RTL fault-injection experiments with an accuracy high enough to take effective design decisions. To demonstrate the capability of SyRA, we analyzed the reliability of a set of microprocessor-based systems characterized by different microprocessor architectures (i.e., Intel x86, ARM Cortex-A15, ARM Cortex-A9) running both the Linux operating system or bare metal in the presence of single bit upsets caused by radiation induced soft errors. Each system under analysis executes different software workloads both from benchmark suites and from real applications.

Journal ArticleDOI
TL;DR: This paper proposes an automated approach for generating directed tests by suitable assignments of input variables to make the remainder non-zero, and proposes an automatic bug fixing technique by utilizing the patterns of the remainder terms.
Abstract: Optimized and custom arithmetic circuits are widely used in embedded systems such as multimedia applications, cryptography systems, signal processing and console games Debugging of arithmetic circuits is a challenge due to increasing complexity coupled with non-standard implementations Existing algebraic rewriting techniques produce a remainder to indicate the presence of a potential bug However, bug localization remains a major bottleneck Simulation-based validation using random or constrained-random tests are not effective for complex arithmetic circuits due to bit-blasting In this paper, we present an automated test generation and bug localization technique for debugging arithmetic circuits This paper makes four important contributions We propose an automated approach for generating directed tests by suitable assignments of input variables to make the remainder non-zero The generated tests are guaranteed to activate bugs We also propose an automatic bug fixing technique by utilizing the patterns of the remainder terms as well as by analyzing the regions activated by the generated tests to detect and correct the error(s) We also propose an efficient debugging algorithm that can handle multiple dependent as well as independent bugs Finally, our proposed framework, consisting of directed test generation, bug localization and bug correction, is fully automated In other words, our framework is capable of producing a corrected implementation of arithmetic circuits without any manual intervention Our experimental results demonstrate that the proposed approach can be used for automated debugging of large and complex arithmetic circuits

Journal ArticleDOI
TL;DR: Six efficient attacks on the ARM TrustZone extension in the SoC are details and a prototype system design on a Xilinx Zynq SoC is the target of the attacks presented in this paper but they could be adapted to other SoCs.
Abstract: Cybersecurity of embedded systems has become a major challenge for the development of the Internet of Things, of Cloud computing and other trendy applications without devoting a significant part of the design budget to industrial players. Technologies like TrustZone, provided by ARM, support a Trusted Execution Environment (TEE) software architecture and are inexpensive integrated solutions. While this technology allows isolation and secure execution of critical software applications (e.g., banking), recent preliminary works highlighted some security breaches or limitations when the ARM processors are embedded in a FPGA-based heterogeneous SoCs such as the Xilinx Zynq or Intel SoC FPGA devices. This paper highlights the security issue of such complex SoCs and details six efficient attacks on the ARM TrustZone extension in the SoC. A prototype system design on a Xilinx Zynq SoC is the target of the attacks presented in this paper but they could be adapted to other SoCs. This paper also includes recommendations and security solutions to design a trustworthy embedded system with a FPGA-based heterogeneous SoC.

Journal ArticleDOI
TL;DR: This work proposes Sparsity-aware Core Extensions (SparCE) - a set of low-overhead micro-architectural and ISA extensions that dynamically detect whether an operand is zero and subsequently skip aSet of future instructions that use it, and improves the performance of DNNs on general-purpose processor (GPP) cores.
Abstract: Deep Neural Networks (DNNs) have emerged as the method of choice for solving a wide range of machine learning tasks. The enormous computational demand posed by DNNs is a key challenge for computing system designers and has most commonly been addressed through the design of DNN accelerators. However, these specialized accelerators utilize large quantities of multiply-accumulate units and on-chip memory and are prohibitive in area and cost constrained systems such as wearable devices and IoT sensors. In this work, we take a complementary approach and improve the performance of DNNs on general-purpose processor (GPP) cores. We do so by exploiting a key attribute of DNNs, viz. sparsity or the prevalence of zero values. We propose Sparsity-aware Core Extensions (SparCE) - a set of low-overhead micro-architectural and ISA extensions that dynamically detect whether an operand (e.g., the result of a load instruction) is zero and subsequently skip a set of future instructions that use it. To maximize performance benefits, SparCE ensures that the instructions to be skipped are prevented from even being fetched, as squashing instructions comes with a penalty (e.g., a pipeline stall). SparCE consists of 2 key micro-architectural enhancements. First, a Sparsity Register File (SpRF) is utilized to track registers that are zero. Next, a Sparsity-Aware Skip Address (SASA) Table is used to indicate instruction sequences that can be skipped, and to specify conditions on SpRF registers that trigger instruction skipping. When an instruction is fetched, SparCE dynamically pre-identifies whether the following instruction(s) can be skipped, and if so appropriately modifies the program counter, thereby skipping the redundant instructions and improving performance. We model SparCE using the gem5 architectural simulator, and evaluate our approach on 6 state-of-the-art image-recognition DNNs in the context of both training and inference using the Caffe deep learning framework. On a scalar microprocessor, SparCE achieves 1.11×-1.96× speedups across both convolution and fully-connected layers that exhibit 10-90 percent sparsity. These speedups translate to 19-31 percent reduction in execution time at the overall application-level. We also evaluate SparCE on a 4-way SIMD ARMv8 processor using the OpenBLAS library, and demonstrate that SparCE achieves 8-15 percent reduction in the application-level execution time.

Journal ArticleDOI
TL;DR: A novel refresh minimization scheme is proposed by exploiting the process variation (PV) of flash memory by detecting the supported retention time of flash blocks and proposing a data hotness and refresh frequency matching scheme.
Abstract: Refresh schemes have been the default approach in NAND flash memory to avoid data losses. The critical issue of the refresh schemes is that they introduce additional costs on lifetime and performance. Recent work proposed to minimize the refresh costs by using uniform refresh frequencies based on the number of program/erase (P/E) cycles. However, from our investigation, we find that the refresh costs still have a high burden on the lifetime performance. In this paper, a novel refresh minimization scheme is proposed by exploiting the process variation (PV) of flash memory. State-of-the-art flash memory always has significant PV, which introduces large variations on the retention time of flash blocks. In order to reduce the refresh costs, we first propose a new refresh frequency determination scheme by detecting the supported retention time of flash blocks. If the detected retention time is large, a low refresh frequency can be applied to minimize the refresh costs. Second, considering that the retention time requirements of data are varied with each others, we further propose a data hotness and refresh frequency matching scheme. The matching scheme is designed to allocate data to blocks with right higher supported retention time. Through simulation studies, the lifetime and performance are significantly improved compared with state-of-the-art refresh schemes.

Journal ArticleDOI
TL;DR: This paper proposes two efficient modular multiplication algorithms with special primes that can be used in SIDH key exchange protocol and shows that the proposed finite field multiplier is over 6.79 times faster than the original multiplier in hardware.
Abstract: Recent progress in quantum physics shows that quantum computers may be a reality in the not too distant future. Post-quantum cryptography (PQC) refers to cryptographic schemes that are based on hard problems which are believed to be resistant to attacks from quantum computers. The supersingular isogeny Diffie-Hellman (SIDH) key exchange protocol shows promising security properties among various post-quantum cryptosystems that have been proposed. In this paper, we propose two efficient modular multiplication algorithms with special primes that can be used in SIDH key exchange protocol. Hardware architectures for the two proposed algorithms are also proposed. The hardware implementations are provided and compared with the original modular multiplication algorithm. The results show that the proposed finite field multiplier is over 6.79 times faster than the original multiplier in hardware. Moreover, the SIDH hardware/software codesign implementation using the proposed FFM2 hardware is over 31 percent faster than the best SIDH software implementation.

Journal ArticleDOI
TL;DR: mARGOt, a dynamic autotuning framework to enhance the target application with an adaptation layer to provide self-optimization capabilities, is introduced as a C++ library that works at function-level and provides to the application a mechanism to adapt in a reactive and a proactive way.
Abstract: In the autonomic computing context, the system is perceived as a set of autonomous elements capable of self-management, where end-users define high-level goals and the system shall adapt to achieve the desired behaviour. Runtime adaptation creates several optimization opportunities, especially if we consider approximate computing applications, where it is possible to trade off the accuracy of the result and the performance. Given that modern systems are limited by the power dissipated, autonomic computing is an appealing approach to increase the computation efficiency. In this paper, we introduce mARGOt, a dynamic autotuning framework to enhance the target application with an adaptation layer to provide self-optimization capabilities. The framework is implemented as a C++ library that works at function-level and provides to the application a mechanism to adapt in a reactive and a proactive way. Moreover, the application is capable to change dynamically its requirements and to learn online the underlying application-knowledge. We evaluated the proposed framework in three real-life scenarios, ranging from embedded to HPC applications. In the three use cases, experimental results demonstrate how, thanks to mARGOt, it is possible to increase the computation efficiency by adapting the application at runtime with a limited overhead.

Journal ArticleDOI
TL;DR: It is evidences that the preprocessing of the source code improves Sherlock performance, and the approach, called Sherlock N-overlap, obtained similarity indexes superior to other complex tools such as MOSS, JPlag and SIM.
Abstract: Some tools for detecting similarity, such as Sherlock, compare textual documents of any nature, but have limitations to compare source code files. The presence or absence of blank spaces between structure elements, variable names, among other actions interfere with the similarity index found. This paper evidences that the preprocessing of the source code improves Sherlock performance. The results are based on experiments conducted with 66 source code previously plagiarized, and a base formed by 2160 codes created by students of engineering courses in programming classes. In this last set, the situation of similarity was not previously known, so a method was created to calculate precision and recall, in a relative way, based on a set of reference tools, as a kind of oracle. Our approach, called Sherlock N-overlap obtained, in most of the cases tested, similarity indexes superior to other complex tools such as MOSS, JPlag and SIM.

Journal ArticleDOI
TL;DR: A coflow compression mechanism to minimize the completion time in data-intensive applications and the results of both trace-driven simulations and real experiments show the superiority of the algorithm, over existing one.
Abstract: Big data analytics in datacenters often involve scheduling of data-parallel jobs. Traditional scheduling techniques based on improving network resource utilization are subject to limited bandwidth in datacenter networks. To alleviate the shortage of bandwidth, some cluster frameworks employ techniques of traffic compression to reduce transmission consumption. However, they tackle scheduling in a coarse-grained manner at task level and do not perform well in terms of flow-level metrics due to high complexity. Fortunately, the abstraction of coflow pioneers a new perspective to facilitate scheduling efficiency. In this paper, we introduce a coflow compression mechanism to minimize the completion time in data-intensive applications. Due to the NP-hardness, we propose a heuristic algorithm called Fastest-Volume-Disposal-First (FVDF) to solve this problem. For online applicability, FVDF supports stage pipelining to accelerate scheduling and exploits recurrent neural networks (RNNs) to predict compression speed. Meanwhile, we build ${\sf Swallow}$ Swallow , an efficient scheduling system that implements our proposed algorithms. It minimizes coflow completion time (CCT) while guaranteeing resource conservation and starvation freedom. The results of both trace-driven simulations and real experiments show the superiority of our algorithm, over existing one. Specifically, ${\sf Swallow}$ Swallow speeds up CCT and job completion time (JCT) by up to $1.47 \times$ 1 . 47 × and $1.66 \times$ 1 . 66 × on average, respectively, over the SEBF in Varys, one of the most efficient coflow scheduling algorithms so far. Moreover, with coflow compression, ${\sf Swallow}$ Swallow reduces data traffic by up to 48.41 percent on average.

Journal ArticleDOI
TL;DR: This paper addresses the main processing bottlenecks involved in key compression and decompression, and suggests substantial improvements for each of them.
Abstract: Supersingular isogeny-based cryptography is one of the more recent families of post-quantum proposals. An interesting feature is the comparatively low bandwidth occupation in key agreement protocols, which stems from the possibility of key compression. However, compression and decompression introduce a significant overhead to the overall processing cost despite recent progress. In this paper we address the main processing bottlenecks involved in key compression and decompression, and suggest substantial improvements for each of them. Some of our techniques may have an independent interest for other, more conventional areas of elliptic curve cryptography as well.