scispace - formally typeset
Search or ask a question

Showing papers in "ACM Journal on Emerging Technologies in Computing Systems in 2017"


Journal ArticleDOI
TL;DR: The proposed work shows that when pruning granularities are applied in combination, the CIFAR-10 network can be pruned by more than 70% with less than a 1% loss in accuracy.
Abstract: Real-time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks: feature map-wise, kernel-wise, and intra-kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, in parallel computing environments, and in hardware-based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by assessing the misclassification rate with a corresponding connectivity pattern. The pruned network is retrained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra-kernel strided sparsity with a simple constraint can significantly reduce the size of the kernel and feature map tensors. The proposed work shows that when pruning granularities are applied in combination, we can prune the CIFAR-10 network by more than 70% with less than a 1% loss in accuracy.

476 citations


Journal ArticleDOI
TL;DR: A review and classification are presented for the current designs of approximate arithmetic circuits including adders, multipliers, and dividers including improvements in delay, power, and area for the detection of differences in images by using approximate dividers.
Abstract: Often as the most important arithmetic modules in a processor, adders, multipliers, and dividers determine the performance and energy efficiency of many computing tasks. The demand of higher speed and power efficiency, as well as the feature of error resilience in many applications (e.g., multimedia, recognition, and data analytics), have driven the development of approximate arithmetic design. In this article, a review and classification are presented for the current designs of approximate arithmetic circuits including adders, multipliers, and dividers. A comprehensive and comparative evaluation of their error and circuit characteristics is performed for understanding the features of various designs. By using approximate multipliers and adders, the circuit for an image processing application consumes as little as 47% of the power and 36% of the power-delay product of an accurate design while achieving similar image processing quality. Improvements in delay, power, and area are obtained for the detection of differences in images by using approximate dividers.

197 citations


Journal ArticleDOI
Leibin Ni1, Hantao Huang1, Zichuan Liu1, Rajiv V. Joshi2, Hao Yu1 
TL;DR: Based on numerical results for fingerprint matching that is mapped on the proposed RRAM-crossbar, the proposed architecture has shown 2.86x faster speed, 154x better energy efficiency, and 100x smaller area when compared to the same design by CMOS-based ASIC.
Abstract: The recently emerging resistive random-access memory (RRAM) can provide nonvolatile memory storage but also intrinsic computing for matrix-vector multiplication, which is ideal for the low-power and high-throughput data analytics accelerator performed in memory. However, the existing RRAM crossbar--based computing is mainly assumed as a multilevel analog computing, whose result is sensitive to process nonuniformity as well as additional overhead from AD-conversion and I/O. In this article, we explore the matrix-vector multiplication accelerator on a binary RRAM crossbar with adaptive 1-bit-comparator--based parallel conversion. Moreover, a distributed in-memory computing architecture is also developed with the according control protocol. Both memory array and logic accelerator are implemented on the binary RRAM crossbar, where the logic-memory pair can be distributed with the control bus protocol. Experimental results have shown that compared to the analog RRAM crossbar, the proposed binary RRAM crossbar can achieve significant area savings with better calculation accuracy. Moreover, significant speedup can be achieved for matrix-vector multiplication in neural network--based machine learning such that the overall training and testing time can be both reduced. In addition, large energy savings can be also achieved when compared to the traditional CMOS-based out-of-memory computing architecture.

65 citations


Journal ArticleDOI
TL;DR: In this article, the authors examine how QPUs can be integrated into current and future HPC system architectures by accounting for functional and physical design requirements, and identify two integration pathways that are differentiated by infrastructure constraints on the QPU and the use cases expected for the HPC systems.
Abstract: The prospects of quantum computing have driven efforts to realize fully functional quantum processing units (QPUs). Recent success in developing proof-of-principle QPUs has prompted the question of how to integrate these emerging processors into modern high-performance computing (HPC) systems. We examine how QPUs can be integrated into current and future HPC system architectures by accounting for functional and physical design requirements. We identify two integration pathways that are differentiated by infrastructure constraints on the QPU and the use cases expected for the HPC system. This includes a tight integration that assumes infrastructure bottlenecks can be overcome as well as a loose integration that assumes they cannot. We find that the performance of both approaches is likely to depend on the quantum interconnect that serves to entangle multiple QPUs. We also identify several challenges in assessing QPU performance for HPC, and we consider new metrics that capture the interplay between system architecture and the quantum parallelism underlying computational performance.

48 citations


Journal ArticleDOI
TL;DR: Various novel design approaches of approximate 4-2 and 5-2 compressors have been proposed for reduction of the partial product stages in multiplication for energy-efficient Very Large Scale Integration (VLSI) system design.
Abstract: Approximate computing is a promising technique for energy-efficient Very Large Scale Integration (VLSI) system design. It is best suited for error-resilient applications such as signal processing and multimedia. Approximate computing reduces accuracy but still provides significant and faster results with lower power consumption. This is attractive to arithmetic circuits. In this article, various novel design approaches of approximate 4-2 and 5-2 compressors have been proposed for reduction of the partial product stages in multiplication. Three approximate 8 × 8 Dadda multiplier designs using three novel approximate 4-2 compressors and two approximate 8 × 8 Dadda multiplier designs using two novel approximate 5-2 compressors have proposed. The synthesis results show that the proposed designs achieved significant accuracy improvement together with power and delay reductions compared to the existing approximate designs.

43 citations


Journal ArticleDOI
TL;DR: The proposed SPARCNet, a hardware accelerator for efficient deployment of SPARse Convolutional NETworks, looks to enable deploying networks in embedded, resource-bound settings by both exploiting efficient forms of parallelism inherent in convolutional layers and by exploiting the sparsification and approximation techniques proposed.
Abstract: Deep neural networks have been shown to outperform prior state-of-the-art solutions that often relied heavily on hand-engineered feature extraction techniques coupled with simple classification algorithms. In particular, deep convolutional neural networks have been shown to dominate on several popular public benchmarks such as the ImageNet database. Unfortunately, the benefits of deep networks have yet to be fully exploited in embedded, resource-bound settings that have strict power and area budgets. Graphical processing unit (GPU) have been shown to improve throughput and energy-efficiency over central processing unit (CPU) due to their highly parallel architecture yet still impose a significant power burden. In a similar fashion, field programmable gate array (FPGA) can be used to improve performance while further allowing more fine-grained control over implementation to improve efficiency. In order to reduce power and area while still achieving required throughput, classification-efficient network architectures are required in addition to optimal deployment on efficient hardware. In this work, we target both of these enterprises. For the first objective, we analyze simple, biologically inspired reduction strategies that are applied both before and after training. The central theme of the techniques is the introduction of sparsification to help dissolve away the dense connectivity that is often found at different levels in convolutional neural networks. The sparsification techniques include feature compression partition, structured filter pruning, and dynamic feature pruning. Additionally, we explore filter factorization and filter quantization approximation techniques to further reduce the complexity of convolutional layers. In the second contribution, we propose SPARCNet, a hardware accelerator for efficient deployment of SPARse Convolutional NETworks. The accelerator looks to enable deploying networks in such resource-bound settings by both exploiting efficient forms of parallelism inherent in convolutional layers and by exploiting the sparsification and approximation techniques proposed. To demonstrate both contributions, modern deep convolutional network architectures containing millions of parameters are explored within the context of the computer vision dataset CIFAR. Utilizing the reduction techniques, we demonstrate the ability to reduce computation and memory by 60% and 93% with less than 0.03% impact on accuracy when compared to the best baseline network with 93.47% accuracy. The SPARCNet accelerator with different numbers of processing engines is implemented on a low-power Artix-7 FPGA platform. Additionally, the same networks are optimally implemented on a number of embedded commercial-off-the-shelf platforms including NVIDIAs CPU+GPU SoCs TK1 and TX1 and Intel Edison. Compared to NVIDIAs TK1 and TX1, the FPGA-based accelerator obtains 11.8 × and 7.5 × improvement in energy efficiency while maintaining a classification throughput of 72 images/s. When further compared to a number of recent FPGA-based accelerators, SPARCNet is able to achieve up to 15 × improvement in energy efficiency while consuming less than 2W of total board power at 100MHz. In addition to improving efficiency, the accelerator has built-in support for sparsification techniques and ability to perform in-place rectified linear unit (ReLU) activation function, max-pooling, and batch normalization.

39 citations


Journal ArticleDOI
TL;DR: Solutions to the reliability issues identified are addressed within a taxonomy created to categorize the current and future approaches to reliable STT-MRAM designs.
Abstract: Spin-Transfer Torque Random Access Memory (STT-MRAM) has been explored as a post-CMOS technology for embedded and data storage applications seeking non-volatility, near-zero standby energy, and high density. Towards attaining these objectives for practical implementations, various techniques to mitigate the specific reliability challenges associated with STT-MRAM elements are surveyed, classified, and assessed in this article. Cost and suitability metrics assessed include the area of nanomagmetic and CMOS components per bit, access time and complexity, sense margin, and energy or power consumption costs versus resiliency benefits. Solutions to the reliability issues identified are addressed within a taxonomy created to categorize the current and future approaches to reliable STT-MRAM designs. A variety of destructive and non-destructive sensing schemes are assessed for process variation tolerance, read disturbance reduction, sense margin, and write polarization asymmetry compensation. The highest resiliency strategies deliver a sensing margin above 300mV while incurring low power and energy consumption on the order of picojoules and microwatts, respectively, and attaining read sense latency of a few nanoseconds down to hundreds of picoseconds for non-destructive and destructive sensing schemes, respectively.

33 citations


Journal ArticleDOI
TL;DR: The results show that the timing characteristics of an M3D IC can be significantly altered due to coupling and wafer-bonding defects if the thickness of its ILD is less than 100nm, and test-generation methods must be enhanced to take M3d fabrication defects into account.
Abstract: Monolithic three-dimensional (M3D) integration is gaining momentum, as it has the potential to achieve significantly higher device density compared to 3D integration based on through-silicon vias. M3D integration uses several techniques that are not used in the fabrication of conventional integrated circuits (ICs). Therefore, a detailed analysis of the M3D fabrication process is required to understand the impact of defects that are likely to occur during chip fabrication. In this article, we first analyze electrostatic coupling in M3D ICs, which arises due to the aggressive scaling of the interlayer dielectric (ILD) thickness. We then analyze defects that arise due to voids created during wafer bonding, a key step in most M3D fabrication processes. We quantify the impact of these defects on the threshold voltage of a top-layer transistor in an M3D IC. We also show that wafer-bonding defects can lead to a change in the resistance of interlayer vias (ILVs), and in some cases lead to an open in an ILV or a short between two ILVs. We then analyze the impact of these defects on path delays using HSpice simulations. We study their impact on the effectiveness of delay-test patterns for multiple instances of IWLS 2005 benchmarks in which these defects were randomly injected. Our results show that the timing characteristics of an M3D IC can be significantly altered due to coupling and wafer-bonding defects if the thickness of its ILD is less than 100nm. Therefore, for such M3D ICs, test-generation methods must be enhanced to take M3D fabrication defects into account.

32 citations


Journal ArticleDOI
TL;DR: This article presents a novel energy managing technique for WiNoC architectures aimed at improving the energy efficiency of the main elements of the wireless infrastructure, namely, radio-hubs based on selectively turning off, for the appropriate number of cycles, all the radio-Hubs that are not involved in the current wireless communication.
Abstract: Wireless Network-on-Chip (WiNoC) represents a promising emerging communication technology for addressing the scalability limitations of future manycore architectures. In a WiNoC, high-latency and power-hungry long-range multi-hop communications can be realized by performance- and energy-efficient single-hop wireless communications. However, the energy contribution of such wireless communication accounts for a significant fraction of the overall communication energy budget. This article presents a novel energy managing technique for WiNoC architectures aimed at improving the energy efficiency of the main elements of the wireless infrastructure, namely, radio-hubs. The rationale behind the proposed technique is based on selectively turning off, for the appropriate number of cycles, all the radio-hubs that are not involved in the current wireless communication. The proposed energy managing technique is assessed on several network configurations under different traffic scenarios both synthetic and extracted from the execution of real applications. The obtained results show that the application of the proposed technique allows up to 25% total communication energy saving without any impact on performance and with a negligible impact on the silicon area of the radio-hub.

31 citations


Journal ArticleDOI
TL;DR: SwiftNoC is presented, a novel reconfigurable silicon-photonic NoC architecture that features improved multicast-enabled channel sharing, as well as dynamic re-prioritization and exchange of bandwidth between clusters of cores running multiple applications, to increase channel utilization and system performance.
Abstract: On-chip communication is widely considered to be one of the major performance bottlenecks in contemporary chip multiprocessors (CMPs). With recent advances in silicon nanophotonics, photonics-based network-on-chip (NoC) architectures are being considered as a viable solution to support communication in future CMPs as they can enable higher bandwidth and lower power dissipation compared to traditional electrical NoCs. In this article, we present SwiftNoC, a novel reconfigurable silicon-photonic NoC architecture that features improved multicast-enabled channel sharing, as well as dynamic re-prioritization and exchange of bandwidth between clusters of cores running multiple applications, to increase channel utilization and system performance. Experimental results show that SwiftNoC improves throughput by up to 25.4× while reducing latency by up to 72.4% and energy-per-bit by up to 95% over state-of-the-art solutions.

31 citations


Journal ArticleDOI
TL;DR: Conditional Deep Learning is proposed, where the convolutional layer features are used to identify the variability in the difficulty of input instances and conditionally activate the deeper layers of the network, resulting in improved classification over state-of-the-art baseline networks.
Abstract: Deep-learning neural networks have proven to be very successful for a wide range of recognition tasks across modern computing platforms. However, the computational requirements associated with such deep nets can be quite high, and hence their energy-efficient implementation is of great interest. Although, traditionally, the entire network is utilized for the recognition of all inputs, we observe that the classification difficulty varies widely across inputs in real-world datasets; only a small fraction of inputs requires the full computational effort of a network, while a large majority can be classified correctly with very low effort. In this article, we propose Conditional Deep Learning (CDL), where the convolutional layer features are used to identify the variability in the difficulty of input instances and conditionally activate the deeper layers of the network. We achieve this by cascading a linear network of output neurons for each convolutional layer and monitoring the output of the linear network to decide whether classification can be terminated at the current stage or not. The proposed methodology thus enables the network to dynamically adjust the computational effort depending on the difficulty of the input data while maintaining competitive classification accuracy. The overall energy benefits for MNIST/CIFAR10/Tiny ImageNet datasets with state-of-the-art deep-learning architectures are 1.84 × /2.83 × /4.02 × , respectively. We further employ the conditional approach to train deep-learning networks from scratch with integrated supervision from the additional output neurons appended at the intermediate convolutional layers. Our proposed integrated CDL training leads to an improvement in the gradient convergence behavior giving substantial error rate reduction on MNIST/CIFAR-10, resulting in improved classification over state-of-the-art baseline networks.

Journal ArticleDOI
TL;DR: The proposed Spin-Torque Transfer RAM--based Ternary CAM (TCAM) cells have a 62.5% reduction in number of transistor compared to conventional CMOS TCAMs (spintronicTCAMs), and the sense margin is analyzed with respect to 16, 32, 64, 128, and 256-bit word sizes in 22nm predictive technology.
Abstract: Content Addressable Memory (CAM) is widely used in applications where searching a specific pattern of data is a major operation. Conventional CAMs suffer from area, power, and speed limitations. We propose Spin-Torque Transfer RAM--based Ternary CAM (TCAM) cells. The proposed NOR-type TCAM cell has a 62.5% (33%) reduction in number of transistor compared to conventional CMOS TCAMs (spintronic TCAMs). We analyzed the sense margin of the proposed TCAM with respect to 16-, 32-, 64-, 128-, and 256-bit word sizes in 22nm predictive technology. Simulations indicated a reliable sense margin of 50mV even at 0.7V supply voltage for 256-bits word. We also explored a selective threshold voltage modulation of transistors to improve the sense margin and tolerate process and voltage variations. The worst-case search latency and sense margin of 256-bit TCAM is found to be 263ps and 220mV, respectively, at 1V supply voltage. The average search power consumed is 13mW, and the search energy is 4.7fJ/bit search. The write time is 4ns, and the write energy is 0.69pJ/bit. We leverage the NOR-type TCAM design to realize a 9T-2 Magnetic Tunnel Junctions NAND-type TCAM cell that has 43.75% less number of transistors than the conventional CMOS TCAM cell. A NAND-type cell can support up to 64-bit words with a maximum sense margin of up to 33mV. We compare the performance metrics of NOR- and NAND-type TCAM cells with other TCAMs in the literature.

Journal ArticleDOI
TL;DR: This article explores the best region of operation for a memristive crossbar PUF (XbarPUF) using a comprehensive temperature dependent model of an HfOx (hafnium-oxide) memristor and presents results for estimates of area, power, and delay alongside security performance metrics to analyze the strengths and weaknesses of the X barPUF.
Abstract: Hardware security has emerged as a field concerned with issues such as integrated circuit (IC) counterfeiting, cloning, piracy, and reverse engineering. Physical unclonable functions (PUF) are hardware security primitives useful for mitigating such issues by providing hardware-specific fingerprints based on intrinsic process variations within individual IC implementations. As technology scaling progresses further into the nanometer region, emerging nanoelectronic technologies, such as memristors or RRAMs (resistive random-access memory), have become interesting options for emerging computing systems. In this article, using a comprehensive temperature dependent model of an HfOx (hafnium-oxide) memristor, based on experimental measurements, we explore the best region of operation for a memristive crossbar PUF (XbarPUF). The design considered also employs XORing and a column shuffling technique to improve reliability and resilience to machine learning attacks. We present a detailed analysis for the noise margin and discuss the scalability of the XbarPUF structure. Finally, we present results for estimates of area, power, and delay alongside security performance metrics to analyze the strengths and weaknesses of the XbarPUF. Our XbarPUF exhibits nearly ideal (near 50%) uniqueness, bit-aliasing and uniformity, good reliability of 90% and up (with 100% being ideal), a very small footprint, and low average power consumption a104μW.

Journal ArticleDOI
TL;DR: This work proposes three low overhead fault tolerance approaches based on instruction duplication with zero latency detection, which uses a rollback mechanism to correct soft errors in the pipelanes of a configurable VLIW processor.
Abstract: Because of technology scaling, the soft error rate has been increasing in digital circuits, which affects system reliability. Therefore, modern processors, including VLIW architectures, must have means to mitigate such effects to guarantee reliable computing. In this scenario, our work proposes three low overhead fault tolerance approaches based on instruction duplication with zero latency detection, which uses a rollback mechanism to correct soft errors in the pipelanes of a configurable VLIW processor. The first uses idle issue slots within a period of time to execute extra instructions considering distinct application phases. The second works at a finer grain, adaptively exploiting idle functional units at run-time. However, some applications present high instruction-level parallelism (ILP), so the ability to provide fault tolerance is reduced: less functional units will be idle, decreasing the number of potential duplicated instructions. The third approach attacks this issue by dynamically reducing ILP according to a configurable threshold, increasing fault tolerance at the cost of performance. While the first two approaches achieve significant fault coverage with minimal area and power overhead for applications with low ILP, the latter improves fault tolerance with low performance degradation. All approaches are evaluated considering area, performance, power dissipation, and error coverage.

Journal ArticleDOI
TL;DR: A methodology for synthesizing a given target function stochastically using finite-state machines (FSMs), and enhances and extends the reconfigurable architecture using sequential logic are introduced.
Abstract: Computations based on stochastic bit streams have several advantages compared to deterministic binary radix computations, including low power consumption, low hardware cost, high fault tolerance, and skew tolerance. To take advantage of this computing technique, previous work proposed a combinational logic-based reconfigurable architecture to perform complex arithmetic operations on stochastic streams of bits. The long execution time and the cost of converting between binary and stochastic representations, however, make the stochastic architectures less energy efficient than the deterministic binary implementations. This article introduces a methodology for synthesizing a given target function stochastically using finite-state machines (FSMs), and enhances and extends the reconfigurable architecture using sequential logic. Compared to the previous approach, the proposed reconfigurable architecture can save hardware area and energy consumption by up to 30% and 40%, respectively, while achieving a higher processing speed. Both stochastic reconfigurable architectures are much more tolerant of soft errors (bit flips) than the deterministic binary radix implementations, and their fault tolerance scales gracefully to very large numbers of errors.

Journal ArticleDOI
TL;DR: It is demonstrated that careful adjustment of path delays can lead to significant error reduction under voltage and frequency scaling, and logical and physical design techniques can be combined to significantly expand the already-powerful accuracy-energy tradeoff possibilities of SC.
Abstract: As we approach the limits of traditional Moore’s-Law scaling, alternative computing techniques that consume energy more efficiently become attractive. Stochastic computing (SC), as a re-emerging computing technique, is a low-cost and error-tolerant alternative to conventional binary circuits in several important applications such as image processing and communications. SC allows a natural accuracy-energy tradeoff that has been exploited in the past. This article presents an accuracy-energy tradeoff technique for SC circuits that reduces their energy consumption with virtually no accuracy loss. To this end, we employ voltage or frequency scaling, which normally reduce energy consumption at the cost of timing errors. Then we show that due to their inherent error tolerance, SC circuits operate satisfactorily without significant accuracy loss even with aggressive scaling. This significantly improves their energy efficiency. In contrast, conventional binary circuits quickly fail as the supply voltage decreases. To find the most energy-efficient operating point of an SC circuit, we propose an error estimation method that allows us to quickly explore the circuit’s design space. The error estimation method is based on Markov chain and least-squares regression. Furthermore, we investigate opportunities to optimize SC circuits under such aggressive scaling. We find that logical and physical design techniques can be combined to significantly expand the already-powerful accuracy-energy tradeoff possibilities of SC. In particular, we demonstrate that careful adjustment of path delays can lead to significant error reduction under voltage and frequency scaling. We perform buffer insertion and route detouring to achieve more balanced path delays. These techniques differ from conventional path-balancing techniques whose goal is to minimize power consumption by resizing the non-critical paths. The goal of our path-balancing approach is to increase error cancellation chances in voltage-/frequency-scaled SC circuits. Our circuit optimization comprehends the tradeoff between power overheads due to inserted buffers and wires versus the energy reduction from supply voltage downscaling enabled by more balanced path delays. Simulation results show that our optimized SC circuits can tolerate aggressive voltage scaling with no significant signal-to-noise ratio (SNR) degradation. In one example, a 40% supply voltage reduction (1V to 0.6V) on the SC circuit leads to 66% energy saving (20.7pJ to 6.9pJ) and makes it more efficient than its conventional binary counterpart. In the same example, a 100% frequency boosting (400ps to 200ps) of the optimized circuits leads to no significant SNR degradation. We also show that process variation and temperature variation have limited impact on optimized SC circuits. The error change is less than 5% when temperature changes by 100°C or process condition changes from worst case to best case.

Journal ArticleDOI
TL;DR: From experimental results, it is shown that the proposed stochastic logic circuits require less hardware complexity than the previous stoChastic polynomial implementation using Bernstein polynomials.
Abstract: This article addresses subtraction and polynomial computations using unipolar stochastic logic. Stochastic computing requires simple logic gates, and stochastic logic--based circuits are inherently fault tolerant. Thus, these structures are well suited for nanoscale CMOS technologies. It is well known that an AND gate and a multiplexer can be used to implement stochastic unipolar multiplier and adder, respectively. Although it is easy to realize multiplication and scaled addition, implementation of subtraction is nontrivial using unipolar stochastic logic. Additionally, an accurate computation of subtraction is critical for the implementation of polynomials with negative coefficients in stochastic unipolar representation. This work, for the first time, demonstrates that instead of using well-known Bernstein polynomials, stochastic computation of polynomials can be implemented by using a stochastic subtractor and factorization. Three major contributions are given in this article. First, two approaches are proposed to compute subtraction in stochastic unipolar representation. In the first approach, the subtraction operation is approximated by cascading multilevels of OR and AND gates. The accuracy of the approximation is improved with the increase in the number of stages. In the second approach, the stochastic subtraction is implemented using a multiplexer and a stochastic divider. This approach requires more hardware complexity due to the use of a linear-feedback shift register and a counter for division. Second, computation of polynomials in stochastic unipolar format is presented using scaled addition and proposed stochastic subtraction. Third, we propose stochastic computation of polynomials using factorization. Stochastic implementations of first- and second-order factors are presented for different locations of polynomial roots. From experimental results, it is shown that the proposed stochastic logic circuits require less hardware complexity than the previous stochastic polynomial implementation using Bernstein polynomials.

Journal ArticleDOI
TL;DR: WAlloc is proposed, an efficient wear-aware manual memory allocator designed for NVRAM that decouples metadata and data management, redesigns an efficient and effective NVM copy mechanism, bypassing the CPU cache partially and prefetching data explicitly.
Abstract: The non-volatile memory (NVM) has the merits of byte-addressability, fast speed, persistency and low power consumption, which make it attractive to be used as main memory. Commonly, user process dynamically acquires memory through memory allocators. However, traditional memory allocators designed with in-place data writes are not appropriate for the non-volatile main memory (NVRAM) due to the limited endurance. In this article, first, we quantitatively analyze the wear-oblivious of DRAM-oriented designed allocator—glibc malloc and the inefficiency of wear-conscious allocator NVMalloc. Then, we propose WAlloc, an efficient wear-aware manual memory allocator designed for NVRAM: (1) decouples metadata and data management; (2) distinguishes metadata with volatility; (3) redirects the data writes around to achieve wear-leveling; (4) redesigns an efficient and effective NVM copy mechanism, bypassing the CPU cache partially and prefetching data explicitly. Finally, experimental results show that the wear-leveling of WAlloc outperforms that of NVMalloc about 30% and 60% under random workloads and well-distributed workloads, respectively. Besides, WAlloc reduces the average data memory writes in 64 bytes block by 1.5 times comparing with glibc malloc. With the fulfillment of data persistency, cache bypassing NVM copy is better than cache line flushing NVM copy with performance improvement circa 14%.

Journal ArticleDOI
TL;DR: This article describes in detail the algorithms and heuristics that have been proposed for resource-constrained scheduling, focusing on several recent contributions: path scheduling and force-directed list scheduling.
Abstract: Digital microfluidics based on electrowetting-on-dielectric technology is poised to revolutionize many aspects of chemistry and biochemistry through miniaturization, automation, and software programmability. Digital microfluidic biochips (DMFBs) offer ample spatial parallelism, which is then exposed to the compiler. The first problem that a DMFB compiler must solve is resource-constrained scheduling, which is NP-complete. If the compiler is applied off-line, then long-running algorithms that produce solutions of high quality, such as iterative improvement or branch-and-bound search, can be applied; in an online context, where a biochemical reaction is to be executed as soon as it is specified by the programmer, heuristics that sacrifice solution quality to attain a fast runtime are used. This article describes in detail the algorithms and heuristics that have been proposed for resource-constrained scheduling, focusing on several recent contributions: path scheduling and force-directed list scheduling. It also discusses shortcomings and limitations of existing optimal scheduling problem formulations based on Integer Linear Programming and presents an updated formulation that addresses these issues. The algorithms are compared and evaluated on an extensive benchmark suite of biochemical assays used for applications, such as in vitro diagnostics, protein crystallization, and automated sample preparation.

Journal ArticleDOI
TL;DR: An in-depth analysis of how the “self-organizing” ability of coupled STNO array can be effectively used for computations that are unsuitable or inefficient in the von-Neumann computing domain is provided.
Abstract: In this article, we present a comprehensive study of four frequency locking mechanisms in Spin Torque Nano Oscillators (STNOs) and explore their suitability for a class of specialized computing applications. We implemented a physical STNO model based on Landau-Lifshitz-Gilbert-Slonczewski equation and benchmarked the model to experimental data. Based on our simulations, we provide an in-depth analysis of how the “self-organizing” ability of coupled STNO array can be effectively used for computations that are unsuitable or inefficient in the von-Neumann computing domain. As a case study, we demonstrate the computing ability of coupled STNOs with two applications: edge detection of an image and associative computing for image recognition. We provide an analysis of the scaling trends of STNOs and the effectiveness of different frequency locking mechanisms with scaling in the presence of thermal noise. We also provide an in-depth analysis of the effect of variations on the four locking mechanisms to find the most robust one in the presence of variations.

Journal ArticleDOI
TL;DR: It is found that 3D wafer scale integration, combined with technologies nearing readiness, offers the potential for scaleup to a primate-scale brain, while further scale up to a human-scalebrain would require significant additional innovations.
Abstract: The Von Neumann architecture, defined by strict and hierarchical separation of memory and processor, has been a hallmark of conventional computer design since the 1940s. It is becoming increasingly unsuitable for cognitive applications, which require massive parallel processing of highly interdependent data. Inspired by the brain, we propose a significantly different architecture characterized by a large number of highly interconnected simple processors intertwined with very large amounts of low-latency memory. We contend that this memory-centric architecture can be realized using 3D wafer scale integration for which the technology is nearing readiness, combined with current CMOS device technologies. The natural fault tolerance and lower power requirements of neuromorphic processing make 3D wafer stacking particularly attractive. In order to assess the performance of this architecture, we propose a specific embodiment of a neuronal system using 3D wafer scale integration; formulate a simple model of brain connectivity including short- and long-range connections; and estimate the memory, bandwidth, latency, and power requirements of the system using the connectivity model. We find that 3D wafer scale integration, combined with technologies nearing readiness, offers the potential for scaleup to a primate-scale brain, while further scaleup to a human-scale brain would require significant additional innovations.

Journal ArticleDOI
TL;DR: This article proposes an advanced SBUS protocol (ASBUS), to improve the data feeding efficiency of the Advanced Encryption Standard (AES) encrypted circuits, and shows that the presented ASBUS structure outperforms the AXI-based design for cipher tests.
Abstract: Security is becoming a de-facto requirement of System-on-Chips (SoC), leading up to a significant share of circuit design cost. In this article, we propose an advanced SBUS protocol (ASBUS), to improve the data feeding efficiency of the Advanced Encryption Standard (AES) encrypted circuits. As a case study, the direct memory access (DMA) combined with AES engine and memory controller are implemented as our design-under-test (DUT) using field-programmable gate arrays (FPGA). The results show that our presented ASBUS structure outperforms the AXI-based design for cipher tests. As an example, the 32-bit ASBUS design costs less in terms of hardware resources and achieves higher throughput (1.30 ×) than the 32-bit AXI implementation, and the dynamic energy consumed by the ASBUS cipher test is reduced to 71.27% compared with the AXI test.

Journal ArticleDOI
TL;DR: This work seeks refuge to the crypsis behavior exhibited by geckos in nature to generate a runtime security technique for SoC architectures, which can bypass runtime passive threats of a HTH.
Abstract: The rapid evolution of the embedded era has witnessed globalization for the design of SoC architectures in the semiconductor design industry. Though issues of cost and stringent marketing deadlines have been resolved in such a methodology, yet the root of hardware trust has been evicted. Malicious circuitry, a.k.a. Hardware Trojan Horse (HTH), is inserted by adversaries in the less trusted phases of design. A HTH remains dormant during testing but gets triggered at runtime to cause sudden active and passive attacks. In this work, we focus on the runtime passive threats based on the parameter delay. Nature-inspired algorithms offer an alternative to the conventional techniques for solving complex problems in the domain of computer science. However, most are optimization techniques and none is dedicated to security. We seek refuge to the crypsis behavior exhibited by geckos in nature to generate a runtime security technique for SoC architectures, which can bypass runtime passive threats of a HTH. An adaptive security intellectual property (IP) that works on the proposed security principles is designed. Embedded timing analysis is used for experimental validation. Low area and power overhead of our proposed security IP over standard benchmarks and practical crypto SoC architectures as obtained in experimental results supports its applicability for practical implementations.

Journal ArticleDOI
TL;DR: A multiobjective quantum circuit synthesis method is proposed that generates a set of quantum circuits and attempts to simultaneously improve the measurement pattern cost metrics after the translation from this set ofquantum circuits.
Abstract: One-way quantum computation (1WQC) is a model of universal quantum computations in which a specific highly entangled state called a cluster state allows for quantum computation by single-qubit measurements The needed computations in this model are organized as measurement patterns The traditional approach to obtain a measurement pattern is by translating a quantum circuit that solely consists of CZ and J(α) gates into the corresponding measurement patterns and then performing some optimizations by using techniques proposed for the 1WQC model However, in these cases, the input of the problem is a quantum circuit, not an arbitrary unitary matrix Therefore, in this article, we focus on the first phase—that is, decomposing a unitary matrix into CZ and J(α) gates Two well-known quantum circuit synthesis methods, namely cosine-sine decomposition and quantum Shannon decomposition are considered and then adapted for a library of gates containing CZ and J(α), equipped with optimizations By exploring the solution space of the combinations of these two methods in a bottom-up approach of dynamic programming, a multiobjective quantum circuit synthesis method is proposed that generates a set of quantum circuits This approach attempts to simultaneously improve the measurement pattern cost metrics after the translation from this set of quantum circuits

Journal ArticleDOI
TL;DR: An efficient hardware architecture of restricted Boltzmann machine (RBM) that is an important category of NN systems is presented and analysis shows that the VLSI design of RBM achieves significant improvement in training speed and energy efficiency as compared to CPU/GPU-based solution.
Abstract: Neural network (NN) systems are widely used in many important applications ranging from computer vision to speech recognition To date, most NN systems are processed by general processing units like CPUs or GPUs However, as the sizes of dataset and network rapidly increase, the original software implementations suffer from long training time To overcome this problem, specialized hardware accelerators are needed to design high-speed NN systems This article presents an efficient hardware architecture of restricted Boltzmann machine (RBM) that is an important category of NN systems Various optimization approaches at the hardware level are performed to improve the training speed As-soon-as-possible and overlapped-scheduling approaches are used to reduce the latency It is shown that, compared with the flat design, the proposed RBM architecture can achieve 50% reduction in training time In addition, an on-the-fly computation scheme is also used to reduce the storage requirement of binary and stochastic states by several hundreds of times Then, based on the proposed approach, a 784-2252 RBM design example is developed for MNIST handwritten digit recognition dataset Analysis shows that the VLSI design of RBM achieves significant improvement in training speed and energy efficiency as compared to CPU/GPU-based solution

Journal ArticleDOI
Bing Li1, Yu Hu1, Ying Wang1, Jing Ye1, Xiaowei Li1 
TL;DR: This work investigates both write scheduling policy and power management to improve the MLC power utility and alleviate the negative impacts induced by high write power and proposes the SET Power Amortization (SPA) policy, which proactively reclaims the power tokens at the intra-SET level to promote the power utilization.
Abstract: Phase change memory (PCM) is a promising alternative to Dynamic Random Access Memory (DRAM) as main memory due to its merits of high density and low leakage power. Multi-level Cell (MLC) PCM is more attractive than Single-level Cell (SLC) PCM, because it can store multiple bits per cell to achieve higher density and lower per-bit cost. With the iterative program-verify write technique, MLC PCM writes demand at much higher power than DRAM writes, while the power supply system of MLC memory system is similar to that of DRAM, and the power capability is limited. The incompatibility of high write power and limited power budget results in the degradation of the write throughput and performance in MLC PCM. In this work, we investigate both write scheduling policy and power management to improve the MLC power utility and alleviate the negative impacts induced by high write power. We identify the power-utility-driven write scheduling as an online bin-packing problem and then derive a power-utility-driven scheduling (PUDS) policy from the First Fit algorithm to improve the write power usage. Based on the ramp-down characteristic of the SET pulse (the pulse changes the PCM to high resistance), we propose the SET Power Amortization (SPA) policy, which proactively reclaims the power tokens at the intra-SET level to promote the power utilization. Our experimental results demonstrate that the PUDS and SPA respectively achieve 24% and 27% performance improvement over the state-of-the-art power management technique, and the PUDS8SPA has an overall 31% improvement of the power utility and 50% increase of performance compared to the baseline system.

Journal ArticleDOI
TL;DR: This article proposes a solution to automatically generate the circuit for the Oracle for welding using the Quantum Assembly Language, which is a language for describing quantum circuits, and optimize the generated circuit using the Fault-Tolerant Quantum Logic Synthesis tool for any BWT instance.
Abstract: Quantum computing is a new computational paradigm that promises an exponential speed-up over classical algorithms. To develop efficient quantum algorithms for problems of a non-deterministic nature, random walk is one of the most successful concepts employed. In this article, we target both continuous-time and discrete-time random walk in both the classical and quantum regimes. Binary Welded Tree (BWT), or glued tree, is one of the most well-known quantum walk algorithms in the continuous-time domain. Prior work implements quantum walk on the BWT with static welding. In this context, static welding is randomized but case-specific. We propose a solution to automatically generate the circuit for the Oracle for welding. We implement the circuit using the Quantum Assembly Language, which is a language for describing quantum circuits. We then optimize the generated circuit using the Fault-Tolerant Quantum Logic Synthesis tool for any BWT instance. Automatic welding enables us to provide a generalized solution for quantum walk on the BWT.

Journal ArticleDOI
TL;DR: A Conductive-Bridge RAM (CBRAM)-based neuromorphic system which efficiently addresses time series prediction and an optimized training methodology powered by a stochastic implementation of the Least-Mean-Squares (SLMS) training rule is presented.
Abstract: In this research, we present a Conductive-Bridge RAM (CBRAM)-based neuromorphic system which efficiently addresses time series prediction. We propose a new (i) voltage-mode, stochastic, multiweight synapse circuit based on experimental bi-stable CBRAM devices, (ii) a voltage-mode neuron circuit based on the concept of charge sharing, and (iii) an optimized training methodology powered by a stochastic implementation of the Least-Mean-Squares (SLMS) training rule. To validate the proposed design, we use time series prediction for short-term electrical load forecasting in smart grids. Our system is able to forecast hourly electrical loads with a mean accuracy of 96%, an estimated power dissipation of 15 μW, and area of 14.5 μm2 at 65 nm CMOS technology.

Journal ArticleDOI
TL;DR: This article uses a two coupled nano-oscillator as a basic computational model and proposes an architecture for a non-Boolean coupled oscillator based co-processor capable of executing certain functions that are commonly used across a variety of approximate application domains, including an accuracy tunable knob.
Abstract: As we enter an era witnessing the closer end of Dennard scaling, where further reduction in power supply-voltage to reduce power consumption becomes more challenging in conventional systems, a goal of developing a system capable of performing large computations with minimal area and power overheads needs more optimization aspects. A rigorous exploration of alternate computing techniques, which can mitigate the limitations of Complementary Metal-Oxide Semiconductor (CMOS) technology scaling and conventional Boolean systems, is imperative. Reflecting on these lines of thought, in this article we explore the potential of non-Boolean computing employing nano-oscillators for performing varied functions. We use a two coupled nano-oscillator as our basic computational model and propose an architecture for a non-Boolean coupled oscillator based co-processor capable of executing certain functions that are commonly used across a variety of approximate application domains. The proposed architecture includes an accuracy tunable knob, which can be tuned by the programmer at runtime. The functionality of the proposed co-processor is verified using a soft coupled oscillator model based on Kuramoto oscillators. The article also demonstrates how real-world applications such as Vector Quantization, Digit Recognition, Structural Health Monitoring, and the like, can be deployed on the proposed model. The proposed co-processor architecture is generic in nature and can be implemented using any of the existing modern day nano-oscillator technologies such as Resonant Body Transistors (RBTs), Spin-Torque Nano-Oscillators (STNOs), and Metal-Insulator Transition (MITs) . In this article, we perform a validation of the proposed architecture using the HyperField Effect Transistor (FET) technology-based coupled oscillators, which provide improvements of up to 3.5× increase in clock speed and up to 10.75× and 14.12× reduction in area and power consumption, respectively, as compared to a conventional Boolean CMOS accelerator executing the same functions.

Journal ArticleDOI
TL;DR: This article proposes a heterogeneous memory system that combines a double data rate (DDRx) DRAM with an emerging 3D hybrid memory cube (HMC) technology and introduces a temperature-aware algorithm that dynamically distributes the requested bandwidth between HMC and DDRx DRAM to reduce the thermal hotspot while maintaining high performance.
Abstract: Three-dimensional DRAMs (3D-DRAMs) are emerging as a promising solution to address the memory wall problem in computer systems However, high fabrication cost per bit and thermal issues are the main reasons that prevent architects from using 3D-DRAM alone as the main memory building block In this article, we address this issue by proposing a heterogeneous memory system that combines a double data rate (DDRx) DRAM with an emerging 3D hybrid memory cube (HMC) technology Bandwidth and temperature management are the challenging issues for this heterogeneous memory architecture To address these challenges, first we introduce a memory page allocation policy for the heterogeneous memory system to maximize performance Then, using the proposed policy, we introduce a temperature-aware algorithm that dynamically distributes the requested bandwidth between HMC and DDRx DRAM to reduce the thermal hotspot while maintaining high performance We take into account the impact of both core count and HMC channel count on performance while using the proposed policies The results show that the proposed memory page allocation policy can utilize the memory bandwidth close to 99% of the ideal bandwidth utilization Moreover, our temperate-aware bandwidth adaptation reduces the average steady-state temperature of the HMC hotspot across various workloads by 45 K while incurring 25% performance overhead