scispace - formally typeset
Search or ask a question

Showing papers presented at "Asia and South Pacific Design Automation Conference in 2018"


Proceedings ArticleDOI
22 Jan 2018
TL;DR: DRL-Cloud is presented, a novel Deep Reinforcement Learning (DRL)-based RP and TS system, to minimize energy cost for large-scale CSPs with very large number of servers that receive enormous numbers of user requests per day.
Abstract: Cloud computing has become an attractive computing paradigm in both academia and industry. Through virtualization technology, Cloud Service Providers (CSPs) that own data centers can structure physical servers into Virtual Machines (VMs) to provide services, resources, and infrastructures to users. Profit-driven CSPs charge users for service access and VM rental, and reduce power consumption and electric bills so as to increase profit margin. The key challenge faced by CSPs is data center energy cost minimization. Prior works proposed various algorithms to reduce energy cost through Resource Provisioning (RP) and/or Task Scheduling (TS). However, they have scalability issues or do not consider TS with task dependencies, which is a crucial factor that ensures correct parallel execution of tasks. This paper presents DRL-Cloud, a novel Deep Reinforcement Learning (DRL)-based RP and TS system, to minimize energy cost for large-scale CSPs with very large number of servers that receive enormous numbers of user requests per day. A deep Q-learning-based two-stage RP-TS processor is designed to automatically generate the best long-term decisions by learning from the changing environment such as user request patterns and realistic electric price. With training techniques such as target network, experience replay, and exploration and exploitation, the proposed DRL-Cloud achieves remarkably high energy cost efficiency, low reject rate as well as low runtime with fast convergence. Compared with one of the state-of-the-art energy efficient algorithms, the proposed DRL-Cloud achieves up to 320% energy cost efficiency improvement while maintaining lower reject rate on average. For an example CSP setup with 5,000 servers and 200,000 tasks, compared to a fast round-robin baseline, the proposed DRL-Cloud achieves up to 144% runtime reduction.

123 citations


Proceedings ArticleDOI
Xiaoyu Sun1, Xiaochen Peng1, Pai-Yu Chen1, Rui Liu1, Jae-sun Seo1, Shimeng Yu1 
22 Jan 2018
TL;DR: This work analyzes a fully parallel RRAM synaptic array architecture that implements the fully connected layers in a convolutional neural network with (+1, −1) weights and (-1, 0) neurons and proposes the proposed fully parallel BNN architecture (P-BNN), which can achieve 137.35 TOPS/W energy efficiency for the inference.
Abstract: Binary Neural Networks (BNNs) have been recently proposed to improve the area-/energy-efficiency of the machine/deep learning hardware accelerators, which opens an opportunity to use the technologically more mature binary RRAM devices to effectively implement the binary synaptic weights. In addition, the binary neuron activation enables using the sense amplifier instead of the analog-to-digital converter to allow bitwise communication between layers of the neural networks. However, the sense amplifier has intrinsic offset that affects the threshold of binary neuron, thus it may degrade the classification accuracy. In this work, we analyze a fully parallel RRAM synaptic array architecture that implements the fully connected layers in a convolutional neural network with (+1, −1) weights and (+1, 0) neurons. The simulation results with TSMC 65 nm PDK show that the offset of current mode sense amplifier introduces a slight accuracy loss from ∼98.5% to ∼97.6% for MNIST dataset. Nevertheless, the proposed fully parallel BNN architecture (P-BNN) can achieve 137.35 TOPS/W energy efficiency for the inference, improved by ∼20X compared to the sequential BNN architecture (S-BNN) with row-by-row read-out scheme. Moreover, the proposed P-BNN architecture can save the chip area by ∼16% as it eliminates the area overhead of MAC peripheral units in the S-BNN architecture.

62 citations


Proceedings ArticleDOI
22 Jan 2018
TL;DR: A novel way towards the concept of bit-wise In-Memory Convolution Engine (IMCE) that could implement the dominant convolution computation of Deep Convolutional Neural Networks (CNN) within memory is paved.
Abstract: In this paper, we pave a novel way towards the concept of bit-wise In-Memory Convolution Engine (IMCE) that could implement the dominant convolution computation of Deep Convolutional Neural Networks (CNN) within memory. IMCE employs parallel computational memory sub-array as a fundamental unit based on our proposed Spin Orbit Torque Magnetic Random Access Memory (SOT-MRAM) design. Then, we propose an accelerator system architecture based on IMCE to efficiently process low bit-width CNNs. This architecture can be leveraged to greatly reduce energy consumption dealing with convolutional layers and also accelerate CNN inference. The device to architecture co-simulation results show that the proposed system architecture can process low bit-width AlexNet on ImageNet data-set favorably with 785.25μJ/img, which consumes ∼3× less energy than that of recent RRAM based counterpart. Besides, the chip area is ∼4× smaller.

56 citations


Proceedings ArticleDOI
22 Jan 2018
TL;DR: ReGAN is proposed — a novel ReRAM-based Process-In-Memory accelerator that can efficiently reduce off-chip memory accesses and greatly increases system throughput by pipelining the layer-wise computation.
Abstract: Generative Adversarial Networks (GANs) have recently drawn tremendous attention in many artificial intelligence (AI) applications including computer vision, speech recognition, and natural language processing. While GANs deliver state-of-the-art performance on these AI tasks, it comes at the cost of high computational complexity. Although recent progress demonstrated the promise of using ReRMA-based Process-In-Memory for acceleration of convolutional neural networks (CNNs) with low energy cost, the unique training process required by GANs makes them difficult to run on existing neural network acceleration platforms: two competing networks are simultaneously co-trained in GANs, and hence, significantly increasing the need of memory and computation resources. In this work, we propose ReGAN - a novel ReRAM-based Process-In-Memory accelerator that can efficiently reduce off-chip memory accesses. Moreover, ReGAN greatly increases system throughput by pipelining the layer-wise computation. Two techniques, namely, Spatial Parallelism and Computation Sharing are particularly proposed to further enhance training efficiency of GANs. Our experimental results show that ReGAN can achieve 240× performance speedup compared to GPU platform averagely, with an average energy saving of 94×.

53 citations


Proceedings ArticleDOI
22 Jan 2018
TL;DR: A new arbiter-based multi-PUF (MPUF) design that utilises a Weak PUF to obfuscate the challenges to a Strong PUF and is harder to model than the conventional arbiter PUF using machine learning attacks is proposed.
Abstract: Current approaches for building physical unclonable function (PUF) designs resistant to machine learning attacks often suffer from large resource overhead and are typically difficult to implement on field programmable gate arrays (FPGAs). In this paper we propose a new arbiter-based multi-PUF (MPUF) design that utilises a Weak PUF to obfuscate the challenges to a Strong PUF and is harder to model than the conventional arbiter PUF using machine learning attacks. The proposed PUF design shows a greater resistance to attacks, which have been successfully applied to other Arbiter PUFs. A mathematical model is presented to analyse the complexity and obfuscation properties of the proposed PUF design. Moreover, we show that it is feasible to implement the proposed MPUF design on a Xilinx Artix-7 FPGA, and that it achieves a good uniqueness result of 40.60 % and uniformity of 37.03 %, which significantly improves over previous work into multi-PUF designs.

51 citations


Proceedings ArticleDOI
22 Jan 2018
TL;DR: The experimental results show DNNs indeed outperform most traditional machine learning techniques, but such superior results can only be achieved with proper design of the right DNN architecture, where domain knowledge plays a key role.
Abstract: The world has seen in recent years great successes in applying deep learning (DL) for many application domains. Though powerful, DL is not easy to be used well. In this invited paper, we study an urban taxi demand forecast problem using DL, and we show a number of key insights in modeling a domain problem as a suitable DL task. We also conduct a systematic comparison of two recent deep neural networks (DNNs) for taxi demand prediction, i.s., the ST-ResNet and FLC-Net, on New York city taxi record dataset. Our experimental results show DNNs indeed outperform most traditional machine learning techniques, but such superior results can only be achieved with proper design of the right DNN architecture, where domain knowledge plays a key role.

46 citations


Proceedings ArticleDOI
22 Jan 2018
TL;DR: This paper proposes Neu-NoC — a high-efficient interconnection network to reduce the redundant data traffic in neuromorphic acceleration systems and explores the data transfer ability between adjacent layers of fully-connected NNs.
Abstract: A modern neuromorphic acceleration system could consist of hundreds of accelerators, which are often organized through a network-on-chip (NoC). Although the overall computing ability is greatly promoted by a large number of the accelerators, the power consumption and average delay of the NoC itself becomes prominent. In this paper, we first analyze the characteristics of the data traffic in neuromorphic acceleration systems and the bottleneck of the popular NoC designs adopted in such systems. We then propose Neu-NoC - a high-efficient interconnection network to reduce the redundant data traffic in neuromorphic acceleration systems and explore the data transfer ability between adjacent layers. A sophisticated neural network aware mapping algorithm and a multicast transmission scheme are designed to alleviate data traffic congestions without increasing the average transmission distance. Finally, we explore the sparsity characteristics of fully-connected NNs. Simulation results show that compared to the most widely-used Mesh NoC design, Neu-NoC can substantially reduce the average data latency by 28.5% and the energy consumption by 39.2% in accelerated neuromorphic systems.

45 citations


Proceedings ArticleDOI
22 Jan 2018
TL;DR: In contrast to other quantized DNNs that trade-off significant amounts of accuracy for lower memory requirements, LightNNs can significantly reduce storage, energy and area while still maintaining a test error similar to a large DNN configuration.
Abstract: Deep Neural Networks (DNNs) have been adopted in many systems because of their higher classification accuracy, with custom hardware implementations great candidates for high-speed, accurate inference. While progress in achieving large scale, highly accurate DNNs has been made, significant energy and area are required due to massive memory accesses and computations. Such demands pose a challenge to any DNN implementation, yet it is more natural to handle in a custom hardware platform. To alleviate the increased demand in storage and energy, quantized DNNs constrain their weights (and activations) from floating-point numbers to only a few discrete levels. Therefore, storage is reduced, thereby leading to less memory accesses. In this paper, we provide an overview of different types of quantized DNNs, as well as the training approaches for them. Among the various quantized DNNs, our LightNN (Light Neural Network) approach can reduce both memory accesses and computation energy, by filling the gap between classic, full-precision and binarized DNNs. We provide a detailed comparison between LightNNs, conventional DNNs and Binarized Neural Networks (BNNs), with MNIST and CIFAR-10 datasets. In contrast to other quantized DNNs that trade-off significant amounts of accuracy for lower memory requirements, LightNNs can significantly reduce storage, energy and area while still maintaining a test error similar to a large DNN configuration. Thus, LightNNs provide more options for hardware designers to trade-off accuracy and energy.

41 citations


Proceedings ArticleDOI
22 Jan 2018
TL;DR: This work investigates the multi-factor adversarial attack problem in practical model optimized deep learning systems by jointly considering the DNN model-reshaping and the input perturbations and conducts a comprehensive robustness and vulnerability analysis of deep compressed DNN models under derived adversarial attacks.
Abstract: Thanks to recent machine learning model innovation and computing hardware advancement, the state-of-the-art of Deep Neural Network (DNN) is presenting human-level performance for many complex intelligent tasks in real-world applications. However, it also introduces ever-increasing security concerns for those intelligent systems. For example, the emerging adversarial attacks indicate that even very small and often imperceptible adversarial input perturbations can easily mislead the cognitive function of deep learning systems (DLS). Existing DNN adversarial studies are narrowly performed on the ideal software-level DNN models with a focus on single uncertainty factor, i.e. input perturbations, however, the impact of DNN model reshaping on adversarial attacks, which is introduced by various hardware-favorable techniques such as hash-based weight compression during modern DNN hardware implementation, has never been discussed. In this work, we for the first time investigate the multi-factor adversarial attack problem in practical model optimized deep learning systems by jointly considering the DNN model-reshaping (e.g. HashNet based deep compression) and the input perturbations. We first augment adversarial example generating method dedicated to the compressed DNN models by incorporating the software-based approaches and mathematical modeled DNN reshaping. We then conduct a comprehensive robustness and vulnerability analysis of deep compressed DNN models under derived adversarial attacks. A defense technique named "gradient inhibition" is further developed to ease the generating of adversarial examples thus to effectively mitigate adversarial attacks towards both software and hardware-oriented DNNs. Simulation results show that "gradient inhibition" can decrease the average success rate of adversarial attacks from 87.99% to 4.77% (from 86.74% to 4.64%) on MNIST (CIFAR-10) benchmark with marginal accuracy degradation across various DNNs.

30 citations


Proceedings ArticleDOI
22 Jan 2018
TL;DR: Experimental results show that the proposed TDC method achieves up to 81 times higher throughput than the state-of-the-art DCNN accelerator with the same hardware resources, and improves the speed by 7.8 times by having all layers in the hourglass-type FSRCNN to be processed in inter-layer parallelism without additional DSP usage.
Abstract: Convolutional neural networks (CNN) are widely used in various computer vision applications. Recently, there have been many studies on FPGA-based CNN accelerators to achieve high performance and power efficiency. Most of them have been on CNN-based object detection algorithms, but researches on image super-resolution have been rarely conducted. Fast super-resolution CNN (FSRCNN), well known for CNN-based super-resolution algorithm, are a combination of multiple convolutional layers and a single deconvolutional layer. Since the deconvolutional layer generates high-resolution (HR) output feature maps from low-resolution (LR) input feature maps, its execution cycles are larger than those of the convolutional layer. In this paper, we propose a novel architecture of the FPGA-based CNN accelerator with the efficient parallelization. We develop a method of transforming a deconvolutional layer into a convolutional layer (TDC), a new methodology for the deconvolutional neural networks (DCNN). There is a massive parallelization source in the deconvolutional layer where multiple outputs within the same output feature map are created with the same inputs. When this new parallelization technique is applied to the deconvolutional layer, it generates the LR output feature maps the same as the convolutional layer. Thus, the performance of the accelerator increases without any additional hardware resources because the kernel size required to generate the LR output feature maps is smaller. In addition, if there is a DSP underutilization problem in the deconvolutional layer that some of the processors are in an idle state, the proposed method solves this problem by allowing more output feature maps to be processed in parallel. Experimental results show that the proposed TDC method achieves up to 81 times higher throughput than the state-of-the-art DCNN accelerator with the same hardware resources. We also improve the speed by 7.8 times by having all layers in the hourglass-type FSRCNN to be processed in inter-layer parallelism without additional DSP usage.

30 citations


Proceedings ArticleDOI
22 Jan 2018
TL;DR: A new metric, Percentage of Netlist Recovery (PNR), is defined and promoted, which can quantify the resilience against gate-level theft of intellectual property (IP) in a manner more meaningful than established metrics.
Abstract: Here we advance the protection of split manufacturing (SM)-based layouts through the judicious and well-controlled handling of interconnects. Initially, we explore the cost-security trade-offs of SM, which are limiting its adoption. Aiming to resolve this issue, we propose effective and efficient strategies to lift nets to the BEOL. Towards this end, we design custom "elevating cells" which we also provide to the community. Further, we define and promote a new metric, Percentage of Netlist Recovery (PNR), which can quantify the resilience against gate-level theft of intellectual property (IP) in a manner more meaningful than established metrics. Our extensive experiments show that we outperform the recent protection schemes regarding security. For example, we reduce the correct connection rate to 0% for commonly considered benchmarks, which is a first in the literature. Besides, we induce reasonably low and controllable overheads on power, performance, and area (PPA). At the same time, we also help to lower the commercial cost incurred by SM.

Proceedings ArticleDOI
22 Jan 2018
TL;DR: This paper proposes a low-power implementation of the approximate logarithmic multiplier to improve the power consumption of convolutional neural networks for image classification, taking advantage of its intrinsic tolerance to error.
Abstract: This paper proposes a low-power implementation of the approximate logarithmic multiplier to improve the power consumption of convolutional neural networks for image classification, taking advantage of its intrinsic tolerance to error. The approximate logarithmic multiplier converts multiplications to additions by taking approximate logarithm and achieves significant improvement in power and area while having low worst-case error, which makes it suitable for neural network computation. Our proposed design shows a significant improvement in terms of power and area over the previous work that applied logarithmic multiplication to neural networks, reducing power up to 76.6% compared to exact fixed-point multiplication, while maintaining comparable prediction accuracy in convolutional neural networks for MNIST and CIFAR10 datasets.

Proceedings ArticleDOI
22 Jan 2018
TL;DR: This paper proposes a methodology to design ternary gates by modeling pull-up and pull-down operations of the gates, which makes it possible to synthesize ternARY gates with a minimum number of transistors.
Abstract: Over the last few decades, CMOS-based digital circuits have been steadily developed. However, because of the power density limits, device scaling may soon come to an end, and new approaches for circuit designs are required. Multi-valued logic (MVL) is one of the new approaches, which increases the radix for computation to lower the complexity of the circuit. For the MVL implementation, ternary logic circuit designs have been proposed previously, though they could not show advantages over binary logic, because of unoptimized synthesis techniques. In this paper, we propose a methodology to design ternary gates by modeling pull-up and pull-down operations of the gates. Our proposed methodology makes it possible to synthesize ternary gates with a minimum number of transistors. From HSPICE simulation results, our ternary designs show significant power-delay product reductions; 49 % in the ternary full adder and 62 % in the ternary multiplier compared to the existing methodology. We have also compared the number of transistors in CMOS-based binary logic circuits and ternary device-based logic circuits.

Proceedings ArticleDOI
22 Jan 2018
TL;DR: This paper describes several near-term challenges and opportunities, along with concrete existence proofs, for application of learning-based methods within the ecosystem of commercial EDA, IC design, and academic research.
Abstract: Design-based equivalent scaling now bears much of the burden of continuing the semiconductor industry's trajectory of Moore's-Law value scaling. In the future, reductions of design effort and design schedule must comprise a substantial portion of this equivalent scaling. In this context, machine learning and deep learning in EDA tools and design flows offer enormous potential for value creation. Examples of opportunities include: improved design convergence through prediction of downstream flow outcomes; margin reduction through new analysis correlation mechanisms; and use of open platforms to develop learning-based applications. These will be the foundations of future design-based equivalent scaling in the IC industry. This paper describes several near-term challenges and opportunities, along with concrete existence proofs, for application of learning-based methods within the ecosystem of commercial EDA, IC design, and academic research.

Proceedings ArticleDOI
22 Jan 2018
TL;DR: A Highly Flexible InMemory (HieIM) computing platform using STT MRAM, which can be leveraged to implement Boolean logic functions without sacrificing memory functionality, thus overcoming the ‘operand locality’ problem in contemporary in-memory computing platform designs is proposed.
Abstract: In this paper we propose a Highly Flexible In-Memory (HieIM) computing platform using STT MRAM, which can be leveraged to implement Boolean logic functions without sacrificing memory functionality. It could pre-process data within memory to further reduce power hungry long distance communication between memory and processing units as in Von-Neumann computing system. HieIM can implement all the Boolean logic functions (AND/NAND, OR/NOR, XOR/XNOR) between any two cells in the same memory array, thus overcoming the 'operand locality' problem in contemporary in-memory computing platform designs. To investigate the performance of HieIM, we test in-memory bulk bit-wise Boolean logic operations using different vector datasets, which shows ∼ 8× energy saving and ∼ 5× speedup compared to recent DRAM based in-memory computing platform. We further implement an in-memory data encryption engine design based on HieIM as another case study. With AES algorithm, it shows 51.5% and 68.9% lower energy consumption compared to CMOS-ASIC and CMOL based implementations, respectively.

Proceedings ArticleDOI
22 Jan 2018
TL;DR: A deep reinforcement learning framework of the HEV power management with the aim of improving fuel economy is proposed and results demonstrate the effectiveness of the proposed framework on optimizing HEV fuel economy.
Abstract: Hybrid electric vehicles employ a hybrid propulsion system to combine the energy efficiency of electric motor and a long driving range of internal combustion engine, thereby achieving a higher fuel economy as well as convenience compared with conventional ICE vehicles. However, the relatively complicated powertrain structures of HEVs necessitate an effective power management policy to determine the power split between ICE and EM. In this work, we propose a deep reinforcement learning framework of the HEV power management with the aim of improving fuel economy. The DRL technique is comprised of an offline deep neural network construction phase and an online deep Q-learning phase. Unlike traditional reinforcement learning, DRL presents the capability of handling the high dimensional state and action space in the actual decision-making process, making it suitable for the HEV power management problem. Enabled by the DRL technique, the derived HEV power management policy is close to optimal, fully model-free, and independent of a prior knowledge of driving cycles. Simulation results based on actual vehicle setup over real-world and testing driving cycles demonstrate the effectiveness of the proposed framework on optimizing HEV fuel economy.

Proceedings ArticleDOI
22 Jan 2018
TL;DR: The key elements of FHE-PDK include technology files for design rule checking, layout versus schematic and layout parasitics extraction, as well as SPICE-compatible models for flexible thin-film transistors (TFTs) and passive elements.
Abstract: Flexible Electronics (FE) is emerging for wearables and low-cost internet of things (IoT) nodes benefiting from its low-cost fabrication and mechanical flexibility. Combining FE with thinned silicon chips, known as flexible hybrid electronics (FHE), can take advantages of both low-cost printed electronics and high performance silicon chips. To design a FHE system, the process design kit (PDK) offering the capabilities for circuit design, simulation and verification for both FE and silicon chips is needed. The key elements of FHE-PDK include technology files for design rule checking (DRC), layout versus schematic (LVS) and layout parasitics extraction (LPE), as well as SPICE-compatible models for flexible thin-film transistors (TFTs) and passive elements. Wafer scale measurements are used to validate our SPICE models and design rules are derived accordingly to assure a satisfactory yield. With FHE-PDK, circuit and system designers can therefore focus on design innovations and can rely on design tools to produce manufacturable designs.

Proceedings ArticleDOI
22 Jan 2018
TL;DR: This work provides a precise definition of the underlying design task of Programmable Microfluidic Devices (PMDs) and presents complementary solutions (both exact as well as heuristic) and discusses how they guarantee a sound valve-control.
Abstract: In the domain of microfluidic devices, a paradigm shift from application-specific to fully-programmable solutions takes place (a similar development from ASICS to FPGAs has been observed in conventional circuitry). So-called Programmable Microfluidic Devices (PMDs) provide a promising platform in this regard. Here, fluids can be pushed into various reaction vessels whose inflow and outflow is controlled by valves. The regular structure in combination with the flexibility of defining various flow paths through valves allows to realize a vast range of biological or chemical applications by only changing the corresponding valve-control sequence. However, determining a sound valve-control constitutes a non-trivial task. Although first automatic approaches for this problem have recently been proposed, we show that they frequently yield impractical control sequences. In this work, we address this issue by providing a precise definition of the underlying design task. Afterwards, we present complementary solutions (both exact as well as heuristic) and discuss how they guarantee a sound valve-control. Experimental evaluations demonstrate that the proposed solutions are capable of automatically generating a sound valve-control for PMDs.

Proceedings ArticleDOI
22 Jan 2018
TL;DR: An approximation-aware test methodology which can be easily integrated into the regular test flow and removed all potential faults that no longer need to be tested because they can be tolerated under the given error metric is presented.
Abstract: A wide range of applications significantly benefit from the Approximate Computing (AC) paradigm in terms of speed or power reduction. AC achieves this by tolerating errors in the design. These errors are introduced into the design either manually by the designer or by approximate synthesis approaches. From here, the standard design flow is taken. Hence, the manufactured AC chip is eventually tested for production errors using well established fault models. To be precise, if the test for a test pattern fails, the AC chip is sorted out. However, from a general perspective this procedure results in throwing away chips which are perfectly fine taking into account that the considered fault (i.e. physical defect that leads to the error) can still be tolerated because of approximation. This can lead to a significant amount of yield loss. In this paper, we present an approximation-aware test methodology which can be easily integrated into the regular test flow. It is based on a pre-process to identify approximation-redundant faults. By this, we remove all potential faults that no longer need to be tested because they can be tolerated under the given error metric. Our experimental results and case studies on a wide variety of benchmark circuits show a significant potential for yield improvement.

Proceedings ArticleDOI
22 Jan 2018
TL;DR: This paper presents a novel, flexible, and adaptable SoC security architecture that efficiently implements diverse security policies and shows, for the first time, that the proposed framework provides high level of patchability with minimal energy and performance overhead.
Abstract: System-on-Chip (SoC) security architectures targeted towards diverse applications including Internet of Things (IoT) and automotive systems enforce two critical design requirements: in-field configurability and low overhead. To simultaneously address these constraints, in this paper, we present a novel, flexible, and adaptable SoC security architecture that efficiently implements diverse security policies. The architecture and associated CAD flow enable “hardware patching” i.e. hardware security policy engine that can be seamlessly and securely upgraded in field to address unanticipated attacks or new security requirements. We implement (1) a centralized Reconfigurable Security Policy Engine (RSPE), (2) smart security wrappers, and (3) Design-for-Debug (DfD) infrastructure interface as the building blocks of the architecture. The proposed framework provides a systematic approach to represent and synthesize diverse security policies. Through extensive analysis using representative SoC models, we show, for the first time to our knowledge, that the proposed framework provides high level of patchability with minimal energy and performance overhead.

Proceedings ArticleDOI
22 Jan 2018
TL;DR: Experimental results show that by combining temperature and reservoir accelerations, the EM lifetime of a chip can be reduced from 10 years down to a few hours under the 150°C temperature limit, which is sufficient for practical EM testing of typical nanometer CMOS ICs.
Abstract: For practical testing and detection of electromigration (EM) induced failures in dual damascene copper interconnects in today's and future sub-10nm ICs, one critical issue is how to create stressing conditions so that the chip will fail exclusively under EM in a very short period of time so that EM signoff and validation can be carried out efficiently. In this work, we propose novel EM wearout-acceleration techniques for practical VLSI chips. We will first review the recently proposed three-phase physics-based EM models and discuss the important factors contributing to the EM aging process. Then we propose a new formula for fast estimation of the void's saturation volume for general multi-segment interconnect wires, which is important for EM mortality check. We then investigate two strategies to accelerate the EM failure process: reservoir-enhanced acceleration and temperature-based acceleration. First we show that multi-segment interconnects with reservoir structures and their stressing currents can be exploited to significantly speedup the EM wearout process. Such configurable reservoir-based wires are very flexible and can achieve various EM accelerations at the costs of some routing resources. Additionally, we show that further acceleration can be achieved by increasing temperature. On average, 10% increase in temperature yields about 10X wearout acceleration. However, purely temperature based acceleration is not possible since practical VLSI chips have temperature limitations which must be strictly enforced to ensure the chip only fails under EM, and not due to other reliability effects. In this study, we show that it is possible to achieve significantly high acceleration while staying within the feasible operating zones by combining the two acceleration techniques. Experimental results show that by combining temperature and reservoir accelerations, we can reduce the EM lifetime of a chip from 10 years down to a few hours (about 105 acceleration) under the 150°C temperature limit, which is sufficient for practical EM testing of typical nanometer CMOS ICs.

Proceedings ArticleDOI
Yi Cai1, Tianqi Tang1, Lixue Xia1, Ming Cheng1, Zhenhua Zhu1, Yu Wang1, Huazhong Yang1 
22 Jan 2018
TL;DR: A low-bitwidth CNN training method, using low- bitwidth convolution outputs, activations, weights, weights and gradients to train CNN models based on RRAM, and a system to implement the training algorithms is designed.
Abstract: Convolutional Neural Networks (CNNs) have achieved excellent performance on various artificial intelligence (AI) applications, while a higher demand on energy efficiency is required for future AI. Resistive Random-Access Memory (RRAM)-based computing system provides a promising solution to energy-efficient neural network training. However, it's difficult to support high-precision CNN in RRAM-based hardware systems. Firstly, multi-bit digital-analog interfaces will take up most energy overhead of the whole system. Secondly, it's difficult to write the RRAM to expected resistance states accurately; only low-precision numbers can be represented. To enable CNN training based on RRAM, we propose a low-bitwidth CNN training method, using low-bitwidth convolution outputs (CO), activations (A), weights (W) and gradients (G) to train CNN models based on RRAM. Furthermore, we design a system to implement the training algorithms. We explore the accuracy under different bitwidth combinations of (A, CO, W, G), and propose a practical tradeoff between accuracy and energy overhead. Our experiments demonstrate that the proposed system perform well on low-bitwidth CNN training tasks. For example, training LeNet-5 with 4-bit convolution outputs, 4-bit weights, 4-bit activations and 4-bit gradients on MNIST can still achieve 97.67% accuracy. Moreover, the proposed system can achieve 23.0X higher energy efficiency than GPU when processing the training task of LeNet-5, and 4.4X higher energy efficiency when processing the training task of ResNet-20.

Proceedings ArticleDOI
22 Jan 2018
TL;DR: A PCE-aware scheduling scheme is developed for effective mapping of multithreaded applications onto an HMP and results indicate that the proposed learning-based scheme outperforms the state of the art solution by 10% when there is no PCE gap between big and little cores.
Abstract: Heterogeneous Multicore Processors (HMPs) are comprised of multiple core types (small vs. big core architectures) with various performance and power characteristics which offer the flexibility to assign each thread to a core that provides the maximum energy-efficiency. Although this architecture provides more flexibility for the running application to determine the optimal run-time settings that maximize energy-efficiency, due to the interdependence of various tuning parameters such as the type of core, run-time voltage and frequency, and the number of threads, the scheduling becomes more challenging. More importantly, the impact of Power Conversion Efficiency (PCE) of the On-Chip Voltage Regulators (OCVRs) is another important parameter that makes it more challenging to schedule multithreaded applications on HMPs. In this paper, the importance of concurrent optimization and fine-tuning of the circuit and architectural parameters for energy-efficient scheduling on HMPs is addressed to harness the power of heterogeneity. In addition, the scheduling challenges for multithreaded applications are investigated for HMP architectures that account for the impact of power conversion efficiency. A highly accurate learning-based model is developed for energy-efficiency prediction to guide the scheduling decision. Using the predictive model, we further develop a PCE-aware scheduling scheme is developed for effective mapping of multithreaded applications onto an HMP. The results indicate that the proposed learning-based scheme outperforms the state of the art solution by 10% when there is no PCE gap between big and little cores. The energy-efficiency improves up to 60% when the PCE gap between big and little cores increases.

Proceedings ArticleDOI
22 Jan 2018
TL;DR: In this work, an execution framework is introduced that leverages local memory by statically allocating a subset of tasks to cores, which reduces the access times to shared memory, as off-chip memory access is avoided, and in turn improves the schedulability of such systems.
Abstract: Access to shared memory is one of the main challenges for many-core processors. One group of scheduling strategies for such platforms focuses on the division of tasks' access to shared memory and code execution. This allows to orchestrate the access to shared local and off-chip memory in a way such that access contention between different compute cores is avoided by design. In this work, an execution framework is introduced that leverages local memory by statically allocating a subset of tasks to cores. This reduces the access times to shared memory, as off-chip memory access is avoided, and in turn improves the schedulability of such systems. A Constraint Programming (CP) formulation is presented to select the statically allocated tasks and to generate the complete system schedule. Evaluations show that the proposed approach yields an up to 19% higher schedulability ratio than related work, and a case study demonstrates its applicability to industrial problems.

Proceedings ArticleDOI
22 Jan 2018
TL;DR: A novel split manufacturing framework that not only guarantees to achieve the required security level but also allows for a drastic reduction of the introduced overhead, and demonstrates much better efficiency, overhead reduction, and security guarantee compared with existing methods.
Abstract: Trojans and backdoors inserted by untrusted foundries have become serious threats to hardware security. Split manufacturing is proposed to prevent Trojan insertion proactively. Existing methods depend on wire lifting to hide partial circuit interconnections, which usually suffer from large overhead and lack of security guarantee. In this paper, we propose a novel split manufacturing framework that not only guarantees to achieve the required security level but also allows for a drastic reduction of the introduced overhead. In our framework, insertion of dummy circuit cells and wires is considered simultaneously with wire lifting. To support cell and wire insertion, we propose a new security criterion, and further derive its sufficient condition to avoid computation intensive operations in traditional methods. Then, for the first time, a novel mixed integer linear programming formulation is proposed to simultaneously consider cell and wire insertion together with wire lifting, which significantly enlarges the design space to guarantee the realization of the sufficient condition under the security requirements and overhead constraints. With extensive experimental results, our framework demonstrates much better efficiency, overhead reduction, and security guarantee compared with existing methods.

Proceedings ArticleDOI
22 Jan 2018
TL;DR: A micropower audio delta-sigma modulator is presented for mobile applications, which employs dynamic bias inverter based integrators, which maximizes both g m /I D ratio and slew rate while compensating PVT variations.
Abstract: A micropower audio delta-sigma modulator is presented for mobile applications. The modulator employs dynamic bias inverter based integrators, which maximizes both g m /I D ratio and slew rate while compensating PVT variations. A prototype modulator implemented in a 0.18pm CMOS process features a single-bit third-order topology. The modulator achieves 97.7dB SNDR, 98.6dB SNR, 100.5dB DR, and 105.8dB SFDR in a 20kHz audio band, while consuming only 300pW from a 1.8V supply. This corresponds to a state-of-the-art FoM of 178.7dB.

Proceedings ArticleDOI
Bing Li1, Wei Wen1, Jiachen Mao1, Sicheng Li2, Yi Chen1, Hai Helen Li1 
22 Jan 2018
TL;DR: This paper demonstrates the co-optimization of the DNN algorithm and hardware which exploits the model redundancy to accelerate DNNs.
Abstract: Deep Neural Networks (DNNs) are pervasively applied in many artificial intelligence (AI) applications. The high performance of DNNs comes at the cost of larger size and higher compute complexity. Recent studies show that DNNs have much redundancy, such as the zero-value parameters and excessive numerical precision. To reduce computing complexity, many redundancy reduction techniques have been proposed, including pruning and data quantization. In this paper, we demonstrate our co-optimization of the DNN algorithm and hardware which exploits the model redundancy to accelerate DNNs.

Proceedings ArticleDOI
22 Jan 2018
TL;DR: It is shown that it is possible to build a hardware implementation of a processor with multiple instructions, support for non-deterministic Pallier encryption, and partially homomorphic processing and Cryptoblaze is at least 10X faster than the state of the art.
Abstract: Homomorphic computing has been suggested as a method to secure processing in insecure servers. One of the drawbacks of homomorphic processing is the enormous execution time taken to process even the simplest of operations. In this paper, we propose a processor with hardware support for homomorphic processing. The proposed processor, named CryptoBlaze, has eight additional specialized instructions and hardware to support computation of encrypted data. For the first time, we show that it is possible to build a hardware implementation of a processor with multiple instructions, support for non-deterministic Pallier encryption, and partially homomorphic processing. The system was implemented and tested on an FPGA with three benchmarks. The design space with differing security parameters was explored and results are presented. Cryptoblaze is at least 10X faster than the state of the art.

Proceedings ArticleDOI
22 Jan 2018
TL;DR: An accuracy-controllable multiplier whose final product is generated by a carry-maskable adder that can dynamically select the length of the carry propagation to satisfy the accuracy requirements flexibly is proposed.
Abstract: Multiplication is a key fundamental function for many error-tolerant applications. Approximate multiplication is considered to be an efficient technique for trading off energy against performance and accuracy. This paper proposes an accuracy-controllable multiplier whose final product is generated by a carry-maskable adder. The proposed scheme can dynamically select the length of the carry propagation to satisfy the accuracy requirements flexibly. The partial product tree of the multiplier is approximated by the proposed tree compressor. An 8×8 multiplier design is implemented by employing the carry-maskable adder and the compressor. Compared with a conventional Wallace tree multiplier, the proposed multiplier reduced power consumption by between 47.3% and 56.2% and critical path delay by between 29.9% and 60.5%, depending on the required accuracy. Its silicon area was also 44.6% smaller. In addition, results from an image processing application demonstrate that the quality of the processed images can be controlled by the proposed multiplier design.

Proceedings ArticleDOI
22 Jan 2018
TL;DR: A gradual training approximation which adaptively sets the level of hardware approximation depending on the neural network's internal error, instead of apply uniform hardware approximation to accelerate inference.
Abstract: Neural networks have been successfully used in many applications. Due to their computational complexity, it is difficult to implement them on embedded devices. Neural networks are inherently approximate and thus can be simplified. In this paper, CANNA proposes a gradual training approximation which adaptively sets the level of hardware approximation depending on the neural network's internal error, instead of apply uniform hardware approximation. To accelerate inference, CANNA's layer-based approximation approach selectively relaxes the computation in each layer of neural network, as a function its sensitivity to approximation. For hardware support, we use a configurable floating point unit in Hardware that dynamically identifies inputs which produce the largest approximation error and process them instead in precise mode. We evaluate the accuracy and efficiency of our design by integrating configurable FPUs into AMD's Southern Island GPU architecture. Our experimental evaluation shows that CANNA achieves up to 4.84× (7.13×) energy savings and 3.22× (4.64×) speedup when training four different neural network applications with 0% (2%) quality loss as compared to the implementation on baseline GPU. During the inference phase, our layer-based approach improves the energy efficiency by 4.42× (6.06×) and results in 2.96× (3.98×) speedup while ensuring 0% (2%) quality loss.