Showing papers presented at "Asia and South Pacific Design Automation Conference in 2018"

PDF

Open Access

Proceedings Article•DOI•

DRL-cloud: deep reinforcement learning-based resource provisioning and task scheduling for cloud service providers

[...]

Mingxi Cheng¹, Ji Li², Shahin Nazarian²•Institutions (2)

Duke University¹, University of Southern California²

22 Jan 2018

TL;DR: DRL-Cloud is presented, a novel Deep Reinforcement Learning (DRL)-based RP and TS system, to minimize energy cost for large-scale CSPs with very large number of servers that receive enormous numbers of user requests per day.

...read moreread less

Abstract: Cloud computing has become an attractive computing paradigm in both academia and industry. Through virtualization technology, Cloud Service Providers (CSPs) that own data centers can structure physical servers into Virtual Machines (VMs) to provide services, resources, and infrastructures to users. Profit-driven CSPs charge users for service access and VM rental, and reduce power consumption and electric bills so as to increase profit margin. The key challenge faced by CSPs is data center energy cost minimization. Prior works proposed various algorithms to reduce energy cost through Resource Provisioning (RP) and/or Task Scheduling (TS). However, they have scalability issues or do not consider TS with task dependencies, which is a crucial factor that ensures correct parallel execution of tasks. This paper presents DRL-Cloud, a novel Deep Reinforcement Learning (DRL)-based RP and TS system, to minimize energy cost for large-scale CSPs with very large number of servers that receive enormous numbers of user requests per day. A deep Q-learning-based two-stage RP-TS processor is designed to automatically generate the best long-term decisions by learning from the changing environment such as user request patterns and realistic electric price. With training techniques such as target network, experience replay, and exploration and exploitation, the proposed DRL-Cloud achieves remarkably high energy cost efficiency, low reject rate as well as low runtime with fast convergence. Compared with one of the state-of-the-art energy efficient algorithms, the proposed DRL-Cloud achieves up to 320% energy cost efficiency improvement while maintaining lower reject rate on average. For an example CSP setup with 5,000 servers and 200,000 tasks, compared to a fast round-robin baseline, the proposed DRL-Cloud achieves up to 144% runtime reduction.

...read moreread less

123 citations

Proceedings Article•DOI•

Fully parallel RRAM synaptic array for implementing binary neural network with (+1, −1) weights and (+1, 0) neurons

[...]

Xiaoyu Sun¹, Xiaochen Peng¹, Pai-Yu Chen¹, Rui Liu¹, Jae-sun Seo¹, Shimeng Yu¹ - Show less +2 more•Institutions (1)

Arizona State University¹

22 Jan 2018

TL;DR: This work analyzes a fully parallel RRAM synaptic array architecture that implements the fully connected layers in a convolutional neural network with (+1, −1) weights and (-1, 0) neurons and proposes the proposed fully parallel BNN architecture (P-BNN), which can achieve 137.35 TOPS/W energy efficiency for the inference.

...read moreread less

Abstract: Binary Neural Networks (BNNs) have been recently proposed to improve the area-/energy-efficiency of the machine/deep learning hardware accelerators, which opens an opportunity to use the technologically more mature binary RRAM devices to effectively implement the binary synaptic weights. In addition, the binary neuron activation enables using the sense amplifier instead of the analog-to-digital converter to allow bitwise communication between layers of the neural networks. However, the sense amplifier has intrinsic offset that affects the threshold of binary neuron, thus it may degrade the classification accuracy. In this work, we analyze a fully parallel RRAM synaptic array architecture that implements the fully connected layers in a convolutional neural network with (+1, −1) weights and (+1, 0) neurons. The simulation results with TSMC 65 nm PDK show that the offset of current mode sense amplifier introduces a slight accuracy loss from ∼98.5% to ∼97.6% for MNIST dataset. Nevertheless, the proposed fully parallel BNN architecture (P-BNN) can achieve 137.35 TOPS/W energy efficiency for the inference, improved by ∼20X compared to the sequential BNN architecture (S-BNN) with row-by-row read-out scheme. Moreover, the proposed P-BNN architecture can save the chip area by ∼16% as it eliminates the area overhead of MAC peripheral units in the S-BNN architecture.

...read moreread less

62 citations

Proceedings Article•DOI•

IMCE: energy-efficient bit-wise in-memory convolution engine for deep neural network

[...]

Shaahin Angizi¹, Zhezhi He¹, Farhana Parveen¹, Deliang Fan¹•Institutions (1)

University of Central Florida¹

22 Jan 2018

TL;DR: A novel way towards the concept of bit-wise In-Memory Convolution Engine (IMCE) that could implement the dominant convolution computation of Deep Convolutional Neural Networks (CNN) within memory is paved.

...read moreread less

Abstract: In this paper, we pave a novel way towards the concept of bit-wise In-Memory Convolution Engine (IMCE) that could implement the dominant convolution computation of Deep Convolutional Neural Networks (CNN) within memory. IMCE employs parallel computational memory sub-array as a fundamental unit based on our proposed Spin Orbit Torque Magnetic Random Access Memory (SOT-MRAM) design. Then, we propose an accelerator system architecture based on IMCE to efficiently process low bit-width CNNs. This architecture can be leveraged to greatly reduce energy consumption dealing with convolutional layers and also accelerate CNN inference. The device to architecture co-simulation results show that the proposed system architecture can process low bit-width AlexNet on ImageNet data-set favorably with 785.25μJ/img, which consumes ∼3× less energy than that of recent RRAM based counterpart. Besides, the chip area is ∼4× smaller.

...read moreread less

56 citations

Proceedings Article•DOI•

ReGAN: a pipelined ReRAM-based accelerator for generative adversarial networks

[...]

Fan Chen¹, Linghao Song¹, Yi Chen¹•Institutions (1)

Duke University¹

22 Jan 2018

TL;DR: ReGAN is proposed — a novel ReRAM-based Process-In-Memory accelerator that can efficiently reduce off-chip memory accesses and greatly increases system throughput by pipelining the layer-wise computation.

...read moreread less

Abstract: Generative Adversarial Networks (GANs) have recently drawn tremendous attention in many artificial intelligence (AI) applications including computer vision, speech recognition, and natural language processing. While GANs deliver state-of-the-art performance on these AI tasks, it comes at the cost of high computational complexity. Although recent progress demonstrated the promise of using ReRMA-based Process-In-Memory for acceleration of convolutional neural networks (CNNs) with low energy cost, the unique training process required by GANs makes them difficult to run on existing neural network acceleration platforms: two competing networks are simultaneously co-trained in GANs, and hence, significantly increasing the need of memory and computation resources. In this work, we propose ReGAN - a novel ReRAM-based Process-In-Memory accelerator that can efficiently reduce off-chip memory accesses. Moreover, ReGAN greatly increases system throughput by pipelining the layer-wise computation. Two techniques, namely, Spatial Parallelism and Computation Sharing are particularly proposed to further enhance training efficiency of GANs. Our experimental results show that ReGAN can achieve 240× performance speedup compared to GPU platform averagely, with an average energy saving of 94×.

...read moreread less

53 citations

Proceedings Article•DOI•

A machine learning attack resistant multi-PUF design on FPGA

[...]

Qingqing Ma¹, Chongyan Gu², Neil Hanley², Chenghua Wang¹, Weiqiang Liu¹, Maire O'Neill² - Show less +2 more•Institutions (2)

Nanjing University of Aeronautics and Astronautics¹, Queen's University²

22 Jan 2018

TL;DR: A new arbiter-based multi-PUF (MPUF) design that utilises a Weak PUF to obfuscate the challenges to a Strong PUF and is harder to model than the conventional arbiter PUF using machine learning attacks is proposed.

...read moreread less

Abstract: Current approaches for building physical unclonable function (PUF) designs resistant to machine learning attacks often suffer from large resource overhead and are typically difficult to implement on field programmable gate arrays (FPGAs). In this paper we propose a new arbiter-based multi-PUF (MPUF) design that utilises a Weak PUF to obfuscate the challenges to a Strong PUF and is harder to model than the conventional arbiter PUF using machine learning attacks. The proposed PUF design shows a greater resistance to attacks, which have been successfully applied to other Arbiter PUFs. A mathematical model is presented to analyse the complexity and obfuscation properties of the proposed PUF design. Moreover, we show that it is feasible to implement the proposed MPUF design on a Xilinx Artix-7 FPGA, and that it achieves a good uniqueness result of 40.60 % and uniformity of 37.03 %, which significantly improves over previous work into multi-PUF designs.

...read moreread less

51 citations

Proceedings Article•DOI•

Large-scale short-term urban taxi demand forecasting using deep learning

[...]

Siyu Liao¹, Liutong Zhou², Xuan Di², Bo Yuan¹, Jinjun Xiong³ - Show less +1 more•Institutions (3)

City University of New York¹, Columbia University², IBM³

22 Jan 2018

TL;DR: The experimental results show DNNs indeed outperform most traditional machine learning techniques, but such superior results can only be achieved with proper design of the right DNN architecture, where domain knowledge plays a key role.

...read moreread less

Abstract: The world has seen in recent years great successes in applying deep learning (DL) for many application domains. Though powerful, DL is not easy to be used well. In this invited paper, we study an urban taxi demand forecast problem using DL, and we show a number of key insights in modeling a domain problem as a suitable DL task. We also conduct a systematic comparison of two recent deep neural networks (DNNs) for taxi demand prediction, i.s., the ST-ResNet and FLC-Net, on New York city taxi record dataset. Our experimental results show DNNs indeed outperform most traditional machine learning techniques, but such superior results can only be achieved with proper design of the right DNN architecture, where domain knowledge plays a key role.

...read moreread less

46 citations

Proceedings Article•DOI•

Neu-NoC: a high-efficient interconnection network for accelerated neuromorphic systems

[...]

Xiaoxiao Liu¹, Wei Wen¹, Xuehai Qian², Hai Li³, Yi Chen³ - Show less +1 more•Institutions (3)

University of Pittsburgh¹, University of Southern California², Duke University³

22 Jan 2018

TL;DR: This paper proposes Neu-NoC — a high-efficient interconnection network to reduce the redundant data traffic in neuromorphic acceleration systems and explores the data transfer ability between adjacent layers of fully-connected NNs.

...read moreread less

Abstract: A modern neuromorphic acceleration system could consist of hundreds of accelerators, which are often organized through a network-on-chip (NoC). Although the overall computing ability is greatly promoted by a large number of the accelerators, the power consumption and average delay of the NoC itself becomes prominent. In this paper, we first analyze the characteristics of the data traffic in neuromorphic acceleration systems and the bottleneck of the popular NoC designs adopted in such systems. We then propose Neu-NoC - a high-efficient interconnection network to reduce the redundant data traffic in neuromorphic acceleration systems and explore the data transfer ability between adjacent layers. A sophisticated neural network aware mapping algorithm and a multicast transmission scheme are designed to alleviate data traffic congestions without increasing the average transmission distance. Finally, we explore the sparsity characteristics of fully-connected NNs. Simulation results show that compared to the most widely-used Mesh NoC design, Neu-NoC can substantially reduce the average data latency by 28.5% and the energy consumption by 39.2% in accelerated neuromorphic systems.

...read moreread less

45 citations

Proceedings Article•DOI•

Quantized deep neural networks for energy efficient hardware-based inference

[...]

Ruizhou Ding¹, Zeye Liu¹, R.D. Blanton¹, Diana Marculescu¹•Institutions (1)

Carnegie Mellon University¹

22 Jan 2018

TL;DR: In contrast to other quantized DNNs that trade-off significant amounts of accuracy for lower memory requirements, LightNNs can significantly reduce storage, energy and area while still maintaining a test error similar to a large DNN configuration.

...read moreread less

Abstract: Deep Neural Networks (DNNs) have been adopted in many systems because of their higher classification accuracy, with custom hardware implementations great candidates for high-speed, accurate inference. While progress in achieving large scale, highly accurate DNNs has been made, significant energy and area are required due to massive memory accesses and computations. Such demands pose a challenge to any DNN implementation, yet it is more natural to handle in a custom hardware platform. To alleviate the increased demand in storage and energy, quantized DNNs constrain their weights (and activations) from floating-point numbers to only a few discrete levels. Therefore, storage is reduced, thereby leading to less memory accesses. In this paper, we provide an overview of different types of quantized DNNs, as well as the training approaches for them. Among the various quantized DNNs, our LightNN (Light Neural Network) approach can reduce both memory accesses and computation energy, by filling the gap between classic, full-precision and binarized DNNs. We provide a detailed comparison between LightNNs, conventional DNNs and Binarized Neural Networks (BNNs), with MNIST and CIFAR-10 datasets. In contrast to other quantized DNNs that trade-off significant amounts of accuracy for lower memory requirements, LightNNs can significantly reduce storage, energy and area while still maintaining a test error similar to a large DNN configuration. Thus, LightNNs provide more options for hardware designers to trade-off accuracy and energy.

...read moreread less

41 citations

Proceedings Article•DOI•

Security analysis and enhancement of model compressed deep learning systems under adversarial attacks

[...]

Qi Liu¹, Tao Liu¹, Zihao Liu¹, Yanzhi Wang², Yier Jin³, Wujie Wen¹ - Show less +2 more•Institutions (3)

Florida International University¹, Syracuse University², University of Florida³

22 Jan 2018

TL;DR: This work investigates the multi-factor adversarial attack problem in practical model optimized deep learning systems by jointly considering the DNN model-reshaping and the input perturbations and conducts a comprehensive robustness and vulnerability analysis of deep compressed DNN models under derived adversarial attacks.

...read moreread less

Abstract: Thanks to recent machine learning model innovation and computing hardware advancement, the state-of-the-art of Deep Neural Network (DNN) is presenting human-level performance for many complex intelligent tasks in real-world applications. However, it also introduces ever-increasing security concerns for those intelligent systems. For example, the emerging adversarial attacks indicate that even very small and often imperceptible adversarial input perturbations can easily mislead the cognitive function of deep learning systems (DLS). Existing DNN adversarial studies are narrowly performed on the ideal software-level DNN models with a focus on single uncertainty factor, i.e. input perturbations, however, the impact of DNN model reshaping on adversarial attacks, which is introduced by various hardware-favorable techniques such as hash-based weight compression during modern DNN hardware implementation, has never been discussed. In this work, we for the first time investigate the multi-factor adversarial attack problem in practical model optimized deep learning systems by jointly considering the DNN model-reshaping (e.g. HashNet based deep compression) and the input perturbations. We first augment adversarial example generating method dedicated to the compressed DNN models by incorporating the software-based approaches and mathematical modeled DNN reshaping. We then conduct a comprehensive robustness and vulnerability analysis of deep compressed DNN models under derived adversarial attacks. A defense technique named "gradient inhibition" is further developed to ease the generating of adversarial examples thus to effectively mitigate adversarial attacks towards both software and hardware-oriented DNNs. Simulation results show that "gradient inhibition" can decrease the average success rate of adversarial attacks from 87.99% to 4.77% (from 86.74% to 4.64%) on MNIST (CIFAR-10) benchmark with marginal accuracy degradation across various DNNs.

...read moreread less

30 citations

Proceedings Article•DOI•

Optimizing FPGA-based convolutional neural networks accelerator for image super-resolution

[...]

Jung-Woo Chang¹, Suk-Ju Kang¹•Institutions (1)

Sogang University¹

22 Jan 2018

TL;DR: Experimental results show that the proposed TDC method achieves up to 81 times higher throughput than the state-of-the-art DCNN accelerator with the same hardware resources, and improves the speed by 7.8 times by having all layers in the hourglass-type FSRCNN to be processed in inter-layer parallelism without additional DSP usage.

...read moreread less

Abstract: Convolutional neural networks (CNN) are widely used in various computer vision applications. Recently, there have been many studies on FPGA-based CNN accelerators to achieve high performance and power efficiency. Most of them have been on CNN-based object detection algorithms, but researches on image super-resolution have been rarely conducted. Fast super-resolution CNN (FSRCNN), well known for CNN-based super-resolution algorithm, are a combination of multiple convolutional layers and a single deconvolutional layer. Since the deconvolutional layer generates high-resolution (HR) output feature maps from low-resolution (LR) input feature maps, its execution cycles are larger than those of the convolutional layer. In this paper, we propose a novel architecture of the FPGA-based CNN accelerator with the efficient parallelization. We develop a method of transforming a deconvolutional layer into a convolutional layer (TDC), a new methodology for the deconvolutional neural networks (DCNN). There is a massive parallelization source in the deconvolutional layer where multiple outputs within the same output feature map are created with the same inputs. When this new parallelization technique is applied to the deconvolutional layer, it generates the LR output feature maps the same as the convolutional layer. Thus, the performance of the accelerator increases without any additional hardware resources because the kernel size required to generate the LR output feature maps is smaller. In addition, if there is a DSP underutilization problem in the deconvolutional layer that some of the processors are in an idle state, the proposed method solves this problem by allowing more output feature maps to be processed in parallel. Experimental results show that the proposed TDC method achieves up to 81 times higher throughput than the state-of-the-art DCNN accelerator with the same hardware resources. We also improve the speed by 7.8 times by having all layers in the hourglass-type FSRCNN to be processed in inter-layer parallelism without additional DSP usage.

...read moreread less

30 citations

Proceedings Article•DOI•

Concerted wire lifting: enabling secure and cost-effective split manufacturing

[...]

Satwik Patnaik¹, Johann Knechtel², Mohammed Ashraf², Ozgur Sinanoglu²•Institutions (2)

New York University¹, New York University Abu Dhabi²

22 Jan 2018

TL;DR: A new metric, Percentage of Netlist Recovery (PNR), is defined and promoted, which can quantify the resilience against gate-level theft of intellectual property (IP) in a manner more meaningful than established metrics.

...read moreread less

Abstract: Here we advance the protection of split manufacturing (SM)-based layouts through the judicious and well-controlled handling of interconnects. Initially, we explore the cost-security trade-offs of SM, which are limiting its adoption. Aiming to resolve this issue, we propose effective and efficient strategies to lift nets to the BEOL. Towards this end, we design custom "elevating cells" which we also provide to the community. Further, we define and promote a new metric, Percentage of Netlist Recovery (PNR), which can quantify the resilience against gate-level theft of intellectual property (IP) in a manner more meaningful than established metrics. Our extensive experiments show that we outperform the recent protection schemes regarding security. For example, we reduce the correct connection rate to 0% for commonly considered benchmarks, which is a first in the literature. Besides, we induce reasonably low and controllable overheads on power, performance, and area (PPA). At the same time, we also help to lower the commercial cost incurred by SM.

...read moreread less

Proceedings Article•DOI•

Low-power implementation of mitchell's approximate logarithmic multiplication for convolutional neural networks

[...]

Min Soo Kim¹, Alberto A. Del Barrio², Roman Hermida², Nader Bagherzadeh¹•Institutions (2)

University of California, Irvine¹, Complutense University of Madrid²

22 Jan 2018

TL;DR: This paper proposes a low-power implementation of the approximate logarithmic multiplier to improve the power consumption of convolutional neural networks for image classification, taking advantage of its intrinsic tolerance to error.

...read moreread less

Abstract: This paper proposes a low-power implementation of the approximate logarithmic multiplier to improve the power consumption of convolutional neural networks for image classification, taking advantage of its intrinsic tolerance to error. The approximate logarithmic multiplier converts multiplications to additions by taking approximate logarithm and achieves significant improvement in power and area while having low worst-case error, which makes it suitable for neural network computation. Our proposed design shows a significant improvement in terms of power and area over the previous work that applied logarithmic multiplication to neural networks, reducing power up to 76.6% compared to exact fixed-point multiplication, while maintaining comparable prediction accuracy in convolutional neural networks for MNIST and CIFAR10 datasets.

...read moreread less

Proceedings Article•DOI•

An optimal gate design for the synthesis of ternary logic circuits

[...]

Sunmean Kim¹, Taeho Lim¹, Seokhyeong Kang¹•Institutions (1)

Ulsan National Institute of Science and Technology¹

22 Jan 2018

TL;DR: This paper proposes a methodology to design ternary gates by modeling pull-up and pull-down operations of the gates, which makes it possible to synthesize ternARY gates with a minimum number of transistors.

...read moreread less

Abstract: Over the last few decades, CMOS-based digital circuits have been steadily developed. However, because of the power density limits, device scaling may soon come to an end, and new approaches for circuit designs are required. Multi-valued logic (MVL) is one of the new approaches, which increases the radix for computation to lower the complexity of the circuit. For the MVL implementation, ternary logic circuit designs have been proposed previously, though they could not show advantages over binary logic, because of unoptimized synthesis techniques. In this paper, we propose a methodology to design ternary gates by modeling pull-up and pull-down operations of the gates. Our proposed methodology makes it possible to synthesize ternary gates with a minimum number of transistors. From HSPICE simulation results, our ternary designs show significant power-delay product reductions; 49 % in the ternary full adder and 62 % in the ternary multiplier compared to the existing methodology. We have also compared the number of transistors in CMOS-based binary logic circuits and ternary device-based logic circuits.

...read moreread less

Proceedings Article•DOI•

New directions for learning-based IC design tools and methodologies

[...]

Andrew B. Kahng¹•Institutions (1)

University of California, San Diego¹

22 Jan 2018

TL;DR: This paper describes several near-term challenges and opportunities, along with concrete existence proofs, for application of learning-based methods within the ecosystem of commercial EDA, IC design, and academic research.

...read moreread less

Abstract: Design-based equivalent scaling now bears much of the burden of continuing the semiconductor industry's trajectory of Moore's-Law value scaling. In the future, reductions of design effort and design schedule must comprise a substantial portion of this equivalent scaling. In this context, machine learning and deep learning in EDA tools and design flows offer enormous potential for value creation. Examples of opportunities include: improved design convergence through prediction of downstream flow outcomes; margin reduction through new analysis correlation mechanisms; and use of open platforms to develop learning-based applications. These will be the foundations of future design-based equivalent scaling in the IC industry. This paper describes several near-term challenges and opportunities, along with concrete existence proofs, for application of learning-based methods within the ecosystem of commercial EDA, IC design, and academic research.

...read moreread less

Proceedings Article•DOI•

HieIM: highly flexible in-memory computing using STT MRAM

[...]

Farhana Parveen¹, Zhezhi He¹, Shaahin Angizi¹, Deliang Fan¹•Institutions (1)

University of Central Florida¹

22 Jan 2018

TL;DR: A Highly Flexible InMemory (HieIM) computing platform using STT MRAM, which can be leveraged to implement Boolean logic functions without sacrificing memory functionality, thus overcoming the ‘operand locality’ problem in contemporary in-memory computing platform designs is proposed.

...read moreread less

Abstract: In this paper we propose a Highly Flexible In-Memory (HieIM) computing platform using STT MRAM, which can be leveraged to implement Boolean logic functions without sacrificing memory functionality. It could pre-process data within memory to further reduce power hungry long distance communication between memory and processing units as in Von-Neumann computing system. HieIM can implement all the Boolean logic functions (AND/NAND, OR/NOR, XOR/XNOR) between any two cells in the same memory array, thus overcoming the 'operand locality' problem in contemporary in-memory computing platform designs. To investigate the performance of HieIM, we test in-memory bulk bit-wise Boolean logic operations using different vector datasets, which shows ∼ 8× energy saving and ∼ 5× speedup compared to recent DRAM based in-memory computing platform. We further implement an in-memory data encryption engine design based on HieIM as another case study. With AES algorithm, it shows 51.5% and 68.9% lower energy consumption compared to CMOS-ASIC and CMOL based implementations, respectively.

...read moreread less

Proceedings Article•DOI•

A deep reinforcement learning framework for optimizing fuel economy of hybrid electric vehicles

[...]

Pu Zhao¹, Yanzhi Wang², Naehyuck Chang, Qi Zhu³, Xue Lin¹ - Show less +1 more•Institutions (3)

Northeastern University¹, Syracuse University², Northwestern University³

22 Jan 2018

TL;DR: A deep reinforcement learning framework of the HEV power management with the aim of improving fuel economy is proposed and results demonstrate the effectiveness of the proposed framework on optimizing HEV fuel economy.

...read moreread less

Abstract: Hybrid electric vehicles employ a hybrid propulsion system to combine the energy efficiency of electric motor and a long driving range of internal combustion engine, thereby achieving a higher fuel economy as well as convenience compared with conventional ICE vehicles. However, the relatively complicated powertrain structures of HEVs necessitate an effective power management policy to determine the power split between ICE and EM. In this work, we propose a deep reinforcement learning framework of the HEV power management with the aim of improving fuel economy. The DRL technique is comprised of an offline deep neural network construction phase and an online deep Q-learning phase. Unlike traditional reinforcement learning, DRL presents the capability of handling the high dimensional state and action space in the actual decision-making process, making it suitable for the HEV power management problem. Enabled by the DRL technique, the derived HEV power management policy is close to optimal, fully model-free, and independent of a prior knowledge of driving cycles. Simulation results based on actual vehicle setup over real-world and testing driving cycles demonstrate the effectiveness of the proposed framework on optimizing HEV fuel economy.

...read moreread less

Proceedings Article•DOI•

Process design kit for flexible hybrid electronics

[...]

Leilai Shao¹, Tsung-Ching Huang², Ting Lei³, Zhenan Bao³, Raymond G. Beausoleil², Kwang-Ting Cheng¹ - Show less +2 more•Institutions (3)

University of California, Santa Barbara¹, Hewlett-Packard², Stanford University³

22 Jan 2018

TL;DR: The key elements of FHE-PDK include technology files for design rule checking, layout versus schematic and layout parasitics extraction, as well as SPICE-compatible models for flexible thin-film transistors (TFTs) and passive elements.

...read moreread less

Abstract: Flexible Electronics (FE) is emerging for wearables and low-cost internet of things (IoT) nodes benefiting from its low-cost fabrication and mechanical flexibility. Combining FE with thinned silicon chips, known as flexible hybrid electronics (FHE), can take advantages of both low-cost printed electronics and high performance silicon chips. To design a FHE system, the process design kit (PDK) offering the capabilities for circuit design, simulation and verification for both FE and silicon chips is needed. The key elements of FHE-PDK include technology files for design rule checking (DRC), layout versus schematic (LVS) and layout parasitics extraction (LPE), as well as SPICE-compatible models for flexible thin-film transistors (TFTs) and passive elements. Wafer scale measurements are used to validate our SPICE models and design rules are derived accordingly to assure a satisfactory yield. With FHE-PDK, circuit and system designers can therefore focus on design innovations and can rely on design tools to produce manufacturable designs.

...read moreread less

Proceedings Article•DOI•

Sound valve-control for programmable microfluidic devices

[...]

Andreas Grimmer¹, Berislav Klepic¹, Tsung-Yi Ho², Robert Wille¹•Institutions (2)

Johannes Kepler University of Linz¹, National Tsing Hua University²

22 Jan 2018

TL;DR: This work provides a precise definition of the underlying design task of Programmable Microfluidic Devices (PMDs) and presents complementary solutions (both exact as well as heuristic) and discusses how they guarantee a sound valve-control.

...read moreread less

Abstract: In the domain of microfluidic devices, a paradigm shift from application-specific to fully-programmable solutions takes place (a similar development from ASICS to FPGAs has been observed in conventional circuitry). So-called Programmable Microfluidic Devices (PMDs) provide a promising platform in this regard. Here, fluids can be pushed into various reaction vessels whose inflow and outflow is controlled by valves. The regular structure in combination with the flexibility of defining various flow paths through valves allows to realize a vast range of biological or chemical applications by only changing the corresponding valve-control sequence. However, determining a sound valve-control constitutes a non-trivial task. Although first automatic approaches for this problem have recently been proposed, we show that they frequently yield impractical control sequences. In this work, we address this issue by providing a precise definition of the underlying design task. Afterwards, we present complementary solutions (both exact as well as heuristic) and discuss how they guarantee a sound valve-control. Experimental evaluations demonstrate that the proposed solutions are capable of automatically generating a sound valve-control for PMDs.

...read moreread less

Proceedings Article•DOI•

Approximation-aware testing for approximate circuits

[...]

Arun Chandrasekharan¹, Stephan Eggersglüß¹, Daniel Große¹, Rolf Drechsler¹•Institutions (1)

University of Bremen¹

22 Jan 2018

TL;DR: An approximation-aware test methodology which can be easily integrated into the regular test flow and removed all potential faults that no longer need to be tested because they can be tolerated under the given error metric is presented.

...read moreread less

Abstract: A wide range of applications significantly benefit from the Approximate Computing (AC) paradigm in terms of speed or power reduction. AC achieves this by tolerating errors in the design. These errors are introduced into the design either manually by the designer or by approximate synthesis approaches. From here, the standard design flow is taken. Hence, the manufactured AC chip is eventually tested for production errors using well established fault models. To be precise, if the test for a test pattern fails, the AC chip is sorted out. However, from a general perspective this procedure results in throwing away chips which are perfectly fine taking into account that the considered fault (i.e. physical defect that leads to the error) can still be tolerated because of approximation. This can lead to a significant amount of yield loss. In this paper, we present an approximation-aware test methodology which can be easily integrated into the regular test flow. It is based on a pre-process to identify approximation-redundant faults. By this, we remove all potential faults that no longer need to be tested because they can be tolerated under the given error metric. Our experimental results and case studies on a wide variety of benchmark circuits show a significant potential for yield improvement.

...read moreread less

Proceedings Article•DOI•

System-on-chip security architecture and CAD framework for hardware patch

[...]

Atul Prasad Deb Nath¹, Sandip Ray², Abhishek Basak³, Swarup Bhunia¹•Institutions (3)

University of Florida¹, NXP Semiconductors², Intel³

22 Jan 2018

TL;DR: This paper presents a novel, flexible, and adaptable SoC security architecture that efficiently implements diverse security policies and shows, for the first time, that the proposed framework provides high level of patchability with minimal energy and performance overhead.

...read moreread less

Abstract: System-on-Chip (SoC) security architectures targeted towards diverse applications including Internet of Things (IoT) and automotive systems enforce two critical design requirements: in-field configurability and low overhead. To simultaneously address these constraints, in this paper, we present a novel, flexible, and adaptable SoC security architecture that efficiently implements diverse security policies. The architecture and associated CAD flow enable “hardware patching” i.e. hardware security policy engine that can be seamlessly and securely upgraded in field to address unanticipated attacks or new security requirements. We implement (1) a centralized Reconfigurable Security Policy Engine (RSPE), (2) smart security wrappers, and (3) Design-for-Debug (DfD) infrastructure interface as the building blocks of the architecture. The proposed framework provides a systematic approach to represent and synthesize diverse security policies. Through extensive analysis using representative SoC models, we show, for the first time to our knowledge, that the proposed framework provides high level of patchability with minimal energy and performance overhead.

...read moreread less

Proceedings Article•DOI•

Accelerating electromigration aging for fast failure detection for nanometer ICs

[...]

Zeyu Sun¹, Sheriff Sadiqbatcha¹, Hengyang Zhao¹, Sheldon X.-D. Tan¹•Institutions (1)

University of California, Riverside¹

22 Jan 2018

TL;DR: Experimental results show that by combining temperature and reservoir accelerations, the EM lifetime of a chip can be reduced from 10 years down to a few hours under the 150°C temperature limit, which is sufficient for practical EM testing of typical nanometer CMOS ICs.

...read moreread less

Abstract: For practical testing and detection of electromigration (EM) induced failures in dual damascene copper interconnects in today's and future sub-10nm ICs, one critical issue is how to create stressing conditions so that the chip will fail exclusively under EM in a very short period of time so that EM signoff and validation can be carried out efficiently. In this work, we propose novel EM wearout-acceleration techniques for practical VLSI chips. We will first review the recently proposed three-phase physics-based EM models and discuss the important factors contributing to the EM aging process. Then we propose a new formula for fast estimation of the void's saturation volume for general multi-segment interconnect wires, which is important for EM mortality check. We then investigate two strategies to accelerate the EM failure process: reservoir-enhanced acceleration and temperature-based acceleration. First we show that multi-segment interconnects with reservoir structures and their stressing currents can be exploited to significantly speedup the EM wearout process. Such configurable reservoir-based wires are very flexible and can achieve various EM accelerations at the costs of some routing resources. Additionally, we show that further acceleration can be achieved by increasing temperature. On average, 10% increase in temperature yields about 10X wearout acceleration. However, purely temperature based acceleration is not possible since practical VLSI chips have temperature limitations which must be strictly enforced to ensure the chip only fails under EM, and not due to other reliability effects. In this study, we show that it is possible to achieve significantly high acceleration while staying within the feasible operating zones by combining the two acceleration techniques. Experimental results show that by combining temperature and reservoir accelerations, we can reduce the EM lifetime of a chip from 10 years down to a few hours (about 105 acceleration) under the 150°C temperature limit, which is sufficient for practical EM testing of typical nanometer CMOS ICs.

...read moreread less

Proceedings Article•DOI•

Training low bitwidth convolutional neural network on RRAM

[...]

Yi Cai¹, Tianqi Tang¹, Lixue Xia¹, Ming Cheng¹, Zhenhua Zhu¹, Yu Wang¹, Huazhong Yang¹ - Show less +3 more•Institutions (1)

Tsinghua University¹

22 Jan 2018

TL;DR: A low-bitwidth CNN training method, using low- bitwidth convolution outputs, activations, weights, weights and gradients to train CNN models based on RRAM, and a system to implement the training algorithms is designed.

...read moreread less

Abstract: Convolutional Neural Networks (CNNs) have achieved excellent performance on various artificial intelligence (AI) applications, while a higher demand on energy efficiency is required for future AI. Resistive Random-Access Memory (RRAM)-based computing system provides a promising solution to energy-efficient neural network training. However, it's difficult to support high-precision CNN in RRAM-based hardware systems. Firstly, multi-bit digital-analog interfaces will take up most energy overhead of the whole system. Secondly, it's difficult to write the RRAM to expected resistance states accurately; only low-precision numbers can be represented. To enable CNN training based on RRAM, we propose a low-bitwidth CNN training method, using low-bitwidth convolution outputs (CO), activations (A), weights (W) and gradients (G) to train CNN models based on RRAM. Furthermore, we design a system to implement the training algorithms. We explore the accuracy under different bitwidth combinations of (A, CO, W, G), and propose a practical tradeoff between accuracy and energy overhead. Our experiments demonstrate that the proposed system perform well on low-bitwidth CNN training tasks. For example, training LeNet-5 with 4-bit convolution outputs, 4-bit weights, 4-bit activations and 4-bit gradients on MNIST can still achieve 97.67% accuracy. Moreover, the proposed system can achieve 23.0X higher energy efficiency than GPU when processing the training task of LeNet-5, and 4.4X higher energy efficiency when processing the training task of ResNet-20.

...read moreread less

Proceedings Article•DOI•

Power conversion efficiency-aware mapping of multithreaded applications on heterogeneous architectures: a comprehensive parameter tuning

[...]

Hossein Sayadi¹, Divya Pathak², Ioannis Savidis², Houman Homayoun¹•Institutions (2)

George Mason University¹, Drexel University²

22 Jan 2018

TL;DR: A PCE-aware scheduling scheme is developed for effective mapping of multithreaded applications onto an HMP and results indicate that the proposed learning-based scheme outperforms the state of the art solution by 10% when there is no PCE gap between big and little cores.

...read moreread less

Abstract: Heterogeneous Multicore Processors (HMPs) are comprised of multiple core types (small vs. big core architectures) with various performance and power characteristics which offer the flexibility to assign each thread to a core that provides the maximum energy-efficiency. Although this architecture provides more flexibility for the running application to determine the optimal run-time settings that maximize energy-efficiency, due to the interdependence of various tuning parameters such as the type of core, run-time voltage and frequency, and the number of threads, the scheduling becomes more challenging. More importantly, the impact of Power Conversion Efficiency (PCE) of the On-Chip Voltage Regulators (OCVRs) is another important parameter that makes it more challenging to schedule multithreaded applications on HMPs. In this paper, the importance of concurrent optimization and fine-tuning of the circuit and architectural parameters for energy-efficient scheduling on HMPs is addressed to harness the power of heterogeneity. In addition, the scheduling challenges for multithreaded applications are investigated for HMP architectures that account for the impact of power conversion efficiency. A highly accurate learning-based model is developed for energy-efficiency prediction to guide the scheduling decision. Using the predictive model, we further develop a PCE-aware scheduling scheme is developed for effective mapping of multithreaded applications onto an HMP. The results indicate that the proposed learning-based scheme outperforms the state of the art solution by 10% when there is no PCE gap between big and little cores. The energy-efficiency improves up to 60% when the PCE gap between big and little cores increases.

...read moreread less

Proceedings Article•DOI•

Scheduling multi-rate real-time applications on clustered many-core architectures with memory constraints

[...]

Matthias Becker¹, Saad Mubeen¹, Dakshina Dasari², Moris Behnam¹, Thomas Nolte¹ - Show less +1 more•Institutions (2)

Mälardalen University College¹, Bosch²

22 Jan 2018

TL;DR: In this work, an execution framework is introduced that leverages local memory by statically allocating a subset of tasks to cores, which reduces the access times to shared memory, as off-chip memory access is avoided, and in turn improves the schedulability of such systems.

...read moreread less

Abstract: Access to shared memory is one of the main challenges for many-core processors. One group of scheduling strategies for such platforms focuses on the division of tasks' access to shared memory and code execution. This allows to orchestrate the access to shared local and off-chip memory in a way such that access contention between different compute cores is avoided by design. In this work, an execution framework is introduced that leverages local memory by statically allocating a subset of tasks to cores. This reduces the access times to shared memory, as off-chip memory access is avoided, and in turn improves the schedulability of such systems. A Constraint Programming (CP) formulation is presented to select the statically allocated tasks and to generate the complete system schedule. Evaluations show that the proposed approach yields an up to 19% higher schedulability ratio than related work, and a case study demonstrates its applicability to industrial problems.

...read moreread less

Proceedings Article•DOI•

A practical split manufacturing framework for trojan prevention via simultaneous wire lifting and cell insertion

[...]

Meng Li¹, Bei Yu², Yibo Lin¹, Xiaoqing Xu¹, Wuxi Li¹, David Z. Pan¹ - Show less +2 more•Institutions (2)

University of Texas at Austin¹, The Chinese University of Hong Kong²

22 Jan 2018

TL;DR: A novel split manufacturing framework that not only guarantees to achieve the required security level but also allows for a drastic reduction of the introduced overhead, and demonstrates much better efficiency, overhead reduction, and security guarantee compared with existing methods.

...read moreread less

Abstract: Trojans and backdoors inserted by untrusted foundries have become serious threats to hardware security. Split manufacturing is proposed to prevent Trojan insertion proactively. Existing methods depend on wire lifting to hide partial circuit interconnections, which usually suffer from large overhead and lack of security guarantee. In this paper, we propose a novel split manufacturing framework that not only guarantees to achieve the required security level but also allows for a drastic reduction of the introduced overhead. In our framework, insertion of dummy circuit cells and wires is considered simultaneously with wire lifting. To support cell and wire insertion, we propose a new security criterion, and further derive its sufficient condition to avoid computation intensive operations in traditional methods. Then, for the first time, a novel mixed integer linear programming formulation is proposed to simultaneously consider cell and wire insertion together with wire lifting, which significantly enlarges the design space to guarantee the realization of the sufficient condition under the security requirements and overhead constraints. With extensive experimental results, our framework demonstrates much better efficiency, overhead reduction, and security guarantee compared with existing methods.

...read moreread less

Proceedings Article•DOI•

A 300-pW audio ΑΣ modulator with 100.5-dB DR using dynamic bias inverter

[...]

Sangwoo Lee¹, Woojin Jo¹, Seungwoo Song¹, Youngcheol Chae¹•Institutions (1)

Yonsei University¹

22 Jan 2018

TL;DR: A micropower audio delta-sigma modulator is presented for mobile applications, which employs dynamic bias inverter based integrators, which maximizes both g m /I D ratio and slew rate while compensating PVT variations.

...read moreread less

Abstract: A micropower audio delta-sigma modulator is presented for mobile applications. The modulator employs dynamic bias inverter based integrators, which maximizes both g m /I D ratio and slew rate while compensating PVT variations. A prototype modulator implemented in a 0.18pm CMOS process features a single-bit third-order topology. The modulator achieves 97.7dB SNDR, 98.6dB SNR, 100.5dB DR, and 105.8dB SFDR in a 20kHz audio band, while consuming only 300pW from a 1.8V supply. This corresponds to a state-of-the-art FoM of 178.7dB.

...read moreread less

Proceedings Article•DOI•

Running sparse and low-precision neural network: when algorithm meets hardware

[...]

Bing Li¹, Wei Wen¹, Jiachen Mao¹, Sicheng Li², Yi Chen¹, Hai Helen Li¹ - Show less +2 more•Institutions (2)

Duke University¹, Hewlett-Packard²

22 Jan 2018

TL;DR: This paper demonstrates the co-optimization of the DNN algorithm and hardware which exploits the model redundancy to accelerate DNNs.

...read moreread less

Abstract: Deep Neural Networks (DNNs) are pervasively applied in many artificial intelligence (AI) applications. The high performance of DNNs comes at the cost of larger size and higher compute complexity. Recent studies show that DNNs have much redundancy, such as the zero-value parameters and excessive numerical precision. To reduce computing complexity, many redundancy reduction techniques have been proposed, including pruning and data quantization. In this paper, we demonstrate our co-optimization of the DNN algorithm and hardware which exploits the model redundancy to accelerate DNNs.

...read moreread less

Proceedings Article•DOI•

Cryptoblaze: a partially homomorphic processor with multiple instructions and non-deterministic encryption support

[...]

Florencia Irena¹, Daniel Murphy¹, Sri Parameswaran¹•Institutions (1)

University of New South Wales¹

22 Jan 2018

TL;DR: It is shown that it is possible to build a hardware implementation of a processor with multiple instructions, support for non-deterministic Pallier encryption, and partially homomorphic processing and Cryptoblaze is at least 10X faster than the state of the art.

...read moreread less

Abstract: Homomorphic computing has been suggested as a method to secure processing in insecure servers. One of the drawbacks of homomorphic processing is the enormous execution time taken to process even the simplest of operations. In this paper, we propose a processor with hardware support for homomorphic processing. The proposed processor, named CryptoBlaze, has eight additional specialized instructions and hardware to support computation of encrypted data. For the first time, we show that it is possible to build a hardware implementation of a processor with multiple instructions, support for non-deterministic Pallier encryption, and partially homomorphic processing. The system was implemented and tested on an FPGA with three benchmarks. The design space with differing security parameters was explored and results are presented. Cryptoblaze is at least 10X faster than the state of the art.

...read moreread less

Proceedings Article•DOI•

A low-power high-speed accuracy-controllable approximate multiplier design

[...]

Tongxin Yang¹, Tomoaki Ukezono¹, Toshinori Sato¹•Institutions (1)

Fukuoka University¹

22 Jan 2018

TL;DR: An accuracy-controllable multiplier whose final product is generated by a carry-maskable adder that can dynamically select the length of the carry propagation to satisfy the accuracy requirements flexibly is proposed.

...read moreread less

Abstract: Multiplication is a key fundamental function for many error-tolerant applications. Approximate multiplication is considered to be an efficient technique for trading off energy against performance and accuracy. This paper proposes an accuracy-controllable multiplier whose final product is generated by a carry-maskable adder. The proposed scheme can dynamically select the length of the carry propagation to satisfy the accuracy requirements flexibly. The partial product tree of the multiplier is approximated by the proposed tree compressor. An 8×8 multiplier design is implemented by employing the carry-maskable adder and the compressor. Compared with a conventional Wallace tree multiplier, the proposed multiplier reduced power consumption by between 47.3% and 56.2% and critical path delay by between 29.9% and 60.5%, depending on the required accuracy. Its silicon area was also 44.6% smaller. In addition, results from an image processing application demonstrate that the quality of the processed images can be controlled by the proposed multiplier design.

...read moreread less

Proceedings Article•DOI•

CANNA: neural network acceleration using configurable approximation on GPGPU

[...]

Mohsen Imani¹, Max Masich¹, Daniel Peroni¹, Pushen Wang¹, Tajana Rosing¹ - Show less +1 more•Institutions (1)

University of California, San Diego¹

22 Jan 2018

TL;DR: A gradual training approximation which adaptively sets the level of hardware approximation depending on the neural network's internal error, instead of apply uniform hardware approximation to accelerate inference.

...read moreread less

Abstract: Neural networks have been successfully used in many applications. Due to their computational complexity, it is difficult to implement them on embedded devices. Neural networks are inherently approximate and thus can be simplified. In this paper, CANNA proposes a gradual training approximation which adaptively sets the level of hardware approximation depending on the neural network's internal error, instead of apply uniform hardware approximation. To accelerate inference, CANNA's layer-based approximation approach selectively relaxes the computation in each layer of neural network, as a function its sensitivity to approximation. For hardware support, we use a configurable floating point unit in Hardware that dynamically identifies inputs which produce the largest approximation error and process them instead in precise mode. We evaluate the accuracy and efficiency of our design by integrating configurable FPUs into AMD's Southern Island GPU architecture. Our experimental evaluation shows that CANNA achieves up to 4.84× (7.13×) energy savings and 3.22× (4.64×) speedup when training four different neural network applications with 0% (2%) quality loss as compared to the implementation on baseline GPU. During the inference phase, our layer-based approach improves the energy efficiency by 4.42× (6.06×) and results in 2.96× (3.98×) speedup while ensuring 0% (2%) quality loss.

...read moreread less