scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

1.1 Computing's energy problem (and what we can do about it)

06 Mar 2014-pp 10-14
TL;DR: If the drive for performance and the end of voltage scaling have made power, and not the number of transistors, the principal factor limiting further improvements in computing performance, a new wave of innovative and efficient computing devices will be created.
Abstract: Our challenge is clear: The drive for performance and the end of voltage scaling have made power, and not the number of transistors, the principal factor limiting further improvements in computing performance. Continuing to scale compute performance will require the creation and effective use of new specialized compute engines, and will require the participation of application experts to be successful. If we play our cards right, and develop the tools that allow our customers to become part of the design process, we will create a new wave of innovative and efficient computing devices.
Citations
More filters
Journal ArticleDOI
20 Nov 2017
TL;DR: In this paper, the authors provide a comprehensive tutorial and survey about the recent advances toward the goal of enabling efficient processing of DNNs, and discuss various hardware platforms and architectures that support DNN, and highlight key trends in reducing the computation cost of deep neural networks either solely via hardware design changes or via joint hardware and DNN algorithm changes.
Abstract: Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of DNNs to improve energy efficiency and throughput without sacrificing application accuracy or increasing hardware cost are critical to the wide deployment of DNNs in AI systems. This article aims to provide a comprehensive tutorial and survey about the recent advances toward the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various hardware platforms and architectures that support DNNs, and highlight key trends in reducing the computation cost of DNNs either solely via hardware design changes or via joint hardware design and DNN algorithm changes. It will also summarize various development resources that enable researchers and practitioners to quickly get started in this field, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic codesigns, being proposed in academia and industry. The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand the tradeoffs between various hardware architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand recent implementation trends and opportunities.

2,391 citations

Posted Content
TL;DR: A binary matrix multiplication GPU kernel is written with which it is possible to run the MNIST BNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy.
Abstract: We introduce a method to train Binarized Neural Networks (BNNs) - neural networks with binary weights and activations at run-time. At training-time the binary weights and activations are used for computing the parameters gradients. During the forward pass, BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations, which is expected to substantially improve power-efficiency. To validate the effectiveness of BNNs we conduct two sets of experiments on the Torch7 and Theano frameworks. On both, BNNs achieved nearly state-of-the-art results over the MNIST, CIFAR-10 and SVHN datasets. Last but not least, we wrote a binary matrix multiplication GPU kernel with which it is possible to run our MNIST BNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The code for training and running our BNNs is available on-line.

2,320 citations

Journal ArticleDOI
TL;DR: Eyeriss as mentioned in this paper is an accelerator for state-of-the-art deep convolutional neural networks (CNNs) that optimizes for the energy efficiency of the entire system, including the accelerator chip and off-chip DRAM, by reconfiguring the architecture.
Abstract: Eyeriss is an accelerator for state-of-the-art deep convolutional neural networks (CNNs). It optimizes for the energy efficiency of the entire system, including the accelerator chip and off-chip DRAM, for various CNN shapes by reconfiguring the architecture. CNNs are widely used in modern AI systems but also bring challenges on throughput and energy efficiency to the underlying hardware. This is because its computation requires a large amount of data, creating significant data movement from on-chip and off-chip that is more energy-consuming than computation. Minimizing data movement energy cost for any CNN shape, therefore, is the key to high throughput and energy efficiency. Eyeriss achieves these goals by using a proposed processing dataflow, called row stationary (RS), on a spatial architecture with 168 processing elements. RS dataflow reconfigures the computation mapping of a given shape, which optimizes energy efficiency by maximally reusing data locally to reduce expensive data movement, such as DRAM accesses. Compression and data gating are also applied to further improve energy efficiency. Eyeriss processes the convolutional layers at 35 frames/s and 0.0029 DRAM access/multiply and accumulation (MAC) for AlexNet at 278 mW (batch size $N = 4$ ), and 0.7 frames/s and 0.0035 DRAM access/MAC for VGG-16 at 236 mW ( $N = 3$ ).

2,165 citations

Posted Content
TL;DR: A binary matrix multiplication GPU kernel is programmed with which it is possible to run the MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy.
Abstract: We introduce a method to train Quantized Neural Networks (QNNs) --- neural networks with extremely low precision (e.g., 1-bit) weights and activations, at run-time. At train-time the quantized weights and activations are used for computing the parameter gradients. During the forward pass, QNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations. As a result, power consumption is expected to be drastically reduced. We trained QNNs over the MNIST, CIFAR-10, SVHN and ImageNet datasets. The resulting QNNs achieve prediction accuracy comparable to their 32-bit counterparts. For example, our quantized version of AlexNet with 1-bit weights and 2-bit activations achieves $51\%$ top-1 accuracy. Moreover, we quantize the parameter gradients to 6-bits as well which enables gradients computation using only bit-wise operation. Quantized recurrent neural networks were tested over the Penn Treebank dataset, and achieved comparable accuracy as their 32-bit counterparts using only 4-bits. Last but not least, we programmed a binary matrix multiplication GPU kernel with which it is possible to run our MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The QNN code is available online.

1,232 citations


Cites background from "1.1 Computing's energy problem (and..."

  • ...Horowitz (2014) provides rough numbers for the energy consumed by the computation (the given numbers are for 45nm technology), as summarized in Tables 5 and 6....

    [...]

  • ...Over the last decade, power has been the main constraint on performance (Horowitz, 2014)....

    [...]

Journal ArticleDOI
01 Jun 2018
TL;DR: This Review Article examines the development of in-memory computing using resistive switching devices, where the two-terminal structure of the devices, theirresistive switching properties, and direct data processing in the memory can enable area- and energy-efficient computation.
Abstract: Modern computers are based on the von Neumann architecture in which computation and storage are physically separated: data are fetched from the memory unit, shuttled to the processing unit (where computation takes place) and then shuttled back to the memory unit to be stored. The rate at which data can be transferred between the processing unit and the memory unit represents a fundamental limitation of modern computers, known as the memory wall. In-memory computing is an approach that attempts to address this issue by designing systems that compute within the memory, thus eliminating the energy-intensive and time-consuming data movement that plagues current designs. Here we review the development of in-memory computing using resistive switching devices, where the two-terminal structure of the devices, their resistive switching properties, and direct data processing in the memory can enable area- and energy-efficient computation. We examine the different digital, analogue, and stochastic computing schemes that have been proposed, and explore the microscopic physical mechanisms involved. Finally, we discuss the challenges in-memory computing faces, including the required scaling characteristics, in delivering next-generation computing. This Review Article examines the development of in-memory computing using resistive switching devices.

1,193 citations

References
More filters
Journal ArticleDOI
TL;DR: This paper considers the design, fabrication, and characterization of very small Mosfet switching devices suitable for digital integrated circuits, using dimensions of the order of 1 /spl mu/.
Abstract: This paper considers the design, fabrication, and characterization of very small Mosfet switching devices suitable for digital integrated circuits, using dimensions of the order of 1 /spl mu/. Scaling relationships are presented which show how a conventional MOSFET can be reduced in size. An improved small device structure is presented that uses ion implantation, to provide shallow source and drain regions and a nonuniform substrate doping profile. One-dimensional models are used to predict the substrate doping profile and the corresponding threshold voltage versus source voltage characteristic. A two-dimensional current transport model is used to predict the relative degree of short-channel effects for different device parameter combinations. Polysilicon-gate MOSFET's with channel lengths as short as 0.5 /spl mu/ were fabricated, and the device characteristics measured and compared with predicted values. The performance improvement expected from using these very small devices in highly miniaturized integrated circuits is projected.

3,008 citations

Book
30 Jun 1995
TL;DR: The Hierarchy of Limits of Power J.D. Stratakos, et al., and Low Power Programmable Computation coauthored with M.B. Srivastava, provide a review of the main approaches to Voltage Scaling Approaches.
Abstract: 1. Introduction. 2. Hierarchy of Limits of Power J.D. Meindl. 3. Sources of Power Consumption. 4. Voltage Scaling Approaches. 5. DC Power Supply Design in Portable Systems coauthored with A.J. Stratakos, et al. 6. Adiabatic Switching L. Svensson. 7. Minimizing Switched Capacitance. 8. Computer Aided Design Tools. 9. A Portable Multimedia Terminal. 10. Low Power Programmable Computation coauthored with M.B. Srivastava. 11. Conclusions. Subject Index.

1,024 citations


"1.1 Computing's energy problem (and..." refers background in this paper

  • ...had been known for decades [7] that if the application was parallel, a parallel...

    [...]

01 Jan 1974

856 citations


"1.1 Computing's energy problem (and..." refers background in this paper

  • ...Section 2 quickly reviews how computing became power limited, even before Dennard constantfield scaling [3] broke down, and explains the difficulties of using a technology change to fix our problems....

    [...]

Journal ArticleDOI
09 Jun 2012
TL;DR: This work architects server memory systems using mobile DRAM devices, trading peak bandwidth for lower energy consumption per bit and more efficient idle modes, and demonstrates 3-5× lower memory power, better proportionality, and negligible performance penalties for data-center workloads.
Abstract: To increase datacenter energy efficiency, we need memory systems that keep pace with processor efficiency gains. Currently, servers use DDR3 memory, which is designed for high bandwidth but not for energy proportionality. A system using 20% of the peak DDR3 bandwidth consumes 2.3x the energy per bit compared to the energy consumed by a system with fully utilized memory bandwidth. Nevertheless, many datacenter applications stress memory capacity and latency but not memory bandwidth. In response, we architect server memory systems using mobile DRAM devices, trading peak bandwidth for lower energy consumption per bit and more efficient idle modes. We demonstrate 3-5x lower memory power, better proportionality, and negligible performance penalties for datacenter workloads.

249 citations


"1.1 Computing's energy problem (and..." refers background in this paper

  • ...Given that the energy/operation is now critical, it is worth revisiting many of the cache design strategies with a view to minimizing the average memory access energy (AMAE) [8]....

    [...]

  • ...AMAE minimization must also take into account the DRAM energy of the system....

    [...]

  • ...Part of this high cost comes from the very energy-inefficient I/O that DRAM systems use [8], which not only takes over 20pJ/bit, but also require static power to keep the I/O active....

    [...]