scispace - formally typeset
Search or ask a question
Topic

Hardware acceleration

About: Hardware acceleration is a research topic. Over the lifetime, 5730 publications have been published within this topic receiving 64621 citations. The topic is also known as: hardware accelerator & acceleration.


Papers
More filters
Journal ArticleDOI
18 Jun 2016
TL;DR: In this paper, the authors proposed an energy efficient inference engine (EIE) that performs inference on a compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing.
Abstract: State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power.Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120× energy saving; Exploiting sparsity saves 10×; Weight sharing gives 8×; Skipping zero activations from ReLU saves another 3×. Evaluated on nine DNN benchmarks, EIE is 189× and 13× faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102 GOPS working directly on a compressed network, corresponding to 3 TOPS on an uncompressed network, and processes FC layers of AlexNet at 1.88×104 frames/sec with a power dissipation of only 600mW. It is 24,000× and 3,400× more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9×, 19× and 3× better throughput, energy efficiency and area efficiency.

2,445 citations

Journal ArticleDOI
TL;DR: AtomEye is free atomistic visualization software for all major UNIX platforms, based on a newly developed graphics core library of higher quality than the X-window standard, with area-weighted anti-aliasing.
Abstract: AtomEye is free atomistic visualization software for all major UNIX platforms It is based on a newly developed graphics core library of higher quality than the X-window standard, with area-weighted anti-aliasing An order-N neighbourlist algorithm is used to compute the bond connectivity The functionalities of AtomEye include: parallel and perspective projections with full three-dimensional navigation; customizable bond and coordination number calculation; colour-encoding of arbitrary user-defined quantities; local atomic strain invariant; coloured atom tiling and tracing; up to 16 cutting planes; periodic boundary condition translations; high-quality JPEG, PNG and EPS screenshots; and animation scripting The program is efficient compared to OpenGL hardware acceleration by employing special algorithms to treat spheres (atoms) and cylinders (bonds), in which they are rendered as primitive objects rather than as composites of polygons AtomEye can handle more than one million atoms on a PC with 1 GB memory It is a robust, low-cost tool for surveying nanostructures and following their evolutions

1,139 citations

Journal Article
TL;DR: In this paper, an Atmel ATmega128 at 8 MHz was used to implement ECC point multiplication over fields using pseudo-Mersenne primes as standardized by NIST and SECG.
Abstract: Strong public-key cryptography is often considered to be too computationally expensive for small devices if not accelerated by cryptographic hardware. We revisited this statement and implemented elliptic curve point multiplication for 160-bit, 192-bit, and 224-bit NIST/SECG curves over GF(p) and RSA-1024 and RSA-2048 on two 8-bit microcontrollers. To accelerate multiple-precision multiplication, we propose a new algorithm to reduce the number of memory accesses. Implementation and analysis led to three observations: 1. Public-key cryptography is viable on small devices without hardware acceleration. On an Atmel ATmega128 at 8 MHz we measured 0.81s for 160-bit ECC point multiplication and 0.43s for a RSA-1024 operation with exponent e = 2 16 +1. 2. The relative performance advantage of ECC point multiplication over RSA modular exponentiation increases with the decrease in processor word size and the increase in key size. 3. Elliptic curves over fields using pseudo-Mersenne primes as standardized by NIST and SECG allow for high performance implementations and show no performance disadvantage over optimal extension fields or prime fields selected specifically for a particular processor architecture.

1,113 citations

Proceedings ArticleDOI
01 Jul 2002
TL;DR: It is demonstrated that ray tracing on graphics hardware could prove to be faster than CPU based implementations as well as competitive with traditional hardware accelerated feed-forward triangle rendering.
Abstract: Recently a breakthrough has occurred in graphics hardware: fixed function pipelines have been replaced with programmable vertex and fragment processors. In the near future, the graphics pipeline is likely to evolve into a general programmable stream processor capable of more than simply feed-forward triangle rendering.In this paper, we evaluate these trends in programmability of the graphics pipeline and explain how ray tracing can be mapped to graphics hardware. Using our simulator, we analyze the performance of a ray casting implementation on next generation programmable graphics hardware. In addition, we compare the performance difference between non-branching programmable hardware using a multipass implementation and an architecture that supports branching. We also show how this approach is applicable to other ray tracing algorithms such as Whitted ray tracing, path tracing, and hybrid rendering algorithms. Finally, we demonstrate that ray tracing on graphics hardware could prove to be faster than CPU based implementations as well as competitive with traditional hardware accelerated feed-forward triangle rendering.

643 citations

Journal ArticleDOI
20 Mar 2020
TL;DR: This article reviews the mainstream compression approaches such as compact model, tensor decomposition, data quantization, and network sparsification, and answers the question of how to leverage these methods in the design of neural network accelerators and present the state-of-the-art hardware architectures.
Abstract: Domain-specific hardware is becoming a promising topic in the backdrop of improvement slow down for general-purpose processors due to the foreseeable end of Moore’s Law. Machine learning, especially deep neural networks (DNNs), has become the most dazzling domain witnessing successful applications in a wide spectrum of artificial intelligence (AI) tasks. The incomparable accuracy of DNNs is achieved by paying the cost of hungry memory consumption and high computational complexity, which greatly impedes their deployment in embedded systems. Therefore, the DNN compression concept was naturally proposed and widely used for memory saving and compute acceleration. In the past few years, a tremendous number of compression techniques have sprung up to pursue a satisfactory tradeoff between processing efficiency and application accuracy. Recently, this wave has spread to the design of neural network accelerators for gaining extremely high performance. However, the amount of related works is incredibly huge and the reported approaches are quite divergent. This research chaos motivates us to provide a comprehensive survey on the recent advances toward the goal of efficient compression and execution of DNNs without significantly compromising accuracy, involving both the high-level algorithms and their applications in hardware design. In this article, we review the mainstream compression approaches such as compact model, tensor decomposition, data quantization, and network sparsification. We explain their compression principles, evaluation metrics, sensitivity analysis, and joint-way use. Then, we answer the question of how to leverage these methods in the design of neural network accelerators and present the state-of-the-art hardware architectures. In the end, we discuss several existing issues such as fair comparison, testing workloads, automatic compression, influence on security, and framework/hardware-level support, and give promising topics in this field and the possible challenges as well. This article attempts to enable readers to quickly build up a big picture of neural network compression and acceleration, clearly evaluate various methods, and confidently get started in the right way.

499 citations


Network Information
Related Topics (5)
Scalability
50.9K papers, 931.6K citations
87% related
Cache
59.1K papers, 976.6K citations
86% related
CMOS
81.3K papers, 1.1M citations
83% related
Server
79.5K papers, 1.4M citations
82% related
Data compression
43.6K papers, 756.5K citations
82% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023116
2022312
2021410
2020524
2019498
2018365