scispace - formally typeset
Journal ArticleDOI

On-Chip Sparse Learning Acceleration With CMOS and Resistive Synaptic Devices

TLDR
This paper cooptimizes algorithm, architecture, circuit, and device for real-time energy-efficient on-chip hardware acceleration of sparse coding and shows that 65 nm implementation of the CMOS ASIC and PARCA scheme accelerates sparse coding computation by 394 and 2140×, respectively, compared to software running on a eight-core CPU.
Abstract
Many recent advances in sparse coding led its wide adoption in signal processing, pattern classification, and object recognition applications. Even with improved performance in state-of-the-art algorithms and the hardware platform of CPUs/GPUs, solving a sparse coding problem still requires expensive computations, making real-time large-scale learning a very challenging problem. In this paper, we cooptimize algorithm, architecture, circuit, and device for real-time energy-efficient on-chip hardware acceleration of sparse coding. The principle of hardware acceleration is to recognize the properties of learning algorithms, which involve many parallel operations of data fetch and matrix/vector multiplication/addition. Today's von Neumann architecture, however, is not suitable for such parallelization, due to the separation of memory and the computing unit that makes sequential operations inevitable. Such principle drives both the selection of algorithms and the design evolution from CPU to CMOS application-specific integrated circuits (ASIC) to parallel architecture with resistive crosspoint array (PARCA) that we propose. The CMOS ASIC scheme implements sparse coding with SRAM dictionaries and all-digital circuits, and PARCA employs resistive-RAM dictionaries with special read and write circuits. We show that 65 nm implementation of the CMOS ASIC and PARCA scheme accelerates sparse coding computation by $394$ and $2140\times$ , respectively, compared to software running on a eight-core CPU. Simulated power for both hardware schemes lie in the milli-Watt range, making it viable for portable single-chip learning applications.

read more

Citations
More filters
Journal ArticleDOI

Memory devices and applications for in-memory computing

TL;DR: This Review provides an overview of memory devices and the key computational primitives enabled by these memory devices as well as their applications spanning scientific computing, signal processing, optimization, machine learning, deep learning and stochastic computing.
Journal ArticleDOI

Acceleration of Deep Neural Network Training with Resistive Cross-Point Devices: Design Considerations.

TL;DR: A concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power is proposed that will be able to tackle Big Data problems with trillions of parameters that is impossible to address today.
Journal ArticleDOI

Recent advances in convolutional neural network acceleration

TL;DR: In this paper, the authors present a taxonomy of CNN acceleration methods in terms of three levels, i.e. structure level, algorithm level, and implementation level, for CNN architectures compression, algorithm optimization and hardware-based improvement.
Journal ArticleDOI

Training Deep Convolutional Neural Networks with Resistive Cross-Point Devices.

TL;DR: In this article, the authors extend the concept of resistive processing unit (RPU) devices to convolutional neural networks (CNNs) and show how to map the CNN layers to fully connected RPU arrays such that the parallelism of the hardware can be fully utilized in all three cycles of the backpropagation algorithm.
References
More filters
Journal ArticleDOI

Gradient-based learning applied to document recognition

TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.
Journal ArticleDOI

LIBSVM: A library for support vector machines

TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Journal ArticleDOI

A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems

TL;DR: A new fast iterative shrinkage-thresholding algorithm (FISTA) which preserves the computational simplicity of ISTA but with a global rate of convergence which is proven to be significantly better, both theoretically and practically.
Journal ArticleDOI

$rm K$ -SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation

TL;DR: A novel algorithm for adapting dictionaries in order to achieve sparse signal representations, the K-SVD algorithm, an iterative method that alternates between sparse coding of the examples based on the current dictionary and a process of updating the dictionary atoms to better fit the data.
Related Papers (5)