scispace - formally typeset
Search or ask a question
Author

Arash Ardakani

Other affiliations: Sharif University of Technology
Bio: Arash Ardakani is an academic researcher from McGill University. The author has contributed to research in topics: Stochastic computing & Artificial neural network. The author has an hindex of 11, co-authored 28 publications receiving 501 citations. Previous affiliations of Arash Ardakani include Sharif University of Technology.

Papers
More filters
Journal ArticleDOI
TL;DR: The proposed architecture uses integer stochastic streams and a modified Finite State Machine-based tanh function to improve the performance and reduce the latency compared to existing stochastically architectures for DNN.
Abstract: The hardware implementation of deep neural networks (DNNs) has recently received tremendous attention: many applications in fact require high-speed operations that suit a hardware implementation. However, numerous elements and complex interconnections are usually required, leading to a large area occupation and copious power consumption. Stochastic computing (SC) has shown promising results for low-power area-efficient hardware implementations, even though existing stochastic algorithms require long streams that cause long latencies. In this paper, we propose an integer form of stochastic computation and introduce some elementary circuits. We then propose an efficient implementation of a DNN based on integral SC. The proposed architecture has been implemented on a Virtex7 field-programmable gate array, resulting in 45% and 62% average reductions in area and latency compared with the best reported architecture in the literature. We also synthesize the circuits in a 65-nm CMOS technology, and we show that the proposed integral stochastic architecture results in up to 21% reduction in energy consumption compared with the binary radix implementation at the same misclassification rate. Due to fault-tolerant nature of stochastic architectures, we also consider a quasi-synchronous implementation that yields 33% reduction in energy consumption with respect to the binary radix implementation without any compromise on performance.

178 citations

Proceedings ArticleDOI
01 Sep 2016
TL;DR: This paper proposes an integer form of stochastic computation and introduces some elementary circuits and proposes an efficient implementation of a DNN based on integral SC, and considers a quasi-synchronous implementation that yields 33% reduction in energy consumption with respect to the binary radix implementation without any compromise on performance.
Abstract: The hardware implementation of deep neural networks (DNNs) has recently received tremendous attention since many applications require high-speed operations. However, numerous processing elements and complex interconnections are usually required, leading to a large area occupation and a high power consumption. Stochastic computing has shown promising results for area-efficient hardware implementations, even though existing stochastic algorithms require long streams that exhibit long latency. In this paper, we propose an integer form of stochastic computation and introduce some elementary circuits. We then propose an efficient implementation of a DNN based on integral stochastic computing. The proposed architecture uses integer stochastic streams and a modified Finite State Machine-based tanh function to improve the performance and reduce the latency compared to existing stochastic architectures for DNN. The simulation results show the negligible performance loss of the proposed integer stochastic DNN for different network sizes compared to their floating point versions.

129 citations

Journal ArticleDOI
TL;DR: This paper proposes an efficient computational method, which is inspired by a computational core of fully connected neural networks, to process convolutional layers of state-of-the-art deep CNNs within strict latency requirements, and implemented its method customized for VGG and VGG-based networks which have shown state of theart performance on different classification/recognition data sets.
Abstract: In the past few years, the demand for real-time hardware implementations of deep neural networks (DNNs), especially convolutional neural networks (CNNs), has dramatically increased, thanks to their excellent performance on a wide range of recognition and classification tasks. When considering real-time action recognition and video/image classification systems, latency is of paramount importance. Therefore, applications strive to maximize the accuracy while keeping the latency under a given application-specific maximum: in most cases, this threshold cannot exceed a few hundred milliseconds. Until now, the research on DNNs has mainly focused on achieving a better classification or recognition accuracy, whereas very few works in literature take in account the computational complexity of the model. In this paper, we propose an efficient computational method, which is inspired by a computational core of fully connected neural networks, to process convolutional layers of state-of-the-art deep CNNs within strict latency requirements. To this end, we implemented our method customized for VGG and VGG-based networks which have shown state-of-the-art performance on different classification/recognition data sets. The implementation results in 65-nm CMOS technology show that the proposed accelerator can process convolutional layers of VGGNet up to 9.5 times faster than state-of-the-art accelerators reported to-date while occupying 3.5 mm2.

93 citations

Journal ArticleDOI
TL;DR: ZASCA achieves a performance efficiency of up to 94 percent over a set of state-of-the-art CNNs for image classification with dense representation where the performance efficiency is the ratio between the average runtime performance and the peak performance.
Abstract: Convolutional neural networks (CNNs) are a vital approach in machine learning. However, their high complexity and energy consumption make them challenging to embed in mobile applications at the edge requiring real-time processes such as smart phones. In order to meet the real-time constraint of edge devices, recently proposed custom hardware CNN accelerators have exploited parallel processing elements (PEs) to increase throughput. However, this straightforward parallelization of PEs and high memory bandwidth require high data movement, leading to large energy consumption. As a result, only a certain number of PEs can be instantiated when designing bandwidth-limited custom accelerators targeting edge devices. While most bandwidth-limited designs claim a peak performance of a few hundred giga operations per second, their average runtime performance is substantially lower than their roofline when applied to state-of-the-art CNNs such as AlexNet, VGGNet and ResNet, as a result of low resource utilization and arithmetic intensity. In this work, we propose a zero-activation-skipping convolutional accelerator (ZASCA) that avoids noncontributory multiplications with zero-valued activations. ZASCA employs a dataflow that minimizes the gap between its average and peak performances while maximizing its arithmetic intensity for both sparse and dense representations of activations, targeting the bandwidth-limited edge computing scenario. More precisely, ZASCA achieves a performance efficiency of up to 94 percent over a set of state-of-the-art CNNs for image classification with dense representation where the performance efficiency is the ratio between the average runtime performance and the peak performance. Using its zero-skipping feature, ZASCA can further improve the performance efficiency of the state-of-the-art CNNs by up to 1.9× depending on the sparsity degree of activations. The implementation results in 65-nm TSMC CMOS technology show that, compared to the most energy-efficient accelerator, ZASCA can process convolutions from 5.5× to 17.5× faster, and is between 2.1× and 4.5× more energy efficient while occupying 2.1× less silicon area.

36 citations

Proceedings Article
03 Nov 2016
TL;DR: In this article, the authors proposed sparsely-connected networks to reduce the number of connections in fully-connected neural networks by up to 90% while improving the accuracy performance on three popular datasets (MNIST, CIFAR10 and SVHN).
Abstract: Recently deep neural networks have received considerable attention due to their ability to extract and represent high-level abstractions in data sets. Deep neural networks such as fully-connected and convolutional neural networks have shown excellent performance on a wide range of recognition and classification tasks. However, their hardware implementations currently suffer from large silicon area and high power consumption due to the their high degree of complexity. The power/energy consumption of neural networks is dominated by memory accesses, the majority of which occur in fully-connected networks. In fact, they contain most of the deep neural network parameters. In this paper, we propose sparsely-connected networks, by showing that the number of connections in fully-connected networks can be reduced by up to 90% while improving the accuracy performance on three popular datasets (MNIST, CIFAR10 and SVHN). We then propose an efficient hardware architecture based on linear-feedback shift registers to reduce the memory requirements of the proposed sparsely-connected networks. The proposed architecture can save up to 90% of memory compared to the conventional implementations of fully-connected neural networks. Moreover, implementation results show up to 84% reduction in the energy consumption of a single neuron of the proposed sparsely-connected networks compared to a single neuron of fully-connected neural networks.

28 citations


Cited by
More filters
Journal ArticleDOI
20 Mar 2020
TL;DR: This article reviews the mainstream compression approaches such as compact model, tensor decomposition, data quantization, and network sparsification, and answers the question of how to leverage these methods in the design of neural network accelerators and present the state-of-the-art hardware architectures.
Abstract: Domain-specific hardware is becoming a promising topic in the backdrop of improvement slow down for general-purpose processors due to the foreseeable end of Moore’s Law. Machine learning, especially deep neural networks (DNNs), has become the most dazzling domain witnessing successful applications in a wide spectrum of artificial intelligence (AI) tasks. The incomparable accuracy of DNNs is achieved by paying the cost of hungry memory consumption and high computational complexity, which greatly impedes their deployment in embedded systems. Therefore, the DNN compression concept was naturally proposed and widely used for memory saving and compute acceleration. In the past few years, a tremendous number of compression techniques have sprung up to pursue a satisfactory tradeoff between processing efficiency and application accuracy. Recently, this wave has spread to the design of neural network accelerators for gaining extremely high performance. However, the amount of related works is incredibly huge and the reported approaches are quite divergent. This research chaos motivates us to provide a comprehensive survey on the recent advances toward the goal of efficient compression and execution of DNNs without significantly compromising accuracy, involving both the high-level algorithms and their applications in hardware design. In this article, we review the mainstream compression approaches such as compact model, tensor decomposition, data quantization, and network sparsification. We explain their compression principles, evaluation metrics, sensitivity analysis, and joint-way use. Then, we answer the question of how to leverage these methods in the design of neural network accelerators and present the state-of-the-art hardware architectures. In the end, we discuss several existing issues such as fair comparison, testing workloads, automatic compression, influence on security, and framework/hardware-level support, and give promising topics in this field and the possible challenges as well. This article attempts to enable readers to quickly build up a big picture of neural network compression and acceleration, clearly evaluate various methods, and confidently get started in the right way.

499 citations

Journal ArticleDOI
TL;DR: A survey of various techniques suggested for compressing and accelerating the ML and DL models is presented and the challenges of the existing techniques are discussed and future research directions in the field are provided.
Abstract: In recent years, machine learning (ML) and deep learning (DL) have shown remarkable improvement in computer vision, natural language processing, stock prediction, forecasting, and audio processing to name a few. The size of the trained DL model is large for these complex tasks, which makes it difficult to deploy on resource-constrained devices. For instance, size of the pre-trained VGG16 model trained on the ImageNet dataset is more than 500 MB. Resource-constrained devices such as mobile phones and internet of things devices have limited memory and less computation power. For real-time applications, the trained models should be deployed on resource-constrained devices. Popular convolutional neural network models have millions of parameters that leads to increase in the size of the trained model. Hence, it becomes essential to compress and accelerate these models before deploying on resource-constrained devices while making the least compromise with the model accuracy. It is a challenging task to retain the same accuracy after compressing the model. To address this challenge, in the last couple of years many researchers have suggested different techniques for model compression and acceleration. In this paper, we have presented a survey of various techniques suggested for compressing and accelerating the ML and DL models. We have also discussed the challenges of the existing techniques and have provided future research directions in the field.

221 citations

Journal ArticleDOI
TL;DR: The struggles of designing a family of polar codes able to satisfy the demands of 5G systems are illustrated, with particular attention to rate flexibility and low decoding latency.
Abstract: Polar codes have attracted the attention of academia and industry alike in the past decade, such that the $5^{\mathrm {th}}$ generation wireless systems (5G) standardization process of the $3^{\mathrm {rd}}$ generation partnership project (3GPP) chose polar codes as a channel coding scheme. In this tutorial, we provide a description of the encoding process of polar codes adopted by the 5G standard. We illustrate the struggles of designing a family of polar codes able to satisfy the demands of 5G systems, with particular attention to rate flexibility and low decoding latency. The result of these efforts is an elaborate framework that applies novel coding techniques to provide a solid channel code for NR requirements.

197 citations

Journal ArticleDOI
TL;DR: The proposed architecture uses integer stochastic streams and a modified Finite State Machine-based tanh function to improve the performance and reduce the latency compared to existing stochastically architectures for DNN.
Abstract: The hardware implementation of deep neural networks (DNNs) has recently received tremendous attention: many applications in fact require high-speed operations that suit a hardware implementation. However, numerous elements and complex interconnections are usually required, leading to a large area occupation and copious power consumption. Stochastic computing (SC) has shown promising results for low-power area-efficient hardware implementations, even though existing stochastic algorithms require long streams that cause long latencies. In this paper, we propose an integer form of stochastic computation and introduce some elementary circuits. We then propose an efficient implementation of a DNN based on integral SC. The proposed architecture has been implemented on a Virtex7 field-programmable gate array, resulting in 45% and 62% average reductions in area and latency compared with the best reported architecture in the literature. We also synthesize the circuits in a 65-nm CMOS technology, and we show that the proposed integral stochastic architecture results in up to 21% reduction in energy consumption compared with the binary radix implementation at the same misclassification rate. Due to fault-tolerant nature of stochastic architectures, we also consider a quasi-synchronous implementation that yields 33% reduction in energy consumption with respect to the binary radix implementation without any compromise on performance.

178 citations

Journal ArticleDOI
TL;DR: The evolution of SC is discussed, mostly focusing on recent developments, to highlight the main challenges and discuss potential methods of overcoming them.
Abstract: Stochastic computing (SC) is an unconventional method of computation that treats data as probabilities. Typically, each bit of an ${N}$ -bit stochastic number (SN) ${X}$ is randomly chosen to be 1 with some probability $p_{X}$ , and ${X}$ is generated and processed by conventional logic circuits. For instance, a single AND gate performs multiplication. The value X of an SN is measured by the density of 1 s in it, an information-coding scheme also found in biological neural systems. SC has uses in massively parallel systems and is very tolerant of soft errors. Its drawbacks include low accuracy, slow processing, and complex design needs. Its ability to efficiently perform tasks like communication decoding and neural network inference has rekindled interest in the field. Many challenges remain to be overcome, however, before SC becomes widespread. In this paper, we discuss the evolution of SC, mostly focusing on recent developments. We highlight the main challenges and discuss potential methods of overcoming them.

174 citations