Author
Manoj Alwani
Other affiliations: LNM Institute of Information Technology
Bio: Manoj Alwani is an academic researcher from Stony Brook University. The author has contributed to research in topics: Wavelet transform & Data compression. The author has an hindex of 5, co-authored 11 publications receiving 436 citations. Previous affiliations of Manoj Alwani include LNM Institute of Information Technology.
Papers
More filters
15 Oct 2016
TL;DR: This work finds that a previously unexplored dimension exists in the design space of CNN accelerators that focuses on the dataflow across convolutional layers, and is able to fuse the processing of multiple CNN layers by modifying the order in which the input data are brought on chip, enabling caching of intermediate data between the evaluation of adjacent CNN layers.
Abstract: Deep convolutional neural networks (CNNs) are rapidly becoming the dominant approach to computer vision and a major component of many other pervasive machine learning tasks, such as speech recognition, natural language processing, and fraud detection. As a result, accelerators for efficiently evaluating CNNs are rapidly growing in popularity. The conventional approaches to designing such CNN accelerators is to focus on creating accelerators to iteratively process the CNN layers. However, by processing each layer to completion, the accelerator designs must use off-chip memory to store intermediate data between layers, because the intermediate data are too large to fit on chip. In this work, we observe that a previously unexplored dimension exists in the design space of CNN accelerators that focuses on the dataflow across convolutional layers. We find that we are able to fuse the processing of multiple CNN layers by modifying the order in which the input data are brought on chip, enabling caching of intermediate data between the evaluation of adjacent CNN layers. We demonstrate the effectiveness of our approach by constructing a fused-layer CNN accelerator for the first five convolutional layers of the VGGNet-E network and comparing it to the state-of-the-art accelerator implemented on a Xilinx Virtex-7 FPGA. We find that, by using 362KB of on-chip storage, our fused-layer accelerator minimizes off-chip feature map data transfer, reducing the total transfer by 95%, from 77MB down to 3.6MB per image.
477 citations
16 Apr 2010
TL;DR: This work is to provide a detailed analysis of state-of-the-art algorithms used for lossless compression of images and to give future research direction based on the analysis to the new researchers.
Abstract: In this paper we are describing some important state-of the-art algorithms used for lossless compression of images. These algorithms are broadly classified as prediction based methods and transform based methods. Motivation behind this work is to provide a detailed analysis of such algorithms and to give future research direction based on the analysis to the new researchers.
41 citations
28 Mar 2013
TL;DR: A method based on regression, to estimate the cabac bits for direction mode and for Transform Unit based on the magnitude of non-zero coefficients is presented and the results show that the method gives 5.4% average BD-rate gain for Luma in comparison to using only distortion as a comparative measure.
Abstract: Improved performance of Intra Prediction in HEVC in comparison to H.264/AVC is primarily due to two main feature enhancements, first is increase in the number of intra prediction directions and secondly due to the support for larger transform sizes. Using lagrangian based rate distortion measure to decide the appropriate prediction direction and Transform Unit size requires calculation of bits generated using complete cabac encoding for every candidate, which is computationally very exhaustive. In order to avoid complex cabac encoding is to use only distortion as a comparison measure, but it has a detrimental effect on coding efficiency. In this paper we present a method based on regression, to estimate the cabac bits for direction mode and for Transform Unit based on the magnitude of non-zero coefficients. This way we avoid calculating cabac encoding for bits calculation and the results show that our method gives 5.4% average BD-rate gain for Luma in comparison to using only distortion as a comparative measure.
12 citations
20 Mar 2013
TL;DR: A fast candidate selection method in RMD step of Intra prediction where coarser steps between the prediction directions are used as possible candidates for comparison, where the coarser step size is a function of Prediction Unit size.
Abstract: In this paper we propose a fast candidate selection method in RMD step of Intra prediction. The proposed method consist of two steps, in the first step, coarser steps between the prediction directions are used as possible candidates for comparison, where the coarser step size is a function of Prediction Unit size. This way we find the dominant direction with minimum cost. In the second step, the dominant direction is refined by checking the prediction directions around the dominant direction to find the best match.
10 citations
Patent•
11 Oct 2017TL;DR: In this article, the intermediate data values associated with a subset of the data values of the input feature maps in the current intermediate tile structure are reused, when the intermediate values of a previous tile structure overlap values to be computed in a current tile structure.
Abstract: System, method, and accelerator to process a convolutional neural network. In accordance therewith, a tile structure having input data values is loaded for a convolution layer. Each tile of the tile structure corresponds to a respective feature map in a set of input feature maps. The tile structure of each iteration represents a different subset of data values in the input feature maps. Intermediate data values associated with a subset of the data values of the input feature maps in the current intermediate tile structure are reused, when the intermediate data values of a previous tile structure overlap values to be computed in the current tile structure. Intermediate non-overlapping data values that are associated with the subset of the data values in the current tile structure are computed using associated filters having weight data values. Available reused intermediate data values and computed intermediate data values are buffered as intermediate data.
7 citations
Cited by
More filters
20 Nov 2017
TL;DR: In this paper, the authors provide a comprehensive tutorial and survey about the recent advances toward the goal of enabling efficient processing of DNNs, and discuss various hardware platforms and architectures that support DNN, and highlight key trends in reducing the computation cost of deep neural networks either solely via hardware design changes or via joint hardware and DNN algorithm changes.
Abstract: Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of DNNs to improve energy efficiency and throughput without sacrificing application accuracy or increasing hardware cost are critical to the wide deployment of DNNs in AI systems. This article aims to provide a comprehensive tutorial and survey about the recent advances toward the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various hardware platforms and architectures that support DNNs, and highlight key trends in reducing the computation cost of DNNs either solely via hardware design changes or via joint hardware design and DNN algorithm changes. It will also summarize various development resources that enable researchers and practitioners to quickly get started in this field, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic codesigns, being proposed in academia and industry. The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand the tradeoffs between various hardware architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand recent implementation trends and opportunities.
2,391 citations
24 Jun 2017
TL;DR: The Sparse CNN (SCNN) accelerator as discussed by the authors employs a dataflow that enables maintaining the sparse weights and activations in a compressed encoding, which eliminates unnecessary data transfers and reduces storage requirements.
Abstract: Convolutional Neural Networks (CNNs) have emerged as a fundamental technology for machine learning. High performance and extreme energy efficiency are critical for deployments of CNNs, especially in mobile platforms such as autonomous vehicles, cameras, and electronic personal assistants. This paper introduces the Sparse CNN (SCNN) accelerator architecture, which improves performance and energy efficiency by exploiting the zero-valued weights that stem from network pruning during training and zero-valued activations that arise from the common ReLU operator. Specifically, SCNN employs a novel dataflow that enables maintaining the sparse weights and activations in a compressed encoding, which eliminates unnecessary data transfers and reduces storage requirements. Furthermore, the SCNN dataflow facilitates efficient delivery of those weights and activations to a multiplier array, where they are extensively reused; product accumulation is performed in a novel accumulator array. On contemporary neural networks, SCNN can improve both performance and energy by a factor of 2.7x and 2.3x, respectively, over a comparably provisioned dense CNN accelerator.
822 citations
Posted Content•
TL;DR: In this article, the authors provide a comprehensive tutorial and survey about the recent advances towards the goal of enabling efficient processing of DNNs, and discuss various hardware platforms and architectures that support deep neural networks.
Abstract: Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of DNNs to improve energy efficiency and throughput without sacrificing application accuracy or increasing hardware cost are critical to the wide deployment of DNNs in AI systems.
This article aims to provide a comprehensive tutorial and survey about the recent advances towards the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various hardware platforms and architectures that support DNNs, and highlight key trends in reducing the computation cost of DNNs either solely via hardware design changes or via joint hardware design and DNN algorithm changes. It will also summarize various development resources that enable researchers and practitioners to quickly get started in this field, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic co-designs, being proposed in academia and industry.
The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand the trade-offs between various hardware architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand recent implementation trends and opportunities.
677 citations
TL;DR: By consolidating information scattered across the communication, networking, and DL areas, this survey can help readers to understand the connections between enabling technologies while promoting further discussions on the fusion of edge intelligence and intelligent edge, i.e., Edge DL.
Abstract: Ubiquitous sensors and smart devices from factories and communities are generating massive amounts of data, and ever-increasing computing power is driving the core of computation and services from the cloud to the edge of the network. As an important enabler broadly changing people’s lives, from face recognition to ambitious smart factories and cities, developments of artificial intelligence (especially deep learning, DL) based applications and services are thriving. However, due to efficiency and latency issues, the current cloud computing service architecture hinders the vision of “providing artificial intelligence for every person and every organization at everywhere”. Thus, unleashing DL services using resources at the network edge near the data sources has emerged as a desirable solution. Therefore, edge intelligence , aiming to facilitate the deployment of DL services by edge computing, has received significant attention. In addition, DL, as the representative technique of artificial intelligence, can be integrated into edge computing frameworks to build intelligent edge for dynamic, adaptive edge maintenance and management. With regard to mutually beneficial edge intelligence and intelligent edge , this paper introduces and discusses: 1) the application scenarios of both; 2) the practical implementation methods and enabling technologies, namely DL training and inference in the customized edge computing framework; 3) challenges and future trends of more pervasive and fine-grained intelligence. We believe that by consolidating information scattered across the communication, networking, and DL areas, this survey can help readers to understand the connections between enabling technologies while promoting further discussions on the fusion of edge intelligence and intelligent edge , i.e., Edge DL.
611 citations
TL;DR: In this paper, a survey on the relationship between edge intelligence and intelligent edge computing is presented, and the practical implementation methods and enabling technologies, namely DL training and inference in the customized edge computing framework, challenges and future trends of more pervasive and fine-grained intelligence.
Abstract: Ubiquitous sensors and smart devices from factories and communities are generating massive amounts of data, and ever-increasing computing power is driving the core of computation and services from the cloud to the edge of the network. As an important enabler broadly changing people's lives, from face recognition to ambitious smart factories and cities, developments of artificial intelligence (especially deep learning, DL) based applications and services are thriving. However, due to efficiency and latency issues, the current cloud computing service architecture hinders the vision of "providing artificial intelligence for every person and every organization at everywhere". Thus, unleashing DL services using resources at the network edge near the data sources has emerged as a desirable solution. Therefore, edge intelligence, aiming to facilitate the deployment of DL services by edge computing, has received significant attention. In addition, DL, as the representative technique of artificial intelligence, can be integrated into edge computing frameworks to build intelligent edge for dynamic, adaptive edge maintenance and management. With regard to mutually beneficial edge intelligence and intelligent edge, this paper introduces and discusses: 1) the application scenarios of both; 2) the practical implementation methods and enabling technologies, namely DL training and inference in the customized edge computing framework; 3) challenges and future trends of more pervasive and fine-grained intelligence. We believe that by consolidating information scattered across the communication, networking, and DL areas, this survey can help readers to understand the connections between enabling technologies while promoting further discussions on the fusion of edge intelligence and intelligent edge, i.e., Edge DL.
518 citations