scispace - formally typeset
Search or ask a question
Author

Asit K. Mishra

Other affiliations: Pennsylvania State University
Bio: Asit K. Mishra is an academic researcher from Intel. The author has contributed to research in topics: Cache & Network on a chip. The author has an hindex of 29, co-authored 63 publications receiving 4196 citations. Previous affiliations of Asit K. Mishra include Pennsylvania State University.


Papers
More filters
Proceedings ArticleDOI
15 Oct 2016
TL;DR: DnnWeaver is devised, a framework that automatically generates a synthesizable accelerator for a given DNN, FPGA pair from a high-level specification in Caffe that best matches the needs of the DNN while providing high performance and efficiency gains for the target FPGAs.
Abstract: Deep Neural Networks (DNNs) are compute-intensive learning models with growing applicability in a wide range of domains. FPGAs are an attractive choice for DNNs since they offer a programmable substrate for acceleration and are becoming available across different market segments. However, obtaining both performance and energy efficiency with FPGAs is a laborious task even for expert hardware designers. Furthermore, the large memory footprint of DNNs, coupled with the FPGAs' limited on-chip storage makes DNN acceleration using FPGAs more challenging. This work tackles these challenges by devising DnnWeaver, a framework that automatically generates a synthesizable accelerator for a given (DNN, FPGA) pair from a high-level specification in Caffe [1]. To achieve large benefits while preserving automation, DNNWEAVER generates accelerators using hand-optimized design templates. First, DnnWeaver translates a given high-level DNN specification to its novel ISA that represents a macro dataflow graph of the DNN. The DnnWeaver compiler is equipped with our optimization algorithm that tiles, schedules, and batches DNN operations to maximize data reuse and best utilize target FPGA's memory and other resources. The final result is a custom synthesizable accelerator that best matches the needs of the DNN while providing high performance and efficiency gains for the target FPGA. We use DnnWeaver to generate accelerators for a set of eight different DNN models and three different FPGAs, Xilinx Zynq, Altera Stratix V, and Altera Arria 10. We use hardware measurements to compare the generated accelerators to both multicore CPUs (ARM Cortex A15 and Xeon E3) and many-core GPUs (Tegra K1, GTX 650Ti, and Tesla K40). In comparison, the generated accelerators deliver superior performance and efficiency without requiring the programmers to participate in the arduous task of hardware design.

435 citations

Journal ArticleDOI
27 Mar 2010
TL;DR: An approach to workload classification and its application to the Google Cloud Backend, arguably the largest cloud backend on the planet is described.
Abstract: The advent of cloud computing promises highly available, efficient, and flexible computing services for applications such as web search, email, voice over IP, and web search alerts. Our experience at Google is that realizing the promises of cloud computing requires an extremely scalable backend consisting of many large compute clusters that are shared by application tasks with diverse service level requirements for throughput, latency, and jitter. These considerations impact (a) capacity planning to determine which machine resources must grow and by how much and (b) task scheduling to achieve high machine utilization and to meet service level objectives.Both capacity planning and task scheduling require a good understanding of task resource consumption (e.g., CPU and memory usage). This in turn demands simple and accurate approaches to workload classification-determining how to form groups of tasks (workloads) with similar resource demands. One approach to workload classification is to make each task its own workload. However, this approach scales poorly since tens of thousands of tasks execute daily on Google compute clusters. Another approach to workload classification is to view all tasks as belonging to a single workload. Unfortunately, applying such a coarse-grain workload classification to the diversity of tasks running on Google compute clusters results in large variances in predicted resource consumptions.This paper describes an approach to workload classification and its application to the Google Cloud Backend, arguably the largest cloud backend on the planet. Our methodology for workload classification consists of: (1) identifying the workload dimensions; (2) constructing task classes using an off-the-shelf algorithm such as k-means; (3) determining the break points for qualitative coordinates within the workload dimensions; and (4) merging adjacent task classes to reduce the number of workloads. We use the foregoing, especially the notion of qualitative coordinates, to glean several insights about the Google Cloud Backend: (a) the duration of task executions is bimodal in that tasks either have a short duration or a long duration; (b) most tasks have short durations; and (c) most resources are consumed by a few tasks with long duration that have large demands for CPU and memory.

411 citations

Proceedings ArticleDOI
01 Dec 2016
TL;DR: This paper proposed a BNN hardware accelerator design, implemented the proposed accelerator on Aria 10 FPGA as well as 14-nm ASIC, and compared them against optimized software on Xeon server CPU, Nvidia Titan X server GPU, and Nvidia TX1 mobile GPU.
Abstract: Deep neural networks (DNNs) are widely used in data analytics, since they deliver state-of-the-art accuracies. Binarized neural networks (BNNs) are recently proposed optimized variant of DNNs. BNNs constraint network weight and/or neuron value to either +1 or −1, which is representable in 1 bit. This leads to dramatic algorithm efficiency improvement, due to reduction in the memory and computational demands. This paper evaluates the opportunity to further improve the execution efficiency of BNNs through hardware acceleration. We first proposed a BNN hardware accelerator design. Then, we implemented the proposed accelerator on Aria 10 FPGA as well as 14-nm ASIC, and compared them against optimized software on Xeon server CPU, Nvidia Titan X server GPU, and Nvidia TX1 mobile GPU. Our evaluation shows that FPGA provides superior efficiency over CPU and GPU. Even though CPU and GPU offer high peak theoretical performance, they are not as efficiently utilized since BNNs rely on binarized bit-level operations that are better suited for custom hardware. Finally, even though ASIC is still more efficient, FPGA can provide orders of magnitudes in efficiency improvements over software, without having to lock into a fixed ASIC solution.

286 citations

Proceedings ArticleDOI
16 Mar 2013
TL;DR: This paper presents a coordinated CTA-aware scheduling policy that utilizes four schemes to minimize the impact of long memory latencies, and indicates that the proposed mechanism can provide 33% average performance improvement compared to the commonly-employed round-robin warp scheduling policy.
Abstract: Emerging GPGPU architectures, along with programming models like CUDA and OpenCL, offer a cost-effective platform for many applications by providing high thread level parallelism at lower energy budgets. Unfortunately, for many general-purpose applications, available hardware resources of a GPGPU are not efficiently utilized, leading to lost opportunity in improving performance. A major cause of this is the inefficiency of current warp scheduling policies in tolerating long memory latencies.In this paper, we identify that the scheduling decisions made by such policies are agnostic to thread-block, or cooperative thread array (CTA), behavior, and as a result inefficient. We present a coordinated CTA-aware scheduling policy that utilizes four schemes to minimize the impact of long memory latencies. The first two schemes, CTA-aware two-level warp scheduling and locality aware warp scheduling, enhance per-core performance by effectively reducing cache contention and improving latency hiding capability. The third scheme, bank-level parallelism aware warp scheduling, improves overall GPGPU performance by enhancing DRAM bank-level parallelism. The fourth scheme employs opportunistic memory-side prefetching to further enhance performance by taking advantage of open DRAM rows. Evaluations on a 28-core GPGPU platform with highly memory-intensive applications indicate that our proposed mechanism can provide 33% average performance improvement compared to the commonly-employed round-robin warp scheduling policy.

280 citations

Proceedings ArticleDOI
03 Jun 2012
TL;DR: This work forms the relationship between retention-time and write-latency, and finds optimal retention- time for architecting an efficient cache hierarchy using STT-RAM to overcome high write latency and energy problems.
Abstract: High density, low leakage and non-volatility are the attractive features of Spin-Transfer-Torque-RAM (STT-RAM), which has made it a strong competitor against SRAM as a universal memory replacement in multi-core systems. However, STT-RAM suffers from high write latency and energy which has impeded its widespread adoption. To this end, we look at trading-off STT-RAM's non-volatility property (data-retention-time) to overcome these problems. We formulate the relationship between retention-time and write-latency, and find optimal retention-time for architecting an efficient cache hierarchy using STT-RAM. Our results show that, compared to SRAM-based design, our proposal can improve performance and energy consumption by 18% and 60%, respectively.

261 citations


Cited by
More filters
Proceedings ArticleDOI
17 Apr 2015
TL;DR: A summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it are presented.
Abstract: Google's Borg system is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines. It achieves high utilization by combining admission control, efficient task-packing, over-commitment, and machine sharing with process-level performance isolation. It supports high-availability applications with runtime features that minimize fault-recovery time, and scheduling policies that reduce the probability of correlated failures. Borg simplifies life for its users by offering a declarative job specification language, name service integration, real-time job monitoring, and tools to analyze and simulate system behavior. We present a summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it.

1,185 citations

Journal ArticleDOI
TL;DR: A comprehensive survey of knowledge distillation from the perspectives of knowledge categories, training schemes, teacher-student architecture, distillation algorithms, performance comparison and applications can be found in this paper.
Abstract: In recent years, deep neural networks have been successful in both industry and academia, especially for computer vision tasks. The great success of deep learning is mainly due to its scalability to encode large-scale data and to maneuver billions of model parameters. However, it is a challenge to deploy these cumbersome deep models on devices with limited resources, e.g., mobile phones and embedded devices, not only because of the high computational complexity but also the large storage requirements. To this end, a variety of model compression and acceleration techniques have been developed. As a representative type of model compression and acceleration, knowledge distillation effectively learns a small student model from a large teacher model. It has received rapid increasing attention from the community. This paper provides a comprehensive survey of knowledge distillation from the perspectives of knowledge categories, training schemes, teacher-student architecture, distillation algorithms, performance comparison and applications. Furthermore, challenges in knowledge distillation are briefly reviewed and comments on future research are discussed and forwarded.

1,027 citations

Proceedings ArticleDOI
08 Oct 2018
TL;DR: TVM as discussed by the authors is a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends, such as mobile phones, embedded devices, and accelerators.
Abstract: There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms - such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) - requires significant manual effort. We propose TVM, a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives, and memory latency hiding. It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations. Experimental results show that TVM delivers performance across hardware back-ends that are competitive with state-of-the-art, hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPUs. We also demonstrate TVM's ability to target new accelerator back-ends, such as the FPGA-based generic deep learning accelerator. The system is open sourced and in production use inside several major companies.

991 citations

Journal ArticleDOI
TL;DR: Digital Control Of Dynamic Systems This well-respected, market-leading text discusses the use of digital computers in the real-time control of dynamic systems with an emphasis on the design of digital controls that achieve good dynamic response and small errors while using signals that are sampled in time and quantized in amplitude.
Abstract: Digital Control Of Dynamic Systems This well-respected, market-leading text discusses the use of digital computers in the real-time control of dynamic systems. The emphasis is on the design of digital controls that achieve good dynamic response and small errors while using signals that are sampled in time and quantized in amplitude. Digital Control of Dynamic Systems (3rd Edition): Franklin ... This well-respected, market-leading text discusses the use of digital computers in the real-time control of dynamic systems. The emphasis is on the design of digital controls that achieve good dynamic response and small errors while using signals that are sampled in time and quantized in amplitude. Digital Control of Dynamic Systems: Gene F. Franklin ... Digital Control of Dynamic Systems, 2nd Edition. Gene F. Franklin, Stanford University. J. David Powell, Stanford University Digital Control of Dynamic Systems, 2nd Edition Pearson This well-respected work discusses the use of digital computers in the real-time control of dynamic systems. The emphasis is on the design of digital controls that achieve good dynamic response and small errors while using signals that are sampled in time and quantized in amplitude. MATLAB statements and problems are thoroughly and carefully integrated throughout the book to offer readers a complete design picture. Digital Control of Dynamic Systems, 3rd Edition ... Digital control of dynamic systems | Gene F. Franklin, J. David Powell, Michael L. Workman | download | B–OK. Download books for free. Find books Digital control of dynamic systems | Gene F. Franklin, J ... Abstract This well-respected work discusses the use of digital computers in the real-time control of dynamic systems. The emphasis is on the design of digital controls that achieve good dynamic... (PDF) Digital Control of Dynamic Systems Digital Control of Dynamic Systems, Addison.pdf There is document Digital Control of Dynamic Systems, Addison.pdfavailable here for reading and downloading. Use the download button below or simple online reader. The file extension PDFand ranks to the Documentscategory. Digital Control of Dynamic Systems, Addison.pdf Download ... Automatic control is the science that develops techniques to steer, guide, control dynamic systems. These systems are built by humans and must perform a specific task. Examples of such dynamic systems are found in biology, physics, robotics, finance, etc. Digital Control means that the control laws are implemented in a digital device, such as a microcontroller or a microprocessor. Introduction to Digital Control of Dynamic Systems And ... The discussions are clear, nomenclature is not hard to follow and there are plenty of worked examples. The book covers discretization effects and design by emulation (i.e. design of continuous-time control system followed by discretization before implementation) which are not to be found on every book on digital control. Amazon.com: Customer reviews: Digital Control of Dynamic ... Find helpful customer reviews and review ratings for Digital Control of Dynamic Systems (3rd Edition) at Amazon.com. Read honest and unbiased product reviews from our users. Amazon.com: Customer reviews: Digital Control of Dynamic ... 1.1.2 Digital control Digital control systems employ a computer as a fundamental component in the controller. The computer typically receives a measurement of the controlled variable, also often receives the reference input, and produces its output using an algorithm. Introduction to Applied Digital Control From the Back Cover This well-respected, marketleading text discusses the use of digital computers in the real-time control of dynamic systems. The emphasis is on the design of digital controls that achieve good dynamic response and small errors while using signals that are sampled in time and quantized in amplitude. Digital Control of Dynamic Systems (3rd Edition) Test Bank `Among the advantages of digital logic for control are the increased flexibility `of the control programs and the decision-making or logic capability of digital `systems, which can be combined with the dynamic control function to meet `other system requirements. `The digital controls studied in this book are for closed-loop (feedback) Every day, eBookDaily adds three new free Kindle books to several different genres, such as Nonfiction, Business & Investing, Mystery & Thriller, Romance, Teens & Young Adult, Children's Books, and others.

902 citations