Home
/
Authors
/
Asit K. Mishra

Author

Asit K. Mishra

Other affiliations: Pennsylvania State University

Bio: Asit K. Mishra is an academic researcher from Intel. The author has contributed to research in topics: Cache & Network on a chip. The author has an hindex of 29, co-authored 63 publications receiving 4196 citations. Previous affiliations of Asit K. Mishra include Pennsylvania State University.

Topics: Cache, Network on a chip, Stratix, Medicine, Memory footprint ...read more

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

From high-level deep neural models to FPGAs

[...]

Hardik Sharma¹, Jongse Park¹, Divya Mahajan¹, Emmanuel Amaro¹, Joon Kyung Kim¹, Chenkai Shao¹, Asit K. Mishra², Hadi Esmaeilzadeh¹ - Show less +4 more•Institutions (2)

Georgia Institute of Technology¹, Intel²

15 Oct 2016

TL;DR: DnnWeaver is devised, a framework that automatically generates a synthesizable accelerator for a given DNN, FPGA pair from a high-level specification in Caffe that best matches the needs of the DNN while providing high performance and efficiency gains for the target FPGAs.

...read moreread less

Abstract: Deep Neural Networks (DNNs) are compute-intensive learning models with growing applicability in a wide range of domains. FPGAs are an attractive choice for DNNs since they offer a programmable substrate for acceleration and are becoming available across different market segments. However, obtaining both performance and energy efficiency with FPGAs is a laborious task even for expert hardware designers. Furthermore, the large memory footprint of DNNs, coupled with the FPGAs' limited on-chip storage makes DNN acceleration using FPGAs more challenging. This work tackles these challenges by devising DnnWeaver, a framework that automatically generates a synthesizable accelerator for a given (DNN, FPGA) pair from a high-level specification in Caffe [1]. To achieve large benefits while preserving automation, DNNWEAVER generates accelerators using hand-optimized design templates. First, DnnWeaver translates a given high-level DNN specification to its novel ISA that represents a macro dataflow graph of the DNN. The DnnWeaver compiler is equipped with our optimization algorithm that tiles, schedules, and batches DNN operations to maximize data reuse and best utilize target FPGA's memory and other resources. The final result is a custom synthesizable accelerator that best matches the needs of the DNN while providing high performance and efficiency gains for the target FPGA. We use DnnWeaver to generate accelerators for a set of eight different DNN models and three different FPGAs, Xilinx Zynq, Altera Stratix V, and Altera Arria 10. We use hardware measurements to compare the generated accelerators to both multicore CPUs (ARM Cortex A15 and Xeon E3) and many-core GPUs (Tegra K1, GTX 650Ti, and Tesla K40). In comparison, the generated accelerators deliver superior performance and efficiency without requiring the programmers to participate in the arduous task of hardware design.

...read moreread less

435 citations

Journal Article•DOI•

Towards characterizing cloud backend workloads: insights from Google compute clusters

[...]

Asit K. Mishra¹, Joseph L. Hellerstein², Walfredo Cirne², Chita R. Das¹•Institutions (2)

Pennsylvania State University¹, Google²

27 Mar 2010

TL;DR: An approach to workload classification and its application to the Google Cloud Backend, arguably the largest cloud backend on the planet is described.

...read moreread less

Abstract: The advent of cloud computing promises highly available, efficient, and flexible computing services for applications such as web search, email, voice over IP, and web search alerts. Our experience at Google is that realizing the promises of cloud computing requires an extremely scalable backend consisting of many large compute clusters that are shared by application tasks with diverse service level requirements for throughput, latency, and jitter. These considerations impact (a) capacity planning to determine which machine resources must grow and by how much and (b) task scheduling to achieve high machine utilization and to meet service level objectives.Both capacity planning and task scheduling require a good understanding of task resource consumption (e.g., CPU and memory usage). This in turn demands simple and accurate approaches to workload classification-determining how to form groups of tasks (workloads) with similar resource demands. One approach to workload classification is to make each task its own workload. However, this approach scales poorly since tens of thousands of tasks execute daily on Google compute clusters. Another approach to workload classification is to view all tasks as belonging to a single workload. Unfortunately, applying such a coarse-grain workload classification to the diversity of tasks running on Google compute clusters results in large variances in predicted resource consumptions.This paper describes an approach to workload classification and its application to the Google Cloud Backend, arguably the largest cloud backend on the planet. Our methodology for workload classification consists of: (1) identifying the workload dimensions; (2) constructing task classes using an off-the-shelf algorithm such as k-means; (3) determining the break points for qualitative coordinates within the workload dimensions; and (4) merging adjacent task classes to reduce the number of workloads. We use the foregoing, especially the notion of qualitative coordinates, to glean several insights about the Google Cloud Backend: (a) the duration of task executions is bimodal in that tasks either have a short duration or a long duration; (b) most tasks have short durations; and (c) most resources are consumed by a few tasks with long duration that have large demands for CPU and memory.

...read moreread less

411 citations

Proceedings Article•DOI•

Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC

[...]

Eriko Nurvitadhi¹, Sheffield David B¹, Jaewoong Sim¹, Asit K. Mishra¹, Ganesh Venkatesh¹, Debbie Marr¹ - Show less +2 more•Institutions (1)

Intel¹

01 Dec 2016

TL;DR: This paper proposed a BNN hardware accelerator design, implemented the proposed accelerator on Aria 10 FPGA as well as 14-nm ASIC, and compared them against optimized software on Xeon server CPU, Nvidia Titan X server GPU, and Nvidia TX1 mobile GPU.

...read moreread less

Abstract: Deep neural networks (DNNs) are widely used in data analytics, since they deliver state-of-the-art accuracies. Binarized neural networks (BNNs) are recently proposed optimized variant of DNNs. BNNs constraint network weight and/or neuron value to either +1 or −1, which is representable in 1 bit. This leads to dramatic algorithm efficiency improvement, due to reduction in the memory and computational demands. This paper evaluates the opportunity to further improve the execution efficiency of BNNs through hardware acceleration. We first proposed a BNN hardware accelerator design. Then, we implemented the proposed accelerator on Aria 10 FPGA as well as 14-nm ASIC, and compared them against optimized software on Xeon server CPU, Nvidia Titan X server GPU, and Nvidia TX1 mobile GPU. Our evaluation shows that FPGA provides superior efficiency over CPU and GPU. Even though CPU and GPU offer high peak theoretical performance, they are not as efficiently utilized since BNNs rely on binarized bit-level operations that are better suited for custom hardware. Finally, even though ASIC is still more efficient, FPGA can provide orders of magnitudes in efficiency improvements over software, without having to lock into a fixed ASIC solution.

...read moreread less

286 citations

Proceedings Article•DOI•

OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance

[...]

Adwait Jog¹, Onur Kayiran¹, Nachiappan Chidambaram Nachiappan¹, Asit K. Mishra², Mahmut Kandemir¹, Onur Mutlu³, Ravishankar Iyer², Chita R. Das¹ - Show less +4 more•Institutions (3)

Pennsylvania State University¹, Intel², Carnegie Mellon University³

16 Mar 2013

TL;DR: This paper presents a coordinated CTA-aware scheduling policy that utilizes four schemes to minimize the impact of long memory latencies, and indicates that the proposed mechanism can provide 33% average performance improvement compared to the commonly-employed round-robin warp scheduling policy.

...read moreread less

Abstract: Emerging GPGPU architectures, along with programming models like CUDA and OpenCL, offer a cost-effective platform for many applications by providing high thread level parallelism at lower energy budgets. Unfortunately, for many general-purpose applications, available hardware resources of a GPGPU are not efficiently utilized, leading to lost opportunity in improving performance. A major cause of this is the inefficiency of current warp scheduling policies in tolerating long memory latencies.In this paper, we identify that the scheduling decisions made by such policies are agnostic to thread-block, or cooperative thread array (CTA), behavior, and as a result inefficient. We present a coordinated CTA-aware scheduling policy that utilizes four schemes to minimize the impact of long memory latencies. The first two schemes, CTA-aware two-level warp scheduling and locality aware warp scheduling, enhance per-core performance by effectively reducing cache contention and improving latency hiding capability. The third scheme, bank-level parallelism aware warp scheduling, improves overall GPGPU performance by enhancing DRAM bank-level parallelism. The fourth scheme employs opportunistic memory-side prefetching to further enhance performance by taking advantage of open DRAM rows. Evaluations on a 28-core GPGPU platform with highly memory-intensive applications indicate that our proposed mechanism can provide 33% average performance improvement compared to the commonly-employed round-robin warp scheduling policy.

...read moreread less

280 citations

Proceedings Article•DOI•

Cache revive: architecting volatile STT-RAM caches for enhanced performance in CMPs

[...]

Adwait Jog¹, Asit K. Mishra², Cong Xu¹, Yuan Xie¹, Vijaykrishnan Narayanan¹, Ravishankar Iyer², Chita R. Das¹ - Show less +3 more•Institutions (2)

Pennsylvania State University¹, Intel²

03 Jun 2012

TL;DR: This work forms the relationship between retention-time and write-latency, and finds optimal retention- time for architecting an efficient cache hierarchy using STT-RAM to overcome high write latency and energy problems.

...read moreread less

Abstract: High density, low leakage and non-volatility are the attractive features of Spin-Transfer-Torque-RAM (STT-RAM), which has made it a strong competitor against SRAM as a universal memory replacement in multi-core systems. However, STT-RAM suffers from high write latency and energy which has impeded its widespread adoption. To this end, we look at trading-off STT-RAM's non-volatility property (data-retention-time) to overcome these problems. We formulate the relationship between retention-time and write-latency, and find optimal retention-time for architecting an efficient cache hierarchy using STT-RAM. Our results show that, compared to SRAM-based design, our proposal can improve performance and energy consumption by 18% and 60%, respectively.

...read moreread less

261 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14

Collapse

Cited by

PDF

Open Access

More filters

DOI•

International Technology Roadmap for Semiconductors 2003の要求清浄度について－シリコンウエハ表面と雰囲気環境に要求される清浄度, 分析方法の現状について－

[...]

飯田裕幸, 竹田菊男, 藤本武利

20 Sep 2004

1,387 citations

Proceedings Article•DOI•

Large-scale cluster management at Google with Borg

[...]

Abhishek Verma¹, Luis Pedrosa¹, Madhukar R. Korupolu¹, David Oppenheimer¹, Eric S. Tune¹, John Wilkes¹ - Show less +2 more•Institutions (1)

Google¹

17 Apr 2015

TL;DR: A summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it are presented.

...read moreread less

Abstract: Google's Borg system is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines. It achieves high utilization by combining admission control, efficient task-packing, over-commitment, and machine sharing with process-level performance isolation. It supports high-availability applications with runtime features that minimize fault-recovery time, and scheduling policies that reduce the probability of correlated failures. Borg simplifies life for its users by offering a declarative job specification language, name service integration, real-time job monitoring, and tools to analyze and simulate system behavior. We present a summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it.

...read moreread less

1,185 citations

Journal Article•DOI•

Knowledge Distillation: A Survey

[...]

Jianping Gou¹, Jianping Gou², Baosheng Yu², Stephen J. Maybank³, Dacheng Tao² - Show less +1 more•Institutions (3)

Jiangsu University¹, University of Sydney², Birkbeck, University of London³

09 Jun 2020-arXiv: Learning

TL;DR: A comprehensive survey of knowledge distillation from the perspectives of knowledge categories, training schemes, teacher-student architecture, distillation algorithms, performance comparison and applications can be found in this paper.

...read moreread less

Abstract: In recent years, deep neural networks have been successful in both industry and academia, especially for computer vision tasks. The great success of deep learning is mainly due to its scalability to encode large-scale data and to maneuver billions of model parameters. However, it is a challenge to deploy these cumbersome deep models on devices with limited resources, e.g., mobile phones and embedded devices, not only because of the high computational complexity but also the large storage requirements. To this end, a variety of model compression and acceleration techniques have been developed. As a representative type of model compression and acceleration, knowledge distillation effectively learns a small student model from a large teacher model. It has received rapid increasing attention from the community. This paper provides a comprehensive survey of knowledge distillation from the perspectives of knowledge categories, training schemes, teacher-student architecture, distillation algorithms, performance comparison and applications. Furthermore, challenges in knowledge distillation are briefly reviewed and comments on future research are discussed and forwarded.

...read moreread less

1,027 citations

Proceedings Article•DOI•

TVM: an automated end-to-end optimizing compiler for deep learning

[...]

Tianqi Chen¹, Thierry Moreau¹, Ziheng Jiang¹, Lianmin Zheng², Eddie Yan¹, Meghan Cowan¹, Haichen Shen¹, Leyuan Wang³, Yuwei Hu⁴, Luis Ceze¹, Carlos Guestrin¹, Arvind Krishnamurthy¹ - Show less +8 more•Institutions (4)

University of Washington¹, Shanghai Jiao Tong University², University of California, Davis³, Cornell University⁴

08 Oct 2018

TL;DR: TVM as discussed by the authors is a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends, such as mobile phones, embedded devices, and accelerators.

...read moreread less

Abstract: There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms - such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) - requires significant manual effort. We propose TVM, a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives, and memory latency hiding. It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations. Experimental results show that TVM delivers performance across hardware back-ends that are competitive with state-of-the-art, hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPUs. We also demonstrate TVM's ability to target new accelerator back-ends, such as the FPGA-based generic deep learning accelerator. The system is open sourced and in production use inside several major companies.

...read moreread less

991 citations

Journal Article•DOI•

Digital control of dynamic systems

[...]

Brian D. O. Anderson¹•Institutions (1)

University of Newcastle¹

01 Feb 1981-IEEE Transactions on Acoustics, Speech, and Signal Processing

TL;DR: Digital Control Of Dynamic Systems This well-respected, market-leading text discusses the use of digital computers in the real-time control of dynamic systems with an emphasis on the design of digital controls that achieve good dynamic response and small errors while using signals that are sampled in time and quantized in amplitude.

...read moreread less

Abstract: Digital Control Of Dynamic Systems This well-respected, market-leading text discusses the use of digital computers in the real-time control of dynamic systems. The emphasis is on the design of digital controls that achieve good dynamic response and small errors while using signals that are sampled in time and quantized in amplitude. Digital Control of Dynamic Systems (3rd Edition): Franklin ... This well-respected, market-leading text discusses the use of digital computers in the real-time control of dynamic systems. The emphasis is on the design of digital controls that achieve good dynamic response and small errors while using signals that are sampled in time and quantized in amplitude. Digital Control of Dynamic Systems: Gene F. Franklin ... Digital Control of Dynamic Systems, 2nd Edition. Gene F. Franklin, Stanford University. J. David Powell, Stanford University Digital Control of Dynamic Systems, 2nd Edition Pearson This well-respected work discusses the use of digital computers in the real-time control of dynamic systems. The emphasis is on the design of digital controls that achieve good dynamic response and small errors while using signals that are sampled in time and quantized in amplitude. MATLAB statements and problems are thoroughly and carefully integrated throughout the book to offer readers a complete design picture. Digital Control of Dynamic Systems, 3rd Edition ... Digital control of dynamic systems | Gene F. Franklin, J. David Powell, Michael L. Workman | download | B–OK. Download books for free. Find books Digital control of dynamic systems | Gene F. Franklin, J ... Abstract This well-respected work discusses the use of digital computers in the real-time control of dynamic systems. The emphasis is on the design of digital controls that achieve good dynamic... (PDF) Digital Control of Dynamic Systems Digital Control of Dynamic Systems, Addison.pdf There is document Digital Control of Dynamic Systems, Addison.pdfavailable here for reading and downloading. Use the download button below or simple online reader. The file extension PDFand ranks to the Documentscategory. Digital Control of Dynamic Systems, Addison.pdf Download ... Automatic control is the science that develops techniques to steer, guide, control dynamic systems. These systems are built by humans and must perform a specific task. Examples of such dynamic systems are found in biology, physics, robotics, finance, etc. Digital Control means that the control laws are implemented in a digital device, such as a microcontroller or a microprocessor. Introduction to Digital Control of Dynamic Systems And ... The discussions are clear, nomenclature is not hard to follow and there are plenty of worked examples. The book covers discretization effects and design by emulation (i.e. design of continuous-time control system followed by discretization before implementation) which are not to be found on every book on digital control. Amazon.com: Customer reviews: Digital Control of Dynamic ... Find helpful customer reviews and review ratings for Digital Control of Dynamic Systems (3rd Edition) at Amazon.com. Read honest and unbiased product reviews from our users. Amazon.com: Customer reviews: Digital Control of Dynamic ... 1.1.2 Digital control Digital control systems employ a computer as a fundamental component in the controller. The computer typically receives a measurement of the controlled variable, also often receives the reference input, and produces its output using an algorithm. Introduction to Applied Digital Control From the Back Cover This well-respected, marketleading text discusses the use of digital computers in the real-time control of dynamic systems. The emphasis is on the design of digital controls that achieve good dynamic response and small errors while using signals that are sampled in time and quantized in amplitude. Digital Control of Dynamic Systems (3rd Edition) Test Bank `Among the advantages of digital logic for control are the increased flexibility `of the control programs and the decision-making or logic capability of digital `systems, which can be combined with the dynamic control function to meet `other system requirements. `The digital controls studied in this book are for closed-loop (feedback) Every day, eBookDaily adds three new free Kindle books to several different genres, such as Nonfiction, Business & Investing, Mystery & Thriller, Romance, Teens & Young Adult, Children's Books, and others.

...read moreread less

902 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse