Home
/
Authors
/
Sunil Shukla

Author

Sunil Shukla

Other affiliations: University of Queensland, Karlsruhe Institute of Technology

Bio: Sunil Shukla is an academic researcher from IBM. The author has contributed to research in topics: Field-programmable gate array & Reconfigurable computing. The author has an hindex of 12, co-authored 41 publications receiving 671 citations. Previous affiliations of Sunil Shukla include University of Queensland & Karlsruhe Institute of Technology.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

FPGA programming for the masses

[...]

David F. Bacon¹, Rodric Rabbah¹, Sunil Shukla¹•Institutions (1)

IBM¹

01 Apr 2013-Communications of The ACM

TL;DR: The programmability of FPGAs must improve if they are to be part of mainstream computing, and this paper presents a meta-modelling architecture suitable for this purpose.

...read moreread less

142 citations

Proceedings Article•DOI•

A Scalable Multi- TeraOPS Deep Learning Processor Core for AI Trainina and Inference

[...]

Bruce M. Fleischer¹, Sunil Shukla¹, Matthew M. Ziegler¹, Joel Abraham Silberman¹, Jinwook Oh¹, Vijavalakshmi Srinivasan¹, Jungwook Choi¹, Silvia Melitta Mueller¹, Ankur Agrawal¹, Tina Babinsky¹, Nianzheng Cao¹, Chia-Yu Chen¹, Pierce Chuang¹, Thomas W. Fox¹, George D. Gristede¹, Michael A. Guillorn¹, Howard M. Haynie¹, Michael J. Klaiber¹, Dongsoo Lee¹, Shih-Hsien Lo¹, Gary W. Maier¹, Michael R. Scheuermann¹, Swagath Venkataramani¹, Christos Vezyrtzis¹, Naigang Wang¹, Fanchieh Yee¹, Ching Zhou¹, Pong-Fei Lu¹, Brian W. Curran¹, Lel Chang¹, Kailash Gopalakrishnan¹ - Show less +27 more•Institutions (1)

IBM¹

18 Jun 2018

TL;DR: A multi-TOPS AI core is presented for acceleration of deep learning training and inference in systems from edge devices to data centers by employing a dataflow architecture and an on-chip scratchpad hierarchy.

...read moreread less

Abstract: A multi-TOPS AI core is presented for acceleration of deep learning training and inference in systems from edge devices to data centers. With a programmable architecture and custom ISA, this engine achieves >90% sustained utilization across the range of neural network topologies by employing a dataflow architecture and an on-chip scratchpad hierarchy. Compute precision is optimized at 16b floating point (fp 16) for high model accuracy in training and inference as well as 1b/2b (bi-nary/ternary) integer for aggressive inference performance. At 1.5 GHz, the AI core prototype achieves 1.5 TFLOPS fp 16, 12 TOPS ternary, or 24 TOPS binary peak performance in 14nm CMOS.

...read moreread less

103 citations

Proceedings Article•DOI•

Approximate computing: Challenges and opportunities

[...]

Ankur Agrawal¹, Jungwook Choi¹, Kailash Gopalakrishnan¹, Suyog Gupta¹, Ravi Nair¹, Jinwook Oh¹, Daniel A. Prener¹, Sunil Shukla¹, Vijayalakshmi Srinivasan¹, Zehra Sura¹ - Show less +6 more•Institutions (1)

IBM¹

01 Oct 2016

TL;DR: It is shown that hot loops in the applications can be perforated by an average of 50% with proportional reduction in execution time, while still producing acceptable quality of results, and that benefits compounded when these techniques are applied concurrently.

...read moreread less

Abstract: Approximate computing is gaining traction as a computing paradigm for data analytics and cognitive applications that aim to extract deep insight from vast quantities of data. In this paper, we demonstrate that multiple approximation techniques can be applied to applications in these domains and can be further combined together to compound their benefits. In assessing the potential of approximation in these applications, we took the liberty of changing multiple layers of the system stack: architecture, programming model, and algorithms. Across a set of applications spanning the domains of DSP, robotics, and machine learning, we show that hot loops in the applications can be perforated by an average of 50% with proportional reduction in execution time, while still producing acceptable quality of results. In addition, the width of the data used in the computation can be reduced to 10–16 bits from the currently common 32/64 bits with potential for significant performance and energy benefits. For parallel applications we reduced execution time by 50% using relaxed synchronization mechanisms. Finally, our results also demonstrate that benefits compounded when these techniques are applied concurrently. Our results across different applications demonstrate that approximate computing is a widely applicable paradigm with potential for compounded benefits from applying multiple techniques across the system stack. In order to exploit these benefits it is essential to re-think multiple layers of the system stack to embrace approximations ground-up and to design tightly integrated approximate accelerators. Doing so will enable moving the applications into a world in which the architecture, programming model, and even the algorithms used to implement the application are all fundamentally designed for approximate computing.

...read moreread less

72 citations

Journal Article•DOI•

FPGA Programming for the Masses: The programmability of FPGAs must improve if they are to be part of mainstream computing.

[...]

David F. Bacon¹, Rodric Rabbah¹, Sunil Shukla¹•Institutions (1)

IBM¹

22 Feb 2013-ACM Queue

TL;DR: When looking at how hardware influences computing performance, the authors have GPPs (general-purpose processors) on one end of the spectrum and ASICs (application-specific integrated circuits) on the other.

...read moreread less

Abstract: When looking at how hardware influences computing performance, we have GPPs (general-purpose processors) on one end of the spectrum and ASICs (application-specific integrated circuits) on the other. Processors are highly programmable but often inefficient in terms of power and performance. ASICs implement a dedicated and fixed function and provide the best power and performance characteristics, but any functional change requires a complete (and extremely expensive) re-spinning of the circuits.

...read moreread less

70 citations

Proceedings Article•DOI•

A compiler and runtime for heterogeneous computing

[...]

Joshua S. Auerbach¹, David F. Bacon¹, Ioana Monica Burcea¹, Perry Cheng¹, Stephen J. Fink¹, Rodric Rabbah¹, Sunil Shukla¹ - Show less +3 more•Institutions (1)

IBM¹

03 Jun 2012

TL;DR: Liquid Metal is presented, a comprehensive compiler and runtime system for a new programming language called Lime that enables the use of a single language for programming heterogeneous computing platforms, and the seamless co-execution of the resultant programs on CPUs and accelerators that include GPUs and FPGAs.

...read moreread less

Abstract: Heterogeneous systems show a lot of promise for extracting highperformance by combining the benefits of conventional architectures with specialized accelerators in the form of graphics processors (GPUs) and reconfigurable hardware (FPGAs). Extracting this performance often entails programming in disparate languages and models, making it hard for a programmer to work equally well on all aspects of an application. Further, relatively little attention is paid to co-execution — the problem of orchestrating program execution using multiple distinct computational elements that work seamlessly together. We present Liquid Metal, a comprehensive compiler and runtime system for a new programming language called Lime. Our work enables the use of a single language for programming heterogeneous computing platforms, and the seamless co-execution of the resultant programs on CPUs and accelerators that include GPUs and FPGAs. We have developed a number of Lime applications, and successfully compiled some of these for co-execution on various GPU and FPGA enabled architectures. Our experience so far leads us to believe the Liquid Metal approach is promising and can make the computational power of heterogeneous architectures more easily accessible to mainstream programmers.

...read moreread less

69 citations

1
2
3
4
…
5
6
7
8
9

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Next generation cloud computing

[...]

Blesson Varghese¹, Rajkumar Buyya²•Institutions (2)

Queen's University Belfast¹, University of Melbourne²

01 Feb 2018-Future Generation Computer Systems

TL;DR: The changing cloud infrastructure is discussed and the use of infrastructure from multiple providers and the benefit of decentralising computing away from data centers is considered, leading to a roadmap of challenges that will need to be addressed for realising the potential of next generation cloud systems.

...read moreread less

471 citations

Posted Content•

Next Generation Cloud Computing: New Trends and Research Directions

[...]

Blesson Varghese, Rajkumar Buyya

24 Jul 2017-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: In this article, the authors discuss the changing cloud infrastructure and consider the use of infrastructure from multiple providers and the benefit of decentralising computing away from data centers, and lay out a roadmap of challenges that will need to be addressed for realising the potential of next generation cloud systems.

...read moreread less

Abstract: The landscape of cloud computing has significantly changed over the last decade. Not only have more providers and service offerings crowded the space, but also cloud infrastructure that was traditionally limited to single provider data centers is now evolving. In this paper, we firstly discuss the changing cloud infrastructure and consider the use of infrastructure from multiple providers and the benefit of decentralising computing away from data centers. These trends have resulted in the need for a variety of new computing architectures that will be offered by future cloud infrastructure. These architectures are anticipated to impact areas, such as connecting people and devices, data-intensive computing, the service space and self-learning systems. Finally, we lay out a roadmap of challenges that will need to be addressed for realising the potential of next generation cloud systems.

...read moreread less

440 citations

Proceedings Article•

Training Deep Neural Networks with 8-bit Floating Point Numbers

[...]

Naigang Wang¹, Jungwook Choi, Daniel Brand¹, Chia-Yu Chen¹, Kailash Gopalakrishnan¹ - Show less +1 more•Institutions (1)

IBM¹

19 Dec 2018

TL;DR: In this paper, the authors demonstrate the successful training of deep neural networks using 8-bit floating point numbers while fully maintaining the accuracy on a spectrum of deep learning models and datasets.

...read moreread less

Abstract: The state-of-the-art hardware platforms for training deep neural networks are moving from traditional single precision (32-bit) computations towards 16 bits of precision - in large part due to the high energy efficiency and smaller bit storage associated with using reduced-precision representations However, unlike inference, training with numbers represented with less than 16 bits has been challenging due to the need to maintain fidelity of the gradient computations during back-propagation Here we demonstrate, for the first time, the successful training of deep neural networks using 8-bit floating point numbers while fully maintaining the accuracy on a spectrum of deep learning models and datasets In addition to reducing the data and computation precision to 8 bits, we also successfully reduce the arithmetic precision for additions (used in partial product accumulation and weight updates) from 32 bits to 16 bits through the introduction of a number of key ideas including chunk-based accumulation and floating point stochastic rounding The use of these novel techniques lays the foundation for a new generation of hardware training platforms with the potential for 2-4 times improved throughput over today's systems

...read moreread less

231 citations

Proceedings Article•DOI•

ClickNP: Highly Flexible and High Performance Network Processing with Reconfigurable Hardware

[...]

Bojie Li¹, Kun Tan¹, Layong Luo¹, Yanqing Peng¹, Renqian Luo¹, Ningyi Xu¹, Yongqiang Xiong¹, Peng Cheng¹, Enhong Chen - Show less +5 more•Institutions (1)

Microsoft¹

22 Aug 2016

TL;DR: To the best of the knowledge, ClickNP is the first FPGA-accelerated platform for NFs, written completely in high-level language and achieving 40 Gbps line rate at any packet size and reducing latency by 10x.

...read moreread less

Abstract: Highly flexible software network functions (NFs) are crucial components to enable multi-tenancy in the clouds. However, software packet processing on a commodity server has limited capacity and induces high latency. While software NFs could scale out using more servers, doing so adds significant cost. This paper focuses on accelerating NFs with programmable hardware, i.e., FPGA, which is now a mature technology and inexpensive for datacenters. However, FPGA is predominately programmed using low-level hardware description languages (HDLs), which are hard to code and difficult to debug. More importantly, HDLs are almost inaccessible for most software programmers. This paper presents ClickNP, a FPGA-accelerated platform for highly flexible and high-performance NFs with commodity servers. ClickNP is highly flexible as it is completely programmable using high-level C-like languages, and exposes a modular programming abstraction that resembles Click Modular Router. ClickNP is also high performance. Our prototype NFs show that they can process traffic at up to 200 million packets per second with ultra-low latency ($

...read moreread less

228 citations

Proceedings Article•DOI•

Plasticine: A Reconfigurable Architecture For Parallel Paterns

[...]

Raghu Prabhakar¹, Yaqi Zhang¹, David Koeplinger¹, Matthew Feldman¹, Tian Zhao¹, Stefan Hadjis¹, Ardavan Pedram¹, Christos Kozyrakis¹, Kunle Olukotun¹ - Show less +5 more•Institutions (1)

Stanford University¹

24 Jun 2017

TL;DR: This work designs Plasticine, a new spatially reconfigurable architecture designed to efficiently execute applications composed of parallel patterns that provide an improvement of up to 76.9× in performance-per-Watt over a conventional FPGA over a wide range of dense and sparse applications.

...read moreread less

Abstract: Reconfigurable architectures have gained popularity in recent years as they allow the design of energy-efficient accelerators. Fine-grain fabrics (e.g. FPGAs) have traditionally suffered from performance and power inefficiencies due to bit-level reconfigurable abstractions. Both fine-grain and coarse-grain architectures (e.g. CGRAs) traditionally require low level programming and suffer from long compilation times. We address both challenges with Plasticine, a new spatially reconfigurable architecture designed to efficiently execute applications composed of parallel patterns. Parallel patterns have emerged from recent research on parallel programming as powerful, high-level abstractions that can elegantly capture data locality, memory access patterns, and parallelism across a wide range of dense and sparse applications.We motivate Plasticine by first observing key application characteristics captured by parallel patterns that are amenable to hardware acceleration, such as hierarchical parallelism, data locality, memory access patterns, and control flow. Based on these observations, we architect Plasticine as a collection of Pattern Compute Units and Pattern Memory Units. Pattern Compute Units are multi-stage pipelines of reconfigurable SIMD functional units that can efficiently execute nested patterns. Data locality is exploited in Pattern Memory Units using banked scratchpad memories and configurable address decoders. Multiple on-chip address generators and scatter-gather engines make efficient use of DRAM bandwidth by supporting a large number of outstanding memory requests, memory coalescing, and burst mode for dense accesses. Plasticine has an area footprint of 113 mm2 in a 28nm process, and consumes a maximum power of 49 W at a 1 GHz clock. Using a cycle-accurate simulator, we demonstrate that Plasticine provides an improvement of up to 76.9x in performance-per-Watt over a conventional FPGA over a wide range of dense and sparse applications.

...read moreread less

212 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157

Collapse