Home
/
Authors
/
Siva Kumar Sastry Hari

Author

Siva Kumar Sastry Hari

Other affiliations: Indian Institute of Technology Madras, University of Illinois at Urbana–Champaign

Bio: Siva Kumar Sastry Hari is an academic researcher from Nvidia. The author has contributed to research in topics: Fault injection & Computer science. The author has an hindex of 21, co-authored 40 publications receiving 1610 citations. Previous affiliations of Siva Kumar Sastry Hari include Indian Institute of Technology Madras & University of Illinois at Urbana–Champaign.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Understanding error propagation in deep learning neural network (DNN) accelerators and applications

[...]

Guanpeng Li¹, Siva Kumar Sastry Hari², Michael J. Sullivan², Timothy Tsai², Karthik Pattabiraman¹, Joel Emer², Stephen W. Keckler² - Show less +3 more•Institutions (2)

University of British Columbia¹, Nvidia²

12 Nov 2017

TL;DR: It is found that the error resilience of a DNN system depends on the data types, values, data reuses, and types of layers in the design, and two efficient protection techniques are proposed.

...read moreread less

Abstract: Deep learning neural networks (DNNs) have been successful in solving a wide range of machine learning problems. Specialized hardware accelerators have been proposed to accelerate the execution of DNN algorithms for high-performance and energy efficiency. Recently, they have been deployed in datacenters (potentially for business-critical or industrial applications) and safety-critical systems such as self-driving cars. Soft errors caused by high-energy particles have been increasing in hardware systems, and these can lead to catastrophic failures in DNN systems. Traditional methods for building resilient systems, e.g., Triple Modular Redundancy (TMR), are agnostic of the DNN algorithm and the DNN accelerator's architecture. Hence, these traditional resilience approaches incur high overheads, which makes them challenging to deploy. In this paper, we experimentally evaluate the resilience characteristics of DNN systems (i.e., DNN software running on specialized accelerators). We find that the error resilience of a DNN system depends on the data types, values, data reuses, and types of layers in the design. Based on our observations, we propose two efficient protection techniques for DNN systems.

...read moreread less

414 citations

Proceedings Article•DOI•

Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults

[...]

Siva Kumar Sastry Hari¹, Sarita V. Adve¹, Helia Naeimi², Pradeep Ramachandran²•Institutions (2)

University of Illinois at Urbana–Champaign¹, Intel²

03 Mar 2012

TL;DR: Relyzer is presented, an approach that systematically analyzes all application fault sites and carefully picks a small subset to perform selective fault injections for transient faults, and employs novel fault pruning techniques that prune faults that need detailed study by either predicting their outcomes or showing them equivalent to other faults.

...read moreread less

Abstract: Future microprocessors need low-cost solutions for reliable operation in the presence of failure-prone devices. A promising approach is to detect hardware faults by deploying low-cost monitors of software-level symptoms of such faults. Recently, researchers have shown these mechanisms work well, but there remains a non-negligible risk that several faults may escape the symptom detectors and result in silent data corruptions (SDCs). Most prior evaluations of symptom-based detectors perform fault injection campaigns on application benchmarks, where each run simulates the impact of a fault injected at a hardware site at a certain point in the application's execution (application fault site). Since the total number of application fault sites is very large (trillions for standard benchmark suites), it is not feasible to study all possible faults. Previous work therefore typically studies a randomly selected sample of faults. Such studies do not provide any feedback on the portions of the application where faults were not injected. Some of those instructions may be vulnerable to SDCs, and identifying them could allow protecting them through other means if needed.This paper presents Relyzer, an approach that systematically analyzes all application fault sites and carefully picks a small subset to perform selective fault injections for transient faults. Relyzer employs novel fault pruning techniques that prune faults that need detailed study by either predicting their outcomes or showing them equivalent to other faults. We find that Relyzer prunes about 99.78% of the total faults across twelve applications studied here, reducing the faults that require detailed simulation by 3 to 5 orders of magnitude for most of the applications. Fault injection simulations on the remaining faults can identify SDC causing faults in the entire application. Some of Relyzer's techniques rely on heuristics to determine fault equivalence. Our validation efforts show that Relyzer determines fault outcomes with 96% accuracy, averaged across all the applications studied here.

...read moreread less

162 citations

Proceedings Article•DOI•

Low-cost program-level detectors for reducing silent data corruptions

[...]

Siva Kumar Sastry Hari¹, Sarita V. Adve¹, Helia Naeimi²•Institutions (2)

University of Illinois at Urbana–Champaign¹, Intel²

25 Jun 2012

TL;DR: Detailed analysis of code sections that produce over 90% of Silent Data Corruption rates facilitated the development of program-level detectors that catch errors in quantities that are either accumulated or active for a long duration, amortizing the detection costs.

...read moreread less

Abstract: With technology scaling, transient faults are becoming an increasing threat to hardware reliability. Commodity systems must be made resilient to these in-field faults through very low-cost resiliency solutions. Software-level symptom detection techniques have emerged as promising low-cost and effective solutions. While the current user-visible Silent Data Corruption (SDC) rates for these techniques is relatively low, eliminating or significantly lowering the SDC rate is crucial for these solutions to become practically successful. Identifying and understanding program sections that cause SDCs is crucial to reducing (or eliminating) SDCs in a cost effective manner. This paper provides a detailed analysis of code sections that produce over 90% of SDCs for six applications we studied. This analysis facilitated the development of program-level detectors that catch errors in quantities that are either accumulated or active for a long duration, amortizing the detection costs. These low-cost detectors significantly reduce the dependency on redundancy-based techniques and provide more practical and flexible choice points on the performance vs. reliability trade-off curve. For example, for an average of 90%, 99%, or 100% reduction of the baseline SDC rate, the average execution overheads of our approach versus redundancy alone are respectively 12% vs. 30%, 19% vs. 43%, and 27% vs. 51%.

...read moreread less

126 citations

Proceedings Article•DOI•

Architectures for online error detection and recovery in multicore processors

[...]

Dimitris Gizopoulos¹, Mihalis Psarakis¹, Sarita V. Adve², Pradeep Ramachandran², Siva Kumar Sastry Hari², Daniel J. Sorin³, Albert Meixner³, Arijit Biswas⁴, Xavier Vera⁴ - Show less +5 more•Institutions (4)

University of Piraeus¹, University of Illinois at Urbana–Champaign², Duke University³, Intel⁴

14 Mar 2011

TL;DR: This paper focuses on dependable multicore processor architectures that integrate solutions for online error detection, diagnosis, recovery, and repair during field operation and discusses taxonomy of representative approaches and presents a qualitative comparison based on hardware cost, performance overhead, types of faults detected, and detection latency.

...read moreread less

Abstract: The huge investment in the design and production of multicore processors may be put at risk because the emerging highly miniaturized but unreliable fabrication technologies will impose significant barriers to the life-long reliable operation of future chips. Extremely complex, massively parallel, multi-core processor chips fabricated in these technologies will become more vulnerable to: (a) environmental disturbances that produce transient (or soft) errors, (b) latent manufacturing defects as well as aging/wearout phenomena that produce permanent (or hard) errors, and (c) verification inefficiencies that allow important design bugs to escape in the system. In an effort to cope with these reliability threats, several research teams have recently proposed multicore processor architectures that provide low-cost dependability guarantees against hardware errors and design bugs. This paper focuses on dependable multicore processor architectures that integrate solutions for online error detection, diagnosis, recovery, and repair during field operation. It discusses taxonomy of representative approaches and presents a qualitative comparison based on: hardware cost, performance overhead, types of faults detected, and detection latency. It also describes in more detail three recently proposed effective architectural approaches: a software-anomaly detection technique (SWAT), a dynamic verification technique (Argus), and a core salvaging methodology.

...read moreread less

122 citations

Proceedings Article•DOI•

SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation

[...]

Siva Kumar Sastry Hari¹, Timothy Tsai¹, Mark Stephenson¹, Stephen W. Keckler¹, Joel Emer¹ - Show less +1 more•Institutions (1)

Nvidia¹

24 Apr 2017

TL;DR: This paper presents an error injection-based methodology and tool called SASSIFI to study the soft error resilience of massively parallel applications running on state-of-the-art NVIDIA GPUs.

...read moreread less

Abstract: As GPUs become more pervasive in both scalable high-performance computing systems and safety-critical embedded systems, evaluating and analyzing their resilience to soft errors caused by high-energy particle strikes will grow increasingly important. GPU designers must develop tools and techniques to understand the effect of these soft errors on applications. This paper presents an error injection-based methodology and tool called SASSIFI to study the soft error resilience of massively parallel applications running on state-of-the-art NVIDIA GPUs. Our approach uses a low-level assembly-language instrumentation tool called SASSI to profile and inject errors. SASSI provides efficiency by allowing instrumentation code to execute entirely on the GPU and provides the ability to inject into different architecture-visible state. For example, SASSIFI can inject errors in general-purpose registers, GPU memory, condition code registers, and predicate registers. SASSIFI can also inject errors into addresses and register indices. In this paper, we describe the SASSIFI tool, its capabilities, and present experiments to illustrate some of the analyses SASSIFI can be used to perform.

...read moreread less

117 citations

1
2
3
4
…
5
6
7
8
9
10

Collapse

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

Understanding error propagation in deep learning neural network (DNN) accelerators and applications

[...]

Guanpeng Li¹, Siva Kumar Sastry Hari², Michael J. Sullivan², Timothy Tsai², Karthik Pattabiraman¹, Joel Emer², Stephen W. Keckler² - Show less +3 more•Institutions (2)

University of British Columbia¹, Nvidia²

12 Nov 2017

TL;DR: It is found that the error resilience of a DNN system depends on the data types, values, data reuses, and types of layers in the design, and two efficient protection techniques are proposed.

...read moreread less

414 citations

Journal Article•DOI•

Addressing failures in exascale computing

[...]

Marc Snir¹, Robert W. Wisniewski², Jacob A. Abraham³, Sarita V. Adve⁴, Saurabh Bagchi⁵, Pavan Balaji¹, James Belak⁶, Pradip Bose⁷, Franck Cappello¹, Bill Carlson, Andrew A. Chien⁸, Paul W. Coteus⁷, Nathan DeBardeleben⁹, Pedro C. Diniz¹⁰, Christian Engelmann¹¹, Mattan Erez³, Saverio Fazzari¹², Al Geist¹¹, Rinku Gupta¹, Fred Johnson¹³, Sriram Krishnamoorthy¹⁴, Sven Leyffer¹, Dean A. Liberty¹⁵, Subhasish Mitra¹⁶, Todd Munson¹, Robert Schreiber¹⁷, Jon Stearley¹⁸, Eric Van Hensbergen - Show less +24 more•Institutions (18)

Argonne National Laboratory¹, Intel², University of Texas at Austin³, University of Illinois at Urbana–Champaign⁴, Purdue University⁵, Lawrence Livermore National Laboratory⁶, IBM⁷, University of Chicago⁸, Los Alamos National Laboratory⁹, Information Sciences Institute¹⁰, Oak Ridge National Laboratory¹¹, Booz Allen Hamilton¹², Science Applications International Corporation¹³, Pacific Northwest National Laboratory¹⁴, Advanced Micro Devices¹⁵, Stanford University¹⁶, Hewlett-Packard¹⁷, Sandia National Laboratories¹⁸

01 May 2014

TL;DR: This report presents a report produced by a workshop on ‘Addressing failures in exascale computing’ held in Park City, Utah, 4–11 August 2012, which summarizes and builds on discussions on resilience.

...read moreread less

Abstract: We present here a report produced by a workshop on 'Addressing failures in exascale computing' held in Park City, Utah, 4-11 August 2012. The charter of this workshop was to establish a common taxonomy about resilience across all the levels in a computing system, discuss existing knowledge on resilience across the various hardware and software layers of an exascale system, and build on those results, examining potential solutions from both a hardware and software perspective and focusing on a combined approach. The workshop brought together participants with expertise in applications, system software, and hardware; they came from industry, government, and academia, and their interests ranged from theory to implementation. The combination allowed broad and comprehensive discussions and led to this document, which summarizes and builds on those discussions.

...read moreread less

406 citations

Journal Article•DOI•

Machine Learning Testing: Survey, Landscapes and Horizons

[...]

Jie Zhang¹, Mark Harman¹, Lei Ma², Yang Liu³•Institutions (3)

University College London¹, Kyushu University², Nanyang Technological University³

17 Feb 2020-IEEE Transactions on Software Engineering

TL;DR: A comprehensive survey of machine learning testing can be found in this article, which covers 138 papers on testing properties (e.g., correctness, robustness, and fairness), testing components (i.e., the data, learning program, and framework), testing workflow, and application scenarios.

...read moreread less

Abstract: This paper provides a comprehensive survey of Machine Learning Testing (ML testing) research. It covers 138 papers on testing properties (e.g., correctness, robustness, and fairness), testing components (e.g., the data, learning program, and framework), testing workflow (e.g., test generation and test evaluation), and application scenarios (e.g., autonomous driving, machine translation). The paper also analyses trends concerning datasets, research trends, and research focus, concluding with research challenges and promising research directions in machine learning testing.

...read moreread less

343 citations

Advances in neural information processing systems 3

[...]

Richard P. Lippmann, David S. Touretzky

01 Jan 1991

259 citations

Proceedings Article•DOI•

Ares: a framework for quantifying the resilience of deep neural networks

[...]

Brandon Reagen¹, Udit Gupta¹, Lillian Pentecost¹, Paul N. Whatmough¹, Sae Kyu Lee¹, Niamh Mulholland¹, David Brooks¹, Gu-Yeon Wei¹ - Show less +4 more•Institutions (1)

Harvard University¹

24 Jun 2018

TL;DR: This paper presents Ares: a light-weight, DNN-specific fault injection framework validated within 12% of real hardware, and finds that DNN fault tolerance varies by orders of magnitude with respect to model, layer type, and structure.

...read moreread less

Abstract: As the use of deep neural networks continues to grow, so does the fraction of compute cycles devoted to their execution. This has led the CAD and architecture communities to devote considerable attention to building DNN hardware. Despite these efforts, the fault tolerance of DNNs has generally been overlooked. This paper is the first to conduct a large-scale, empirical study of DNN resilience. Motivated by the inherent algorithmic resilience of DNNs, we are interested in understanding the relationship between fault rate and model accuracy. To do so, we present Ares: a light-weight, DNN-specific fault injection framework validated within 12% of real hardware. We find that DNN fault tolerance varies by orders of magnitude with respect to model, layer type, and structure.

...read moreread less

249 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse