Journal ArticleDOI
Algorithm-Based Fault Tolerance for Matrix Operations
Kuang-Hua Huang,Abraham +1 more
TLDR
Algorithm-based fault tolerance schemes are proposed to detect and correct errors when matrix operations such as addition, multiplication, scalar product, LU-decomposition, and transposition are performed using multiple processor systems.Abstract:
The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple copies of low-cost processors to provide a large amount of computational capability for a small cost. In addition to achieving high performance, high reliability is also important to ensure that the results of long computations are valid. This paper proposes a novel system-level method of achieving high reliability, called algorithm-based fault tolerance. The technique encodes data at a high level, and algorithms are designed to operate on encoded data and produce encoded output data. The computation tasks within an algorithm are appropriately distributed among multiple computation units for fault tolerance. The technique is applied to matrix compomations which form the heart of many computation-intensive tasks. Algorithm-based fault tolerance schemes are proposed to detect and correct errors when matrix operations such as addition, multiplication, scalar product, LU-decomposition, and transposition are performed using multiple processor systems. The method proposed can detect and correct any failure within a single processor in a multiple processor system. The number of processors needed to just detect errors in matrix multiplication is also studied.read more
Citations
More filters
Proceedings Article
Polynomial codes: an optimal design for high-dimensional coded matrix multiplication
TL;DR: This work considers a large-scale matrix multiplication problem where the computation is carried out using a distributed system with a master node and multiple worker nodes, where each worker can store parts of the input matrices, and proposes a computation strategy that leverages ideas from coding theory to design intermediate computations at the worker nodes to efficiently deal with straggling workers.
Journal ArticleDOI
Addressing failures in exascale computing
Marc Snir,Robert W. Wisniewski,Jacob A. Abraham,Sarita V. Adve,Saurabh Bagchi,Pavan Balaji,James Belak,Pradip Bose,Franck Cappello,Bill Carlson,Andrew A. Chien,Paul W. Coteus,Nathan DeBardeleben,Pedro C. Diniz,Christian Engelmann,Mattan Erez,Saverio Fazzari,Al Geist,Rinku Gupta,Fred Johnson,Sriram Krishnamoorthy,Sven Leyffer,Dean A. Liberty,Subhasish Mitra,Todd Munson,Robert Schreiber,Jon Stearley,Eric Van Hensbergen +27 more
TL;DR: This report presents a report produced by a workshop on ‘Addressing failures in exascale computing’ held in Park City, Utah, 4–11 August 2012, which summarizes and builds on discussions on resilience.
Journal ArticleDOI
Xception: a technique for the experimental evaluation of dependability in modern computers
TL;DR: Experimental, results are presented to demonstrate the accuracy and potential of Xception in the evaluation of the dependability properties of the complex computer systems available nowadays.
Journal ArticleDOI
Toward Exascale Resilience
TL;DR: This white paper synthesizes the motivations, observations and research issues considered as determinant of several complimentary experts of HPC in applications, programming models, distributed systems and system management.
Journal ArticleDOI
BLIS: A Framework for Rapidly Instantiating BLAS Functionality
TL;DR: Preliminary performance of level-2 and level-3 operations is observed to be competitive with two mature open source libraries (OpenBLAS and ATLAS) as well as an established commercial product (Intel MKL).
References
More filters
Book
The Art of Computer Programming
TL;DR: The arrangement of this invention provides a strong vibration free hold-down mechanism while avoiding a large pressure drop to the flow of coolant fluid.
Journal ArticleDOI
Error detecting and error correcting codes
TL;DR: The author was led to the study given in this paper from a consideration of large scale computing machines in which a large number of operations must be performed without a single error in the end result.