scispace - formally typeset
Journal ArticleDOI

Algorithm-Based Fault Tolerance for Matrix Operations

TLDR
Algorithm-based fault tolerance schemes are proposed to detect and correct errors when matrix operations such as addition, multiplication, scalar product, LU-decomposition, and transposition are performed using multiple processor systems.
Abstract
The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple copies of low-cost processors to provide a large amount of computational capability for a small cost. In addition to achieving high performance, high reliability is also important to ensure that the results of long computations are valid. This paper proposes a novel system-level method of achieving high reliability, called algorithm-based fault tolerance. The technique encodes data at a high level, and algorithms are designed to operate on encoded data and produce encoded output data. The computation tasks within an algorithm are appropriately distributed among multiple computation units for fault tolerance. The technique is applied to matrix compomations which form the heart of many computation-intensive tasks. Algorithm-based fault tolerance schemes are proposed to detect and correct errors when matrix operations such as addition, multiplication, scalar product, LU-decomposition, and transposition are performed using multiple processor systems. The method proposed can detect and correct any failure within a single processor in a multiple processor system. The number of processors needed to just detect errors in matrix multiplication is also studied.

read more

Citations
More filters
Proceedings Article

Polynomial codes: an optimal design for high-dimensional coded matrix multiplication

TL;DR: This work considers a large-scale matrix multiplication problem where the computation is carried out using a distributed system with a master node and multiple worker nodes, where each worker can store parts of the input matrices, and proposes a computation strategy that leverages ideas from coding theory to design intermediate computations at the worker nodes to efficiently deal with straggling workers.
Journal ArticleDOI

Xception: a technique for the experimental evaluation of dependability in modern computers

TL;DR: Experimental, results are presented to demonstrate the accuracy and potential of Xception in the evaluation of the dependability properties of the complex computer systems available nowadays.
Journal ArticleDOI

Toward Exascale Resilience

TL;DR: This white paper synthesizes the motivations, observations and research issues considered as determinant of several complimentary experts of HPC in applications, programming models, distributed systems and system management.
Journal ArticleDOI

BLIS: A Framework for Rapidly Instantiating BLAS Functionality

TL;DR: Preliminary performance of level-2 and level-3 operations is observed to be competitive with two mature open source libraries (OpenBLAS and ATLAS) as well as an established commercial product (Intel MKL).
References
More filters
Book

The Art of Computer Programming

TL;DR: The arrangement of this invention provides a strong vibration free hold-down mechanism while avoiding a large pressure drop to the flow of coolant fluid.
Journal ArticleDOI

Error detecting and error correcting codes

TL;DR: The author was led to the study given in this paper from a consideration of large scale computing machines in which a large number of operations must be performed without a single error in the end result.
Journal ArticleDOI

Error-correcting codes

Related Papers (5)