Algorithm-Based Fault Tolerance for Matrix Operations

doi:10.1109/TC.1984.1676475

Journal ArticleDOI

Algorithm-Based Fault Tolerance for Matrix Operations

Kuang-Hua Huang, +1 more

- 01 Jun 1984 -

IEEE Transactions on Computers

- Vol. 33, Iss: 6, pp 518-528

TLDR

Algorithm-based fault tolerance schemes are proposed to detect and correct errors when matrix operations such as addition, multiplication, scalar product, LU-decomposition, and transposition are performed using multiple processor systems.

Abstract:

The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple copies of low-cost processors to provide a large amount of computational capability for a small cost. In addition to achieving high performance, high reliability is also important to ensure that the results of long computations are valid. This paper proposes a novel system-level method of achieving high reliability, called algorithm-based fault tolerance. The technique encodes data at a high level, and algorithms are designed to operate on encoded data and produce encoded output data. The computation tasks within an algorithm are appropriately distributed among multiple computation units for fault tolerance. The technique is applied to matrix compomations which form the heart of many computation-intensive tasks. Algorithm-based fault tolerance schemes are proposed to detect and correct errors when matrix operations such as addition, multiplication, scalar product, LU-decomposition, and transposition are performed using multiple processor systems. The method proposed can detect and correct any failure within a single processor in a multiple processor system. The number of processors needed to just detect errors in matrix multiplication is also studied.

Citations

PDF

Open Access

More filters

Proceedings Article

Polynomial codes: an optimal design for high-dimensional coded matrix multiplication

Qian Yu, +2 more

TL;DR: This work considers a large-scale matrix multiplication problem where the computation is carried out using a distributed system with a master node and multiple worker nodes, where each worker can store parts of the input matrices, and proposes a computation strategy that leverages ideas from coding theory to design intermediate computations at the worker nodes to efficiently deal with straggling workers.

...read moreread less

Journal ArticleDOI

Addressing failures in exascale computing

Marc Snir, +27 more

TL;DR: This report presents a report produced by a workshop on ‘Addressing failures in exascale computing’ held in Park City, Utah, 4–11 August 2012, which summarizes and builds on discussions on resilience.

...read moreread less

Journal ArticleDOI

Xception: a technique for the experimental evaluation of dependability in modern computers

Joao Carreira, +2 more

- 01 Feb 1998 -

IEEE Transactions on Software Engineerin...

TL;DR: Experimental, results are presented to demonstrate the accuracy and potential of Xception in the evaluation of the dependability properties of the complex computer systems available nowadays.

...read moreread less

Journal ArticleDOI

Toward Exascale Resilience

Franck Cappello, +5 more

TL;DR: This white paper synthesizes the motivations, observations and research issues considered as determinant of several complimentary experts of HPC in applications, programming models, distributed systems and system management.

...read moreread less

Journal ArticleDOI

BLIS: A Framework for Rapidly Instantiating BLAS Functionality

Field G. Van Zee, +1 more

- 01 Jun 2015 -

ACM Transactions on Mathematical Softwar...

TL;DR: Preliminary performance of level-2 and level-3 operations is observed to be competitive with two mature open source libraries (OpenBLAS and ATLAS) as well as an established commercial product (Intel MKL).

...read moreread less

Collapse

References

PDF

Open Access

More filters

Book

The Art of Computer Programming

Donald Ervin Knuth

TL;DR: The arrangement of this invention provides a strong vibration free hold-down mechanism while avoiding a large pressure drop to the flow of coolant fluid.

...read moreread less

Journal ArticleDOI

Error detecting and error correcting codes

Richard W. Hamming

- 01 Apr 1950 -

Bell System Technical Journal

TL;DR: The author was led to the study given in this paper from a consideration of large scale computing machines in which a large number of operations must be performed without a single error in the end result.

...read moreread less