scispace - formally typeset
Journal ArticleDOI

An efficient algorithm for out-of-core matrix transposition

TLDR
This paper proposes an algorithm that considers the index computation time and the I/O time and reduces the overall execution time and results in an overall reduction in the execution time due to the elimination of the expensive index computation.
Abstract
Efficient transposition of out-of-core matrices has been widely studied. These efforts have focused on reducing the number of I/O operations. However, in state-of-the-art architectures, the memory-memory data transfer time and the index computation time are also significant components of the overall time. In this paper, we propose an algorithm that considers the index computation time and the I/O time and reduces the overall execution time. Our algorithm reduces the total execution time by reducing the number of I/O operations and eliminating the index computation. In doing so, two techniques are employed: writing the data on to disk in pre-defined patterns and balancing the number of disk read and write operations. The index computation time, which is an expensive operation involving two divisions and a multiplication, is eliminated by partitioning the memory into read and write buffers. The expensive in-processor permutation is replaced by data collection from the read buffer to the write buffer. Even though this partitioning may increase the number of I/O operations for some cases, it results in an overall reduction in the execution time due to the elimination of the expensive index computation. Our algorithm is analyzed using the well-known linear model and the parallel disk model. The experimental results on a Sun Enterprise, an SGI R12000 and a Pentium III show that our algorithm reduces the overall execution time by up to 50% compared with the best known algorithms in the literature.

read more

Citations
More filters
Journal ArticleDOI

vLOD: high-fidelity walkthrough of large virtual environments

TL;DR: A novel feature of this walkthrough system is that it performs work proportional only to the required detail in visible geometry at the rendering time, and uses a precomputation phase that efficiently generates per cell vLOD: the geometry visible from a view-region at the right level of detail.
Book ChapterDOI

Generating SIMD vectorized permutations

TL;DR: A method to generate efficient vectorized implementations of small stride permutations using only vector load and vector shuffle instructions for highperformance numerical kernels including the fast Fourier transform is introduced.
Journal ArticleDOI

Efficient parallel out-of-core matrix transposition

TL;DR: An algorithm that directly targets the improvement of overall transposition time is proposed and the I/O characteristics of the system are used to determine the read, write and communication block sizes such that the total execution time is minimised.
Journal ArticleDOI

Enhancing the matrix transpose operation using intel avx instruction set extension

TL;DR: This paper presents a novel vector-based matrix transpose algorithm and its optimized implementation using AVX instructions, and demonstrates a 2.83 speedup over the standard sequential implementation, and a maximum of 1.53 speed up over the GCC library implementation.
References
More filters
Book

Introduction to parallel computing: design and analysis of algorithms

TL;DR: Performance and Scalability of Parallel Systems, General Issues in Mapping Systolic Systems Onto Parallel Computers, and Speedup Anomalies in Parallel Search Algorithms.
Journal ArticleDOI

RAID: high-performance, reliable secondary storage

TL;DR: A comprehensive overview of disk array technology and implementation topics such as refining the basic RAID levels to improve performance and designing algorithms to maintain data consistency are discussed.
Journal ArticleDOI

The input/output complexity of sorting and related problems

TL;DR: Tight upper and lower bounds are provided for the number of inputs and outputs (I/OS) between internal memory and secondary storage required for five sorting-related problems: sorting, the fast Fourier transform (FFT), permutation networks, permuting, and matrix transposition.
Journal ArticleDOI

Computability of Recursive Functions

TL;DR: One half of this equivalence, that all functions computable by any finite, discrete, deterministic device supplied with unlimited storage are partial recursive, is relatively straightforward 3 once the elements of recursive function theory have been established.
Journal ArticleDOI

Algorithms for parallel memory, I: Two-level memories

TL;DR: In this article, the authors provided the first optimal algorithms in terms of the number of input/outputs (I/Os) required between internal memory and multiple secondary storage devices for sorting, FFT, matrix transposition, standard matrix multiplication, and related problems.
Related Papers (5)