scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Parallel and Distributed Systems in 1997"


Journal ArticleDOI
TL;DR: This work presents efficient algorithms for two all-to-all communication operations in message-passing systems: index and concatenation, both of which are based on the communication start-up time and the communication bandwidth.
Abstract: We present efficient algorithms for two all-to-all communication operations in message-passing systems: index (or all-to-all personalized communication) and concatenation (or all-to-all broadcast). We assume a model of a fully connected message-passing system, in which the performance of any point-to-point communication is independent of the sender-receiver pair. We also assume that each processor has k/spl ges/1 ports, through which it can send and receive k messages in every communication round. The complexity measures we use are independent of the particular system topology and are based on the communication start-up time, and on the communication bandwidth.

333 citations


Journal ArticleDOI
TL;DR: The honeycomb mesh, based on hexagonal plane tessellation, is considered as a multiprocessor interconnection network and honeycomb networks with rhombus and rectangle as the bounding polygons are considered.
Abstract: The honeycomb mesh, based on hexagonal plane tessellation, is considered as a multiprocessor interconnection network. A honeycomb mesh network with n nodes has degree 3 and diameter /spl ap/1.63/spl radic/n-1, which is 25 percent smaller degree and 18.5 percent smaller diameter than the mesh-connected computer with approximately the same number of nodes. Vertex and edge symmetric honeycomb torus network is obtained by adding wraparound edges to the honeycomb mesh. The network cost, defined as the product of degree and diameter, is better for honeycomb networks than for the two other families based on square (mesh-connected computers and tori) and triangular (hexagonal meshes and tori) tessellations. A convenient addressing scheme for nodes is introduced which provides simple computation of shortest paths and the diameter. Simple and optimal (in the number of required communication steps) routing, broadcasting, and semigroup computation algorithms are developed. The average distance in honeycomb torus with n nodes is proved to be approximately 0.54/spl radic/n. In addition to honeycomb meshes bounded by a regular hexagon, we consider also honeycomb networks with rhombus and rectangle as the bounding polygons.

300 citations


Journal ArticleDOI
TL;DR: The first algorithms to factor a wide class of sparse matrices that are asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures are presented.
Abstract: In this paper, we describe scalable parallel algorithms for symmetric sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1,024 processors on a Gray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithms substantially improve the state of the art in parallel direct solution of sparse linear systems-both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. In this paper, we present the first algorithms to factor a wide class of sparse matrices (including those arising from two- and three-dimensional finite element problems) that are asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithms incur less communication overhead and are more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. An implementation of one of our sparse Cholesky factorization algorithms delivers up to 20 GFlops on a Gray T3D for medium-size structural engineering and linear programming problems. To the best of our knowledge, this is the highest performance ever obtained for sparse Cholesky factorization on any supercomputer.

239 citations


Journal ArticleDOI
TL;DR: This paper presents and analyzes techniques for automatically translating the overall deadline into deadlines for the individual subtasks in a real-time system.
Abstract: In a distributed environment, tasks often have processing demands at multiple different sites. A distributed task is usually divided into several subtasks, each to be executed in order at some site. In a real-time system, an overall deadline is usually specified by an application designer indicating when a distributed task is to be finished. In this paper, we present and analyze techniques for automatically translating the overall deadline into deadlines for the individual subtasks.

220 citations


Journal ArticleDOI
TL;DR: This work introduces self-stabilizing protocols for synchronization that are used as building blocks by the leader-election algorithm and presents a simple, uniform, self-Stabilizing ranking protocol.
Abstract: A distributed system is self-stabilizing if it can be started in any possible global state. Once started the system regains its consistency by itself, without any kind of outside intervention. The self-stabilization property makes the system tolerant to faults in which processors exhibit a faulty behavior for a while and then recover spontaneously in an arbitrary state. When the intermediate period in between one recovery and the next faulty period is long enough, the system stabilizes. A distributed system is uniform if all processors with the same number of neighbors are identical. A distributed system is dynamic if it can tolerate addition or deletion of processors and links without reinitialization. In this work, we study uniform dynamic self-stabilizing protocols for leader election under readwrite atomicity. Our protocols use randomization to break symmetry. The leader election protocol stabilizes in O(/spl Delta/D log n) time when the number of the processors is unknown and O(/spl Delta/D), otherwise. Here /spl Delta/ denotes the maximal degree of a node, D denotes the diameter of the graph and n denotes the number of processors in the graph. We introduce self-stabilizing protocols for synchronization that are used as building blocks by the leader-election algorithm. We conclude this work by presenting a simple, uniform, self-stabilizing ranking protocol.

208 citations


Journal ArticleDOI
TL;DR: A scheme that provides fault tolerance through scheduling in real time multiprocessor systems, and uses two techniques that are called deallocation and overloading to achieve high acceptance ratio (percentage of arriving tasks scheduled by the system).
Abstract: Real time systems are being increasingly used in several applications which are time critical in nature. Fault tolerance is an important requirement of such systems, due to the catastrophic consequences of not tolerating faults. We study a scheme that provides fault tolerance through scheduling in real time multiprocessor systems. We schedule multiple copies of dynamic, aperiodic, nonpreemptive tasks in the system, and use two techniques that we call deallocation and overloading to achieve high acceptance ratio (percentage of arriving tasks scheduled by the system). The paper compares the performance of our scheme with that of other fault tolerant scheduling schemes, and determines how much each of deallocation and overloading affects the acceptance ratio of tasks. The paper also provides a technique that can help real time system designers determine the number of processors required to provide fault tolerance in dynamic systems. Lastly, a formal model is developed for the analysis of systems with uniform tasks.

200 citations


Journal ArticleDOI
TL;DR: It is shown that noncontiguous allocation algorithms perform better overall than the contiguous ones, even when message-passing contention is considered, and the results of experiments on an Intel Paragon XP/S-15 with 208 nodes that show non Contiguous allocation is feasible with current technologies are presented.
Abstract: Current processor allocation techniques for highly parallel systems are typically restricted to contiguous allocation strategies for which performance suffers significantly due to the inherent problem of fragmentation. As a result, message-passing systems have yet to achieve the high utilization levels exhibited by traditional vector supercomputers. We are investigating processor allocation algorithms which lift the restriction on contiguity of processors in order to address the problem of fragmentation. Three noncontiguous processor allocation strategies-paging allocation, random allocation, and the Multiple Buddy Strategy (MBS)-are proposed and studied in this paper. Simulations compare the performance of the noncontiguous strategies with that of several well-known contiguous algorithms. We show that noncontiguous allocation algorithms perform better overall than the contiguous ones, even when message-passing contention is considered. We also present the results of experiments on an Intel Paragon XP/S-15 with 208 nodes that show noncontiguous allocation is feasible with current technologies.

140 citations


Journal ArticleDOI
TL;DR: The fault diameter of k-ary n-cube interconnection networks (also known as n-dimensional k-torus networks) is obtained and is shown to be /spl Delta/+1 where /splDelta/ is the fault free diameter.
Abstract: We obtain the fault diameter of k-ary n-cube interconnection networks (also known as n-dimensional k-torus networks). We start by constructing a complete set of node-disjoint paths (i.e., as many paths as the degree) between any two nodes of a k-ary n-cube. Each of the obtained paths is of length zero, two, or four plus the minimum length except for one path in a special case (when the Hamming distance between the two nodes is one) where the increase over the minimum length may attain eight. These results improve those obtained by B. Bose et al. (1995) where the length of some of the paths has a variable increase (which can be arbitrarily large) over the minimum length. These results are then used to derive the fault diameter of the k-ary n-cube, which is shown to be /spl Delta/+1 where /spl Delta/ is the fault free diameter.

139 citations


Journal ArticleDOI
TL;DR: In this article, the authors present new techniques to allow fusion of loop nests in the presence of fusion-preventing dependences, maintain parallelism and allow the parallel execution of fused loops with minimal synchronization.
Abstract: Loop fusion improves data locality and reduces synchronization in data-parallel applications. However, loop fusion is not always legal. Even when legal, fusion may introduce loop-carried dependences which prevent parallelism. In addition, performance losses result from cache conflicts in fused loops. In this paper, we present new techniques to: (1) allow fusion of loop nests in the presence of fusion-preventing dependences, (2) maintain parallelism and allow the parallel execution of fused loops with minimal synchronization, and (3) eliminate cache conflicts in fused loops. We describe algorithms for implementing these techniques in compilers. The techniques are evaluated on a 56-processor KSR2 multiprocessor and on a 18-processor Convex SPP-1000 multiprocessor. The results demonstrate performance improvements for both kernels and complete applications. The results also indicate that careful evaluation of the profitability of fusion is necessary as more processors are used.

133 citations


Journal ArticleDOI
TL;DR: This paper considers an injured star graph with some faulty links and nodes, and shows that even with f/sub e//spl les/n-3 faulty links, a Hamiltonian cycle still can be found in an n-star, and that an embedding is able to establish a ring containing at least n!-4f/sub v/ nodes.
Abstract: The star graph interconnection network has been recognized as an attractive alternative to the hypercube network. Previously, the star graph has been shown to contain a Hamiltonian cycle. In this paper, we consider an injured star graph with some faulty links and nodes. We show that even with f/sub e//spl les/n-3 faulty links, a Hamiltonian cycle still can be found in an n-star, and that with f/sub v//spl les/n-3 faulty nodes, a ring containing at most 4f/sub v/ nodes less than that in a Hamiltonian cycle can be found (i.e. the ring contains at least n!-4f/sub v/ nodes). In general, in an n-star with f/sub e/ faulty links and f/sub v/ faulty nodes, where f/sub e/+f/sub v//spl les/n-3, our embedding is able to establish a ring containing at least n!-4f/sub v/ nodes.

121 citations


Journal ArticleDOI
TL;DR: This paper analyzes the effective redundancy available in a wormhole network by combining connectivity and deadlock freedom, and proposes a sufficient condition for channel redundancy, also computing the set of redundant channels.
Abstract: Fault-tolerant systems aim at providing continuous operation in the presence of faults. Multicomputers rely on an interconnection network between processors to support the message-passing mechanism. Therefore, the reliability of the interconnection network is very important for the reliability of the whole system. This paper analyzes the effective redundancy available in a wormhole network by combining connectivity and deadlock freedom. Redundancy is defined at the channel level. We propose a sufficient condition for channel redundancy, also computing the set of redundant channels. The redundancy level of the network is also defined, proposing a theorem that supplies its value. This theory is developed on top of our necessary and sufficient condition for deadlock-free adaptive routing. The new theory also considers the failure of physical channels when virtual channels are used. Finally, we propose a methodology for the design of fault-tolerant routing algorithms, showing its application to n-dimensional meshes.

Journal ArticleDOI
TL;DR: These results show that the hardware for CR and FCR networks can achieve superior performance to alternatives such as dimension order routing, and not only simplify hardware support for adaptive routing and fault tolerance, they also can simplify software communication layers.
Abstract: Compressionless routing (CR) is an adaptive routing framework which provides a unified framework for efficient deadlock free adaptive routing and fault tolerance. CR exploits the tight coupling between wormhole routers for flow control to detect and recover from potential deadlock situations. Fault tolerant compressionless routing (FCR) extends CR to support end to end fault tolerant delivery. Detailed routing algorithms, implementation complexity, and performance simulation results for CR and FCR are presented. These results show that the hardware for CR and FCR networks is modest. Further, CR and FCR networks can achieve superior performance to alternatives such as dimension order routing. Compressionless routing has several key advantages: deadlock free adaptive routing in toroidal networks with no virtual channels, simple router designs, order preserving message transmission, applicability to a wide variety of network topologies, and elimination of the need for buffer allocation messages. Fault tolerant compressionless routing has several additional advantages: data integrity in the presence of transient faults (nonstop fault tolerance), permanent fault tolerance, and elimination of the need for software buffering and retry for reliability. The advantages of CR and FCR not only simplify hardware support for adaptive routing and fault tolerance, they also can simplify software communication layers.

Journal ArticleDOI
Alexander Thomasian1, Jai Menon1
TL;DR: This work analyzes the performance of RAID5 with distributed sparing in normal mode, degraded mode, and rebuild mode in an OLTP environment, which implies small reads and writes.
Abstract: Distributed sparing is a method to improve the performance of RAID5 disk arrays with respect to a dedicated sparing system with N+2 disks (including the spare disk), since it utilizes the bandwidth of all N+2 disks. We analyze the performance of RAID5 with distributed sparing in normal mode, degraded mode, and rebuild mode in an OLTP environment, which implies small reads and writes. The analysis in normal mode uses an M/G/1 queuing model, which takes into account the components of disk service time. In degraded mode, a low-cost approximate method is developed to estimate the mean response time of fork-join requests resulting from accesses to recreate lost data on the failed disk. Rebuild mode performance is analyzed by considering an M/G/1 vacationing server model with multiple vacations of different types to take into account differences in processing requirements for reading the first and subsequent tracks. An iterative solution method is used to estimate the mean response time of disk requests, as well as the time to read each disk, which is shown to be quite accurate through validation against simulation results. We next compare RAID5 performance in a system (1) without a cache; (2) with a cache; and (3) with a nonvolatile storage (NVS) cache. The last configuration, in addition to improved read response time due to cache hits, provides a fast-write capability, such that dirty blocks can be destaged asynchronously and at a lower priority than read requests, resulting in an improvement in read response time. The small write penalty is also reduced due to the possibility of repeated writes to dirty blocks in the cache and by taking advantage of disk geometry to efficiently destage multiple blocks at a time.

Journal ArticleDOI
TL;DR: The cross product is studied as a method for generating and analyzing interconnection network topologies for multiprocessor systems and is given a new tool for further studying some of the known interconnection topologies, such as the hypercube and the mesh.
Abstract: We study the cross product as a method for generating and analyzing interconnection network topologies for multiprocessor systems. Consider two interconnection graphs G/sub 1/ and G/sub 2/ each with some established properties such as symmetry, low degree and diameter, scalability, simple optimal routing, recursive structure (partitionability), fault tolerance, existence of node-disjoint paths, low cost embedding, and efficient broadcasting. We investigate and evaluate the corresponding properties for the cross product of G/sub 1/ and G/sub 2/ based on the properties of G/sub 1/ and those of G/sub 2/. We also give a mathematical characterization of product families of graphs which are closed under the cross product operation. This investigation is useful in two ways. On one hand, it gives a new tool for further studying some of the known interconnection topologies, such as the hypercube and the mesh, which can be defined using the cross product operation. On the other hand, it can be used in defining and evaluating new interconnection graphs using the cross product operation on known topologies.

Journal ArticleDOI
TL;DR: It is proved that the order of retiming and unfolding is immaterial for scheduling a data-flow graph (DFG), and a polynomial-time algorithm is presented on the original DFG, before unfolding, to find the minimum-rate static schedule for a given unfolding factor.
Abstract: Loop scheduling is an important problem in parallel processing. The retiming technique reorganizes an iteration; the unfolding technique schedules several iterations together. We combine these two techniques to obtain a static schedule with a reduced average computation time per iteration. We first prove that the order of retiming and unfolding is immaterial for scheduling a data-flow graph (DFG). From this nice property, we present a polynomial-time algorithm on the original DFG, before unfolding, to find the minimum-rate static schedule for a given unfolding factor. For the case of a unit-time DFG, efficient checking and retiming algorithms are presented.

Journal ArticleDOI
TL;DR: This paper explores a new compiler optimization for regular scientific applications-the simultaneous exploitation of task and data parallelism, implemented as part of the PARADIGM HPF compiler framework the authors have developed.
Abstract: Distributed Memory Multicomputers (DMMs), such as the IBM SP-2, the Intel Paragon, and the Thinking Machines CM-5, offer significant advantages over shared memory multiprocessors in terms of cost and scalability. Unfortunately, the utilization of all the available computational power in these machines involves a tremendous programming effort on the part of users, which creates a need for sophisticated compiler and run-time support for distributed memory machines. In this paper, we explore a new compiler optimization for regular scientific applications-the simultaneous exploitation of task and data parallelism. Our optimization is implemented as part of the PARADIGM HPF compiler framework we have developed. The intuitive idea behind the optimization is the use of task parallelism to control the degree of data parallelism of individual tasks. The reason this provides increased performance is that data parallelism provides diminishing returns as the number of processors used is increased. By controlling the number of processors used for each data parallel task in an application and by concurrently executing these tasks, we make program execution more efficient and, therefore, faster.

Journal ArticleDOI
TL;DR: A modeling technique is developed that transforms the assignment problem in an array or tree into a minimum-cut maximum-flow problem, which is then solved for a generalarray or tree network in polynomial time.
Abstract: This paper considers the problem of assigning the tasks of a distributed application to the processors of a distributed system such that the sum of execution and communication costs is minimized. Previous work has shown this problem to be tractable for a system of two processors or a linear array of N processors, and for distributed programs of serial parallel structures. Here we focus on the assignment problem on a homogeneous network, which is composed of N functionally-identical processors, each with its own memory. Some processors in the network may have unique resources, such as data files or certain peripheral devices. Certain tasks may have to use these unique resources; they are called attached tasks. The tasks of a distributed program should therefore be assigned so as to make use of specific resources located at certain processors in the network while minimizing the amount of interprocessor communication. The assignment problem in such a homogeneous network is known to be NP-hard even for N=3, thus making it intractable for a network with a medium to large number of processors. We therefore focus on task assignment in general array networks, such as linear arrays, meshes, hypercubes, and trees. We first develop a modeling technique that transforms the assignment problem in an array or tree into a minimum-cut maximum-flow problem. The assignment problem is then solved for a general array or tree network in polynomial time.

Journal ArticleDOI
TL;DR: This paper presents efficient algorithms for sorting, selection, and packet routing on the AROB (Array with Reconfigurable Optical Buses) model and shows that selection from out of n elements can be done in randomized O(1) time employing n processors.
Abstract: In this paper, we present efficient algorithms for sorting, selection, and packet routing on the AROB (Array with Reconfigurable Optical Buses) model. One of our sorting algorithms sorts n general keys in O(1) time on an AROB of size n/sup /spl epsiv///spl times/n for any constant /spl epsiv/>0. We also show that selection from out of n elements can be done in randomized O(1) time employing n processors. Our routing algorithm can route any h-relation in randomized O(h) time. All these algorithms are clearly optimal.

Journal ArticleDOI
TL;DR: This paper proves exactly which local checkpoints can be used for constructing such consistent global checkpoints and illustrates the use of the results with a simple and elegant algorithm to enumerate all such consistentglobal checkpoints.
Abstract: Consistent global checkpoints have many uses in distributed computations A central question in applications that use consistent global checkpoints is to determine whether a consistent global checkpoint that includes a given set of local checkpoints can exist Netzer and Xu (1995) presented the necessary and sufficient conditions under which such a consistent global checkpoint can exist, but they did not explore what checkpoints could be constructed In this paper, we prove exactly which local checkpoints can be used for constructing such consistent global checkpoints We illustrate the use of our results with a simple and elegant algorithm to enumerate all such consistent global checkpoints

Journal ArticleDOI
TL;DR: This paper gives an overview of recoverable DSVMs (RDSVMs) that provide a checkpointing mechanism to restart parallel computations in the event of a site failure.
Abstract: Distributed Shared Virtual Memory (DSVM) systems provide a shared memory abstraction on distributed memory architectures. Such systems ease parallel application programming because the shared-memory programming model is often more natural than the message-passing paradigm. However, the probability of failure of a DSVM increases with the number of sites. Thus, fault tolerance mechanisms must be implemented in order to allow processes to continue their execution in the event of a failure. This paper gives an overview of recoverable DSVMs (RDSVMs) that provide a checkpointing mechanism to restart parallel computations in the event of a site failure.

Journal ArticleDOI
TL;DR: The goal of this work is to generate a provably optimal scheme for communicating shared data among subtasks as an enhancement to any given matching and scheduling.
Abstract: In a heterogeneous computing (HC) environment consisting of different types of machines, an application program is decomposed into subtasks, each of which is computationally homogeneous. The goal is to execute subtasks on the machines in such a way that the total program execution time is minimized. A mathematical framework is presented that models the matching of subtasks to machines, scheduling of subtasks' computation, scheduling of intermachine communication steps, and selection of sources of shared data items for intermachine communication (data relocation). The goal of this work is to generate a provably optimal scheme for communicating shared data among subtasks as an enhancement to any given matching and scheduling. Initially, it is assumed that at any instant in time, only one machine is being used for program execution and only one subtask is being executed. Based on this assumption, a polynomial algorithm is introduced to optimize scheduling and data relocation with respect to any given matching of subtasks to machines. The data relocation scheme is then extended to reduce intermachine data communication time in an HC environment with a given matching and scheduling of subtasks' computation where: multiple subtasks' computations can be performed concurrently on different machines; subtask computation steps can be overlapped with other subtasks' communication steps for intermachine data transfers; and machines in the HC suite are interconnected by a shared-bus type of network.

Journal ArticleDOI
TL;DR: This solution is generalized to obtain two faster heuristics, one for the case of homogeneous processors and the other for heterogeneous processors, and it is successful in explaining the experimental results qualitatively.
Abstract: The problem of allocating task interaction graphs (TIGs) to heterogeneous computing systems to minimize job completion time is investigated. The only restriction is that the interprocessor communication cost is the same for any pair of processors. This is suitable for local area network based systems, such as Ethernet, as well as fully interconnected multiprocessor systems. An optimal polynomial solution exists if sufficient homogeneous processors and communication capacity are available. This solution is generalized to obtain two faster heuristics, one for the case of homogeneous processors and the other for heterogeneous processors. The heuristics were tested extensively with 60,900 systematically generated random TIGs and shown to be stable independent of the size of the TIG. A performance model is also proposed to predict the performance of the heuristic algorithms, and it is successful in explaining the experimental results qualitatively.

Journal ArticleDOI
TL;DR: Scheduling algorithms for tree, hypercube, and mesh networks are presented that can fully balance the load and maximize locality at runtime and are significantly reduced compared to other existing algorithms.
Abstract: Parallel scheduling is a new approach for load balancing. In parallel scheduling, all processors cooperate to schedule work. Parallel scheduling is able to accurately balance the load by using global load information at compile-time or runtime. It provides high-quality load balancing. This paper presents an overview of the parallel scheduling technique. Scheduling algorithms for tree, hypercube, and mesh networks are presented. These algorithms can fully balance the load and maximize locality at runtime. Communication costs are significantly reduced compared to other existing algorithms.

Journal ArticleDOI
TL;DR: New methods for the representation and distribution of such data on DMMPs are described, and simple language features are proposed that permit the user to characterize a matrix as "sparse" and specify the associated representation.
Abstract: Vienna Fortran, High Performance Fortran (HPF), and other data parallel languages have been introduced to allow the programming of massively parallel distributed-memory machines (DMMP) at a relatively high level of abstraction, based on the SPMD paradigm. Their main features include directives to express the distribution of data and computations across the processors of a machine. In this paper, we use Vienna-Fortran as a general framework for dealing with sparse data structures. We describe new methods for the representation and distribution of such data on DMMPs, and propose simple language features that permit the user to characterize a matrix as "sparse" and specify the associated representation. Together with the data distribution for the matrix, this enables the complier and runtime system to translate sequential sparse code into explicitly parallel message-passing code. We develop new compilation and runtime techniques, which focus on achieving storage economy and reducing communication overhead in the target program. The overall result is a powerful mechanism for dealing efficiently with sparse matrices in data parallel languages and their compilers for DMMPs.

Journal ArticleDOI
TL;DR: A linear programming-based method is used to solve the incremental graph-partitioning problem and the quality of the partitioning achieved is comparable to that achieved by applying recursive spectral bisection to the incremental graphs from scratch.
Abstract: Partitioning graphs into equally large groups of nodes while minimizing the number of edges between different groups is an extremely important problem in parallel computing. For instance, efficiently parallelizing several scientific and engineering applications requires the partitioning of data or tasks among processors such that the computational load on each node is roughly the same, while communication is minimized. Obtaining exact solutions is computationally intractable, since graph partitioning is NP-complete. For a large class of irregular and adaptive data parallel applications (such as adaptive graphs), the computational structure changes from one phase to another in an incremental fashion. In incremental graph-partitioning problems the partitioning of the graph needs to be updated as the graph changes over time; a small number of nodes or edges may be added or deleted at any given instant. In this paper, we use a linear programming-based method to solve the incremental graph-partitioning problem. All the steps used by our method are inherently parallel and hence our approach can be easily parallelized. By using an initial solution for the graph partitions derived from recursive spectral bisection-based methods, our methods can achieve repartitioning at considerably lower cost than can be obtained by applying recursive spectral bisection. Further, the quality of the partitioning achieved is comparable to that achieved by applying recursive spectral bisection to the incremental graphs from scratch.

Journal ArticleDOI
TL;DR: The results show that using runtime information to adaptively adjust scheduling granularity is an effective way to handle loops with a wide range of load distributions when no prior knowledge of the execution can be used.
Abstract: Using runtime information of load distributions and processor affinity, the authors propose an adaptive scheduling algorithm and its variations from different control mechanisms The proposed algorithm applies different degrees of aggressiveness to adjust loop scheduling granularities, aiming at improving the execution performance of parallel loops by making scheduling decisions that match the real workload distributions at runtime They experimentally compared the performance of the algorithm and its variations with several existing scheduling algorithms on two parallel machines: the KSR-1 and the Convex Exemplar The kernel application programs used for performance evaluation were carefully selected for different classes of parallel loops The results show that using runtime information to adaptively adjust scheduling granularity is an effective way to handle loops with a wide range of load distributions when no prior knowledge of the execution can be used The overhead caused by collecting runtime information is insignificant in comparison with the performance improvement The experiments show that the adaptive algorithm and its five variations outperformed the existing scheduling algorithms

Journal ArticleDOI
TL;DR: In this article, the authors give deterministic algorithms for many-to-many hot-potato routing in hypercube, meshes, tori, trees and hypercubic networks such as the butterfly.
Abstract: We consider algorithms for many-to-many hot potato routing. In hot potato (deflection) routing, a packet cannot be buffered, and is therefore always moving until it reaches its destination. We give optimal and nearly optimal deterministic algorithms for many-to-many packet routing in commonly occurring networks such as the hypercube, meshes, and tori of various dimensions and sizes, trees, and hypercubic networks such as the butterfly. All these algorithms are analyzed using a charging scheme that may be applicable to other algorithms as well. Moreover, all bounds hold in a dynamic setting in which packets can be injected at arbitrary times.

Journal ArticleDOI
TL;DR: Comprehensive computer simulation reveals that the average allocation time and waiting delay are much smaller than earlier schemes of comparable performances, irrespective of the size of meshes and distribution of the shape of the incoming tasks.
Abstract: Efficient allocation of processors to incoming tasks in parallel computer systems is very important for achieving the desired high performance. It requires recognizing the free available processors with minimum overhead. In this paper, we present an efficient task allocation scheme for 2D mesh architectures. By employing a new approach for searching the mesh, our scheme can find the available submesh without scanning the entire mesh, unlike earlier designs. Comprehensive computer simulation reveals that the average allocation time and waiting delay are much smaller than earlier schemes of comparable performances, irrespective of the size of meshes and distribution of the shape of the incoming tasks.

Journal ArticleDOI
TL;DR: The EFC/sub k/s can be considered as flexible versions of incomplete hypercubes, which eliminates their restriction on the number of nodes, and, thus, makes it possible to construct parallel machines with arbitrary sizes.
Abstract: The Fibonacci Cube is an interconnection network that possesses many desirable properties that are important in network design and application. The Fibonacci Cube can efficiently emulate many hypercube algorithms and uses fewer links than the comparable hypercube, while its size does not increase as fast as the hypercube. However, most Fibonacci Cubes (more than 2/3 of all) are not Hamiltonian. In this paper, we propose an Extended Fibonacci Cube (EFC/sub 1/) with an even number of nodes. It is defined based on the same sequence F(i)=F(i-1)+F(i-2) as the regular Fibonacci sequence; however, its initial conditions are different. We show that the Extended Fibonacci Cube includes the Fibonacci Cube as a subgraph and maintains its sparsity property. In addition, it is Hamiltonian and is better in emulating other topologies. Specifically, the Extended Fibonacci Cube can embed binary trees more efficiently than the regular Fibonacci Cube and is almost as efficient as the hypercube, even though the Extended Fibonacci Cube is a much sparser network than the hypercube. We also propose a series of Extended Fibonacci Cubes with even number of nodes. Any Extended Fibonacci Cube (EFC/sub k/, with k/spl ges/) in the series contains the node set of any other cube that precedes EFC/sub k/ in the series. We show that any Extended Fibonacci Cube maintains virtually all the desirable properties of the Fibonacci Cube. The EFC/sub k/s can be considered as flexible versions of incomplete hypercubes, which eliminates their restriction on the number of nodes, and, thus, makes it possible to construct parallel machines with arbitrary sizes.

Journal ArticleDOI
TL;DR: It is found that, in exchange for a small execution time overhead, the approximate scheduling algorithms can provide substantial improvements in I/O completion times.
Abstract: The I/O bottleneck in parallel computer systems has recently begun receiving increasing interest. Most attention has focused on improving the performance of I/O devices using fairly low level parallelism in techniques such as disk striping and interleaving. Widely applicable solutions, however, will require an integrated approach which addresses the problem at multiple system levels, including applications, systems software, and architecture. We propose that within the context of such an integrated approach, scheduling parallel I/O operations will become increasingly attractive and can potentially provide substantial performance benefits. We describe a simple I/O scheduling problem and present approximate algorithms for its solution. The costs of using these algorithms in terms of execution time, and the benefits in terms of reduced time to complete a batch of I/O operations, are compared with the situations in which no scheduling is used, and in which an optimal scheduling algorithm is used. The comparison is performed both theoretically and experimentally. We have found that, in exchange for a small execution time overhead, the approximate scheduling algorithms can provide substantial improvements in I/O completion times.