Showing papers in "IEEE Transactions on Parallel and Distributed Systems in 1996"

PDF

Open Access

Journal Article•DOI•

Dynamic critical-path scheduling: an effective technique for allocating task graphs to multiprocessors

[...]

Yu-Kwong Kwok¹, Ishfaq Ahmad²•Institutions (2)

University of Hong Kong¹, Hong Kong University of Science and Technology²

01 May 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: A static scheduling algorithm for allocating task graphs to fully connected multiprocessors which has admissible time complexity, is economical in terms of the number of processors used and is suitable for a wide range of graph structures.

...read moreread less

Abstract: In this paper, we propose a static scheduling algorithm for allocating task graphs to fully connected multiprocessors. We discuss six recently reported scheduling algorithms and show that they possess one drawback or the other which can lead to poor performance. The proposed algorithm, which is called the Dynamic Critical-Path (DCP) scheduling algorithm, is different from the previously proposed algorithms in a number of ways. First, it determines the critical path of the task graph and selects the next node to be scheduled in a dynamic fashion. Second, it rearranges the schedule on each processor dynamically in the sense that the positions of the nodes in the partial schedules are not fixed until all nodes have been considered. Third, it selects a suitable processor for a node by looking ahead the potential start times of the remaining nodes on that processor, and schedules relatively less important nodes to the processors already in use. A global as well as a pair-wise comparison is carried out for all seven algorithms under various scheduling conditions. The DCP algorithm outperforms the previous algorithms by a considerable margin. Despite having a number of new features, the DCP algorithm has admissible time complexity, is economical in terms of the number of processors used and is suitable for a wide range of graph structures.

...read moreread less

842 citations

Journal Article•DOI•

Computing on anonymous networks. I. Characterizing the solvable cases

[...]

Masafumi Yamashita¹, Tiko Kameda²•Institutions (2)

Hiroshima University¹, Simon Fraser University²

01 Jan 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: The symmetricity of a network is related to its 1- and 2-factors in terms of a new graph property called "symmetricity" in each of the four cases (a)-(4) above, to characterize the class of networks on which each of these problems (a)(d) is solvable.

...read moreread less

Abstract: In anonymous networks, the processors do not have identity numbers. We investigate the following representative problems on anonymous networks: (a) the leader election problem, (b) the edge election problem, (c) the spanning tree construction problem, and (d) the topology recognition problem. On a given network, the above problems may or may not be solvable, depending on the amount of information about the attributes of the network made available to the processors. Some possibilities are: (1) no network attribute information at all is available, (2) an upper bound on the number of processors in the network is available, (3) the exact number of processors in the network is available, and (4) the topology of the network is available. In terms of a new graph property called "symmetricity", in each of the four cases (1)-(4) above, we characterize the class of networks on which each of the four problems (a)(d) is solvable. We then relate the symmetricity of a network to its 1- and 2-factors.

...read moreread less

322 citations

Journal Article•DOI•

Low-cost checkpointing and failure recovery in mobile computing systems

[...]

Ravi Prakash¹, Mukesh Singhal•Institutions (1)

University of Rochester¹

01 Oct 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: A synchronous snapshot collection algorithm for mobile systems that neither forces every node to take a local snapshot, nor blocks the underlying computation during snapshot collection, and a minimal rollback/recovery algorithm in which the computation at a node is rolled back only if it depends on operations that have been undone due to the failure of node(s).

...read moreread less

Abstract: A mobile computing system consists of mobile and stationary nodes, connected to each other by a communication network. The presence of mobile nodes in the system places constraints on the permissible energy consumption and available communication bandwidth. To minimize the lost computation during recovery from node failures, periodic collection of a consistent snapshot of the system (checkpoint) is required. Locating mobile nodes contributes to the checkpointing and recovery costs. Synchronous snapshot collection algorithms, designed for static networks, either force every node in the system to take a new local snapshot, or block the underlying computation during snapshot collection. Hence, they are not suitable for mobile computing systems. If nodes take their local checkpoints independently in an uncoordinated manner, each node may have to store multiple local checkpoints in stable storage. This is not suitable for mobile nodes as they have small memory. This paper presents a synchronous snapshot collection algorithm for mobile systems that neither forces every node to take a local snapshot, nor blocks the underlying computation during snapshot collection. If a node initiates snapshot collection, local snapshots of only those nodes that have directly or transitively affected the initiator since their last snapshots need to be taken. We prove that the global snapshot collection terminates within a finite time of its invocation and the collected global snapshot is consistent. We also propose a minimal rollback/recovery algorithm in which the computation at a node is rolled back only if it depends on operations that have been undone due to the failure of node(s). Both the algorithms have low communication and storage overheads and meet the low energy consumption and low bandwidth constraints of mobile computing systems.

...read moreread less

213 citations

Journal Article•DOI•

Task clustering and scheduling for distributed memory parallel architectures

[...]

Michael A. Palis¹, Jing-Chiou Liou¹, David S. L. Wei²•Institutions (2)

New Jersey Institute of Technology¹, University of Aizu²

01 Jan 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: A simple greedy algorithm is presented for the problem of scheduling parallel programs represented as directed acyclic task graphs for execution on distributed memory parallel architectures which runs in O(n(n lg n+e) time, which is n times faster than the currently best known algorithm for this problem.

...read moreread less

Abstract: This paper addresses the problem of scheduling parallel programs represented as directed acyclic task graphs for execution on distributed memory parallel architectures. Because of the high communication overhead in existing parallel machines, a crucial step in scheduling is task clustering, the process of coalescing fine grain tasks into single coarser ones so that the overall execution time is minimized. The task clustering problem is NP-hard, even when the number of processors is unbounded and task duplication is allowed. A simple greedy algorithm is presented for this problem which, for a task graph with arbitrary granularity, produces a schedule whose makespan is at most twice optimal. Indeed, the quality of the schedule improves as the granularity of the task graph becomes larger. For example, if the granularity is at least 1/2, the makespan of the schedule is at most 5/3 times optimal. For a task graph with n tasks and e inter-task communication constraints, the algorithm runs in O(n(n lg n+e)) time, which is n times faster than the currently best known algorithm for this problem. Similar algorithms are developed that produce: (1) optimal schedules for coarse grain graphs; (2) 2-optimal schedules for trees with no task duplication; and (3) optimal schedules for coarse grain trees with no task duplication.

...read moreread less

206 citations

Journal Article•DOI•

File-access characteristics of parallel scientific workloads

[...]

Nils Nieuwejaar¹, David Kotz¹, A. Purakayastha², C. Sclatter Ellis², Michael L. Best³ - Show less +1 more•Institutions (3)

Dartmouth College¹, Duke University², Massachusetts Institute of Technology³

01 Oct 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: A comparison of the workloads observed on an Intel iPSC/860 and a Thinking Machines CM-5 is compared to gain more insight into the general principles that should guide multiprocessor file-system design.

...read moreread less

Abstract: Phenomenal improvements in the computational performance of multiprocessors have not been matched by comparable gains in I/O system performance. This imbalance has resulted in I/O becoming a significant bottleneck for many scientific applications. One key to overcoming this bottleneck is improving the performance of multiprocessor file systems. The design of a high-performance multiprocessor file system requires a comprehensive understanding of the expected workload. Unfortunately, until recently, no general workload studies of multiprocessor file systems have been conducted. The goal of the CHARISMA project was to remedy this problem by characterizing the behavior of several production workloads, on different machines, at the level of individual reads and writes. The first set of results from the CHARISMA project describe the workloads observed on an Intel iPSC/860 and a Thinking Machines CM-5. This paper is intended to compare and contrast these two workloads for an understanding of their essential similarities and differences, isolating common trends and platform-dependent variances. Using this comparison, we are able to gain more insight into the general principles that should guide multiprocessor file-system design.

...read moreread less

203 citations

Journal Article•DOI•

Dynamic partitioning of non-uniform structured workloads with spacefilling curves

[...]

J.R. Pilkington¹, Scott B. Baden¹•Institutions (1)

University of California, San Diego¹

01 Mar 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: In inverse spacefilling partitioning (ISP), a partitioning strategy for non-uniform scientific computations running on distributed memory MIMD parallel computers, is discussed and the general d-dimensional ISP algorithm is described and empirical results with two- and three-dimensional, non-hierarchical particle methods are reported.

...read moreread less

Abstract: We discuss inverse spacefilling partitioning (ISP), a partitioning strategy for non-uniform scientific computations running on distributed memory MIMD parallel computers. We consider the case of a dynamic workload distributed on a uniform mesh, and compare ISP against orthogonal recursive bisection (ORE) and a median of medians variant of ORE, ORB-MM. We present two results. First, ISP and ORB-MM are superior to ORE in rendering balanced workloads-because they are more fine-grained-and incur communication overheads that are comparable to ORE. Second, ISP is more attractive than ORB-MM from a software engineering standpoint because it avoids elaborate bookkeeping. Whereas ISP partitionings can be described succinctly as logically contiguous segments of the line, ORB-MM's partitionings are inherently unstructured. We describe the general d-dimensional ISP algorithm and report empirical results with two- and three-dimensional, non-hierarchical particle methods.

...read moreread less

151 citations

Journal Article•DOI•

A necessary and sufficient condition for deadlock-free routing in cut-through and store-and-forward networks

[...]

José Duato

01 Aug 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: The theoretical background for the design of deadlock-free adaptive routing algorithms for virtual cut-through and store-and-forward switching is developed and a design methodology is proposed, which automatically supplies fully adaptive, minimal and non-minimal routing algorithms.

...read moreread less

Abstract: This paper develops the theoretical background for the design of deadlock-free adaptive routing algorithms for virtual cut-through and store-and-forward switching. This theory is valid for networks using either central buffers or edge buffers. Some basic definitions and three theorems are proposed, developing conditions to verify that an adaptive algorithm is deadlock-free, even when there are cyclic dependencies between routing resources. Moreover, we propose a necessary and sufficient condition for deadlock-free routing. Also, a design methodology is proposed. It supplies fully adaptive, minimal and non-minimal routing algorithms, guaranteeing that they are deadlock-free. The theory proposed in this paper extends the necessary and sufficient condition for wormhole switching previously proposed by us. The resulting routing algorithms are more flexible than the ones for wormhole switching. Also, the design methodology is much easier to apply because it automatically supplies deadlock-free routing algorithms.

...read moreread less

145 citations

Journal Article•DOI•

Fault-tolerant routing in hypercube multicomputers using local safety information

[...]

Christopher J. Glass, Lionel M. Ni¹•Institutions (1)

Michigan State University¹

01 Jun 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: Simulations of the one-fault-tolerant routing algorithm and other minimal and nonminimal routing algorithms in a two-dimensional mesh indicate that misrouting increases communication latencies significantly at high throughputs, so it is concluded thatMisrouting should be used only for increasing the degree of fault tolerance, never for just increasing adaptiveness.

...read moreread less

Abstract: Previous methods of making wormhole-routed meshes fault tolerant have been based on adding virtual channels to the networks. This paper proposes an alternative method, one based on the turn model for designing wormhole routing algorithms. The turn model produces routing algorithms that are deadlock free, very adaptive, minimal or nonminimal, and livelock free for direct networks--whether or not they contain virtual channels. This paper illustrates how to modify the routing algorithms produced by the turn model to handle dynamic faults. This paper first describes how to modify the negative-first routing algorithm, which the turn model produces for n-dimensional meshes without virtual channels, to make it one-fault tolerant. Simulations of the one-fault-tolerant routing algorithm and other minimal and nonminimal routing algorithms in a two-dimensional mesh indicate that misrouting increases communication latencies significantly at high throughputs. The conclusion is that misrouting should be used only for increasing the degree of fault tolerance, never for just increasing adaptiveness. Finally , the paper describes how to modify the negative-first routing algorithm to make it (n - 1)-fault tolerant for n-dimensional meshes without virtual channels.

...read moreread less

132 citations

Journal Article•DOI•

Detection of strong unstable predicates in distributed programs

[...]

Vijay K. Garg, B. Waldecker¹•Institutions (1)

IBM¹

01 Dec 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: This paper presents algorithms which detect if the given strong global predicate became true in a run of a distributed program, and these algorithms can be executed on line as well as off line.

...read moreread less

Abstract: This paper discusses detection of global predicates in a distributed program. A run of a distributed program results in a set of sequential traces, one for each process. These traces may be combined to form many global sequences consistent with the single run of the program. A strong global predicate is true in a run if it is true for all global sequences consistent with the run. We present algorithms which detect if the given strong global predicate became true in a run of a distributed program. Our algorithms can be executed on line as well as off line. Moreover, our algorithms do not assume that underlying channels satisfy FIFO ordering.

...read moreread less

120 citations

Journal Article•DOI•

Leader election in the presence of link failures

[...]

Gurdip Singh¹•Institutions (1)

Kansas State University¹

01 Mar 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: In this paper, the problem of leader election in the presence of intermittent link failures is studied and a message optimal algorithm with message complexity O(N/sup 2/ε) is presented.

...read moreread less

Abstract: We study the problem of leader election in the presence of intermittent link failures. We assume that up to N/2-1 links incident on each node may fail during the execution of the protocol. We present a message optimal algorithm with message complexity O(N/sup 2/).

...read moreread less

100 citations

Journal Article•DOI•

Fast parallel sorting under LogP: experience with the CM-5

[...]

Andrea C. Dusseau¹, David E. Culler¹, Klaus Erik Schauser¹, Richard Martin¹•Institutions (1)

University of California, Berkeley¹

01 Aug 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: The LogP model is shown to be a valuable guide in the development of parallel algorithms and a good predictor of implementation performance; the model encourages the use of data layouts which minimize communication and balanced communication schedules which avoid contention.

...read moreread less

Abstract: In this paper, we analyze four parallel sorting algorithms (bitonic, column, radix, and sample sort) with the LogP model. LogP characterizes the performance of modern parallel machines with a small set of parameters: the communication latency (L), overhead (o), bandwidth (g), and the number of processors (P). We develop implementations of these algorithms in Split-C, a parallel extension to C, and compare the performance predicted by LogP to actual performance on a CM-5 of 32 to 512 processors for a range of problem sizes. We evaluate the robustness of the algorithms by varying the distribution and ordering of the key values. We also briefly examine the sensitivity of the algorithms to the communication parameters. We show that the LogP model is a valuable guide in the development of parallel algorithms and a good predictor of implementation performance. The model encourages the use of data layouts which minimize communication and balanced communication schedules which avoid contention. With an empirical model of local processor performance, LogP predictions closely match observed execution times on uniformly distributed keys across a broad range of problem and machine sizes. We find that communication performance is oblivious to the distribution of the key values, whereas the local processor performance is not; some communication phases are sensitive to the ordering of keys due to contention. Finally, our analysis shows that overhead is the most critical communication parameter in the sorting algorithms.

...read moreread less

Journal Article•DOI•

Synchronous and asynchronous parallel simulated annealing with multiple Markov chains

[...]

Soo-Young Lee¹, Kyung-Geun Lee²•Institutions (2)

Auburn University¹, Samsung²

01 Oct 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: New parallel simulated annealing algorithms which allow multiple Markov chains to be traced simultaneously by PEs which may communicate with each other which can find a solution of equivalent (or even better) quality up to an order of magnitude faster than the conventional parallel schemes.

...read moreread less

Abstract: Simulated annealing is a general-purpose optimization technique capable of finding an optimal or near-optimal solution in various applications. However, the long execution time required for a good quality solution has been a major drawback in practice. Extensive studies have been carried out to develop parallel algorithms for simulated annealing. Most of them were not very successful, mainly because multiple processing elements (PEs) were required to follow a single Markov chain and, therefore, only a limited parallelism was exploited. In this paper, we propose new parallel simulated annealing algorithms which allow multiple Markov chains to be traced simultaneously by PEs which may communicate with each other. We have considered both synchronous and asynchronous implementations of the algorithms. Their performance has been analyzed in detail and also verified by extensive experimental results. It has been shown that for graph partitioning the proposed parallel simulated annealing schemes can find a solution of equivalent (or even better) quality up to an order of magnitude faster than the conventional parallel schemes. Among the proposed schemes, the one where PEs exchange information dynamically (not with a fixed period) performs best.

...read moreread less

Journal Article•DOI•

Evaluation of hardware-based stride and sequential prefetching in shared-memory multiprocessors

[...]

Fredrik Dahlgren¹, Per Stenström•Institutions (1)

Lund University¹

01 Apr 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: It is found that sequential prefetching does as well as and in same cases even better than stridePrefetching for five applications, and offers the extra advantage of consuming less memory-system bandwidth.

...read moreread less

Abstract: We study the efficiency of previously proposed stride and sequential prefetching-two promising hardware-based prefetching schemes to reduce read-miss penalties in shared-memory multiprocessors. Although stride accesses dominate in four out of six of the applications we study, we find that sequential prefetching does as well as and in same cases even better than stride prefetching for five applications. This is because 1) most strides are shorter than the block size (we assume 32 byte blocks), which means that sequential prefetching is as effective for these stride accesses, and 2) sequential prefetching also exploits the locality of read misses with nonstride accesses. However, since stride prefetching in general results in fewer useless prefetches, it offers the extra advantage of consuming less memory-system bandwidth.

...read moreread less

Journal Article•DOI•

An application of Petri net reduction for Ada tasking deadlock analysis

[...]

Sol M. Shatz¹, Shengru Tu², Tadao Murata³, S. Duri⁴•Institutions (4)

University of Illinois at Urbana–Champaign¹, University of New Orleans², University of Illinois at Chicago³, IBM⁴

01 Dec 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: A number of reduction rules are introduced and it is shown how they can be applied to Ada nets, which are automatically generated Petri net models of Ada tasking, and experimental results from applying the reduction process are discussed.

...read moreread less

Abstract: As part of our continuing research on using Petri nets to support automated analysis of Ada tasking behavior, we have investigated the application of Petri net reduction for deadlock analysis. Although reachability analysis is an important method to detect deadlocks, it is in general inefficient or even intractable. Net reduction can aid the analysis by reducing the size of the net while preserving relevant properties. We introduce a number of reduction rules and show how they can be applied to Ada nets, which are automatically generated Petri net models of Ada tasking. We define a reduction process and a method by which a useful description of a detected deadlock state can be obtained from the reduced net's information. A reduction tool and experimental results from applying the reduction process are discussed.

...read moreread less

Journal Article•DOI•

A unified framework for optimizing communication in data-parallel programs

[...]

Manish Gupta¹, Edith Schonberg¹, Harini Srinivasan¹•Institutions (1)

IBM¹

01 Jul 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: It is shown that the bidirectional problem of eliminating partial redundancies can be decomposed into simpler unidirectional problems even in the context of an array section representation, which makes the analysis procedure more efficient.

...read moreread less

Abstract: This paper presents a framework, based on global array data-flow analysis, to reduce communication costs in a program being compiled for a distributed memory machine. We introduce available section descriptor, a novel representation of communication involving array sections. This representation allows us to apply techniques for partial redundancy elimination to obtain powerful communication optimizations. With a single framework, we are able to capture optimizations like (1) vectorizing communication, (2) eliminating communication that is redundant on any control flow path, (3) reducing the amount of data being communicated, (4) reducing the number of processors to which data must be communicated, and (5) moving communication earlier to hide latency, and to subsume previous communication. We show that the bidirectional problem of eliminating partial redundancies can be decomposed into simpler unidirectional problems even in the context of an array section representation, which makes the analysis procedure more efficient. We present results from a preliminary implementation of this framework, which are extremely encouraging, and demonstrate the effectiveness of this analysis in improving the performance of programs.

...read moreread less

Journal Article•DOI•

Circuit-switched broadcasting in torus networks

[...]

Joseph G. Peters¹, M. Syska•Institutions (1)

Simon Fraser University¹

01 Mar 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: Three broadcast algorithms and lower bounds on the three main components of the broadcast time for 2-dimensional torus networks (wrap-around meshes) that use synchronous circuit-switched routing are presented.

...read moreread less

Abstract: In this paper we present three broadcast algorithms and lower bounds on the three main components of the broadcast time for 2-dimensional torus networks (wrap-around meshes) that use synchronous circuit-switched routing. The first algorithm is based on a recursive tiling of a torus and is optimal in terms of both phases and intermediate switch settings when the start-up time to initiate message transmissions is the dominant cost. It is the first broadcast algorithm to match the lower bound of log/sub 5/ N on number of phases (where N is the number of nodes). The second and third algorithms are hybrids which combine circuit-switching with the pipelining and arc-disjoint spanning trees techniques that are commonly used to speed up store-and-forward routing. When the propagation time of messages through the network is significant, our hybrid algorithms achieve close to optimal performance in terms of phases, intermediate switch settings, and total transmission time. They are the first algorithms to achieve this performance in terms of all three parameters simultaneously.

...read moreread less

Journal Article•DOI•

A note on consensus on dual failure modes

[...]

Hin-Sing Siu¹, Y. H. Chin², Wei-Pang Yang¹•Institutions (2)

National Chiao Tung University¹, National Tsing Hua University²

01 Mar 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: The proposed MS (for "mixed-sum") algorithm to solve the Byzantine Agreement problem with dual failure modes is corrected and the bound on the number of allowable faulty processors is overestimated.

...read moreread less

Abstract: F.J. Meyer and D.K. Pradhan (1991) proposed the MS (for "mixed-sum") algorithm to solve the Byzantine Agreement (BA) problem with dual failure modes: arbitrary faults (Byzantine faults) and dormant faults (essentially omission faults and timing faults). Our study indicates that this algorithm uses an inappropriate method to eliminate the effects of dormant faults and that the bound on the number of allowable faulty processors is overestimated. This paper corrects the algorithm and gives a new bound for the allowable faulty processors.

...read moreread less

Journal Article•DOI•

Efficient algorithms for array redistribution

[...]

Rajeev Thakur¹, Alok Choudhary², J. Ramanujam³•Institutions (3)

Argonne National Laboratory¹, Syracuse University², Louisiana State University³

01 Jun 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: This work proposes two algorithms for efficient algorithms for redistribution between different cyclic(k) distributions, as defined in High Performance Fortran, and implements these algorithms on the Intel Touchstone Delta, finding that they perform well for different array sizes and number of processors.

...read moreread less

Abstract: Dynamic redistribution of arrays is required very often in programs on distributed presents efficient algorithms for redistribution between different cyclic(k) distributions, as defined in High Performance Fortran. We first propose special optimized algorithms for a cyclic(x) to cyclic(y) redistribution when x is a multiple of y, or y is a multiple of x. We then propose two algorithms, called the GCD method and the LCM method, for the general cyclic(x) to cyclic(y) redistribution when there is no particular relation between x and y. We have implemented these algorithms on the Intel Touchstone Delta, and find that they perform well for different array sizes and number of processors.

...read moreread less

Journal Article•DOI•

A trip-based multicasting model in wormhole-routed networks with virtual channels

[...]

Yu-Chee Tseng, Dhabaleswar K. Panda¹, Ten-Hwang Lai¹•Institutions (1)

Ohio State University¹

01 Feb 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: A trip-based model is proposed to support adaptive, distributed, and deadlock-free multiple multicast on any network with arbitrary topology using at most two virtual channels per physical channel.

...read moreread less

Abstract: This paper focuses on efficient multicasting in wormhole-routed networks. A trip-based model is proposed to support adaptive, distributed, and deadlock-free multiple multicast on any network with arbitrary topology using at most two virtual channels per physical channel. This model significantly generalizes the path-based model proposed earlier which works only for Hamiltonian networks and cannot be applicable to networks with arbitrary topology resulted due to system faults. Fundamentals of the trip-based model, including the necessary and sufficient condition to be deadlock-free, and the use of appropriate number of virtual channels to avoid deadlock are investigated. The potential of this model is illustrated by applying it to hypercubes with faulty nodes. Simulation results indicate that the proposed model can implement multiple multicast on faulty hypercubes with negligible performance degradation.

...read moreread less

Journal Article•DOI•

Data forwarding in scalable shared-memory multiprocessors

[...]

David A. Koufaty¹, Xiangfeng Chen, D. K. Poulsen¹, Josep Torrellas¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Dec 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: This paper presents a framework for a compiler algorithm for forwarding in shared-memory multiprocessors, and uses address traces to evaluate the performance impact of different levels of support for forwarding.

...read moreread less

Abstract: Scalable shared-memory multiprocessors are often slowed down by long-latency memory accesses. One way to cope with this problem is to use data forwarding to overlap memory accesses with computation. With data forwarding, when a processor produces a datum, in addition to updating its cache, it sends a copy of the datum to the caches of the processors that the compiler identified as consumers of it. As a result, when the consumer processors access the datum, they find it in their caches. This paper addresses two main issues. First, it presents a framework for a compiler algorithm for forwarding. Second, using address traces, it evaluates the performance impact of different levels of support for forwarding. Our simulations of a 32-processor machine show that an optimistic support for forwarding speeds up five applications by an average of 50% for large caches and 30% for small caches. For large caches, most sharing read misses are eliminated, while for small caches, forwarding does not increase the number of conflict misses significantly. Overall, support for forwarding in shared-memory multiprocessors promises to deliver good application speedups.

...read moreread less

Journal Article•DOI•

Folded Petersen cube networks: new competitors for the hypercubes

[...]

Sabine R. Öhring¹, Sajal K. Das¹•Institutions (1)

University of North Texas¹

01 Feb 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: This paper emphasizes the versatility of the folded Petersen cube networks as a multicomputer interconnection topology by providing embeddings of many computationally important structures such as rings, multi-dimensional meshes, hypercubes, complete binary trees, tree machines, meshes of trees, and pyramids.

...read moreread less

Abstract: We introduce and analyze a new interconnection topology, called the k-dimensional folded Petersen (FP/sub k/) network, which is constructed by iteratively applying the Cartesian product operation on the well-known Petersen graph. Since the number of nodes in FP/sub k/ is restricted to a power of ten, for better scalability we propose a generalization, the folded Petersen cube network FPQ/sub n,k/=Q/sub n//spl times/FP/sub k/, which is a product of the n-dimensional binary hypercube (Q/sub n/) and FP/sub k/. The FPQ/sub n,k/ topology provides regularity, node- and edge-symmetry, optimal connectivity (and therefore maximal fault-tolerance), logarithmic diameter, modularity, and permits simple self-routing and broadcasting algorithms. With the same node-degree and connectivity, FPQ/sub n,k/ has smaller diameter and accommodates more nodes than Q/sub n+3k/, and its packing density is higher compared to several other product networks. This paper also emphasizes the versatility of the folded Petersen cube networks as a multicomputer interconnection topology by providing embeddings of many computationally important structures such as rings, multi-dimensional meshes, hypercubes, complete binary trees, tree machines, meshes of trees, and pyramids. The dilation and edge-congestion of all such embeddings are at most two.

...read moreread less

Journal Article•DOI•

Parallelized direct execution simulation of message-passing parallel programs

[...]

Phillip M. Dickens¹, Philip Heidelberger, David M. Nicol•Institutions (1)

Langley Research Center¹

01 Oct 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: This paper describes methods suitable for parallelized direct execution simulation of message-passing parallel programs, and reports on the performance of such a system, LAPSE (Large Application Parallel Simulation Environment), which has built on the Intel Paragon.

...read moreread less

Abstract: As massively parallel computers proliferate, there is growing interest in finding ways by which performance of massively parallel codes can be efficiently predicted. This problem arises in diverse contexts such as parallelizing compilers, parallel performance monitoring, and parallel algorithm development. In this paper, we describe one solution where one directly executes the application code, but uses a discrete-event simulator to model details of the presumed parallel machine, such as operating system and communication network behavior. Because this approach is computationally expensive, we are interested in its own parallelization, specifically the parallelization of the discrete-event simulator. We describe methods suitable for parallelized direct execution simulation of message-passing parallel programs, and report on the performance of such a system, LAPSE (Large Application Parallel Simulation Environment), we have built on the Intel Paragon. On all codes measured to date, LAPSE predicts performance well, typically within 10% relative error. Depending on the nature of the application code, we have observed low slowdowns (relative to natively executing code) and high relative speedups using up to 64 processors.

...read moreread less

Journal Article•DOI•

All-to-all personalized communication in a wormhole-routed torus

[...]

Yu-Chee Tseng, Sandeep K. S. Gupta¹•Institutions (1)

Ohio University¹

01 May 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: This paper considers the problem of all-to-all personalized communication in a torus of any dimension with the wormhole-routing capability, and proposes complete exchange algorithms that use optimal numbers of phases (if each side of the tori is a multiple of eight) or asymptotically optimal number of phases.

...read moreread less

Abstract: All-to-all personalized communication, or complete exchange, is at the heart of numerous applications in parallel computing. It is one of the most dense communication patterns. In this paper, we consider this problem in a torus of any dimension with the wormhole-routing capability. We propose complete exchange algorithms that use optimal numbers of phases (if each side of the tori is a multiple of eight) or asymptotically optimal numbers of phases (otherwise). Interestingly, in order to achieve this, we only make weak assumptions-that a node is capable of sending and receiving at most one message at a time, and the network is capable of supporting the dimension-ordered (or e-cube) minimum routing.

...read moreread less

Journal Article•DOI•

A framework for designing deadlock-free wormhole routing algorithms

[...]

Rajendra V. Boppana, Suresh Chalasani¹•Institutions (1)

University of Wisconsin-Madison¹

01 Feb 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: A framework to design fully-adaptive, deadlock-free wormhole algorithms for a variety of network topologies and an analysis of the resource requirements and performances of a proposed algorithm, called negative-hop algorithm, with some of the previously proposed algorithms for torus and mesh networks is presented.

...read moreread less

Abstract: This paper presents a framework to design fully-adaptive, deadlock-free wormhole algorithms for a variety of network topologies. The main theoretical contributions are: (a) design of new wormhole algorithms using store-and-forward algorithms, (b) a sufficient condition for deadlock free routing by the wormhole algorithms so designed, and (c) a sufficient condition for deadlock free routing by these wormhole algorithms with centralized flit buffers shared among multiple channels. To illustrate the theory, several wormhole algorithms based on store-and-forward hop schemes are designed. The hop-based wormhole algorithms can be applied to a variety of networks including torus, mesh, de Brujin, and a class of Cayley networks, with the best known bounds on virtual channels for minimal routing on the last two classes of networks. An analysis of the resource requirements and performances of a proposed algorithm, called negative-hop algorithm, with some of the previously proposed algorithms for torus and mesh networks is presented.

...read moreread less

Journal Article•DOI•

Achieving full parallelism using multidimensional retiming

[...]

N.L. Passos¹, Edwin H.-M. Sha²•Institutions (2)

Midwestern State University¹, University of Notre Dame²

01 Nov 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: An important and counter-intuitive result is shown, which proves that the authors can always obtain full-parallelism for MDFGs with more than one dimension.

...read moreread less

Abstract: Most scientific and digital signal processing (DSP) applications are recursive or iterative. Transformation techniques are usually applied to get optimal execution rates in parallel and/or pipeline systems. The retiming technique is a common and valuable transformation tool in one-dimensional problems, when loops are represented by data flow graphs (DFGs). In this paper, uniform nested loops are modeled as multidimensional data flow graphs (MDFGs). Full parallelism of the loop body, i.e., all nodes in the MDFG executed in parallel, substantially decreases the overall computation time. It is well known that, for one-dimensional DFGs, retiming can not always achieve full parallelism. Other existing optimization techniques for nested loops also can not always achieve full parallelism. This paper shows an important and counter-intuitive result, which proves that we can always obtain full-parallelism for MDFGs with more than one dimension. This result is obtained by transforming the MDFG into a new structure. The restructuring process is based on a multidimensional retiming technique. The theory and two algorithms to obtain full parallelism are presented in this paper. Examples of optimization of nested loops and digital signal processing designs are shown to demonstrate the effectiveness of the algorithms.

...read moreread less

Journal Article•DOI•

Using finite state automata to produce self-optimization and self-control

[...]

B. Tung¹, Leonard Kleinrock¹•Institutions (1)

University of Southern California¹

01 Apr 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: This paper presents a simple game, and develops basic theory underlying a robust method for distributed coordination based on this game that makes use of finite state automata-one associated with each agent-which guide the agents.

...read moreread less

Abstract: A simple game provides a framework within which agents can spontaneously self-organize. In this paper, we present this game, and develop basic theory underlying a robust method for distributed coordination based on this game. This method makes use of finite state automata-one associated with each agent-which guide the agents. We give a new, general method of analysis of these systems, which previously had been studied only in limited cases. We also provide a physical example, which should hint at the type of problems resolvable using this method.

...read moreread less

Journal Article•DOI•

Computing on anonymous networks. II. Decision and membership problems

[...]

Masafumi Yamashita¹, Tiko Kameda²•Institutions (2)

Hiroshima University¹, Simon Fraser University²

01 Jan 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: It turns out that, depending on the class, the complexity varies from P-time to NP-complete or co-NP-complete, and a new graph property called symmetricity played a central role in the analysis of anonymous networks.

...read moreread less

Abstract: For pt I see ibid. In anonymous networks, the processors do not have identity numbers. In Part I of this paper, we characterized the classes of networks on which some representative distributed computation problems are solvable under different conditions. A new graph property called symmetricity played a central role in our analysis of anonymous networks. In Part II, we turn our attention to the computational complexity issues. We first discuss the complexity of determining the symmetricity of a given graph, and then that of testing membership in each of the 16 classes of anonymous networks defined in Part I. It turns out that, depending on the class, the complexity varies from P-time to NP-complete or co-NP-complete.

...read moreread less

Journal Article•DOI•

Optimal information dissemination in star and pancake networks

[...]

Pascal Berthomé¹, Afonso Ferreira¹, Stéphane Pérennes²•Institutions (2)

École normale supérieure de Lyon¹, Centre national de la recherche scientifique²

01 Dec 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: A new decomposition technique for hierarchical Cayley graphs is presented, which yields a very easy implementation of the divide and conquer paradigm for some problems on very complex architectures as the star graph or the pancake.

...read moreread less

Abstract: This paper presents a new decomposition technique for hierarchical Cayley graphs. This technique yields a very easy implementation of the divide and conquer paradigm for some problems on very complex architectures as the star graph or the pancake. As applications, we introduce algorithms for broadcasting and prefix-like operations that improve the best known bounds for these problems. We also give the first nontrivial optimal gossiping algorithms for these networks. In star-graphs and pancakes with N=n! processors, our algorithms take less than [log N]+1.5n steps.

...read moreread less

Journal Article•DOI•

Automatic data structure selection and transformation for sparse matrix computations

[...]

A.J.C. Bik¹, Harry A. G. Wijshoff¹•Institutions (1)

Leiden University¹

01 Feb 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: The method presented in this paper delays data structure selection until the compile phase, thereby allowing the compiler to combine code optimization with explicitData structure selection, and enables the compilation of efficient code for sparse computations.

...read moreread less

Abstract: The problem of compiler optimization of sparse codes is well known and no satisfactory solutions have been found yet. One of the major obstacles is formed by the fact that sparse programs explicitly deal with particular data structures selected for storing sparse matrices. This explicit data structure handling obscures the functionality of a code to such a degree that optimization of the code is prohibited, for instance, by the introduction of indirect addressing. The method presented in this paper delays data structure selection until the compile phase, thereby allowing the compiler to combine code optimization with explicit data structure selection. This method enables the compiler to generate efficient code for sparse computations. Moreover, the task of the programmer is greatly reduced in complexity.

...read moreread less

Journal Article•DOI•

Static and dynamic evaluation of data dependence analysis techniques

[...]

Paul Petersen, David Padua

01 Nov 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: The results indicate that Banerjee's test is for all practical purposes as accurate as the more complex Omega test in detecting parallelism, but the Omega test is quite effective in proving the existence of dependences, in contrast with Baners' test, which can only disprove, or break dependences.

...read moreread less

Abstract: Data dependence analysis techniques are the main component of today's trategies for automatic detection of parallelism. Parallelism detection strategies are being incorporated in commercial compilers with increasing frequency because of the widespread use of processors capable of exploiting instruction-level parallelism and the growing importance of multiprocessors. An assessment of the accuracy of data dependence tests is therefore of great importance for compiler writers and researchers. The tests evaluated in this study include the generalized greatest common divisor test, three variants of Banerjee's test, and the Omega test. Their effectiveness was measured with respect to the Perfect Benchmarks and the linear algebra libraries, EISPACK and LAPACK. Two methods were applied, one using only compile-time information for the analysis, and the second using information gathered during program execution. The results indicate that Banerjee's test is for all practical purposes as accurate as the more complex Omega test in detecting parallelism. However, the Omega test is quite effective in proving the existence of dependences, in contrast with Banerjee's test, which can only disprove, or break dependences. The capability of the Omega test of proving dependences could have a significant impact on several compiler algorithms not considered in this study.

...read moreread less