scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Parallel and Distributed Systems in 1996"


Journal ArticleDOI
TL;DR: A static scheduling algorithm for allocating task graphs to fully connected multiprocessors which has admissible time complexity, is economical in terms of the number of processors used and is suitable for a wide range of graph structures.
Abstract: In this paper, we propose a static scheduling algorithm for allocating task graphs to fully connected multiprocessors. We discuss six recently reported scheduling algorithms and show that they possess one drawback or the other which can lead to poor performance. The proposed algorithm, which is called the Dynamic Critical-Path (DCP) scheduling algorithm, is different from the previously proposed algorithms in a number of ways. First, it determines the critical path of the task graph and selects the next node to be scheduled in a dynamic fashion. Second, it rearranges the schedule on each processor dynamically in the sense that the positions of the nodes in the partial schedules are not fixed until all nodes have been considered. Third, it selects a suitable processor for a node by looking ahead the potential start times of the remaining nodes on that processor, and schedules relatively less important nodes to the processors already in use. A global as well as a pair-wise comparison is carried out for all seven algorithms under various scheduling conditions. The DCP algorithm outperforms the previous algorithms by a considerable margin. Despite having a number of new features, the DCP algorithm has admissible time complexity, is economical in terms of the number of processors used and is suitable for a wide range of graph structures.

842 citations


Journal ArticleDOI
TL;DR: The symmetricity of a network is related to its 1- and 2-factors in terms of a new graph property called "symmetricity" in each of the four cases (a)-(4) above, to characterize the class of networks on which each of these problems (a)(d) is solvable.
Abstract: In anonymous networks, the processors do not have identity numbers. We investigate the following representative problems on anonymous networks: (a) the leader election problem, (b) the edge election problem, (c) the spanning tree construction problem, and (d) the topology recognition problem. On a given network, the above problems may or may not be solvable, depending on the amount of information about the attributes of the network made available to the processors. Some possibilities are: (1) no network attribute information at all is available, (2) an upper bound on the number of processors in the network is available, (3) the exact number of processors in the network is available, and (4) the topology of the network is available. In terms of a new graph property called "symmetricity", in each of the four cases (1)-(4) above, we characterize the class of networks on which each of the four problems (a)(d) is solvable. We then relate the symmetricity of a network to its 1- and 2-factors.

322 citations


Journal ArticleDOI
TL;DR: A synchronous snapshot collection algorithm for mobile systems that neither forces every node to take a local snapshot, nor blocks the underlying computation during snapshot collection, and a minimal rollback/recovery algorithm in which the computation at a node is rolled back only if it depends on operations that have been undone due to the failure of node(s).
Abstract: A mobile computing system consists of mobile and stationary nodes, connected to each other by a communication network. The presence of mobile nodes in the system places constraints on the permissible energy consumption and available communication bandwidth. To minimize the lost computation during recovery from node failures, periodic collection of a consistent snapshot of the system (checkpoint) is required. Locating mobile nodes contributes to the checkpointing and recovery costs. Synchronous snapshot collection algorithms, designed for static networks, either force every node in the system to take a new local snapshot, or block the underlying computation during snapshot collection. Hence, they are not suitable for mobile computing systems. If nodes take their local checkpoints independently in an uncoordinated manner, each node may have to store multiple local checkpoints in stable storage. This is not suitable for mobile nodes as they have small memory. This paper presents a synchronous snapshot collection algorithm for mobile systems that neither forces every node to take a local snapshot, nor blocks the underlying computation during snapshot collection. If a node initiates snapshot collection, local snapshots of only those nodes that have directly or transitively affected the initiator since their last snapshots need to be taken. We prove that the global snapshot collection terminates within a finite time of its invocation and the collected global snapshot is consistent. We also propose a minimal rollback/recovery algorithm in which the computation at a node is rolled back only if it depends on operations that have been undone due to the failure of node(s). Both the algorithms have low communication and storage overheads and meet the low energy consumption and low bandwidth constraints of mobile computing systems.

213 citations


Journal ArticleDOI
TL;DR: A simple greedy algorithm is presented for the problem of scheduling parallel programs represented as directed acyclic task graphs for execution on distributed memory parallel architectures which runs in O(n(n lg n+e) time, which is n times faster than the currently best known algorithm for this problem.
Abstract: This paper addresses the problem of scheduling parallel programs represented as directed acyclic task graphs for execution on distributed memory parallel architectures. Because of the high communication overhead in existing parallel machines, a crucial step in scheduling is task clustering, the process of coalescing fine grain tasks into single coarser ones so that the overall execution time is minimized. The task clustering problem is NP-hard, even when the number of processors is unbounded and task duplication is allowed. A simple greedy algorithm is presented for this problem which, for a task graph with arbitrary granularity, produces a schedule whose makespan is at most twice optimal. Indeed, the quality of the schedule improves as the granularity of the task graph becomes larger. For example, if the granularity is at least 1/2, the makespan of the schedule is at most 5/3 times optimal. For a task graph with n tasks and e inter-task communication constraints, the algorithm runs in O(n(n lg n+e)) time, which is n times faster than the currently best known algorithm for this problem. Similar algorithms are developed that produce: (1) optimal schedules for coarse grain graphs; (2) 2-optimal schedules for trees with no task duplication; and (3) optimal schedules for coarse grain trees with no task duplication.

206 citations


Journal ArticleDOI
TL;DR: A comparison of the workloads observed on an Intel iPSC/860 and a Thinking Machines CM-5 is compared to gain more insight into the general principles that should guide multiprocessor file-system design.
Abstract: Phenomenal improvements in the computational performance of multiprocessors have not been matched by comparable gains in I/O system performance. This imbalance has resulted in I/O becoming a significant bottleneck for many scientific applications. One key to overcoming this bottleneck is improving the performance of multiprocessor file systems. The design of a high-performance multiprocessor file system requires a comprehensive understanding of the expected workload. Unfortunately, until recently, no general workload studies of multiprocessor file systems have been conducted. The goal of the CHARISMA project was to remedy this problem by characterizing the behavior of several production workloads, on different machines, at the level of individual reads and writes. The first set of results from the CHARISMA project describe the workloads observed on an Intel iPSC/860 and a Thinking Machines CM-5. This paper is intended to compare and contrast these two workloads for an understanding of their essential similarities and differences, isolating common trends and platform-dependent variances. Using this comparison, we are able to gain more insight into the general principles that should guide multiprocessor file-system design.

203 citations


Journal ArticleDOI
TL;DR: In inverse spacefilling partitioning (ISP), a partitioning strategy for non-uniform scientific computations running on distributed memory MIMD parallel computers, is discussed and the general d-dimensional ISP algorithm is described and empirical results with two- and three-dimensional, non-hierarchical particle methods are reported.
Abstract: We discuss inverse spacefilling partitioning (ISP), a partitioning strategy for non-uniform scientific computations running on distributed memory MIMD parallel computers. We consider the case of a dynamic workload distributed on a uniform mesh, and compare ISP against orthogonal recursive bisection (ORE) and a median of medians variant of ORE, ORB-MM. We present two results. First, ISP and ORB-MM are superior to ORE in rendering balanced workloads-because they are more fine-grained-and incur communication overheads that are comparable to ORE. Second, ISP is more attractive than ORB-MM from a software engineering standpoint because it avoids elaborate bookkeeping. Whereas ISP partitionings can be described succinctly as logically contiguous segments of the line, ORB-MM's partitionings are inherently unstructured. We describe the general d-dimensional ISP algorithm and report empirical results with two- and three-dimensional, non-hierarchical particle methods.

151 citations


Journal ArticleDOI
TL;DR: The theoretical background for the design of deadlock-free adaptive routing algorithms for virtual cut-through and store-and-forward switching is developed and a design methodology is proposed, which automatically supplies fully adaptive, minimal and non-minimal routing algorithms.
Abstract: This paper develops the theoretical background for the design of deadlock-free adaptive routing algorithms for virtual cut-through and store-and-forward switching. This theory is valid for networks using either central buffers or edge buffers. Some basic definitions and three theorems are proposed, developing conditions to verify that an adaptive algorithm is deadlock-free, even when there are cyclic dependencies between routing resources. Moreover, we propose a necessary and sufficient condition for deadlock-free routing. Also, a design methodology is proposed. It supplies fully adaptive, minimal and non-minimal routing algorithms, guaranteeing that they are deadlock-free. The theory proposed in this paper extends the necessary and sufficient condition for wormhole switching previously proposed by us. The resulting routing algorithms are more flexible than the ones for wormhole switching. Also, the design methodology is much easier to apply because it automatically supplies deadlock-free routing algorithms.

145 citations


Journal ArticleDOI
TL;DR: Simulations of the one-fault-tolerant routing algorithm and other minimal and nonminimal routing algorithms in a two-dimensional mesh indicate that misrouting increases communication latencies significantly at high throughputs, so it is concluded thatMisrouting should be used only for increasing the degree of fault tolerance, never for just increasing adaptiveness.
Abstract: Previous methods of making wormhole-routed meshes fault tolerant have been based on adding virtual channels to the networks. This paper proposes an alternative method, one based on the turn model for designing wormhole routing algorithms. The turn model produces routing algorithms that are deadlock free, very adaptive, minimal or nonminimal, and livelock free for direct networks--whether or not they contain virtual channels. This paper illustrates how to modify the routing algorithms produced by the turn model to handle dynamic faults. This paper first describes how to modify the negative-first routing algorithm, which the turn model produces for n-dimensional meshes without virtual channels, to make it one-fault tolerant. Simulations of the one-fault-tolerant routing algorithm and other minimal and nonminimal routing algorithms in a two-dimensional mesh indicate that misrouting increases communication latencies significantly at high throughputs. The conclusion is that misrouting should be used only for increasing the degree of fault tolerance, never for just increasing adaptiveness. Finally , the paper describes how to modify the negative-first routing algorithm to make it (n - 1)-fault tolerant for n-dimensional meshes without virtual channels.

132 citations


Journal ArticleDOI
Vijay K. Garg, B. Waldecker1
TL;DR: This paper presents algorithms which detect if the given strong global predicate became true in a run of a distributed program, and these algorithms can be executed on line as well as off line.
Abstract: This paper discusses detection of global predicates in a distributed program. A run of a distributed program results in a set of sequential traces, one for each process. These traces may be combined to form many global sequences consistent with the single run of the program. A strong global predicate is true in a run if it is true for all global sequences consistent with the run. We present algorithms which detect if the given strong global predicate became true in a run of a distributed program. Our algorithms can be executed on line as well as off line. Moreover, our algorithms do not assume that underlying channels satisfy FIFO ordering.

120 citations


Journal ArticleDOI
TL;DR: In this paper, the problem of leader election in the presence of intermittent link failures is studied and a message optimal algorithm with message complexity O(N/sup 2/ε) is presented.
Abstract: We study the problem of leader election in the presence of intermittent link failures. We assume that up to N/2-1 links incident on each node may fail during the execution of the protocol. We present a message optimal algorithm with message complexity O(N/sup 2/).

100 citations


Journal ArticleDOI
TL;DR: The LogP model is shown to be a valuable guide in the development of parallel algorithms and a good predictor of implementation performance; the model encourages the use of data layouts which minimize communication and balanced communication schedules which avoid contention.
Abstract: In this paper, we analyze four parallel sorting algorithms (bitonic, column, radix, and sample sort) with the LogP model. LogP characterizes the performance of modern parallel machines with a small set of parameters: the communication latency (L), overhead (o), bandwidth (g), and the number of processors (P). We develop implementations of these algorithms in Split-C, a parallel extension to C, and compare the performance predicted by LogP to actual performance on a CM-5 of 32 to 512 processors for a range of problem sizes. We evaluate the robustness of the algorithms by varying the distribution and ordering of the key values. We also briefly examine the sensitivity of the algorithms to the communication parameters. We show that the LogP model is a valuable guide in the development of parallel algorithms and a good predictor of implementation performance. The model encourages the use of data layouts which minimize communication and balanced communication schedules which avoid contention. With an empirical model of local processor performance, LogP predictions closely match observed execution times on uniformly distributed keys across a broad range of problem and machine sizes. We find that communication performance is oblivious to the distribution of the key values, whereas the local processor performance is not; some communication phases are sensitive to the ordering of keys due to contention. Finally, our analysis shows that overhead is the most critical communication parameter in the sorting algorithms.

Journal ArticleDOI
TL;DR: New parallel simulated annealing algorithms which allow multiple Markov chains to be traced simultaneously by PEs which may communicate with each other which can find a solution of equivalent (or even better) quality up to an order of magnitude faster than the conventional parallel schemes.
Abstract: Simulated annealing is a general-purpose optimization technique capable of finding an optimal or near-optimal solution in various applications. However, the long execution time required for a good quality solution has been a major drawback in practice. Extensive studies have been carried out to develop parallel algorithms for simulated annealing. Most of them were not very successful, mainly because multiple processing elements (PEs) were required to follow a single Markov chain and, therefore, only a limited parallelism was exploited. In this paper, we propose new parallel simulated annealing algorithms which allow multiple Markov chains to be traced simultaneously by PEs which may communicate with each other. We have considered both synchronous and asynchronous implementations of the algorithms. Their performance has been analyzed in detail and also verified by extensive experimental results. It has been shown that for graph partitioning the proposed parallel simulated annealing schemes can find a solution of equivalent (or even better) quality up to an order of magnitude faster than the conventional parallel schemes. Among the proposed schemes, the one where PEs exchange information dynamically (not with a fixed period) performs best.

Journal ArticleDOI
TL;DR: It is found that sequential prefetching does as well as and in same cases even better than stridePrefetching for five applications, and offers the extra advantage of consuming less memory-system bandwidth.
Abstract: We study the efficiency of previously proposed stride and sequential prefetching-two promising hardware-based prefetching schemes to reduce read-miss penalties in shared-memory multiprocessors. Although stride accesses dominate in four out of six of the applications we study, we find that sequential prefetching does as well as and in same cases even better than stride prefetching for five applications. This is because 1) most strides are shorter than the block size (we assume 32 byte blocks), which means that sequential prefetching is as effective for these stride accesses, and 2) sequential prefetching also exploits the locality of read misses with nonstride accesses. However, since stride prefetching in general results in fewer useless prefetches, it offers the extra advantage of consuming less memory-system bandwidth.

Journal ArticleDOI
TL;DR: A number of reduction rules are introduced and it is shown how they can be applied to Ada nets, which are automatically generated Petri net models of Ada tasking, and experimental results from applying the reduction process are discussed.
Abstract: As part of our continuing research on using Petri nets to support automated analysis of Ada tasking behavior, we have investigated the application of Petri net reduction for deadlock analysis. Although reachability analysis is an important method to detect deadlocks, it is in general inefficient or even intractable. Net reduction can aid the analysis by reducing the size of the net while preserving relevant properties. We introduce a number of reduction rules and show how they can be applied to Ada nets, which are automatically generated Petri net models of Ada tasking. We define a reduction process and a method by which a useful description of a detected deadlock state can be obtained from the reduced net's information. A reduction tool and experimental results from applying the reduction process are discussed.

Journal ArticleDOI
TL;DR: It is shown that the bidirectional problem of eliminating partial redundancies can be decomposed into simpler unidirectional problems even in the context of an array section representation, which makes the analysis procedure more efficient.
Abstract: This paper presents a framework, based on global array data-flow analysis, to reduce communication costs in a program being compiled for a distributed memory machine. We introduce available section descriptor, a novel representation of communication involving array sections. This representation allows us to apply techniques for partial redundancy elimination to obtain powerful communication optimizations. With a single framework, we are able to capture optimizations like (1) vectorizing communication, (2) eliminating communication that is redundant on any control flow path, (3) reducing the amount of data being communicated, (4) reducing the number of processors to which data must be communicated, and (5) moving communication earlier to hide latency, and to subsume previous communication. We show that the bidirectional problem of eliminating partial redundancies can be decomposed into simpler unidirectional problems even in the context of an array section representation, which makes the analysis procedure more efficient. We present results from a preliminary implementation of this framework, which are extremely encouraging, and demonstrate the effectiveness of this analysis in improving the performance of programs.

Journal ArticleDOI
TL;DR: Three broadcast algorithms and lower bounds on the three main components of the broadcast time for 2-dimensional torus networks (wrap-around meshes) that use synchronous circuit-switched routing are presented.
Abstract: In this paper we present three broadcast algorithms and lower bounds on the three main components of the broadcast time for 2-dimensional torus networks (wrap-around meshes) that use synchronous circuit-switched routing. The first algorithm is based on a recursive tiling of a torus and is optimal in terms of both phases and intermediate switch settings when the start-up time to initiate message transmissions is the dominant cost. It is the first broadcast algorithm to match the lower bound of log/sub 5/ N on number of phases (where N is the number of nodes). The second and third algorithms are hybrids which combine circuit-switching with the pipelining and arc-disjoint spanning trees techniques that are commonly used to speed up store-and-forward routing. When the propagation time of messages through the network is significant, our hybrid algorithms achieve close to optimal performance in terms of phases, intermediate switch settings, and total transmission time. They are the first algorithms to achieve this performance in terms of all three parameters simultaneously.

Journal ArticleDOI
TL;DR: The proposed MS (for "mixed-sum") algorithm to solve the Byzantine Agreement problem with dual failure modes is corrected and the bound on the number of allowable faulty processors is overestimated.
Abstract: F.J. Meyer and D.K. Pradhan (1991) proposed the MS (for "mixed-sum") algorithm to solve the Byzantine Agreement (BA) problem with dual failure modes: arbitrary faults (Byzantine faults) and dormant faults (essentially omission faults and timing faults). Our study indicates that this algorithm uses an inappropriate method to eliminate the effects of dormant faults and that the bound on the number of allowable faulty processors is overestimated. This paper corrects the algorithm and gives a new bound for the allowable faulty processors.

Journal ArticleDOI
TL;DR: This work proposes two algorithms for efficient algorithms for redistribution between different cyclic(k) distributions, as defined in High Performance Fortran, and implements these algorithms on the Intel Touchstone Delta, finding that they perform well for different array sizes and number of processors.
Abstract: Dynamic redistribution of arrays is required very often in programs on distributed presents efficient algorithms for redistribution between different cyclic(k) distributions, as defined in High Performance Fortran. We first propose special optimized algorithms for a cyclic(x) to cyclic(y) redistribution when x is a multiple of y, or y is a multiple of x. We then propose two algorithms, called the GCD method and the LCM method, for the general cyclic(x) to cyclic(y) redistribution when there is no particular relation between x and y. We have implemented these algorithms on the Intel Touchstone Delta, and find that they perform well for different array sizes and number of processors.

Journal ArticleDOI
TL;DR: A trip-based model is proposed to support adaptive, distributed, and deadlock-free multiple multicast on any network with arbitrary topology using at most two virtual channels per physical channel.
Abstract: This paper focuses on efficient multicasting in wormhole-routed networks. A trip-based model is proposed to support adaptive, distributed, and deadlock-free multiple multicast on any network with arbitrary topology using at most two virtual channels per physical channel. This model significantly generalizes the path-based model proposed earlier which works only for Hamiltonian networks and cannot be applicable to networks with arbitrary topology resulted due to system faults. Fundamentals of the trip-based model, including the necessary and sufficient condition to be deadlock-free, and the use of appropriate number of virtual channels to avoid deadlock are investigated. The potential of this model is illustrated by applying it to hypercubes with faulty nodes. Simulation results indicate that the proposed model can implement multiple multicast on faulty hypercubes with negligible performance degradation.

Journal ArticleDOI
TL;DR: This paper presents a framework for a compiler algorithm for forwarding in shared-memory multiprocessors, and uses address traces to evaluate the performance impact of different levels of support for forwarding.
Abstract: Scalable shared-memory multiprocessors are often slowed down by long-latency memory accesses. One way to cope with this problem is to use data forwarding to overlap memory accesses with computation. With data forwarding, when a processor produces a datum, in addition to updating its cache, it sends a copy of the datum to the caches of the processors that the compiler identified as consumers of it. As a result, when the consumer processors access the datum, they find it in their caches. This paper addresses two main issues. First, it presents a framework for a compiler algorithm for forwarding. Second, using address traces, it evaluates the performance impact of different levels of support for forwarding. Our simulations of a 32-processor machine show that an optimistic support for forwarding speeds up five applications by an average of 50% for large caches and 30% for small caches. For large caches, most sharing read misses are eliminated, while for small caches, forwarding does not increase the number of conflict misses significantly. Overall, support for forwarding in shared-memory multiprocessors promises to deliver good application speedups.

Journal ArticleDOI
TL;DR: This paper emphasizes the versatility of the folded Petersen cube networks as a multicomputer interconnection topology by providing embeddings of many computationally important structures such as rings, multi-dimensional meshes, hypercubes, complete binary trees, tree machines, meshes of trees, and pyramids.
Abstract: We introduce and analyze a new interconnection topology, called the k-dimensional folded Petersen (FP/sub k/) network, which is constructed by iteratively applying the Cartesian product operation on the well-known Petersen graph. Since the number of nodes in FP/sub k/ is restricted to a power of ten, for better scalability we propose a generalization, the folded Petersen cube network FPQ/sub n,k/=Q/sub n//spl times/FP/sub k/, which is a product of the n-dimensional binary hypercube (Q/sub n/) and FP/sub k/. The FPQ/sub n,k/ topology provides regularity, node- and edge-symmetry, optimal connectivity (and therefore maximal fault-tolerance), logarithmic diameter, modularity, and permits simple self-routing and broadcasting algorithms. With the same node-degree and connectivity, FPQ/sub n,k/ has smaller diameter and accommodates more nodes than Q/sub n+3k/, and its packing density is higher compared to several other product networks. This paper also emphasizes the versatility of the folded Petersen cube networks as a multicomputer interconnection topology by providing embeddings of many computationally important structures such as rings, multi-dimensional meshes, hypercubes, complete binary trees, tree machines, meshes of trees, and pyramids. The dilation and edge-congestion of all such embeddings are at most two.

Journal ArticleDOI
TL;DR: This paper describes methods suitable for parallelized direct execution simulation of message-passing parallel programs, and reports on the performance of such a system, LAPSE (Large Application Parallel Simulation Environment), which has built on the Intel Paragon.
Abstract: As massively parallel computers proliferate, there is growing interest in finding ways by which performance of massively parallel codes can be efficiently predicted. This problem arises in diverse contexts such as parallelizing compilers, parallel performance monitoring, and parallel algorithm development. In this paper, we describe one solution where one directly executes the application code, but uses a discrete-event simulator to model details of the presumed parallel machine, such as operating system and communication network behavior. Because this approach is computationally expensive, we are interested in its own parallelization, specifically the parallelization of the discrete-event simulator. We describe methods suitable for parallelized direct execution simulation of message-passing parallel programs, and report on the performance of such a system, LAPSE (Large Application Parallel Simulation Environment), we have built on the Intel Paragon. On all codes measured to date, LAPSE predicts performance well, typically within 10% relative error. Depending on the nature of the application code, we have observed low slowdowns (relative to natively executing code) and high relative speedups using up to 64 processors.

Journal ArticleDOI
TL;DR: This paper considers the problem of all-to-all personalized communication in a torus of any dimension with the wormhole-routing capability, and proposes complete exchange algorithms that use optimal numbers of phases (if each side of the tori is a multiple of eight) or asymptotically optimal number of phases.
Abstract: All-to-all personalized communication, or complete exchange, is at the heart of numerous applications in parallel computing. It is one of the most dense communication patterns. In this paper, we consider this problem in a torus of any dimension with the wormhole-routing capability. We propose complete exchange algorithms that use optimal numbers of phases (if each side of the tori is a multiple of eight) or asymptotically optimal numbers of phases (otherwise). Interestingly, in order to achieve this, we only make weak assumptions-that a node is capable of sending and receiving at most one message at a time, and the network is capable of supporting the dimension-ordered (or e-cube) minimum routing.

Journal ArticleDOI
TL;DR: A framework to design fully-adaptive, deadlock-free wormhole algorithms for a variety of network topologies and an analysis of the resource requirements and performances of a proposed algorithm, called negative-hop algorithm, with some of the previously proposed algorithms for torus and mesh networks is presented.
Abstract: This paper presents a framework to design fully-adaptive, deadlock-free wormhole algorithms for a variety of network topologies. The main theoretical contributions are: (a) design of new wormhole algorithms using store-and-forward algorithms, (b) a sufficient condition for deadlock free routing by the wormhole algorithms so designed, and (c) a sufficient condition for deadlock free routing by these wormhole algorithms with centralized flit buffers shared among multiple channels. To illustrate the theory, several wormhole algorithms based on store-and-forward hop schemes are designed. The hop-based wormhole algorithms can be applied to a variety of networks including torus, mesh, de Brujin, and a class of Cayley networks, with the best known bounds on virtual channels for minimal routing on the last two classes of networks. An analysis of the resource requirements and performances of a proposed algorithm, called negative-hop algorithm, with some of the previously proposed algorithms for torus and mesh networks is presented.

Journal ArticleDOI
TL;DR: An important and counter-intuitive result is shown, which proves that the authors can always obtain full-parallelism for MDFGs with more than one dimension.
Abstract: Most scientific and digital signal processing (DSP) applications are recursive or iterative. Transformation techniques are usually applied to get optimal execution rates in parallel and/or pipeline systems. The retiming technique is a common and valuable transformation tool in one-dimensional problems, when loops are represented by data flow graphs (DFGs). In this paper, uniform nested loops are modeled as multidimensional data flow graphs (MDFGs). Full parallelism of the loop body, i.e., all nodes in the MDFG executed in parallel, substantially decreases the overall computation time. It is well known that, for one-dimensional DFGs, retiming can not always achieve full parallelism. Other existing optimization techniques for nested loops also can not always achieve full parallelism. This paper shows an important and counter-intuitive result, which proves that we can always obtain full-parallelism for MDFGs with more than one dimension. This result is obtained by transforming the MDFG into a new structure. The restructuring process is based on a multidimensional retiming technique. The theory and two algorithms to obtain full parallelism are presented in this paper. Examples of optimization of nested loops and digital signal processing designs are shown to demonstrate the effectiveness of the algorithms.

Journal ArticleDOI
TL;DR: This paper presents a simple game, and develops basic theory underlying a robust method for distributed coordination based on this game that makes use of finite state automata-one associated with each agent-which guide the agents.
Abstract: A simple game provides a framework within which agents can spontaneously self-organize. In this paper, we present this game, and develop basic theory underlying a robust method for distributed coordination based on this game. This method makes use of finite state automata-one associated with each agent-which guide the agents. We give a new, general method of analysis of these systems, which previously had been studied only in limited cases. We also provide a physical example, which should hint at the type of problems resolvable using this method.

Journal ArticleDOI
TL;DR: It turns out that, depending on the class, the complexity varies from P-time to NP-complete or co-NP-complete, and a new graph property called symmetricity played a central role in the analysis of anonymous networks.
Abstract: For pt I see ibid. In anonymous networks, the processors do not have identity numbers. In Part I of this paper, we characterized the classes of networks on which some representative distributed computation problems are solvable under different conditions. A new graph property called symmetricity played a central role in our analysis of anonymous networks. In Part II, we turn our attention to the computational complexity issues. We first discuss the complexity of determining the symmetricity of a given graph, and then that of testing membership in each of the 16 classes of anonymous networks defined in Part I. It turns out that, depending on the class, the complexity varies from P-time to NP-complete or co-NP-complete.

Journal ArticleDOI
TL;DR: A new decomposition technique for hierarchical Cayley graphs is presented, which yields a very easy implementation of the divide and conquer paradigm for some problems on very complex architectures as the star graph or the pancake.
Abstract: This paper presents a new decomposition technique for hierarchical Cayley graphs. This technique yields a very easy implementation of the divide and conquer paradigm for some problems on very complex architectures as the star graph or the pancake. As applications, we introduce algorithms for broadcasting and prefix-like operations that improve the best known bounds for these problems. We also give the first nontrivial optimal gossiping algorithms for these networks. In star-graphs and pancakes with N=n! processors, our algorithms take less than [log N]+1.5n steps.

Journal ArticleDOI
TL;DR: The method presented in this paper delays data structure selection until the compile phase, thereby allowing the compiler to combine code optimization with explicitData structure selection, and enables the compilation of efficient code for sparse computations.
Abstract: The problem of compiler optimization of sparse codes is well known and no satisfactory solutions have been found yet. One of the major obstacles is formed by the fact that sparse programs explicitly deal with particular data structures selected for storing sparse matrices. This explicit data structure handling obscures the functionality of a code to such a degree that optimization of the code is prohibited, for instance, by the introduction of indirect addressing. The method presented in this paper delays data structure selection until the compile phase, thereby allowing the compiler to combine code optimization with explicit data structure selection. This method enables the compiler to generate efficient code for sparse computations. Moreover, the task of the programmer is greatly reduced in complexity.

Journal ArticleDOI
TL;DR: The results indicate that Banerjee's test is for all practical purposes as accurate as the more complex Omega test in detecting parallelism, but the Omega test is quite effective in proving the existence of dependences, in contrast with Baners' test, which can only disprove, or break dependences.
Abstract: Data dependence analysis techniques are the main component of today's trategies for automatic detection of parallelism. Parallelism detection strategies are being incorporated in commercial compilers with increasing frequency because of the widespread use of processors capable of exploiting instruction-level parallelism and the growing importance of multiprocessors. An assessment of the accuracy of data dependence tests is therefore of great importance for compiler writers and researchers. The tests evaluated in this study include the generalized greatest common divisor test, three variants of Banerjee's test, and the Omega test. Their effectiveness was measured with respect to the Perfect Benchmarks and the linear algebra libraries, EISPACK and LAPACK. Two methods were applied, one using only compile-time information for the analysis, and the second using information gathered during program execution. The results indicate that Banerjee's test is for all practical purposes as accurate as the more complex Omega test in detecting parallelism. However, the Omega test is quite effective in proving the existence of dependences, in contrast with Banerjee's test, which can only disprove, or break dependences. The capability of the Omega test of proving dependences could have a significant impact on several compiler algorithms not considered in this study.