scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Parallel and Distributed Systems in 1995"


Journal ArticleDOI
TL;DR: This paper proposes a necessary and sufficient condition for deadlock-free adaptive routing, the key for the design of fully adaptive routing algorithms with minimum restrictions, and shows the application of the new theory.
Abstract: Deadlock avoidance is a key issue in wormhole networks. A first approach by W.J. Dally and C.L. Seitz (1987) consists of removing the cyclic dependencies between channels. Many deterministic and adaptive routing algorithms have been proposed based on that approach. Although the absence of cyclic dependencies is a necessary and sufficient condition for deadlock-free deterministic routing, it is only a sufficient condition for deadlock-free adaptive routing. A more powerful approach by J. Duato (1991) only requires the absence of cyclic dependencies on a connected channel subset. The remaining channels can be used in almost any way. In this paper, we show that the previously mentioned approach is also a sufficient condition. Moreover, we propose a necessary and sufficient condition for deadlock-free adaptive routing. This condition is the key for the design of fully adaptive routing algorithms with minimum restrictions, An example shows the application of the new theory. >

338 citations


Journal ArticleDOI
TL;DR: This work proves the exact conditions for an arbitrary checkpoint, or a set of checkpoints, to belong to a consistent global snapshot, a previously open problem.
Abstract: Consistent global snapshots are important in many distributed applications. We prove the exact conditions for an arbitrary checkpoint, or a set of checkpoints, to belong to a consistent global snapshot, a previously open problem. To describe the conditions, we introduce a generalization of Lamport's (1978) happened-before relation called a zigzag path. >

262 citations


Journal ArticleDOI
TL;DR: This paper discusses a static algorithm for allocating and scheduling components of periodic tasks across sites in distributed systems that handles precedence, communication, as well as replication requirements of subtasks of the tasks.
Abstract: This paper discusses a static algorithm for allocating and scheduling components of periodic tasks across sites in distributed systems. Besides dealing with the periodicity constraints, (which have been the sole concern of many previous algorithms), this algorithm handles precedence, communication, as well as replication requirements of subtasks of the tasks. The algorithm determines the allocation of subtasks of periodic tasks to sites, the scheduled start times of subtasks allocated to a site, and the schedule for communication along the communication channel(s). Simulation results show that the heuristics and search techniques incorporated in the algorithm are very effective. >

253 citations


Journal ArticleDOI
TL;DR: Simulation of uniform and localized traffic patterns reveal that the normalized average internode distances in a HCN are better than in a comparable hypercube, a fact that has positive ramifications on the implementation of HCN-connected systems.
Abstract: We introduce a new interconnection network for large-scale distributed memory multiprocessors called the hierarchical cubic network (HCN). We establish that the number of routing steps needed by several data parallel applications running on a HCN-based system and a hypercube-based system are about identical. Further, hypercube connections can be emulated on the HCN in constant time. Simulation of uniform and localized traffic patterns reveal that the normalized average internode distances in a HCN are better than in a comparable hypercube. Additionally, the HCN also has about three-fourths the diameter of a comparable hypercube, although it uses about half as many links per node-a fact that has positive ramifications on the implementation of HCN-connected systems. >

188 citations


Journal ArticleDOI
TL;DR: Simulations of this adaptive scheme show reductions of the number of read misses, the read penalty, and of the execution time by up to 78%, 58%, and 25% respectively.
Abstract: To offset the effect of read miss penalties on processor utilization in shared-memory multiprocessors, several software- and hardware-based data prefetching schemes have been proposed. A major advantage of hardware techniques is that they need no support from the programmer or compiler. Sequential prefetching is a simple hardware-controlled prefetching technique which relies on the automatic prefetch of consecutive blocks following the block that misses in the cache, thus exploiting spatial locality. In its simplest form, the number of prefetched blocks on each miss is fixed throughout the execution. However, since the prefetching efficiency varies during the execution of a program, we propose to adapt the number of pre-fetched blocks according to a dynamic measure of prefetching effectiveness. Simulations of this adaptive scheme show reductions of the number of read misses, the read penalty, and of the execution time by up to 78%, 58%, and 25% respectively. >

167 citations


Journal ArticleDOI
TL;DR: A new class of adaptive routing algorithms-misrouting backtracking with m misroutes (MB-m) is presented, made possible by PCS, and an analysis of the performance and static fault-tolerant properties of MB-m is provided.
Abstract: Our goal is to reconcile the conflicting demands of performance and fault-tolerance in interprocessor communication. To this end, we propose a pipelined communication mechanism-pipelined circuit-switching (PCS)-which is a variant of the well known wormhole routing (WR) mechanism. PCS relaxes some of the routing constraints imposed by WR and as a result enables routing behavior that cannot otherwise be realized. This paper presents a new class of adaptive routing algorithms-misrouting backtracking with m misroutes (MB-m). This class of routing algorithms is made possible by PCS. We provide an analysis of the performance and static fault-tolerant properties of MB-m. The results of an experimental evaluation of PCS and MB-3 are also presented. This methodology provides performance approaching that of WR, while realizing a level of resilience to static faults that is difficult to achieve with WR. >

167 citations


Journal ArticleDOI
TL;DR: This paper shows a new combination of residue number systems with efficient modulo reduction methods, and two methods are compared, and the faster one is scrutinized in detail.
Abstract: Residue number systems provide a good means for extremely long integer arithmetic. Their carry-free operations make parallel implementations feasible. Some applications involving very long integers, such as public key encryption, rely heavily on fast modulo reductions. This paper shows a new combination of residue number systems with efficient modulo reduction methods. Two methods are compared, and the faster one is scrutinized in detail. Both methods have the same order of complexity, O(log n), with n denoting the amount of registers involved. >

154 citations


Journal ArticleDOI
TL;DR: This paper presents a theoretical framework for automatically partitioning parallel loops to minimize cache coherency traffic on shared-memory multiprocessors and shows that the same theoretical framework can also be used to determine optimal tiling parameters for both data and loop partitioning in distributed memory multicomputers.
Abstract: Presents a theoretical framework for automatically partitioning parallel loops to minimize cache coherency traffic on shared-memory multiprocessors. While several previous papers have looked at hyperplane partitioning of iteration spaces to reduce communication traffic, the problem of deriving the optimal tiling parameters for minimal communication in loops with general affine index expressions has remained open. Our paper solves this open problem by presenting a method for deriving an optimal hyperparallelepiped tiling of iteration spaces for minimal communication in multiprocessors with caches. We show that the same theoretical framework can also be used to determine optimal tiling parameters for both data and loop partitioning in distributed memory multicomputers. Our framework uses matrices to represent iteration and data space mappings and the notion of uniformly intersecting references to capture temporal locality in array references. We introduce the notion of data footprints to estimate the communication traffic between processors and use linear algebraic methods and lattice theory to compute precisely the size of data footprints. We have implemented this framework in a compiler for Alewife, a distributed shared-memory multiprocessor. >

124 citations


Journal ArticleDOI
TL;DR: A simple method is presented for eliminating the trapping of local oscillator phase in undesirable stable equilibria where global phase is not aligned in k-ary Cartesian meshes and a proof of its correctness for two-dimensional networks is given.
Abstract: It has historically been difficult to distribute a well-aligned hardware clock throughout the physical extent of a synchronous processor. Traditionally, this task has been accomplished by distributing the output of a central oscillator over a tree-like network, with repeaters at necessary intervals. While straightforward in concept, this method suffers from poor reliability, poor scalability and high skew. In this paper, we present an alternative approach-Distributed Synchronous Clocking-that maintains the simplicity of synchronous operation without suffering the drawbacks of centralized clocking. A network of independent oscillators takes the place of the centralized clock source, providing separate clock signals to the physically distant parts of a computing system. A distributed error correction algorithm effects global phase alignment by utilizing Local comparisons of neighboring oscillator phase. In contrast to centralized clock distribution, distributed clocking has the inherent potential for complete scalability and graceful degradation. However, because oscillator phase is a modular quantity, a naive implementation of distributed synchronous clocking can suffer from mode-lock-the trapping of local oscillator phase in undesirable stable equilibria where global phase is not aligned. We present a simple method for eliminating this problem in k-ary Cartesian meshes and give a proof of its correctness for two-dimensional networks. An electronic implementation is also presented and several engineering issues relating to error tolerance are discussed. >

123 citations


Journal ArticleDOI
TL;DR: This paper proposes two different partitioning schemes for an inverted file system for a shared-everything multiprocessor machine with multiple disks and studies the performance of these schemes by simulation under a number of workloads.
Abstract: Multiple-disk I/O systems (disk arrays) have been an attractive approach to meet high performance I/O demands in data intensive applications such as information retrieval systems. When we partition and distribute files across multiple disks to exploit the potential for I/O parallelism, a balanced I/O workload distribution becomes important for good performance. Naturally, the performance of a parallel information retrieval system using an inverted file structure is affected by the partitioning scheme of the inverted file. In this paper, we propose two different partitioning schemes for an inverted file system for a shared-everything multiprocessor machine with multiple disks. We study the performance of these schemes by simulation under a number of workloads where the term frequencies in the documents are varied, the term frequencies in the queries are varied, the number of disks are varied and the multiprogramming level is varied. >

117 citations


Journal ArticleDOI
TL;DR: FT-Linda is described, a version of Linda that addresses this problem by providing two major enhancements that facilitate the writing of fault-tolerant applications: stable tuple spaces and atomic execution of tuple space operations.
Abstract: Linda is a language for programming parallel applications whose most notable feature is a distributed shared memory called tuple space. While suitable for a wide variety of programs, one shortcoming of the language as commonly defined and implemented is a lack of support for writing programs that can tolerate failures in the underlying computing platform. This paper describes FT-Linda, a version of Linda that addresses this problem by providing two major enhancements that facilitate the writing of fault-tolerant applications: stable tuple spaces and atomic execution of tuple space operations. The former is a type of stable storage in which tuple values are guaranteed to persist across failures, while the latter allows collections of tuple operations to be executed in an all-or-nothing fashion despite failures and concurrency. The design of these enhancements is presented in detail and illustrated by examples drawn from both the Linda and fault-tolerance domains. An implementation of FT-Linda for a network of workstations is also described. The design is based on replicating the contents of stable tuple spaces to provide failure resilience and then updating the copies using atomic multicast. This strategy allows an efficient implementation in which only a single multicast message is needed for each atomic collection of tuple space operations. >

Journal ArticleDOI
TL;DR: This paper focuses on three novel aspects in the design and implementation of CCL: the introduction of process groups, the definition of semantics that ensures correctness, and the design of new and tunable algorithms based on a realistic point-to-point communication model.
Abstract: A collective communication library for parallel computers includes frequently used operations such as broadcast, reduce, scatter, gather, concatenate, synchronize, and shift. Such a library provides users with a convenient programming interface, efficient communication operations, and the advantage of portability. A library of this nature, the Collective Communication Library (CCL), intended for the line of scalable parallel computer products by IBM, has been designed. CCL is part of the parallel application programming interface of the recently announced IBM 9076 Scalable POWERparallel System 1 (SP1). In this paper, we examine several issues related to the functionality, correctness, and performance of a portable collective communication library while focusing on three novel aspects in the design and implementation of CCL: 1) the introduction of process groups, 2) the definition of semantics that ensures correctness, and 3) the design of new and tunable algorithms based on a realistic point-to-point communication model. >

Journal ArticleDOI
TL;DR: In this paper, the authors present runtime and compile-time analysis for block structured codes on distributed memory parallel machines in an efficient and machine-independent fashion, which can be used by compilers for HPF-like parallel programming languages in compiling codes in which data distribution, loop bounds and/or strides are unknown at compile time.
Abstract: In compiling applications for distributed memory machines, runtime analysis is required when data to be communicated cannot be determined at compile-time. One such class of applications requiring runtime analysis is block structured codes. These codes employ multiple structured meshes, which may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present runtime and compile-time analysis for compiling such applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which supports the runtime analysis required. The library is currently implemented on several different systems. We have also developed compiler analysis for determining data access patterns at compile time and inserting calls to the appropriate runtime routines. Our methods can be used by compilers for HPF-like parallel programming languages in compiling codes in which data distribution, loop bounds and/or strides are unknown at compile-time. To demonstrate the efficacy of our approach, we have implemented our compiler analysis in the Fortran 90D/HPF compiler developed at Syracuse University. We have experimented with a multi-bloc Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads and the compiler parallelized codes perform within 20% of the codes parallelized by manually inserting calls to the runtime library. >

Journal ArticleDOI
TL;DR: A global Finite State Machine model characterizing the protocol behavior is built and protocol verification becomes equivalent to finding whether or not the global FSM may enter erroneous states, based on a symbolic state expansion procedure.
Abstract: We introduce a cache protocol verification technique based on a symbolic state expansion procedure. A global Finite State Machine (FSM) model characterizing the protocol behavior is built and protocol verification becomes equivalent to finding whether or not the global FSM may enter erroneous states. In order to reduce the complexity of the state expansion process, all the caches in the same state are grouped into an equivalence class and the number of caches in the class is symbolically represented by a repetition constructor. This symbolic representation is partly justified by the symmetry and homogeneity of cache-based systems. However, the key idea behind the representation is to exploit a unique property of cache coherence protocols: the fact that protocol correctness is not dependent on the exact number of cached copies. Rather, symbolic states only need to keep track of whether the caches have 0, 1, or multiple copies. The resulting symbolic state expansion process only takes a few steps and verifies the protocol for any system size. Therefore, it is more efficient and reliable than current approaches. The verification procedure is first applied to the verification of five existing protocols under the assumption of atomic protocol transitions. A simple snooping protocol on a split-transaction shared bus is also verified to illustrate the extension of our approach to protocols with nonatomic transitions. >

Journal ArticleDOI
TL;DR: This paper first establishes the necessary and sufficient condition for deadlock free routing, based on the analysis of the message flow on each channel, and uses the model to develop new adaptive routing algorithms for 2D meshes.
Abstract: In this paper, we introduce a new approach to deadlock-free routing in wormhole-routed networks called the message flow model. This method may be used to develop deterministic, partially-adaptive, and fully-adaptive routing algorithms for wormhole-routed networks with arbitrary topologies. We first establish the necessary and sufficient condition for deadlock free routing, based on the analysis of the message flow on each channel. We then use the model to develop new adaptive routing algorithms for 2D meshes. >

Journal ArticleDOI
TL;DR: It is shown that the HFN can emulate algorithms which are executable on the ring or the mesh-connected computer with the same time complexities in big-O notation, and can embed a folded hypercube having the same number of nodes with constant dilation.
Abstract: In this paper, a new two-level interconnection network, called a hierarchical folded-hypercube network (HFN, for short), is proposed. The HFN takes folded hypercubes as basic modules which are connected in a complete manner. We investigate the topological properties of the HFN, including the diameter, cost, average distance, embedding, connectivity, container, /spl kappa/-wide diameter, and node-fault diameter. We show that the HFN can emulate algorithms which are executable on the ring or the mesh-connected computer with the same time complexities in big-O notation. Moreover, the HFN can embed a folded hypercube having the same number of nodes with constant dilation. We compute the diameter, node connectivity, best container, /spl kappa/-wide diameter, and node-fault diameter of the HFN. We present optimal routing and broadcasting algorithms for the HFN. The semigroup computation and descend/ascend algorithms can be executed as well on the HFN. >

Journal ArticleDOI
TL;DR: The algorithm accounts for machine resource constraints in a way that smoothly integrates the management of resource constraints with software pipelining, and generality is not sacrificed to handle resource constraints, and scheduling choices are made with truly global information.
Abstract: This paper presents a software pipelining algorithm for the automatic extraction of fine-grain parallelism in general loops. The algorithm accounts for machine resource constraints in a way that smoothly integrates the management of resource constraints with software pipelining. Furthermore, generality in the software pipelining algorithm is not sacrificed to handle resource constraints, and scheduling choices are made with truly global information. Proofs of correctness and the results of experiments with an implementation are also presented.

Journal ArticleDOI
TL;DR: By using the approach of recovery line transformation and decomposition, this paper develops an optimal checkpoint space reclamation algorithm and shows that the space overhead for uncoordinated checkpointing is in fact bounded by N+1)/2 checkpoints where N is the number of processes.
Abstract: Uncoordinated checkpointing allows process autonomy and general nondeterministic execution, but suffers from potential domino effects and the associated space overhead. Previous to this research, checkpoint space reclamation had been based on the notion of obsolete checkpoints; as a result, a potentially unbounded number of nonobsolete checkpoints may have to be retained on stable storage. In this paper, we derive a necessary and sufficient condition for identifying all garbage checkpoints. By using the approach of recovery line transformation and decomposition, we develop an optimal checkpoint space reclamation algorithm and show that the space overhead for uncoordinated checkpointing is in fact bounded by N(N+1)/2 checkpoints where N is the number of processes. >

Journal ArticleDOI
TL;DR: A technique that minimizes the amount of data exchange for BLOCK to CYCLIC(c) (or vice-versa) redistributions of arbitrary number of dimensions and preserves the semantics of the target (destination) distribution pattern.
Abstract: Run-time data redistribution can enhance algorithm performance in distributed-memory machines. Explicit redistribution of data can be performed between algorithm phases when a different data decomposition is expected to deliver increased performance for a subsequent phase of computation. Redistribution, however, represents increased program overhead as algorithm computation is discontinued while data are exchanged among processor memories. In this paper, we present a technique that minimizes the amount of data exchange for BLOCK to CYCLIC(c) (or vice-versa) redistributions of arbitrary number of dimensions. Preserving the semantics of the target (destination) distribution pattern, the technique manipulates the data to logical processor mapping of the target pattern. When implemented on an IBM SP, the mapping technique demonstrates redistribution performance improvements of approximately 40% over traditional data to processor mapping. Relative to the traditional mapping technique, the proposed method affords greater flexibility in specifying precisely which data elements are redistributed and which elements remain on-processor.

Journal ArticleDOI
TL;DR: This work proposes an optimal and nonredundant distributed broadcasting algorithm in star graphs that takes O(n log/sub 2/ n) time and guarantees that all nodes in the star graph receive the message exactly once.
Abstract: Based on the V.E. Mendia and D. Sarkar's algorithm (1992), we propose an optimal and nonredundant distributed broadcasting algorithm in star graphs. For an n-dimensional star graph, our algorithm takes O(n log/sub 2/ n) time and guarantees that all nodes in the star graph receive the message exactly once. Moreover, broadcasting m packets in a pipeline fashion takes O(m log/sub 2/ n+n log/sub 2/ n) time due to the nonredundant property of our broadcasting algorithm. >

Journal ArticleDOI
TL;DR: Analysis of the performance and scalability of an iteration of the preconditioned conjugate gradient algorithm on parallel architectures with a variety of interconnection networks shows that for block-tridiagonal matrices resulting from two-dimensional finite difference grids, the communication overhead due to vector inner products dominates the communication overheads of the remainder of the computation on a large number of processors.
Abstract: This paper analyzes the performance and scalability of an iteration of the preconditioned conjugate gradient algorithm on parallel architectures with a variety of interconnection networks, such as the mesh, the hypercube, and that of the CM-5 parallel computer. It is shown that for block-tridiagonal matrices resulting from two-dimensional finite difference grids, the communication overhead due to vector inner products dominates the communication overheads of the remainder of the computation on a large number of processors. However, with a suitable mapping, the parallel formulation of a PCG iteration is highly scalable for such matrices on a machine like the CM-5 whose fast control network practically eliminates the overheads due to inner product computation. The use of the truncated Incomplete Cholesky (IC) preconditioner can lead to further improvement in scalability on the CM-5 by a constant factor,as a result, a parallel formulation of the PCG algorithm with IC preconditioner may execute faster than that with a simple diagonal preconditioner even if the latter runs faster in a serial implementation. >

Journal ArticleDOI
TL;DR: A minimum-time multicast algorithm is presented for n-dimensional torus networks that use deterministic, dimension-ordered routing of unicasts, and can deliver a multicast message to m-1 destinations in [log/sub 2/ m] message-passing steps, while avoiding contention among the constituent unicast messages.
Abstract: This paper presents efficient algorithms that implement one-to-many, or multicast, communication in wormhole-routed torus networks. By exploiting the properties of the switching technology and the use of virtual channels, a minimum-time multicast algorithm is presented for n-dimensional torus networks that use deterministic, dimension-ordered routing of unicast messages. The algorithm can deliver a multicast message to m-1 destinations in [log/sub 2/ m] message-passing steps, while avoiding contention among the constituent unicast messages. Performance results of a simulation study on torus networks with up to 4096 nodes are also given. >

Journal ArticleDOI
TL;DR: The theoretical background for the design of deadlock-free adaptive multicast routing algorithms for wormhole networks is developed, developing conditions to verify that an adaptive multicasts routing algorithm is deadlocked-free, even when there are cyclic dependencies between channels.
Abstract: A theory for the design of deadlock-free adaptive routing algorithms for wormhole networks, proposed by the author (1991, 1993), supplies sufficient conditions for an adaptive routing algorithm to be deadlock-free, even when there are cyclic dependencies between channels. Also, two design methodologies were proposed. Multicast communication refers to the delivery of the same message from one source node to an arbitrary number of destination nodes. A tree-like routing scheme is not suitable for hardware-supported multicast in wormhole networks because it produces many headers for each message, drastically increasing the probability of a message being blocked. A path-based multicast routing model was proposed by Lin and Ni (1991) for multicomputers with 2D-mesh and hypercube topologies. In this model, messages are not replicated at intermediate nodes. This paper develops the theoretical background for the design of deadlock-free adaptive multicast routing algorithms. This theory is valid for wormhole networks using the path-based routing model. It is also valid when messages with a single destination and multiple destinations are mixed together. The new channel dependencies produced by messages with several destinations are studied. Also, two theorems are proposed, developing conditions to verify that an adaptive multicast routing algorithm is deadlock-free, even when there are cyclic dependencies between channels. As an example, the multicast routing algorithms of Lin and Ni are extended, so that they can take advantage of the alternative paths offered by the network. >

Journal ArticleDOI
TL;DR: New upper bounds on the evacuation time of batch admissions are derived, and bounds on worst case transit delay for certain networks admitting packets continuously are derived.
Abstract: We consider the problem of finding the worst case packet transit delay in networks using deflection routing. Several classes of networks are studied, including many topologies for which deflection routing has been proposed or implemented (e.g., hypercube, Manhattan Street Network, shuffle-exchange network). We derive new upper bounds on the evacuation time of batch admissions, and present simple proofs for several existing bounds. Also derived are bounds on worst case transit delay for certain networks admitting packets continuously. To demonstrate the practical utility of our results, we compare a new delay bound to the maximum delay observed in simulations. The results have application in both protocol design and the determination of the required capacity of packet resequencing buffers. >

Journal ArticleDOI
TL;DR: Two new ideas by which a High Performance Fortran compiler can deal with irregular computations effectively are described and performance results for these mechanisms from a Fortran 90D compiler implementation are presented.
Abstract: This paper describes two new ideas by which a High Performance Fortran compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of proposed compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements, and computational load. The second mechanism is a conservative method for compiling irregular loops in which dependence arises only due to reduction operations. This mechanism in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g., communication schedules, loop iteration partitions, and information that associates off-processor data copies with on-processor buffer locations). This paper also presents performance results for these mechanisms from a Fortran 90D compiler implementation. >

Journal ArticleDOI
TL;DR: The method to construct a matrix has been proposed and it is shown that any matrix constructed by the proposed method can be mapped into a solution to the placement problem if a certain condition holds between N and p.
Abstract: In this paper, we deal with the data/parity placement problem which is described as follows: how to place data and parity evenly across disks in order to tolerate two disk failures, given the number of disks N and the redundancy rate p which represents the amount of disk spaces to store parity information. To begin with, we transform the data/parity placement problem into the problem of constructing an N/spl times/N matrix such that the matrix will correspond to a solution to the problem. The method to construct a matrix has been proposed and we have shown how our method works through several illustrative examples. It is also shown that any matrix constructed by our proposed method can be mapped into a solution to the placement problem if a certain condition holds between N and p where N is the number of disks and p is a redundancy rate.

Journal ArticleDOI
TL;DR: An optimal algorithm that broadcasts on an n-dimensional hypercube in O(n/ log/sub 2/ (n+1)) routing steps with wormhole, e-cube routing and all-port communication is given.
Abstract: We give an optimal algorithm that broadcasts on an n-dimensional hypercube in O(n/ log/sub 2/ (n+1)) routing steps with wormhole, e-cube routing and all-port communication. Previously, the best algorithm of P.K. McKinley and C. Trefftz (1993) requires [n/2] routing steps. We also give routing algorithms that achieve tight time bounds for n /spl les/7. >

Journal ArticleDOI
TL;DR: A new variant of the scheduling problem is attempted by investigating the scalability of the schedule length with the required number of processors, by performing scheduling partially at compile time and partially at run time using a new concept of the threshold of a task.
Abstract: We attempt a new variant of the scheduling problem by investigating the scalability of the schedule length with the required number of processors, by performing scheduling partially at compile time and partially at run time. Assuming infinite number of processors, the compile time schedule is found using a new concept of the threshold of a task that quantifies a trade-off between the schedule-length and the degree of parallelism. The schedule is found to minimize either the schedule length or the number of required processors and it satisfies: A feasibility condition which guarantees that the schedule delay of a task from its earliest start time is below the threshold, and an optimality condition which uses a merit function to decide the best task-processor match for a set of tasks competing for a given processor. At run time, the tasks are merged producing a schedule for a smaller number of available processors. This allows the program to be scaled down to the processors actually available at run time. Usefulness of this scheduling heuristic has been demonstrated by incorporating the scheduler in the compiler backend for targeting Sisal (Streams and Iterations in a Single Assignment Language) on iPSC/860. >

Journal ArticleDOI
TL;DR: Analyzes some general properties of product networks that are pertinent to parallel architectures and then focuses on three case studies that are products of complete binary trees, shuffle-exchange and de Bruijn networks, which are powerful architectures for parallel computation.
Abstract: Analyzes some general properties of product networks that are pertinent to parallel architectures and then focuses on three case studies. These are products of complete binary trees, shuffle-exchange and de Bruijn networks. It is shown that all of these are powerful architectures for parallel computation, as evidenced by their ability to efficiently emulate numerous other architectures. In particular, r-dimensional grids and r-dimensional meshes of trees can be embedded efficiently in products of these graphs, i.e. either as a subgraph or with small constant dilation and congestion. In addition, the shuffle-exchange network can be embedded in an r-dimensional product of shuffle-exchange networks with dilation cost 2r and congestion cost 2. Similarly, the de Bruijn network can be embedded in an r-dimensional product of de Bruijn networks with dilation cost r and congestion cost 4. Moreover, it is well known that shuffle-exchange and de Bruijn graphs can emulate the hypercube with a small constant slowdown for "normal" algorithms. This means that their product versions can also emulate these hypercube algorithms with constant slowdown. Conclusions include a discussion of many open research areas. >

Journal ArticleDOI
TL;DR: A multicast mechanism using propagation trees that guarantees the total ordering (including causal ordering) of messages in multiple groups and introduces a concept of meta-groups (a subset of a multicast group) and organizes meta- groups into propagation trees.
Abstract: The paper discusses a multicast mechanism using propagation trees. It guarantees the total ordering (including causal ordering) of messages in multiple groups. The mechanism introduces a concept of meta-groups (a subset of a multicast group) and organizes meta-groups into propagation trees. Compared with the existing propagation tree mechanisms, this mechanism has the following advantages: 1) Greater parallelism. Messages can be sent to destinations by using broadcast networks. 2) Less message cost and less latency time. It takes less network communication to multicast a message and less time to have the message delivered to all the destinations. 3) More flexibility to dynamic membership changes and higher reliability for message propagation. It does not need to restructure propagation trees when there is a change in membership, and a site failure does not stop the message propagation to its descendants in the tree. >