scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Parallel and Distributed Systems in 1993"


Journal ArticleDOI
TL;DR: The authors present a compile-time scheduling heuristic called dynamic level scheduling, which accounts for interprocessor communication overhead when mapping precedence-constrained, communicating tasks onto heterogeneous processor architectures with limited or possibly irregular interconnection structures.
Abstract: The authors present a compile-time scheduling heuristic called dynamic level scheduling, which accounts for interprocessor communication overhead when mapping precedence-constrained, communicating tasks onto heterogeneous processor architectures with limited or possibly irregular interconnection structures. This technique uses dynamically-changing priorities to match tasks with processors at each step, and schedules over both spatial and temporal dimensions to eliminate shared resource contention. This method is fast, flexible, widely targetable, and displays promising performance. >

905 citations


Journal ArticleDOI
TL;DR: The theoretical background for the design of deadlock-free adaptive routing algorithms for wormhole networks is developed and some basic definitions and two theorems are proposed, which create the conditions to verify that an adaptive algorithm is deadlocks-free, even when there are cycles in the channel dependency graph.
Abstract: The theoretical background for the design of deadlock-free adaptive routing algorithms for wormhole networks is developed. The author proposes some basic definitions and two theorems. These create the conditions to verify that an adaptive algorithm is deadlock-free, even when there are cycles in the channel dependency graph. Two design methodologies are also proposed. The first supplies algorithms with a high degree of freedom, without increasing the number of physical channels. The second methodology is intended for the design of fault-tolerant algorithms. Some examples are given to show the application of the methodologies. Simulations show the performance improvement that can be achieved by designing the routing algorithms with the new theory. >

831 citations


Journal ArticleDOI
TL;DR: Two deadlock-free adaptive routing algorithms are described that allocate virtual channels using a count of the number of dimension reversals a packet has performed to eliminate cycles in resource dependency graphs and improve virtual channel utilization.
Abstract: The use of adaptive routing in a multicomputer interconnection network improves network performance by using all available paths and provides fault tolerance by allowing messages to be routed around failed channels and nodes. Two deadlock-free adaptive routing algorithms are described. Both algorithms allocate virtual channels using a count of the number of dimension reversals a packet has performed to eliminate cycles in resource dependency graphs. The static algorithm eliminates cycles in the network channel dependency graph. The dynamic algorithm improves virtual channel utilization by permitting dependency cycles and instead eliminating cycles in the packet wait-for graph. It is proved that these algorithms are deadlock-free. Experimental measurements of their performance are presented. >

574 citations


Journal ArticleDOI
TL;DR: Five DLB strategies are presented which illustrate the tradeoff between knowledge - the accuracy of each balancing decision, and overhead - the amount of added processing and communication incurred by the balancing process.
Abstract: Dynamic load balancing strategies for minimizing the execution time of single applications running in parallel on multicomputer systems are discussed. Dynamic load balancing (DLB) is essential for the efficient use of highly parallel systems when solving non-uniform problems with unpredictable load estimates. With the evolution of more highly parallel systems, centralized DLB approaches which make use of a high degree of knowledge become less feasible due to the load balancing communication overhead. Five DLB strategies are presented which illustrate the tradeoff between 1) knowledge - the accuracy of each balancing decision, and 2) overhead - the amount of added processing and communication incurred by the balancing process. All five strategies have been implemented on an Inter iPSC/2 hypercube. >

564 citations


Journal ArticleDOI
TL;DR: It is proved that every nonlinear clustering of a coarse grain DAG can be transformed into a linear clustering that has less or equal parallel time than the nonlinear one.
Abstract: The authors consider the impact of the granularity on scheduling task graphs. Scheduling consists of two parts, the processors assignment of tasks, also called clustering, and the ordering of tasks for execution in each processor. The authors introduce two types of clusterings: nonlinear and linear clusterings. A clustering is nonlinear if two parallel tasks are mapped in the same cluster otherwise it is linear. Linear clustering fully exploits the natural parallelism of a given directed acyclic task graph (DAG) while nonlinear clustering sequentializes independent tasks to reduce parallelism. The authors also introduce a new quantification of the granularity of a DAG and define a coarse grain DAG as the one whose granularity is greater than one. It is proved that every nonlinear clustering of a coarse grain DAG can be transformed into a linear clustering that has less or equal parallel time than the nonlinear one. This result is used to prove the optimality of some important linear clusterings used in parallel numerical computing. >

302 citations


Journal ArticleDOI
TL;DR: The experiments conducted in a 96-node Butterfly GP-1000 clearly show the advantage of the trapezoid self-scheduling over other well-known self- scheduling approaches.
Abstract: A practical processor self-scheduling scheme, trapezoid self-scheduling, is proposed for arbitrary parallel nested loops in shared-memory multiprocessors. Generally, loops are the richest source of parallelism in parallel programs. To dynamically allocate loop iterations to processors, one may achieve load balancing among processors at the expense of run-time scheduling overhead. By linearly decreasing the chunk size at run time, the best tradeoff between the scheduling overhead and balanced workload can be obtained in the proposed trapezoid self-scheduling approach. Due to its simplicity and flexibility, this approach can be efficiently implemented in any parallel compiler. The small and predictable number of chores also allow efficient management of memory in a static fashion. The experiments conducted in a 96-node Butterfly GP-1000 clearly show the advantage of the trapezoid self-scheduling over other well-known self-scheduling approaches. >

279 citations


Journal ArticleDOI
TL;DR: It is shown that two efficient arbiter implementations are proposed that achieve nearly optimal system performance without becoming the critical path that limits the system clock.
Abstract: The design and implementation of symmetric crossbar arbiters are addressed. Several arbiter designs are compared based on simulations of a multistage interconnection network. These simulations demonstrate the influence of the switch arbitration policy on network throughput, average latency, and worst-case latency. It is shown that some natural designs result in poor system performance and/or slow implementations. Two efficient arbiter implementations are proposed. Based on network simulations, VLSI implementation, and circuit simulation, it is shown that these arbiters achieve nearly optimal system performance without becoming the critical path that limits the system clock. >

267 citations


Journal ArticleDOI
TL;DR: A novel interconnection topology called the Fibonacci cube is shown to possess attractive recurrent structures in spite of its asymmetric and relatively sparse interconnections.
Abstract: A novel interconnection topology called the Fibonacci cube is shown to possess attractive recurrent structures in spite of its asymmetric and relatively sparse interconnections. Since it can be embedded as a subgraph in the Boolean cube (hypercube) and it is also a supergraph of other structures, the Fibonacci cube may find applications in fault-tolerant computing. For a graph with N nodes, the diameter, the edge connectivity, and the node connectivity of the Fibonacci cube are in the logarithmic order of N. It is also shown that common system communication primitives can be implemented efficiently. >

251 citations


Journal ArticleDOI
TL;DR: A model that unifies the models of weak ordering, release consistency, the VAX, and data-race-free-0 by formalizing the intuition that if programs synchronize explicitly and correctly, then sequential consistency can be guaranteed with high performance in a manner that retains the advantages of each of the four models.
Abstract: The authors present a data-race-free-1, shared-memory model that unifies four earlier models: weak ordering, release consistency (with sequentially consistent special operations), the VAX memory model, and data-race-free-0. Data-race-free-1 unifies the models of weak ordering, release consistency, the VAX, and data-race-free-0 by formalizing the intuition that if programs synchronize explicitly and correctly, then sequential consistency can be guaranteed with high performance in a manner that retains the advantages of each of the four models. Data-race-free-1 expresses the programmer's interface more explicitly and formally than weak ordering and the VAX, and allows an implementation not allowed by weak ordering, release consistency, or data-race-free-0. The implementation proposal for data-race-free-1 differs from earlier implementations by permitting the execution of all synchronization operations of a processor even while previous data operations of the processor are in progress. To ensure sequential consistency, two sychronizing processors exchange information to delay later operations of the second processor that conflict with an incomplete data operation of the first processor. >

222 citations


Journal ArticleDOI
TL;DR: The importance of having a policy that adapts its behavior to changes in system load is demonstrated and the effects of an initial burst of cache misses experienced by tasks when they return to a processor for execution are demonstrated.
Abstract: In a shared-memory multiprocessor system, it may be more efficient to schedule a task on one processor than on another if relevant data already reside in a particular processor's cache. The effects of this type of processor affinity are examined. It is observed that tasks continuously alternate between executing at a processor and releasing this processor due to I/O, synchronization, quantum expiration, or preemption. Queuing network models of different abstract scheduling policies are formulated, spanning the range from ignoring affinity to fixing tasks on processors. These models are solved via mean value analysis, where possible, and by simulation otherwise. An analytic cache model is developed and used in these scheduling models to include the effects of an initial burst of cache misses experienced by tasks when they return to a processor for execution. A mean-value technique is also developed and used in the scheduling models to include the effects of increased bus traffic due to these bursts of cache misses. Only a small amount of affinity information needs to be maintained for each task. The importance of having a policy that adapts its behavior to changes in system load is demonstrated. >

194 citations


Journal ArticleDOI
TL;DR: The authors study routing evaluation criteria for multicast communication under different switching technologies and show that all these optimization problems are NP-complete for the popular 2D-mesh and hypercube host graphs.
Abstract: Efficient routing of messages is a key to the performance of multicomputers. Multicast communication refers to the delivery of the same message from a source node to an arbitrary number of destination nodes. While multicast communication is highly demanded in many applications, most of the existing multicomputers do not directly support this service; rather it is indirectly supported by multiple one-to-one or broadcast communications, which result in more network traffic and a waste of system resources. The authors study routing evaluation criteria for multicast communication under different switching technologies. Multicast communication in multicomputers is formulated as a graph theoretical problem. Depending on the evaluation criteria and switching technologies, they study three optimal multicast communication problems, which are equivalent to the finding of the following three subgraphs: optimal multicast path, optimal multicast cycle, and minimal Steiner tree, where the interconnection of a multicomputer defines a host graph. They show that all these optimization problems are NP-complete for the popular 2D-mesh and hypercube host graphs. Heuristic multicast algorithms for these routing problems are proposed. >

Journal ArticleDOI
TL;DR: The fundamental premise behind the DASH project is that it is feasible to build large-scale shared-memory multiprocessors with hardware cache coherence with a small cost for the ease of programming offered by coherent caches and the potential for higher performance.
Abstract: The fundamental premise behind the DASH project is that it is feasible to build large-scale shared-memory multiprocessors with hardware cache coherence. The hardware overhead of directory-based cache coherence in a 48-processor is examined. The data show that the overhead is only about 10-15%, which appears to be a small cost for the ease of programming offered by coherent caches and the potential for higher performance. The performance of the system is discussed, and the speedups obtained by a variety of parallel applications running on the prototype are shown. Using a sophisticated hardware performance monitor, the effectiveness of coherent caches and the relationship between an application's reference behavior and its speedup are characterized. The optimizations incorporated in the DASH protocol are evaluated in terms of their effectiveness on parallel applications and on atomic tests that stress the memory system. >

Journal ArticleDOI
TL;DR: The authors present the scalability analysis of a parallel fast Fourier transform (FFT) algorithm on mesh and hypercube connected multicomputers using the isoefficiency metric and show that it is more cost-effective to implement the FFT algorithm on a hypercube rather than a mesh.
Abstract: The authors present the scalability analysis of a parallel fast Fourier transform (FFT) algorithm on mesh and hypercube connected multicomputers using the isoefficiency metric. The isoefficiency function of an algorithm architecture combination is defined as the rate at which the problem size should grow with the number of processors to maintain a fixed efficiency. It is shown that it is more cost-effective to implement the FFT algorithm on a hypercube rather than a mesh despite the fact that large scale meshes are cheaper to construct than large hypercubes. Although the scope of this work is limited to the Cooley-Tukey FFT algorithm on a few classes of architectures, the methodology can be used to study the performance of various FFT algorithms on a variety of architectures such as SIMD hypercube and mesh architectures and shared memory architecture. >

Journal ArticleDOI
Avraham Leff1, Joel L. Wolf1, Philip S. Yu1
TL;DR: Performance of the distributed algorithms is found to be close to optimal, while that of the greedy algorithms is far from optimal.
Abstract: Studies the cache performance in a remote caching architecture. The authors develop a set of distributed object replication policies that are designed to implement different optimization goals. Each site is responsible for local cache decisions, and modifies cache contents in response to decisions made by other sites. The authors use the optimal and greedy policies as upper and lower bounds, respectively, for performance in this environment. Critical system parameters are identified, and their effect on system performance studied. Performance of the distributed algorithms is found to be close to optimal, while that of the greedy algorithms is far from optimal. >

Journal ArticleDOI
J.L. Kim1, T. Park1
TL;DR: In this protocol, a process takes a checkpoint when it knows that all processes on which it computationally depends took their checkpoints, hence the process need not always wait for the decision made by the checkpointing coordinator as in the conventional synchronized protocols.
Abstract: The authors present an efficient synchronized checkpointing protocol that exploits the dependency relation between processes in distributed systems. In this protocol, a process takes a checkpoint when it knows that all processes on which it computationally depends took their checkpoints, hence the process need not always wait for the decision made by the checkpointing coordinator as in the conventional synchronized protocols. As a result, the checkpointing coordination time is substantially reduced and the possibility of total abort of the checkpointing coordination is reduced. >

Journal ArticleDOI
TL;DR: A load balancing algorithm for a discrete event simulation executed under Time Warp is presented and results indicate that substantial performance gains may be realized with the algorithm.
Abstract: A load balancing algorithm for a discrete event simulation executed under Time Warp is presented. The algorithm rests upon recent developments in active process migration, which permit the use of dynamic strategies. Dynamic load balancing allows for readjustments when resource requirements vary during simulation. It is also useful when initial resource predictions are unknown or incorrect. A simulated multiprocessor environment (PARALLEX) was developed in order to evaluate the algorithm. The results indicate that substantial performance gains may be realized with the algorithm. >

Journal ArticleDOI
TL;DR: MULTIFIT-COM, a static task allocator which could be incorporated into an automated compiler/linker/loader for distributed processing systems, is presented and its clustering and load balancing properties on a large system are applied.
Abstract: MULTIFIT-COM, a static task allocator which could be incorporated into an automated compiler/linker/loader for distributed processing systems, is presented. The allocator uses performance information for the processes making up the system in order to determine an appropriate mapping of tasks onto processors. It uses several heuristic extensions of the MULTIFIT bin-packing algorithm to find an allocation that will offer a high system throughput, taking into account the expected execution and interprocessor communication requirements of the software on the given hardware architecture. Throughput is evaluated by an asymptotic bound for saturated conditions and under an assumption that only processing resources are required. A set of options are proposed for each of the allocator's major steps. An evaluation was made on 680 small randomly generated examples. Using all the search options, an average performance difference of just over 1% was obtained. Using a carefully chosen small subset of only four options, a further degradation of just over 1.5% was obtained. The allocator is also applied to a digital signal processing system consisting of 119 tasks to illustrate its clustering and load balancing properties on a large system. >

Journal ArticleDOI
TL;DR: A generalized mapping strategy that uses a combination of graph theory, mathematical programming, and heuristics to guide the mapping of a parallel algorithm and the architecture is proposed.
Abstract: A generalized mapping strategy that uses a combination of graph theory, mathematical programming, and heuristics is proposed. The authors use the knowledge from the given algorithm and the architecture to guide the mapping. The approach begins with a graphical representation of the parallel algorithm (problem graph) and the parallel computer (host graph). Using these representations, the authors generate a new graphical representation (extended host graph) on which the problem graph is mapped. An accurate characterization of the communication overhead is used in the objective functions to evaluate the optimality of the mapping. An efficient mapping scheme is developed which uses two levels of optimization procedures. The objective functions include minimizing the communication overhead and minimizing the total execution time which includes both computation and communication times. The mapping scheme is tested by simulation and further confirmed by mapping a real world application onto actual distributed environments. >

Journal ArticleDOI
TL;DR: It is proved that any function representing an ND coterie can be decomposed into copies of the three-majority function, and this decomposition is representable as a binary tree.
Abstract: A coterie under a ground set U consists of subsets (called quorums) of U such that any pair of quorums intersect with each other. Nondominated (ND) coteries are of particular interest, since they are optimal in some sense. By assigning a Boolean variable to each element in U, a family of subsets of U is represented by a Boolean function of these variables. The authors characterize the ND coteries as exactly those families which can be represented by positive, self-dual functions. In this Boolean framework, it is proved that any function representing an ND coterie can be decomposed into copies of the three-majority function, and this decomposition is representable as a binary tree. It is also shown that the class of ND coteries proposed by D. Agrawal and A. El Abbadi (1989) is related to a special case of the above binary decomposition, and that the composition proposed by M.L. Neilsen and M. Mizuno (1992) is closely related to the classical Ashenhurst decomposition of Boolean functions. A number of other results are also obtained. The compactness of the proofs of most of these results indicates the suitability of Boolean algebra for the analysis of coteries. >

Journal ArticleDOI
TL;DR: In this article, the authors present dynamic load-sharing heuristics that use predicted resource requirements of processes to manage workloads in a distributed system using a previously developed statistical pattern-recognition method.
Abstract: Presents dynamic load-sharing heuristics that use predicted resource requirements of processes to manage workloads in a distributed system. A previously developed statistical pattern-recognition method is employed for resource prediction. While nonprediction-based heuristics depend on a rapidly changing system status, the new heuristics depend on slowly changing program resource usage patterns. Furthermore, prediction-based heuristics can be more effective since they use future requirements rather than just the current system state. Four prediction-based heuristics, two centralized and two distributed, are presented. Using trace driven simulations, they are compared against random scheduling and two effective nonprediction based heuristics. Results show that the prediction-based centralized heuristics achieve up to 30% better response times than the nonprediction centralized heuristic, and that the prediction-based distributed heuristics achieve up to 50% improvements relative to their nonpredictive counterpart. >

Journal ArticleDOI
TL;DR: Experimental results for many synthetic and practical problems run on various parallel machines that validate the theoretical analysis are presented, and it is shown that the average speedup obtained is linear when the distribution of solutions is uniform and superlinear when the distributed distribution is nonuniform.
Abstract: Analytical models and experimental results concerning the average case behavior of parallel backtracking are presented. Two types of backtrack search algorithms are considered: simple backtracking, which does not use heuristics to order and prune search, and heuristic backtracking, which does. Analytical models are used to compare the average number of nodes visited in sequential and parallel search for each case. For simple backtracking, it is shown that the average speedup obtained is linear when the distribution of solutions is uniform and superlinear when the distribution of solutions is nonuniform. For heuristic backtracking, the average speedup obtained is at least linear, and the speedup obtained on a subset of instances is superlinear. Experimental results for many synthetic and practical problems run on various parallel machines that validate the theoretical analysis are presented. >

Journal ArticleDOI
TL;DR: Mtool augments a program with low overhead instrumentation which perturbs the program's execution as little as possible while generating enough information to isolate memory and synchronization bottlenecks.
Abstract: The authors describe Mtool, a software tool for analyzing performance losses in shared memory parallel programs. Mtool augments a program with low overhead instrumentation which perturbs the program's execution as little as possible while generating enough information to isolate memory and synchronization bottlenecks. After running the instrumented version of the parallel program, the programmer can use Mtool's window-based user interface to view compute time, memory, and synchronization objects. The authors describe Mtool's low overhead instrumentation methods, memory bottleneck detection technique, and attention focusing mechanisms, contrast Mtool with other approaches, and offer a case study to demonstrate its effectiveness. >

Journal ArticleDOI
TL;DR: In this article, the authors present dynamic resource reclaiming algorithms that are effective, avoid any run time anomalies, and have bounded overhead costs that are independent of the number of tasks in the schedule.
Abstract: Most real-time scheduling algorithms schedule tasks with regard to their worst case computation times. Resources reclaiming refers to the problem of utilizing the resources left unused by a task when it executes in less than its worst case computation time, or when a task is deleted from the current schedule. Dynamic resource reclaiming algorithms that are effective, avoid any run time anomalies, and have bounded overhead costs that are independent of the number of tasks in the schedule are presented. Each task is assumed to have a worst case computation time, a deadline, and a set of resource requirements. The algorithms utilize the information given in a multiprocessor task schedule and perform online local optimization. The effectiveness of the algorithms is demonstrated through simulation studies. >

Journal ArticleDOI
TL;DR: Polymorphic processor arrays (PPAs), two-dimensional mesh-connected arrays of processors in which each processor is equipped with a switch able to interconnect its four NEWS ports, are discussed.
Abstract: Polymorphic processor arrays (PPAs), two-dimensional mesh-connected arrays of processors in which each processor is equipped with a switch able to interconnect its four NEWS ports, are discussed. The main features of PPA are that it models a realistic class of parallel computers, it supports the definition of high level programming models, it supports virtual parallelism, and it supports low complexity algorithms in a number of application fields. Both the PPA computation model and the PPA programming model are presented. It is shown that the PPA computation model is realistic by relating it to the design of the polymorphic torus (PT) chip. It is also shown that the PPA programming model is scalable by demonstrating that any algorithm having O(p) complexity on a virtual PPA of size square root m* square root m, has O(k p) complexity on a PPA of size square root n* square root n, with m k n and k integers. Some application algorithms in the area of numerical analysis and graph processing are presented. >

Journal ArticleDOI
TL;DR: A class of hierarchical networks that is suitable for implementation of large multi-computers in VLSI with wafer scale integration (VLSI/WSI) technology is introduced, which employ the hypercube topology as a basic cluster, connect many of these clusters using a de Bruijn graph, and maintain the node connectivity to be the same for all nodes product graph.
Abstract: Introduces a class of hierarchical networks that is suitable for implementation of large multi-computers in VLSI with wafer scale integration (VLSI/WSI) technology. These networks, which are termed dBCube, employ the hypercube topology as a basic cluster, connect many of these clusters using a de Bruijn graph, and maintain the node connectivity to be the same for all nodes product graph. The size of this class of regular networks can be easily extended by increments of a cluster size. Local communication, to be satisfied by the hypercube topology, allows easy embedding of existing parallel algorithms, while the de Bruijn graph, which was chosen for JPL's 8096-node multiprocessor, provides the shortest distance between clusters running different parts of an application. A scheme for obtaining WSI layout is introduced and used to estimate the number of tracks needed and the required area of the wafer. The exact number of tracks in the hypercube and an approximation for the de Bruijn graph are also obtained. Tradeoffs of area versus static parameters and the size of the hypercube versus that of the de Bruijn graph are also discussed. >

Journal ArticleDOI
G.C. Sih1, Edward A. Lee
TL;DR: The authors present a new compile-time scheduling heuristic called declustering, which schedules acyclic precedence graphs that fit the synchronous data flow (SDF) model onto multiprocessor architectures so that shared resource contention can be avoided.
Abstract: The authors present a new compile-time scheduling heuristic called declustering, which schedules acyclic precedence graphs that fit the synchronous data flow (SDF) model onto multiprocessor architectures. This technique accounts for interprocessor communication (IPC) overheads and considers interconnection constraints in the architecture so that shared resource contention can be avoided. The algorithm initially invokes a new clustering method that uses graph-analysis techniques to isolate parallelism instances. When constructing an initial set of clusters, this procedure explicitly addresses the tradeoff between exploiting parallelism and incurring communication cost. By hierarchically combining these clusters and then systematically decomposing this hierarchy, the declustering method exposes parallelism instances in order of importance and attains a cluster granularity that fits the characteristics of the architecture. It is shown that declustering retains the clustering advantage of avoiding IPC, yet overcomes the inflexibility associated with traditional clustering approaches. >

Journal ArticleDOI
TL;DR: The architecture proposed in this paper is a combination of Hypercube and deBruijn architectures, providing some of the desirable properties of both networks such as admitting many computationally important networks, flexibility in terms of connections per node as well as level of fault-tolerance.
Abstract: Both Hypercube and deBruijn networks possess desirable properties. It should be understood, though, that some of the attractive features of one are not found in the other. The architecture proposed in this paper is a combination of these architectures, providing some of the desirable properties of both the networks such as admitting many computationally important networks, flexibility in terms of connections per node as well as level of fault-tolerance. Also the network allows a simple VLSI layout, scalability as well as decomposability. Thus, these networks can be a potential candidate for VLSI multiprocessor networks. The proposed network possesses logarithmic diameter, optimal connectivity, and simple routing algorithms amendable to networks with faults. Importantly, in addition to being pancyclic, these hyper-deBruijn networks admit most computationally important subnetworks including rings, multidimensional meshes, complete binary trees, and mesh of trees with perfect dilation. Techniques for optimal one-to-all (OTA) broadcasting in these networks are presented. As an intermediate result, this technique provides the fastest OTA broadcasting in binary deBruijn networks as well. The recent renewed interest in binary deBruijn networks makes this later result valuable. >

Journal ArticleDOI
TL;DR: It is shown that for a given p (1 or=5 remains connected provided that at most two neighbors of any processor are allowed to fail.
Abstract: It is shown that for a given p (1 or=5 remains connected provided that at most two neighbors of any processor are allowed to fail. >

Journal ArticleDOI
TL;DR: A new technique for estimating and understanding the speed improvement that can result from executing a program on a parallel computer is described, which indicates that the three symbolic programs differ substantially from the numeric programs and, as a consequence, cannot be automatically parallelized with the same compilation techniques.
Abstract: A new technique for estimating and understanding the speed improvement that can result from executing a program on a parallel computer is described. The technique requires no additional programming and minimal effort by a program's author. The analysis begins by tracing a sequential program. A parallelism analyzer uses information from the trace to simulate parallel execution of the program. In addition to predicting parallel performance, the parallelism analyzer measures many aspects of a program's dynamic behavior. Measurements of six substantial programs are presented. These results indicate that the three symbolic programs differ substantially from the numeric programs and, as a consequence, cannot be automatically parallelized with the same compilation techniques. >

Journal ArticleDOI
TL;DR: A parallel memory system for efficient parallel array access using perfect latin squares as skewing functions and self-routing Benes networks to realize the permutations needed between the processing elements and the memory modules is discussed.
Abstract: A parallel memory system for efficient parallel array access using perfect latin squares as skewing functions is discussed. Simple construction methods for building perfect latin squares are presented. The resulting skewing scheme provides conflict free access to several important subsets of an array. The address generation can be performed in constant time with simple circuitry. The skewing scheme can provide constant time access to rows, columns, diagonals, and N/sup 1/2/*N/sup 1/2/ subarrays of an N*N array with maximum memory utilization. Self-routing Benes networks can be used to realize the permutations needed between the processing elements and the memory modules. Two skewing schemes that provide conflict free access to three-dimensional arrays are also discussed. Combined with self-routing Benes networks, these schemes provide efficient access to frequently used subsets of three-dimensional arrays. >