scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Computers in 1987"


Journal ArticleDOI
TL;DR: In this article, a deadlock-free routing algorithm for arbitrary interconnection networks using the concept of virtual channels is presented, where the necessary and sufficient condition for deadlock free routing is the absence of cycles in a channel dependency graph.
Abstract: A deadlock-free routing algorithm can be generated for arbitrary interconnection networks using the concept of virtual channels. A necessary and sufficient condition for deadlock-free routing is the absence of cycles in a channel dependency graph. Given an arbitrary network and a routing function, the cycles of the channel dependency graph can be removed by splitting physical channels into groups of virtual channels. This method is used to develop deadlock-free routing algorithms for k-ary n-cubes, for cube-connected cycles, and for shuffle-exchange networks.

2,110 citations


Journal ArticleDOI
TL;DR: This self-contained paper develops the theory necessary to statically schedule SDF programs on single or multiple processors, and a class of static (compile time) scheduling algorithms is proven valid, and specific algorithms are given for scheduling SDF systems onto single ormultiple processors.
Abstract: Large grain data flow (LGDF) programming is natural and convenient for describing digital signal processing (DSP) systems, but its runtime overhead is costly in real time or cost-sensitive applications. In some situations, designers are not willing to squander computing resources for the sake of programmer convenience. This is particularly true when the target machine is a programmable DSP chip. However, the runtime overhead inherent in most LGDF implementations is not required for most signal processing systems because such systems are mostly synchronous (in the DSP sense). Synchronous data flow (SDF) differs from traditional data flow in that the amount of data produced and consumed by a data flow node is specified a priori for each input and output. This is equivalent to specifying the relative sample rates in signal processing system. This means that the scheduling of SDF nodes need not be done at runtime, but can be done at compile time (statically), so the runtime overhead evaporates. The sample rates can all be different, which is not true of most current data-driven digital signal processing programming methodologies. Synchronous data flow is closely related to computation graphs, a special case of Petri nets. This self-contained paper develops the theory necessary to statically schedule SDF programs on single or multiple processors. A class of static (compile time) scheduling algorithms is proven valid, and specific algorithms are given for scheduling SDF systems onto single or multiple processors.

1,380 citations


Journal ArticleDOI
TL;DR: Instant Replay as discussed by the authors is a general solution for reproducing the execution behavior of parallel programs, termed Instant Replay, which saves the relative order of significant events as they occur, not the data associated with such events.
Abstract: The debugging cycle is the most common methodology for finding and correcting errors in sequential programs. Cyclic debugging is effective because sequential programs are usually deterministic. Debugging parallel programs is considerably more difficult because successive executions of the same program often do not produce the same results. In this paper we present a general solution for reproducing the execution behavior of parallel programs, termed Instant Replay. During program execution we save the relative order of significant events as they occur, not the data associated with such events. As a result, our approach requires less time and space to save the information needed for program replay than other methods. Our technique is not dependent on any particular form of interprocess communication. It provides for replay of an entire program, rather than individual processes in isolation. No centralized bottlenecks are introduced and there is no need for synchronized clocks or a globally consistent logical time. We describe a prototype implementation of Instant Replay on the BBN Butterfly™ Parallel Processor, and discuss how it can be incorporated into the debugging cycle for parallel programs.

765 citations


Journal ArticleDOI
TL;DR: For certain types of loops, it is shown analytically that guided self-scheduling uses minimal overhead and achieves optimal schedules, and experimental results that clearly show the advantage of guidedSelfScheduling over the most widely known dynamic methods are discussed.
Abstract: This paper proposes guided self-scheduling, a new approach for scheduling arbitrarily nested parallel program loops on shared memory multiprocessor systems. Utilizing loop parallelism is clearly most crucial in achieving high system and program performance. Because of its simplicity, guided self-scheduling is particularly suited for implementation on real parallel machines. This method achieves simultaneously the two most important objectives: load balancing and very low synchronization overhead. For certain types of loops we show analytically that guided self-scheduling uses minimal overhead and achieves optimal schedules. Two other interesting properties of this method are its insensitivity to the initial processor configuration (in time) and its parameterized nature which allows us to tune it for different systems. Finally we discuss experimental results that clearly show the advantage of guided self-scheduling over the most widely known dynamic methods.

656 citations


Journal ArticleDOI
TL;DR: Depending on the types and number of tolerated faults, this paper presents upper bounds on the achievable synchronization accuracy for external and internal synchronization in a distributed real-time system.
Abstract: The generation of a fault-tolerant global time base with known accuracy of synchronization is one of the important operating system functions in a distributed real-time system. Depending on the types and number of tolerated faults, this paper presents upper bounds on the achievable synchronization accuracy for external and internal synchronization in a distributed real-time system. The concept of continuous versus instantaneous synchronization is introduced in order to generate a uniform common time base for local, global, and external time measurements. In the last section, the functions of a VLSI clock synchronization unit, which improves the synchronization accuracy and reduces the CPU load, are described. With this unit, the CPU overhead and the network traffic for clock synchronization in state-of-the-art distributed real-time systems can be reduced to less than 1 percent.

625 citations


Journal ArticleDOI
TL;DR: This work uses a binary decomposition of the domain to partition it into rectangles requiring equal computational effort, and studies the communication costs of mapping this partitioning onto different multiprocessors: a mesh- connected array, a tree machine, and a hypercube.
Abstract: We consider the partitioning of a problem on a domain with unequal work estimates in different subdomains in a way that balances the workload across multiple processors. Such a problem arises for example in solving partial differential equations using an adaptive method that places extra grid points in certain subregions of the domain. We use a binary decomposition of the domain to partition it into rectangles requiring equal computational effort. We then study the communication costs of mapping this partitioning onto different multiprocessors: a mesh- connected array, a tree machine, and a hypercube. The communication cost expressions can be used to determine the optimal depth of the above partitioning.

623 citations


Journal ArticleDOI
TL;DR: The architecture, implementation, and performance of the Warp machine is described, demonstrating that the Warp architecture is effective in the application domain of robot navigation as well as in other fields such as signal processing, scientific computation, and computer vision research.
Abstract: The Warp machine is a systolic array computer of linearly connected cells, each of which is a programmable processor capable of performing 10 million floating-point operations per second (10 MFLOPS). A typical Warp array includes ten cells, thus having a peak computation rate of 100 MFLOPS. The Warp array can be extended to include more cells to accommodate applications capable of using the increased computational bandwidth. Warp is integrated as an attached processor into a Unix host system. Programs for Warp are written in a high-level language supported by an optimizing compiler. The first ten-cell prototype was completed in February 1986; delivery of production machines started in April 1987. Extensive experimentation with both the prototype and production machines has demonstrated that the Warp architecture is effective in the application domain of robot navigation as well as in other fields such as signal processing, scientific computation, and computer vision research. For these applications, Warp is typically several hundred times faster than a VAX 11/780 class computer. This paper describes the architecture, implementation, and performance of the Warp machine. Each major architectural decision is discussed and evaluated with system, software, and application considerations. The programming model and tools developed for the machine are also described. The paper concludes with performance data for a large number of applications.

328 citations


Journal ArticleDOI
TL;DR: An algorithm to convert redundant number representations into conventional representations is presented, which is applicable in arithmetic algorithms such as nonrestoring division, square root, and on-line operations in which redundantly represented results are generated in a digit-by-digit manner.
Abstract: An algorithm to convert redundant number representations into conventional representations is presented. The algorithm is performed concurrently with the digit-by-digit generation of redundant forms by schemes such as SRT division. It has a step delay roughly equivalent to the delay of a carry-save adder and simple implementation. The conversion scheme is applicable in arithmetic algorithms such as nonrestoring division, square root, and on-line operations in which redundantly represented results are generated in a digit-by-digit manner, from most significant to least significant.

256 citations


Journal ArticleDOI
TL;DR: In this article, the authors describe three mechanisms that improve network coherence: an organizational structure that provides a long-term framework for network coordination to guide each node's local control decisions; a planner at each node that develops sequences of problem-solving activities based on the current situation; and meta-level communication about the current state of local problem solving that enables nodes to dynamically refine the organization.
Abstract: When two or more computing agents work on interacting tasks, their activities should be coordinated so that they cooperate coherently. Coherence is particularly problematic in domains where each agent has only a limited view of the overall task, where communication between agents is limited, and where there is no ``controller'' to coordinate the agents. Our approach to coherent cooperation in such domains is developed in the context of a distributed problem-solving network where agents cooperate to solve a single problem. The approach stresses the importance of sophisticated local control by which each problem-solving node integrates knowledge of the problem domain with (meta-level) knowledge about network coordination. This allows nodes to make rapid, intelligent local decisions based on changing problem characteristics with only a limited amount of intercommunication to coordinate these decisions. We describe three mechanisms that improve network coherence: 1) an organizational structure that provides a long-term framework for network coordination to guide each node's local control decisions; 2) a planner at each node that develops sequences of problem-solving activities based on the current situation; and 3) meta-level communication about the current state of local problem solving that enables nodes to dynamically refine the organization. We present a variety of problem-solving situations to show the benefits and limitations of these mechanisms, and we provide simulation results showing the mechanisms to be particularly cost effective in more complex problem-solving situations. We also discuss how these mechanisms might be of more general use in other distributed computing applications.

254 citations


Journal ArticleDOI
TL;DR: In this paper, it is shown that even if only a small percentage of all requests are to a hot-spot, these requests can cause very serious performances problems, and networks that do the necessary combining of requests are suggested to keep the interconnection network and memory contention from becoming a bottleneck.
Abstract: When a large number of processors try to access a common variable, referred to as hot-spot accesses in [6], not only can the resulting memory contention seriously degrade performance, but it can also cause tree saturation in the interconnection network which blocks both hot and regular requests alike. It is shown in [6] that even if only a small percentage of all requests are to a hot-spot, these requests can cause very serious performances problems, and networks that do the necessary combining of requests are suggested to keep the interconnection network and memory contention from becoming a bottleneck.

252 citations


Journal ArticleDOI
TL;DR: The GC strategy is proposed and shown that the GC strategy outperforms the buddy strategy in detecting the availability of subcubes and the minimal number of GC's required for complete subcube recognition in a Qn is proved to be less than or equal to C[n/2]n.
Abstract: The processor allocation problem in an n-dimensional hypercube (or an n-cube) multiprocessor is similar to the conventional memory allocation problem. The main objective in both problems is to maximize the utilization of available resources as well as minimize the inherent system fragmentation. A processor allocation strategy using the buddy system, called the buddy strategy, is discussed first and then a new allocation strategy using a Gray code (GC), called the GC strategy, is proposed. When processor relinquishment is not considered (i.e., static allocation), both of these strategies are proved to be optimal in the sense that each incoming request sequence is always assigned to a minimal subcube. It is also shown that the GC strategy outperforms the buddy strategy in detecting the availability of subcubes. Our results are extended further to implement an allocation strategy using more than one GC and derive the relationship between the GC's used and the corresponding ability of detecting the availability of various subcubes. The minimal number of GC's required for complete subcube recognition in a Q n is proved to be less than or equal to C [n/2] n. Several processor allocation strategies in a Q 5 are implemented on the NCUBE/six multiprocessor at the University of Michigan, and their performance is experimentally measured.

Journal ArticleDOI
TL;DR: This paper presents a mapping strategy for parallel processing using an accurate characterization of the communication overhead using an efficient mapping scheme for the objective functions, where two levels of assignment optimization procedures are employed: initial assignment and pairwise exchange.
Abstract: This paper presents a mapping strategy for parallel processing using an accurate characterization of the communication overhead. A set of objective functions is formulated to evaluate the optimality of mapping a problem graph onto a system graph. One of them is especially suitable for real-time applications of parallel processing. These objective functions are different from the conventional objective functions in that the edges in the problem graph are weighted and the actual distance rather than the nominal distance for the edges in the system graph is employed. This facilitates a more accurate quantification of the communication overhead. An efficient mapping scheme has been developed for the objective functions, where two levels of assignment optimization procedures are employed: initial assignment and pairwise exchange. The mapping scheme has been tested using the hypercube as a system graph.

Journal ArticleDOI
TL;DR: This paper presents an innovative approach, called signatured instruction streams (SIS), to the on-line detection of control flow errors caused by transient and intermittent faults.
Abstract: This paper presents an innovative approach, called signatured instruction streams (SIS), to the on-line detection of control flow errors caused by transient and intermittent faults. At compile time an application program is appropriately partitioned into smaller subprograms, and cyclic codes, or signatures, characterizing the control flow of each subprogram are generated and embedded in the object code. At runtime, special built-in hardware regenerates these signatures using runtime information and compares them to the precomputed signatures. A mismatch indicates the detection of an error. A demonstration system, based on the MC68000 processor, has been designed and built. Fault insertion experiments have been performed using the demonstration system. The demonstration system, using 17 percent hardware overhead, is able to detect 98 percent of faults affecting the control flow and 82 percent of all randomly inserted faults.

Journal ArticleDOI
TL;DR: In this article, the authors examined the cache miss ratio as a function of line size, and found that for high performance microprocessor designs, line sizes in the range 16-64 bytes seem best; shorter line sizes yield high delays due to memory latency, although they reduce memory traffic somewhat.
Abstract: The line (block) size of a cache memory is one of the parameters that most strongly affects cache performance. In this paper, we study the factors that relate to the selection of a cache line size. Our primary focus is on the cache miss ratio, but we also consider influences such as logic complexity, address tags, line crossers, I/O overruns, etc. The behavior of the cache miss ratio as a function of line size is examined carefully through the use of trace driven simulation, using 27 traces from five different machine architectures. The change in cache miss ratio as the line size varies is found to be relatively stable across workloads, and tables of this function are presented for instruction caches, data caches, and unified caches. An empirical mathematical fit is obtained. This function is used to extend previously published design target miss ratios to cover line sizes from 4 to 128 bytes and cache sizes from 32 bytes to 32K bytes; design target miss ratios are to be used to guide new machine designs. Mean delays per memory reference and memory (bus) traffic rates are computed as a function of line and cache size, and memory access time parameters. We find that for high performance microprocessor designs, line sizes in the range 16-64 bytes seem best; shorter line sizes yield high delays due to memory latency, although they reduce memory traffic somewhat. Longer line sizes are suitable for mainframes because of the higher bandwidth to main memory.

Journal ArticleDOI
TL;DR: This paper presents a method for optimal module allocation that satisfies certain performance constraints and proposes an objective function that includes the intermodule communication (IMC) and accumulative execution time (AET) of each module.
Abstract: In a distributed processing system with the application software partitioned into a set of program modules, allocation of those modules to the processors is an important problem. This paper presents a method for optimal module allocation that satisfies certain performance constraints. An objective function that includes the intermodule communication (IMC) and accumulative execution time (AET) of each module is proposed. It minimizes the bottleneck-processor utilization—a good principle for task allocation. Next, the effects of precedence relationship (PR) among program modules on response time are studied. Both simulation and analytical results reveal that the program-size ratio between two consecutive modules plays an important role in task response time. Finally, an algorithm based on PR, AET, and IMC and on the proposed objective function is presented. This algorithm generates better module assignments than those that do not consider the PR effects.

Journal ArticleDOI
Hou1
TL;DR: Through use of the fast Hartley transform, discrete cosine transforms (DCT) and discrete Fourier transforms (DFT) can be obtained and the recursive nature of the FHT algorithm derived in this paper enables us to generate the next higher order FHT from two identical lower order F HT's.
Abstract: The fast Hartley transform (FHT) is similar to the Cooley-Tukey fast Fourier transform (FFT) but performs much faster because it requires only real arithmetic computations compared to the complex arithmetic computations required by the FFT. Through use of the FHT, discrete cosine transforms (DCT) and discrete Fourier transforms (DFT) can be obtained. The recursive nature of the FHT algorithm derived in this paper enables us to generate the next higher order FHT from two identical lower order FHT's. In practice, this recursive relationship offers flexibility in programming different sizes of transforms, while the orderly structure of its signal flow-graphs indicates an ease of implementation in VLSI.

Journal ArticleDOI
TL;DR: Several loop synchronization techniques to generate synchronization instructions for singly-nested loops are presented and a technique for the elimination of redundant synchronization instructions is presented.
Abstract: Translating program loops into a parallel form is one of the most important transformations performed by concurrentizing compilers. This transformation often requires the insertion of synchronization instructions within the body of the concurrent loop. Several loop synchronization techniques are presented first. Compiler algorithms to generate synchronization instructions for singly-nested loops are then discussed. Finally, a technique for the elimination of redundant synchronization instructions is presented.

Journal ArticleDOI
TL;DR: The node organization algorithm presented in this paper provides a completely distributed, maximally localized execution of collision free channel allocation that allows for parallel channel allocation in stationary and mobile networks with provable spatial reuse properties.
Abstract: This paper proposes a solution to providing a collision free channel allocation in a multihop mobile radio network. An efficient solution to this problem provides spatial reuse of the bandwidth whenever possible. A robust solution maintains the collision free property of the allocation under any combination of topological changes. The node organization algorithm presented in this paper provides a completely distributed, maximally localized execution of collision free channel allocation. It allows for parallel channel allocation in stationary and mobile networks with provable spatial reuse properties. A simpler version of the algorithm provides also a highly localized distributed coloring algorithm of dynamic graphs.

Journal ArticleDOI
TL;DR: This paper presents a class of repair mechanisms using the concept of checkpointing and derives several properties of checkpoint repair mechanisms, and provides algorithms for performing checkpoint repair that incur little overhead in time and modest cost in hardware.
Abstract: Out-or-order execution and branch prediction are two mechanisms that can be used profitably in the design of supercomputers to increase performance. Proper exception handling and branch prediction miss handling in an out-of-order execution machine do require some kind of repair mechanism which can restore the machine to a known previous state. In this paper we present a class of repair mechanisms using the concept of checkpointing. We derive several properties of checkpoint repair mechanisms. In addition, we provide algorithms for performing checkpoint repair that incur little overhead in time and modest cost in hardware. We also note that our algorithms require no additional complexity or time for use with write-back cache memory systems than they do with write-through cache memory systems, contrary to statements made by previous researchers.

Journal ArticleDOI
TL;DR: This paper presents the principles of constructing hypernets and analyzes their architectural potentials in terms of message routing complexity, cost-effective support for global as well as localized communication, I/O capabilities, and fault tolerance.
Abstract: A new class of modular networks is proposed for hierarchically constructing massively parallel computer systems for distributed supercomputing and AI applications. These networks are called hypernets. They are constructed incrementally with identical cubelets, treelets, or buslets that are well suited for VLSI implementation. Hypernets integrate positive features of both hypercubes and tree-based topologies, and maintain a constant node degree when the network size increases. This paper presents the principles of constructing hypernets and analyzes their architectural potentials in terms of message routing complexity, cost-effective support for global as well as localized communication, I/O capabilities, and fault tolerance. Several algorithms are mapped onto hypernets to illustrate their ability to support parallel processing in a hierarchically structured or data-dependent environment. The emulation of hypercube connections using less hardware is shown. The potential of hypernets for efficient support of connectionist models of computation is also explored.

Journal ArticleDOI
TL;DR: A graph-theoretic algorithm for safety analysis of a class of timing properties in real-time systems which are expressible in a subset of real time logic (RTL) formulas.
Abstract: This paper presents a graph-theoretic algorithm for safety analysis of a class of timing properties in real-time systems which are expressible in a subset of real time logic (RTL) formulas Our procedure is in three parts: the first part constructs a graph representing the system specification and the negation of the safety assertion The second part detects positive cycles in the graph using a node removal operation The third part determines the consistency of the safety assertion with respect to the system specification based on the positive cycles detected The implementation and an application of this procedure will also be described

Journal ArticleDOI
TL;DR: The proposed broadcast protocol thus possesses the advantages of TDM solutions while allowing the channel bandwidth to be shared, concurrently with the broadcast, with other transmission activities as dictated, for instance, by data link protocols.
Abstract: This paper considers the issue of broadcasting protocols in multihop radio networks. The objective of a broadcasting protocol is to deliver the broadcasted message to all network nodes. To efficiently achieve this objective the broad- casting protocol in this paper utilizes two basic properties of the multihop radio network. One is the broadcast nature of the radio which allows every single trasmission to reach all nodes that are in line of sight and within range of the transmitting node. The other, spatial reuse of the radio channel, which due to the multihop nature of the network allows multiple simultaneous transmissions to be received correctly. The proposed protocol incorporates these properties to obtain a collision free forwarding of the broadcasted message on a tree. Centralized and distributed algorithms for the tree construction are presented. The obtained trees are unique in incorporating radio oriented time ordering as part of their definition. In this way multiple copies of one or more broadcasted messages can be transmitted simultaneously without collision, requiring only a small number of message transmissions. Consequently, the protocol not only guarantees that the broadcasted message reaches all network nodes in bounded time, but also ensures that the broadcasting activity will use only limited channel bandwidth and node memory. The proposed broadcast protocol thus possesses the advantages of TDM solutions while allowing the channel bandwidth to be shared, concurrently with the broadcast, with other transmission activities as dictated, for instance, by data link protocols. Some NP-completeness proofs are also given.

Journal Article
Wagner1, Chin, McCluskey
TL;DR: In this paper, pseudorandom patterns generated by a linear feedback shift register (LFSR) were used to test a circuit for high fault coverage using the detectability profile of the circuit.
Abstract: Algorithmic test generation for high fault coverage is an expensive and time-consuming process As an alternative, circuits can be tested by applying pseudorandom patterns generated by a linear feedback shift register (LFSR) Although no fault simulation is needed, analysis of pseudorandom testing requires the circuit detectability profile

Journal ArticleDOI
TL;DR: A simple and efficient algorithm, SYREL, to obtain compact terminal reliability expressions between a terminal pair of computers of complex networks that incorporates conditional probability, set theory, and Boolean algebra in a distinct approach.
Abstract: Symbolic terminal reliability algorithms are important for analysis and synthesis of computer networks. In this paper, we present a simple and efficient algorithm, SYREL, to obtain compact terminal reliability expressions between a terminal pair of computers of complex networks. This algorithm incorporates conditional probability,, set theory, and Boolean algebra in a distinct approach in which most of the computations performed are directly executable Boolean operations. The conditibnal probability is used to avoid applying at each iteration the most time consuming step in reliability algorithms, which is making a set of events mutually exclusive. The algorithm has been implemented on a VAX 11/750 and can analyze fairly large networks with modest memory and time requirements.

Journal ArticleDOI
TL;DR: A completely new generalization of the characterization problem in the system-level diagnosis area is developed and provides necessary and sufficient conditions for any fault-pattern of any size to be uniquely diagnosable, under the symmetric, and asymmetric invalidation models with or without the intermittent faults.
Abstract: System-level diagnosis appears to be a viable alternative to circuit-level testing in complex multiprocessor systems. A completely new generalization of the characterization problem in the system-level diagnosis area is developed in this paper. This generalized characterization theorem provides necessary and sufficient conditions for any fault-pattern of any size to be uniquely diagnosable, under the symmetric, and asymmetric invalidation models with or without the intermittent faults. Moreover, it is also shown that the well known t-characterization theorems under these models can be derived as special cases. In addition to the generalization provided by these results, it is hoped that these results will also have a great impact on the diagnosis of faulty units in uniform structures based on the system-level diagnosis concepts and would be particularly useful in the diagnosis of WSI-oriented multiprocessor systems.

Journal ArticleDOI
TL;DR: This paper presents the solution of the following optimization problem that appears in the design of double-loop structures for local networks and also in data memory, allocation and data alignment in SIMD processors.
Abstract: This paper presents the solution of the following optimization problem that appears in the design of double-loop structures for local networks and also in data memory, allocation and data alignment in SIMD processors.

Journal ArticleDOI
TL;DR: This paper addresses the problem of selecting vote assignments in order to maximize the probability that the critical operations can be performed at a given time by some group of nodes, and suggests simple heuristics to assign votes.
Abstract: In a faulty distributed system, voting is commonly used to achieve mutual exclusion among groups of isolated nodes. Each node is assigned a number of votes, and any group with a majority of votes can perform the critical operations. Vote assignments can have a significant impact on system reliability. In this paper we address the problem of selecting vote assignments in order to maximize the probability that the critical operations can be performed at a given time by some group of nodes. We suggest simple heuristics to assign votes, and show that they give good results in most cases. We also study three particular homogeneous topologies (fully connected, Ethernet, and ring networks), and derive analytical expressions for system reliability. These expressions provide useful insights into the reliability provided by voting mechanisms.

Journal ArticleDOI
TL;DR: This work uses the track graph, a suitably defined grid-like structure, to obtain efficient solutions for rectilinear shortest paths and minimum spanning tree (MST) problems for a set of points in the plane in the presence of rectilInear obstacles.
Abstract: We study the rectilinear shortest paths and minimum spanning tree (MST) problems for a set of points in the plane in the presence of rectilinear obstacles. We use the track graph, a suitably defined grid-like structure, to obtain efficient solutions for both problems. The track graph consists of rectilinear tracks defined by the obstacles and the points for which shortest paths and a minimum spanning tree are sought. We use a growth process like Dijkstra's on the track graph to find shortest paths from any point in the set to all other points (the one-to-all shortest paths problem). For the one-to-all shortest paths problem for n points we derive an O(n min {log n, log e} + (e + k) log t) time algorithm, where e is the total number of edges of all obstacles, t is the number of extreme edges of all obstacles, and k is the number of intersections among obstacle tracks (all bounds are for the worst case). The MST for the points is constructed also in time O(n log n + (e + k) log t) by a hybrid method of searching for shortest paths while simultaneously constructing an MST. An interesting application of the MST algorithm is the approximation of Steiner trees in graphs.

Journal ArticleDOI
TL;DR: A heuristic two-step, graph-based mapping scheme with polynomial-time complexity is developed and a heuristic boundary refinement procedure is developed to incrementally alter the initial partition for improved load balancing among the processors.
Abstract: The processor allocation problem is addressed in the context of the parallelization of a finite element modeling program on a processor mesh. A heuristic two-step, graph-based mapping scheme with polynomial-time complexity is developed: 1) initial generation of a graph partition for nearest-neighbor mapping of the finite element graph onto the processor graph, and, 2) a heuristic boundary refinement procedure to incrementally alter the initial partition for improved load balancing among the processors. The effectiveness of the approach is gaged both by estimation using a model with empirically determined parameters, as well as implementation and experimental measurement on a 16 node hypercube parallel computer.

Journal ArticleDOI
TL;DR: Parallel algorithms for finding the connected components (CC) and a minimum spanning FOREST of an undirected graph are presented and the PRAM algorithm is a simplification of the one appearing in [17].
Abstract: Parallel algorithms for finding the connected components (CC) and a minimum spanning FOREST (MSF) of an undirected graph are presented. The primary model of computation considered is that called "shuffle-exchange network" in which each processor has its own local memory, no memory is shared, and communication among processors is done via a fixed degree network. This model is very convenient for actual realization. Both algorithms have depth of O(log2 n) while using n2 processors. Here n is the number of vertices in the graph. The algorithms are first presented for the PRAM (parallel RAM) model, which is not realizable, but much more convenient for the design and presentation of algorithms. The CC and MSF algorithms are no exceptions. The CC PRAM algorithm is a simplification of the one appearing in [17]. A modification of this algorithm yields a simple and efficient MSF algorithm. Both have depth of O(log m) and they use m processors, where m is the number of edges in the graph.