scispace - formally typeset
Search or ask a question

Showing papers on "Parallel algorithm published in 1996"


Book
15 Oct 1996
TL;DR: This chapter discusses the design and Coding of Parallel Programs, performance, and grouping data for Communication in the context of parallel computing.
Abstract: Chapter 1 Introduction Chapter 2 An Overview of Parallel Computing Chapter 3 Greetings! Chapter 4 An Application: Numerical Integration Chapter 5 Collective Communication Chapter 6 Grouping Data for Communication Chapter 7 Communicators and Topologies Chapter 8 Dealing with I/O Chapter 9 Debugging Your Program Chapter 10 Design and Coding of Parallel Programs Chapter 11 Performance Chapter 12 More on Performance Chapter 13 Advanced Point-to-Point Communication Chapter 14 Parallel Algorithms Chapter 15 Parallel Libraries Chapter 16 Wrapping Up Appendix A Summary of MPI Commands Appendix B MPI on the Internet

1,357 citations


Journal ArticleDOI
TL;DR: This work considers the problem of mining association rules on a shared nothing multiprocessor and presents three algorithms that explore a spectrum of trade-offs between computation, communication, memory usage, synchronization, and the use of problem specific information.
Abstract: We consider the problem of mining association rules on a shared nothing multiprocessor. We present three algorithms that explore a spectrum of trade-offs between computation, communication, memory usage, synchronization, and the use of problem specific information. The best algorithm exhibits near perfect scaleup behavior, yet requires only minimal overhead compared to the current best serial algorithm.

1,121 citations


Journal ArticleDOI
TL;DR: A static scheduling algorithm for allocating task graphs to fully connected multiprocessors which has admissible time complexity, is economical in terms of the number of processors used and is suitable for a wide range of graph structures.
Abstract: In this paper, we propose a static scheduling algorithm for allocating task graphs to fully connected multiprocessors. We discuss six recently reported scheduling algorithms and show that they possess one drawback or the other which can lead to poor performance. The proposed algorithm, which is called the Dynamic Critical-Path (DCP) scheduling algorithm, is different from the previously proposed algorithms in a number of ways. First, it determines the critical path of the task graph and selects the next node to be scheduled in a dynamic fashion. Second, it rearranges the schedule on each processor dynamically in the sense that the positions of the nodes in the partial schedules are not fixed until all nodes have been considered. Third, it selects a suitable processor for a node by looking ahead the potential start times of the remaining nodes on that processor, and schedules relatively less important nodes to the processors already in use. A global as well as a pair-wise comparison is carried out for all seven algorithms under various scheduling conditions. The DCP algorithm outperforms the previous algorithms by a considerable margin. Despite having a number of new features, the DCP algorithm has admissible time complexity, is economical in terms of the number of processors used and is suitable for a wide range of graph structures.

842 citations


Journal ArticleDOI
TL;DR: This research on parallel algorithms has not only improved the general understanding of par-allelism but in several cases has led to improvements in sequential algorithms.
Abstract: In the past 20 years there has been treftlen-dous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some of these algorithms are efficient only in a theoretical framework, many are quite efficient in practice or have key ideas that have been used in efficient implementations. This research on parallel algorithms has not only improved our general understanding ofpar-allelism but in several cases has led to improvements in sequential algorithms. Unf:ortunately there has been less success in developing good languages f:or prograftlftling parallel algorithftls, particularly languages that are well suited for teaching and pro-totyping algorithms. There has been a large gap between languages that are too low level, requiring specification of many details that obscure the meaning of the algorithm, and languages that are too high level, making the performance implications of various constructs unclear. In sequential computing many standard languages such as C or Pascal do a reasonable J·ob of bridging this gap, but in parallel languages building such a bridge has been significantly more difficult.

458 citations


03 Oct 1996
TL;DR: The algorithms described in this thesis are designed to schedule cells in a very high-speed, parallel, input-queued crossbar switch, and it is proved that LQ although too complex to implement in hardware, is stable under all admissible i.i.d. offered loads.
Abstract: The algorithms described in this thesis are designed to schedule cells in a very high-speed, parallel, input-queued crossbar switch. We present several novel scheduling algorithms that we have devised, each aims to match the set of inputs of an input queued switch to the set of outputs more efficiently, fairly and quickly than existing techniques. In Chapter 2 we present the simplest and fastest of these algorithms: SLIP--a parallel algorithm that uses rotating priority ("round-robin") arbitration. SLIP is simple: it is readily implemented in hardware and can operate at high speed. SLIP has high performance: for uniform i.i.d. Bernoulli arrivals, SLIP is stable for any admissible load, because the arbiters tend to desynchronize. We present analytical results to model this behavior. However, SLIP is not always stable and is not always monotonic: adding more traffic can actually make the algorithm operate more efficiently. We present an approximate analytical model of this behavior. SLIP prevents starvation: all contending inputs are eventually served. We present simulation results, indicating SLIP's performance. We argue that SLIP can be readily implemented for a 32 x 32 switch on a single chip. In Chapter 3 we present i-SLIP, an iterative algorithm that improves upon SLIP by converging on a maximal size match. The performance of i-SLIP improves with up to log$\sb2N$ iterations. We show that although it has a longer running time than SLIP, an i-SLIP scheduler is little more complex to implement. In Chapter 4 we describe maximum or maximal weight matching algorithms based on the occupancy of queues, or waiting times of cells. These algorithms are stable over a wider range of traffic loads. We describe two algorithms, longest queue first (LQF) and oldest cell first (OCF) and consider their performance. We prove that LQ although too complex to implement in hardware, is stable under all admissible i.i.d. offered loads. We consider two implementable, iterative algorithms i-LQF and i-OCF which converge on a maximal weight matching. Finally, we present two interesting implementations of the Gale-Shapley algorithm, designed to solve the stable marriage problem.

425 citations


Proceedings ArticleDOI
15 Apr 1996
TL;DR: A new VLSI architecture for a real-time pipeline FFT processor is proposed, derived by integrating a twiddle factor decomposition technique in the divide-and-conquer approach, which has the same multiplicative complexity as the radix-4 algorithm, but retains the butterfly structure of the Radix-2 algorithm.
Abstract: A new VLSI architecture for a real-time pipeline FFT processor is proposed. A hardware-oriented radix-2/sup 2/ algorithm is derived by integrating a twiddle factor decomposition technique in the divide-and-conquer approach. The radix-2/sup 2/ algorithm has the same multiplicative complexity as the radix-4 algorithm, but retains the butterfly structure of the radix-2 algorithm. The single-path delay-feedback architecture is used to exploit the spatial regularity in the signal flow graph of the algorithm. For length-N DFT computation, the hardware requirement of the proposed architecture is minimal on both dominant components: log/sub 4/N-1 complexity multipliers and N-1 complexity data memory. The validity and efficiency of the architecture have been verified by simulation in the hardware description language VHDL.

410 citations


Journal ArticleDOI
TL;DR: The proposed method is applicable to any multiplier size and adaptable to any technology for which speed parameters are known, and it is easy to incorporate this method in silicon compilation or logic synthesis tools.
Abstract: This paper presents a method and an algorithm for generation of a parallel multiplier, which is optimized for speed. This method is applicable to any multiplier size and adaptable to any technology for which speed parameters are known. Most importantly, it is easy to incorporate this method in silicon compilation or logic synthesis tools. The parallel multiplier produced by the proposed method outperforms other schemes used for comparison in our experiment. It uses the minimal number of cells in the partial product reduction tree. These findings are tested on design examples simulated in 1 /spl mu/ CMOS ASIC technology.

370 citations


01 Jan 1996
TL;DR: This survey presents a general framework (an algorithm space) that integrates existing SAT algorithms into a unified perspective and describes sequential and parallel SAT algorithms including variable splitting, resolution, local search, global optimization, mathematical programming, and practical SAT algorithms.
Abstract: : The satisfiability (SAT) problem is a core problem in mathematical logic and computing theory. In practice, SAT is fundamental in solving many problems in automated reasoning, computer aided design, computer aided manufacturing, machine vision, database, robotics, integrated circuit design, computer architecture design, and computer network design. Traditional methods treat SAT as a discrete, constrained decision problem. In recent years, many optimization methods, parallel algorithms, and practical techniques have been developed for solving SAT. In this survey, we present a general framework (an algorithm space) that integrates existing SAT algorithms into a unified perspective. We describe sequential and parallel SAT algorithms including variable splitting, resolution, local search, global optimization, mathematical programming, and practical SAT algorithms. We give performance evaluation of some existing SAT algorithms. Finally, we provide a set of practical applications of the satisfiability problems.

329 citations


Journal ArticleDOI
TL;DR: By this demonstration, a new generation of 3D parallel thinning algorithms can be designed and proved to preserve connectivity relatively easily.

256 citations


Journal ArticleDOI
TL;DR: This paper presents two general algorithms for simulated annealing that have been applied to job shop scheduling problem and the traveling salesman problem and it is observed that it is possible to achieve superlinear speedups using the algorithm.

179 citations


Proceedings ArticleDOI
27 Oct 1996
TL;DR: The performance of the sequential algorithm to locate the cell elements intersected by isosurfaces is faster than the Kd tree searching method originally used for the Span Space algorithm, which can achieve high load balancing for massively parallel machines with distributed memory architectures.
Abstract: We present efficient sequential and parallel algorithms for isosurface extraction. Based on the Span Space data representation, new data subdivision and searching methods are described. We also present a parallel implementation with an emphasis on load balancing. The performance of our sequential algorithm to locate the cell elements intersected by isosurfaces is faster than the Kd tree searching method originally used for the Span Space algorithm. The parallel algorithm can achieve high load balancing for massively parallel machines with distributed memory architectures.

Journal ArticleDOI
TL;DR: In inverse spacefilling partitioning (ISP), a partitioning strategy for non-uniform scientific computations running on distributed memory MIMD parallel computers, is discussed and the general d-dimensional ISP algorithm is described and empirical results with two- and three-dimensional, non-hierarchical particle methods are reported.
Abstract: We discuss inverse spacefilling partitioning (ISP), a partitioning strategy for non-uniform scientific computations running on distributed memory MIMD parallel computers. We consider the case of a dynamic workload distributed on a uniform mesh, and compare ISP against orthogonal recursive bisection (ORE) and a median of medians variant of ORE, ORB-MM. We present two results. First, ISP and ORB-MM are superior to ORE in rendering balanced workloads-because they are more fine-grained-and incur communication overheads that are comparable to ORE. Second, ISP is more attractive than ORB-MM from a software engineering standpoint because it avoids elaborate bookkeeping. Whereas ISP partitionings can be described succinctly as logically contiguous segments of the line, ORB-MM's partitionings are inherently unstructured. We describe the general d-dimensional ISP algorithm and report empirical results with two- and three-dimensional, non-hierarchical particle methods.

Proceedings ArticleDOI
17 Nov 1996
TL;DR: This paper presents parallel algorithms for data mining of association rules, and studies the degree of parallelism, synchronization, and data locality issues on the SGI Power Challenge shared-memory multi-processor.
Abstract: Data mining is an emerging research area, whose goal is to extract significant patterns or interesting rules from large databases. High-level inference from large volumes of routine business data can provide valuable information to businesses, such as customer buying patterns, shelving criterion in supermarkets and stock trends. Many algorithms have been proposed for data mining of association rules. However, research so far has mainly focused on sequential algorithms. In this paper we present parallel algorithms for data mining of association rules, and study the degree of parallelism, synchronization, and data locality issues on the SGI Power Challenge shared-memory multi-processor. We further present a set of optimizations for the sequential and parallel algorithms.Experiments show that a significant improvement of performance is achieved using our proposed optimizations. We also achieved good speed-up for the parallel algorithm, but we observe a need for parallel I/O techniques for further performance gains.

Proceedings ArticleDOI
01 Dec 1996
TL;DR: Four parallel algorithms for mining association rules on shared nothing parallel machines to improve its performance are proposed and the best algorithm, HPA-ELD, attains good linearity on speedup ratio and is effective for handling skew.
Abstract: We propose four parallel algorithms (NPA, SPA, HPA and HPA-ELD) for mining association rules on shared nothing parallel machines to improve its performance. In NPA, candidate itemsets are just copied amongst all the processors, which can lead to memory overflow for large transaction databases. The remaining three algorithms partition the candidate itemsets over the processors. If it is partitioned simply (SPA), transaction data has to be broadcast to all processors. HPA partitions the candidate itemsets using a hash function to eliminate broadcasting, which also reduces the comparison workload significantly. HPA-ELD fully utilizes the available memory space by detecting the extremely large itemsets and copying them, which is also very effective at flattering the load over the processors. We implemented these algorithms in a shared nothing environment. Performance evaluations show that the best algorithm, HPA-ELD, attains good linearity on speedup ratio and is effective for handling skew.

Proceedings ArticleDOI
26 Feb 1996
TL;DR: It is shown that spatial joins are very suitable to be processed on a parallel hardware platform and the most efficient one shows an almost optimal speed up under the assumption that the number of disks is sufficiently large.
Abstract: We show that spatial joins are very suitable to be processed on a parallel hardware platform. The parallel system is equipped with a so called shared virtual memory which is well suited for the design and implementation of parallel spatial join algorithms. We start with an algorithm that consists of three phases: task creation, task assignment and parallel task execution. In order to reduce CPU and I/O cost, the three phases are processed in a fashion that preserves spatial locality. Dynamic load balancing is achieved by splitting tasks into smaller ones and reassigning some of the smaller tasks to idle processors. In an experimental performance comparison, we identify the advantages and disadvantages of several variants of our algorithm. The most efficient one shows an almost optimal speed up under the assumption that the number of disks is sufficiently large.

Book ChapterDOI
01 Jan 1996
TL;DR: A Parallel Tabu Search algorithm for the vehicle routing problem under capacity and distance restrictions and in the neighborhood search, the algorithm uses compound moves generated by an ejection chain process.
Abstract: In this paper we describe a Parallel Tabu Search algorithm for the vehicle routing problem under capacity and distance restrictions. In the neighborhood search, the algorithm uses compound moves generated by an ejection chain process. Parallel processing is used to explore the solution space more extensively and different parallel techniques are used to accelerate the search process. Tests were carried out on a network of SUNSparc workstations and computational results for a set of benchmark problems prove the efficiency of the algorithm proposed.

Journal ArticleDOI
TL;DR: Using this method, the standard algorithms for sequential programs computing liveness, availability, very busyness, reaching definitions, definition-use chains, or the analyses for performing code motion, assignment motion, partial dead-code elimination or strength reduction, can straightforward be transferred to the parallel setting at almost no cost.
Abstract: We consider parallel programs with shared memory and interleaving semantics, for which we show how to construct for unidirectional bitvector problems optimal analysis algorithms that are as efficient as their purely sequential counterparts and that can easily be implemented. Whereas the complexity result is rather obvious, our optimality result is a consequence of a new Kam/Ullman-style Coincidence Theorem. Thus using our method, the standard algorithms for sequential programs computing liveness, availability, very busyness, reaching definitions, definition-use chains, or the analyses for performing code motion, assignment motion, partial dead-code elimination or strength reduction, can straightforward be transferred to the parallel setting at almost no cost.

Journal ArticleDOI
01 May 1996
TL;DR: The results reveal the speed and effectiveness of the proposed method for solving this problem and it is compared favorably with dynamic programming and conventional genetic algorithm.
Abstract: This paper presents an application of parallel genetic algorithm to optimal long-range generation expansion planning. The problem is formulated as a combinatorial optimization problem that determines the number of newly introduced generation units of each technology during different time intervals. A new string representation method for the problem is presented. Binary and decimal coding for the string representation method are compared. The method is implemented on transputers, one of the practical multi-processors. The effectiveness of the proposed method is demonstrated on a typical generation expansion problem with four technologies, five intervals, and a various number of generation units. It is compared favorably with dynamic programming and conventional genetic algorithm. The results reveal the speed and effectiveness of the proposed method for solving this problem.

Journal ArticleDOI
TL;DR: This paper presents a classical multiscale model which consists of a label pyramid and a whole observation field, and proposes a hierarchical Markov random field model based on this classical model, which results in a relaxation algorithm with a new annealing scheme: the multitemperatureAnnealing (MTA) scheme, which consist of associating higher temperatures to higher levels in order to be less sensitive to local minima at coarser grids.

Journal ArticleDOI
TL;DR: This work presents O(Tsequential/p+Ts(n, p)) time scalable parallel algorithms for several computational geometry problems, which use only a small number of very large messages and greatly reduces the overhead for the communication protocol between processors.
Abstract: We study scalable parallel computational geometry algorithms for the coarse grained multicomputer model: p processors solving a problem on n data items, were each processor has O(n/p)≫O(1) local memory and all processors are connected via some arbitrary interconnection network (e.g. mesh, hypercube, fat tree). We present O(Tsequential/p+Ts(n, p)) time scalable parallel algorithms for several computational geometry problems. Ts(n, p) refers to the time of a global sort operation. Our results are independent of the multicomputer’s interconnection network. Their time complexities become optimal when Tsequential/p dominates Ts(n, p) or when Ts(n, p) is optimal. This is the case for several standard architectures, including meshes and hypercubes, and a wide range of ratios n/p that include many of the currently available machine configurations. Our methods also have some important practical advantages: For interprocessor communication, they use only a small fixed number of one global routing operation, global sort, and all other programming is in the sequential domain. Furthermore, our algorithms use only a small number of very large messages, which greatly reduces the overhead for the communication protocol between processors. (Note however, that our time complexities account for the lengths of messages.) Experiments show that our methods are easy to implement and give good timing results.

Journal ArticleDOI
TL;DR: The LogP model is shown to be a valuable guide in the development of parallel algorithms and a good predictor of implementation performance; the model encourages the use of data layouts which minimize communication and balanced communication schedules which avoid contention.
Abstract: In this paper, we analyze four parallel sorting algorithms (bitonic, column, radix, and sample sort) with the LogP model. LogP characterizes the performance of modern parallel machines with a small set of parameters: the communication latency (L), overhead (o), bandwidth (g), and the number of processors (P). We develop implementations of these algorithms in Split-C, a parallel extension to C, and compare the performance predicted by LogP to actual performance on a CM-5 of 32 to 512 processors for a range of problem sizes. We evaluate the robustness of the algorithms by varying the distribution and ordering of the key values. We also briefly examine the sensitivity of the algorithms to the communication parameters. We show that the LogP model is a valuable guide in the development of parallel algorithms and a good predictor of implementation performance. The model encourages the use of data layouts which minimize communication and balanced communication schedules which avoid contention. With an empirical model of local processor performance, LogP predictions closely match observed execution times on uniformly distributed keys across a broad range of problem and machine sizes. We find that communication performance is oblivious to the distribution of the key values, whereas the local processor performance is not; some communication phases are sensitive to the ordering of keys due to contention. Finally, our analysis shows that overhead is the most critical communication parameter in the sorting algorithms.

Journal ArticleDOI
TL;DR: New parallel simulated annealing algorithms which allow multiple Markov chains to be traced simultaneously by PEs which may communicate with each other which can find a solution of equivalent (or even better) quality up to an order of magnitude faster than the conventional parallel schemes.
Abstract: Simulated annealing is a general-purpose optimization technique capable of finding an optimal or near-optimal solution in various applications. However, the long execution time required for a good quality solution has been a major drawback in practice. Extensive studies have been carried out to develop parallel algorithms for simulated annealing. Most of them were not very successful, mainly because multiple processing elements (PEs) were required to follow a single Markov chain and, therefore, only a limited parallelism was exploited. In this paper, we propose new parallel simulated annealing algorithms which allow multiple Markov chains to be traced simultaneously by PEs which may communicate with each other. We have considered both synchronous and asynchronous implementations of the algorithms. Their performance has been analyzed in detail and also verified by extensive experimental results. It has been shown that for graph partitioning the proposed parallel simulated annealing schemes can find a solution of equivalent (or even better) quality up to an order of magnitude faster than the conventional parallel schemes. Among the proposed schemes, the one where PEs exchange information dynamically (not with a fixed period) performs best.

Journal ArticleDOI
TL;DR: Experiments indicate that the interval Newton/generalized bisection method works quite well on relatively small problems, providing a powerful method for finding all solutions to a problem, at least when reasonable initial bounds are not provided.

Proceedings ArticleDOI
01 Jun 1996
TL;DR: This paper presents an efficient parallel BDD package for a distributed environment such as a network of workstations or a distributed memory parallel computer that exploits a number of different forms of parallelism that can be found in depth-first algorithms.
Abstract: Large BDD applications push completing resources to their limits. One solution to overcoming resource limitations is to distribute the BDD data structure across multiple networked workstations. This paper presents an efficient parallel BDD package for a distributed environment such as a network of workstations (NOW) or a distributed memory parallel computer. The implementation exploits a number of different forms of parallelism that can be found in depth-first algorithms. Significant effort is made to limit the communication overhead, including a two-level distributed hash table and an uncomputed cache. The package simultaneously executes multiple threads of computation on a distributed BDD.

Journal ArticleDOI
TL;DR: The results indicate that shared-memory machines with hundreds of processors would be useful only for rendering very large data sets and that cache locality requirements impose a limit on parallelism in volume rendering algorithms.
Abstract: This paper presents a parallel volume rendering algorithm that can render a 256/spl times/256/spl times/225 voxel medical data set at over 15 Hz and a 512/spl times/512/spl times/334 voxel data set at over 7 Hz on a 32-processor Silicon Graphics Challenge The algorithm achieves these results by minimizing each of the three components of execution time: computation time, synchronization time, and data communication time Computation time is low because the parallel algorithm is based on the recently-reported shear-warp serial volume rendering algorithm which is over five times faster than previous serial algorithms The algorithm uses run-length encoding to exploit coherence and an efficient volume traversal to reduce overhead Synchronization time is minimized by using dynamic load balancing and a task partition that minimizes synchronization events Data communication costs are low because the algorithm is implemented for shared-memory multiprocessors, a class of machines with hardware support for low-latency fine-grain communication and hardware caching to hide latency We draw two conclusions from our implementation First, we find that on shared-memory architectures data redistribution and communication costs do not dominate rendering time Second, we find that cache locality requirements impose a limit on parallelism in volume rendering algorithms Specifically, our results indicate that shared-memory machines with hundreds of processors would be useful only for rendering very large data sets

Journal ArticleDOI
TL;DR: The cellular automata methods studied present very fast convergence to fixed points, noise stability, and improvements on real images, which are features that allow them to propose them as a first level elementary image enhancement.

Journal ArticleDOI
TL;DR: N-1 directed edge disjoint spanning trees on the star network are constructed to derive a near optimal single node broadcasting algorithm, and fault tolerant algorithms for the single node and multinode broadcasting, and for thesingle node andMultinode scattering problems.
Abstract: Data communication and fault tolerance are important issues in parallel computers in which the processors are interconnected according to a specific topology. One way to achieve fault tolerant interprocessor communication is by exploiting the disjoint paths that exist between pairs of source and destination nodes. We construct n-1 directed edge disjoint spanning trees on the star network. These spanning trees are used to derive a near optimal single node broadcasting algorithm, and fault tolerant algorithms for the single node and multinode broadcasting, and for the single node and multinode scattering problems. Broadcasting is the distribution of the same group of messages from one processor to all the other processors. Scattering is the distribution of distinct groups of messages from one processor to all the other processors. We consider broadcasting and scattering from a single processor of the network and simultaneously from all processors of the network. The single node broadcasting algorithm offers a speed up of n-1 for a large number of messages, over the straightforward algorithm that uses a single shortest path spanning tree. Fault tolerance is achieved by transmitting the same messages through a number of edge disjoint spanning trees. The fault tolerant algorithms operate successfully in the presence of up to n-2 faulty nodes or edges in the network. No prior knowledge of the faulty nodes or edges is required. All of the algorithms operate under the store and forward, all port communication model.

Journal ArticleDOI
TL;DR: It is shown that, for this type of manipulator, the inverse kinematics and the inverse dynamics procedures can be easily parallelized and the result is a closed-form efficient algorithm using n processors.
Abstract: This paper introduces a novel approach for the computation of the inverse dynamics of parallel manipulators. It is shown that, for this type of manipulator, the inverse kinematics and the inverse dynamics procedures can be easily parallelized. The result is a closed-form efficient algorithm using n processors, where n is the number of kinematic chains connecting the base to the end-effector. The dynamics computations are based on the Newton-Euler formalism. The parallel algorithm arises from a judicious choice of the coordinate frames attached to each of the legs, which allows the exploitation of the parallel nature of the mechanism itself. Examples of the application of the algorithm to a planar three-degree-of-freedom parallel manipulator and to a spatial six-degree-of-freedom parallel manipulator are presented.

Journal ArticleDOI
TL;DR: Experimental results on a Silicon Graphics Challenge multiprocessor demonstrate good overall performance for the new algorithm on small heaps, and significant performance improvements over known alternatives on large heaps with mixed insertion/deletion workloads.

Journal ArticleDOI
TL;DR: The RMRN is shown to be a truly scalable network in that each node in the network has a fixed degree of connectivity and the reconfiguration mechanism ensures a network diameter of O(log2N) for anN-processor network.
Abstract: A reconfigurable network termed as the reconfigurable multi-ring network (RMRN) is described. The RMRN is shown to be a truly scalable network in that each node in the network has a fixed degree of connectivity and the reconfiguration mechanism ensures a network diameter of O(log2 N) for anN-processor network. Algorithms for the two-dimensional mesh and the SIMD or SPMD n-cube are shown to map very elegantly onto the RMRN. Basic message passing and reconfiguration primitives for the SIMD/SPMD RMRN are designed for use as building blocks for more complex parallel algorithms. The RMRN is shown to be a viable architecture for image processing and computer vision problems using the parallel computation of the stereocorrelation imaging operation as an example. Stereocorrelation is one of the most computationally intensive imaging tasks. It is used as a visualization tool in many applications, including remote sensing, geographic information systems and robot vision.