Showing papers on "Parallel algorithm published in 1996"

PDF

Open Access

Book•

[...]

15 Oct 1996

TL;DR: This chapter discusses the design and Coding of Parallel Programs, performance, and grouping data for Communication in the context of parallel computing.

...read moreread less

Abstract: Chapter 1 Introduction Chapter 2 An Overview of Parallel Computing Chapter 3 Greetings! Chapter 4 An Application: Numerical Integration Chapter 5 Collective Communication Chapter 6 Grouping Data for Communication Chapter 7 Communicators and Topologies Chapter 8 Dealing with I/O Chapter 9 Debugging Your Program Chapter 10 Design and Coding of Parallel Programs Chapter 11 Performance Chapter 12 More on Performance Chapter 13 Advanced Point-to-Point Communication Chapter 14 Parallel Algorithms Chapter 15 Parallel Libraries Chapter 16 Wrapping Up Appendix A Summary of MPI Commands Appendix B MPI on the Internet

...read moreread less

1,357 citations

Journal Article•DOI•

Parallel mining of association rules

[...]

Rakesh Agrawal¹, J.C. Shafer²•Institutions (2)

IBM¹, University of Wisconsin-Madison²

01 Dec 1996-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This work considers the problem of mining association rules on a shared nothing multiprocessor and presents three algorithms that explore a spectrum of trade-offs between computation, communication, memory usage, synchronization, and the use of problem specific information.

...read moreread less

Abstract: We consider the problem of mining association rules on a shared nothing multiprocessor. We present three algorithms that explore a spectrum of trade-offs between computation, communication, memory usage, synchronization, and the use of problem specific information. The best algorithm exhibits near perfect scaleup behavior, yet requires only minimal overhead compared to the current best serial algorithm.

...read moreread less

1,121 citations

Journal Article•DOI•

Dynamic critical-path scheduling: an effective technique for allocating task graphs to multiprocessors

[...]

Yu-Kwong Kwok¹, Ishfaq Ahmad²•Institutions (2)

University of Hong Kong¹, Hong Kong University of Science and Technology²

01 May 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: A static scheduling algorithm for allocating task graphs to fully connected multiprocessors which has admissible time complexity, is economical in terms of the number of processors used and is suitable for a wide range of graph structures.

...read moreread less

Abstract: In this paper, we propose a static scheduling algorithm for allocating task graphs to fully connected multiprocessors. We discuss six recently reported scheduling algorithms and show that they possess one drawback or the other which can lead to poor performance. The proposed algorithm, which is called the Dynamic Critical-Path (DCP) scheduling algorithm, is different from the previously proposed algorithms in a number of ways. First, it determines the critical path of the task graph and selects the next node to be scheduled in a dynamic fashion. Second, it rearranges the schedule on each processor dynamically in the sense that the positions of the nodes in the partial schedules are not fixed until all nodes have been considered. Third, it selects a suitable processor for a node by looking ahead the potential start times of the remaining nodes on that processor, and schedules relatively less important nodes to the processors already in use. A global as well as a pair-wise comparison is carried out for all seven algorithms under various scheduling conditions. The DCP algorithm outperforms the previous algorithms by a considerable margin. Despite having a number of new features, the DCP algorithm has admissible time complexity, is economical in terms of the number of processors used and is suitable for a wide range of graph structures.

...read moreread less

842 citations

Journal Article•DOI•

Programming parallel algorithms

[...]

Guy E. Blelloch¹•Institutions (1)

Carnegie Mellon University¹

01 Mar 1996-Communications of The ACM

TL;DR: This research on parallel algorithms has not only improved the general understanding of par-allelism but in several cases has led to improvements in sequential algorithms.

...read moreread less

Abstract: In the past 20 years there has been treftlen-dous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some of these algorithms are efficient only in a theoretical framework, many are quite efficient in practice or have key ideas that have been used in efficient implementations. This research on parallel algorithms has not only improved our general understanding ofpar-allelism but in several cases has led to improvements in sequential algorithms. Unf:ortunately there has been less success in developing good languages f:or prograftlftling parallel algorithftls, particularly languages that are well suited for teaching and pro-totyping algorithms. There has been a large gap between languages that are too low level, requiring specification of many details that obscure the meaning of the algorithm, and languages that are too high level, making the performance implications of various constructs unclear. In sequential computing many standard languages such as C or Pascal do a reasonable J·ob of bridging this gap, but in parallel languages building such a bridge has been significantly more difficult.

...read moreread less

458 citations

Scheduling algorithms for input-queued cell switches

[...]

Nick McKeown¹•Institutions (1)

University of California, Berkeley¹

03 Oct 1996

TL;DR: The algorithms described in this thesis are designed to schedule cells in a very high-speed, parallel, input-queued crossbar switch, and it is proved that LQ although too complex to implement in hardware, is stable under all admissible i.i.d. offered loads.

...read moreread less

Abstract: The algorithms described in this thesis are designed to schedule cells in a very high-speed, parallel, input-queued crossbar switch. We present several novel scheduling algorithms that we have devised, each aims to match the set of inputs of an input queued switch to the set of outputs more efficiently, fairly and quickly than existing techniques. In Chapter 2 we present the simplest and fastest of these algorithms: SLIP--a parallel algorithm that uses rotating priority ("round-robin") arbitration. SLIP is simple: it is readily implemented in hardware and can operate at high speed. SLIP has high performance: for uniform i.i.d. Bernoulli arrivals, SLIP is stable for any admissible load, because the arbiters tend to desynchronize. We present analytical results to model this behavior. However, SLIP is not always stable and is not always monotonic: adding more traffic can actually make the algorithm operate more efficiently. We present an approximate analytical model of this behavior. SLIP prevents starvation: all contending inputs are eventually served. We present simulation results, indicating SLIP's performance. We argue that SLIP can be readily implemented for a 32 x 32 switch on a single chip. In Chapter 3 we present i-SLIP, an iterative algorithm that improves upon SLIP by converging on a maximal size match. The performance of i-SLIP improves with up to log$\sb2N$ iterations. We show that although it has a longer running time than SLIP, an i-SLIP scheduler is little more complex to implement. In Chapter 4 we describe maximum or maximal weight matching algorithms based on the occupancy of queues, or waiting times of cells. These algorithms are stable over a wider range of traffic loads. We describe two algorithms, longest queue first (LQF) and oldest cell first (OCF) and consider their performance. We prove that LQ although too complex to implement in hardware, is stable under all admissible i.i.d. offered loads. We consider two implementable, iterative algorithms i-LQF and i-OCF which converge on a maximal weight matching. Finally, we present two interesting implementations of the Gale-Shapley algorithm, designed to solve the stable marriage problem.

...read moreread less

425 citations

Proceedings Article•DOI•

A new approach to pipeline FFT processor

[...]

Shousheng He¹, M. Torkelson¹•Institutions (1)

Lund University¹

15 Apr 1996

TL;DR: A new VLSI architecture for a real-time pipeline FFT processor is proposed, derived by integrating a twiddle factor decomposition technique in the divide-and-conquer approach, which has the same multiplicative complexity as the radix-4 algorithm, but retains the butterfly structure of the Radix-2 algorithm.

...read moreread less

Abstract: A new VLSI architecture for a real-time pipeline FFT processor is proposed. A hardware-oriented radix-2/sup 2/ algorithm is derived by integrating a twiddle factor decomposition technique in the divide-and-conquer approach. The radix-2/sup 2/ algorithm has the same multiplicative complexity as the radix-4 algorithm, but retains the butterfly structure of the radix-2 algorithm. The single-path delay-feedback architecture is used to exploit the spatial regularity in the signal flow graph of the algorithm. For length-N DFT computation, the hardware requirement of the proposed architecture is minimal on both dominant components: log/sub 4/N-1 complexity multipliers and N-1 complexity data memory. The validity and efficiency of the architecture have been verified by simulation in the hardware description language VHDL.

...read moreread less

410 citations

Journal Article•DOI•

A method for speed optimized partial product reduction and generation of fast parallel multipliers using an algorithmic approach

[...]

Vojin G. Oklobdzija, D. Villeger¹, S.S. Liu²•Institutions (2)

École Normale Supérieure¹, Advanced Micro Devices²

01 Mar 1996-IEEE Transactions on Computers

TL;DR: The proposed method is applicable to any multiplier size and adaptable to any technology for which speed parameters are known, and it is easy to incorporate this method in silicon compilation or logic synthesis tools.

...read moreread less

Abstract: This paper presents a method and an algorithm for generation of a parallel multiplier, which is optimized for speed. This method is applicable to any multiplier size and adaptable to any technology for which speed parameters are known. Most importantly, it is easy to incorporate this method in silicon compilation or logic synthesis tools. The parallel multiplier produced by the proposed method outperforms other schemes used for comparison in our experiment. It uses the minimal number of cells in the partial product reduction tree. These findings are tested on design examples simulated in 1 /spl mu/ CMOS ASIC technology.

...read moreread less

370 citations

Algorithms for the Satisfiability (SAT) Problem: A Survey,

[...]

Jun Gu, Paul Walton Purdom, John Franco, Benjamin W. Wah

01 Jan 1996

TL;DR: This survey presents a general framework (an algorithm space) that integrates existing SAT algorithms into a unified perspective and describes sequential and parallel SAT algorithms including variable splitting, resolution, local search, global optimization, mathematical programming, and practical SAT algorithms.

...read moreread less

Abstract: : The satisfiability (SAT) problem is a core problem in mathematical logic and computing theory. In practice, SAT is fundamental in solving many problems in automated reasoning, computer aided design, computer aided manufacturing, machine vision, database, robotics, integrated circuit design, computer architecture design, and computer network design. Traditional methods treat SAT as a discrete, constrained decision problem. In recent years, many optimization methods, parallel algorithms, and practical techniques have been developed for solving SAT. In this survey, we present a general framework (an algorithm space) that integrates existing SAT algorithms into a unified perspective. We describe sequential and parallel SAT algorithms including variable splitting, resolution, local search, global optimization, mathematical programming, and practical SAT algorithms. We give performance evaluation of some existing SAT algorithms. Finally, we provide a set of practical applications of the satisfiability problems.

...read moreread less

329 citations

Journal Article•DOI•

A Fully Parallel 3D Thinning Algorithm and Its Applications

[...]

C.Min Ma, Milan Sonka¹•Institutions (1)

University of Iowa¹

01 Nov 1996-Computer Vision and Image Understanding

TL;DR: By this demonstration, a new generation of 3D parallel thinning algorithms can be designed and proved to preserve connectivity relatively easily.

...read moreread less

256 citations

Journal Article•DOI•

Parallel Simulated Annealing Algorithms

[...]

D. Janaki Ram¹, T.H. Sreenivas¹, K.Ganapathy Subramaniam¹•Institutions (1)

Indian Institute of Technology Madras¹

15 Sep 1996-Journal of Parallel and Distributed Computing

TL;DR: This paper presents two general algorithms for simulated annealing that have been applied to job shop scheduling problem and the traveling salesman problem and it is observed that it is possible to achieve superlinear speedups using the algorithm.

...read moreread less

179 citations

Proceedings Article•DOI•

Isosurfacing in span space with utmost efficiency (ISSUE)

[...]

Han-Wei Shen¹, Charles Hansen², Yarden Livnat¹, Chris R. Johnson¹•Institutions (2)

University of Utah¹, Los Alamos National Laboratory²

27 Oct 1996

TL;DR: The performance of the sequential algorithm to locate the cell elements intersected by isosurfaces is faster than the Kd tree searching method originally used for the Span Space algorithm, which can achieve high load balancing for massively parallel machines with distributed memory architectures.

...read moreread less

Abstract: We present efficient sequential and parallel algorithms for isosurface extraction. Based on the Span Space data representation, new data subdivision and searching methods are described. We also present a parallel implementation with an emphasis on load balancing. The performance of our sequential algorithm to locate the cell elements intersected by isosurfaces is faster than the Kd tree searching method originally used for the Span Space algorithm. The parallel algorithm can achieve high load balancing for massively parallel machines with distributed memory architectures.

...read moreread less

Journal Article•DOI•

Dynamic partitioning of non-uniform structured workloads with spacefilling curves

[...]

J.R. Pilkington¹, Scott B. Baden¹•Institutions (1)

University of California, San Diego¹

01 Mar 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: In inverse spacefilling partitioning (ISP), a partitioning strategy for non-uniform scientific computations running on distributed memory MIMD parallel computers, is discussed and the general d-dimensional ISP algorithm is described and empirical results with two- and three-dimensional, non-hierarchical particle methods are reported.

...read moreread less

Abstract: We discuss inverse spacefilling partitioning (ISP), a partitioning strategy for non-uniform scientific computations running on distributed memory MIMD parallel computers. We consider the case of a dynamic workload distributed on a uniform mesh, and compare ISP against orthogonal recursive bisection (ORE) and a median of medians variant of ORE, ORB-MM. We present two results. First, ISP and ORB-MM are superior to ORE in rendering balanced workloads-because they are more fine-grained-and incur communication overheads that are comparable to ORE. Second, ISP is more attractive than ORB-MM from a software engineering standpoint because it avoids elaborate bookkeeping. Whereas ISP partitionings can be described succinctly as logically contiguous segments of the line, ORB-MM's partitionings are inherently unstructured. We describe the general d-dimensional ISP algorithm and report empirical results with two- and three-dimensional, non-hierarchical particle methods.

...read moreread less

Proceedings Article•DOI•

Parallel Data Mining for Association Rules on Shared-Memory Multi-Processors

[...]

Mohammed J. Zaki¹, Mitsunori Ogihara¹, Srinivasan Parthasarathy¹, Wei Li¹•Institutions (1)

University of Rochester¹

17 Nov 1996

TL;DR: This paper presents parallel algorithms for data mining of association rules, and studies the degree of parallelism, synchronization, and data locality issues on the SGI Power Challenge shared-memory multi-processor.

...read moreread less

Abstract: Data mining is an emerging research area, whose goal is to extract significant patterns or interesting rules from large databases. High-level inference from large volumes of routine business data can provide valuable information to businesses, such as customer buying patterns, shelving criterion in supermarkets and stock trends. Many algorithms have been proposed for data mining of association rules. However, research so far has mainly focused on sequential algorithms. In this paper we present parallel algorithms for data mining of association rules, and study the degree of parallelism, synchronization, and data locality issues on the SGI Power Challenge shared-memory multi-processor. We further present a set of optimizations for the sequential and parallel algorithms.Experiments show that a significant improvement of performance is achieved using our proposed optimizations. We also achieved good speed-up for the parallel algorithm, but we observe a need for parallel I/O techniques for further performance gains.

...read moreread less

Proceedings Article•DOI•

Hash based parallel algorithms for mining association rules

[...]

Takahiko Shintani¹, Masaru Kitsuregawa¹•Institutions (1)

University of Tokyo¹

01 Dec 1996

TL;DR: Four parallel algorithms for mining association rules on shared nothing parallel machines to improve its performance are proposed and the best algorithm, HPA-ELD, attains good linearity on speedup ratio and is effective for handling skew.

...read moreread less

Abstract: We propose four parallel algorithms (NPA, SPA, HPA and HPA-ELD) for mining association rules on shared nothing parallel machines to improve its performance. In NPA, candidate itemsets are just copied amongst all the processors, which can lead to memory overflow for large transaction databases. The remaining three algorithms partition the candidate itemsets over the processors. If it is partitioned simply (SPA), transaction data has to be broadcast to all processors. HPA partitions the candidate itemsets using a hash function to eliminate broadcasting, which also reduces the comparison workload significantly. HPA-ELD fully utilizes the available memory space by detecting the extremely large itemsets and copying them, which is also very effective at flattering the load over the processors. We implemented these algorithms in a shared nothing environment. Performance evaluations show that the best algorithm, HPA-ELD, attains good linearity on speedup ratio and is effective for handling skew.

...read moreread less

Proceedings Article•DOI•

Parallel processing of spatial joins using R-trees

[...]

Thomas Brinkhoff, Hans-Peter Kriegel, Bernhard Seeger

26 Feb 1996

TL;DR: It is shown that spatial joins are very suitable to be processed on a parallel hardware platform and the most efficient one shows an almost optimal speed up under the assumption that the number of disks is sufficiently large.

...read moreread less

Abstract: We show that spatial joins are very suitable to be processed on a parallel hardware platform. The parallel system is equipped with a so called shared virtual memory which is well suited for the design and implementation of parallel spatial join algorithms. We start with an algorithm that consists of three phases: task creation, task assignment and parallel task execution. In order to reduce CPU and I/O cost, the three phases are processed in a fashion that preserves spatial locality. Dynamic load balancing is achieved by splitting tasks into smaller ones and reassigning some of the smaller tasks to idle processors. In an experimental performance comparison, we identify the advantages and disadvantages of several variants of our algorithm. The most efficient one shows an almost optimal speed up under the assumption that the number of disks is sufficiently large.

...read moreread less

Book Chapter•DOI•

A Parallel Tabu Search Algorithm Using Ejection Chains for the Vehicle Routing Problem

[...]

César Rego, Catherine Roucairol¹•Institutions (1)

Versailles Saint-Quentin-en-Yvelines University¹

01 Jan 1996

TL;DR: A Parallel Tabu Search algorithm for the vehicle routing problem under capacity and distance restrictions and in the neighborhood search, the algorithm uses compound moves generated by an ejection chain process.

...read moreread less

Abstract: In this paper we describe a Parallel Tabu Search algorithm for the vehicle routing problem under capacity and distance restrictions. In the neighborhood search, the algorithm uses compound moves generated by an ejection chain process. Parallel processing is used to explore the solution space more extensively and different parallel techniques are used to accelerate the search process. Tests were carried out on a network of SUNSparc workstations and computational results for a set of benchmark problems prove the efficiency of the algorithm proposed.

...read moreread less

Journal Article•DOI•

Parallelism for free: efficient and optimal bitvector analyses for parallel programs

[...]

Jens Knoop¹, Bernhard Steffen¹, Jürgen Vollmer²•Institutions (2)

University of Passau¹, Karlsruhe Institute of Technology²

01 May 1996-ACM Transactions on Programming Languages and Systems

TL;DR: Using this method, the standard algorithms for sequential programs computing liveness, availability, very busyness, reaching definitions, definition-use chains, or the analyses for performing code motion, assignment motion, partial dead-code elimination or strength reduction, can straightforward be transferred to the parallel setting at almost no cost.

...read moreread less

Abstract: We consider parallel programs with shared memory and interleaving semantics, for which we show how to construct for unidirectional bitvector problems optimal analysis algorithms that are as efficient as their purely sequential counterparts and that can easily be implemented. Whereas the complexity result is rather obvious, our optimality result is a consequence of a new Kam/Ullman-style Coincidence Theorem. Thus using our method, the standard algorithms for sequential programs computing liveness, availability, very busyness, reaching definitions, definition-use chains, or the analyses for performing code motion, assignment motion, partial dead-code elimination or strength reduction, can straightforward be transferred to the parallel setting at almost no cost.

...read moreread less

Journal Article•DOI•

A parallel genetic algorithm for generation expansion planning

[...]

Yoshikazu Fukuyama, Hsaio-Dong Chiang¹•Institutions (1)

Cornell University¹

01 May 1996

TL;DR: The results reveal the speed and effectiveness of the proposed method for solving this problem and it is compared favorably with dynamic programming and conventional genetic algorithm.

...read moreread less

Abstract: This paper presents an application of parallel genetic algorithm to optimal long-range generation expansion planning. The problem is formulated as a combinatorial optimization problem that determines the number of newly introduced generation units of each technology during different time intervals. A new string representation method for the problem is presented. Binary and decimal coding for the string representation method are compared. The method is implemented on transputers, one of the practical multi-processors. The effectiveness of the proposed method is demonstrated on a typical generation expansion problem with four technologies, five intervals, and a various number of generation units. It is compared favorably with dynamic programming and conventional genetic algorithm. The results reveal the speed and effectiveness of the proposed method for solving this problem.

...read moreread less

Journal Article•DOI•

A hierarchical Markov random field model and multitemperature annealing for parallel image classification

[...]

Zoltan Kato¹, Marc Berthod¹, Josiane Zerubia¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

01 Jan 1996-Graphical Models and Image Processing

TL;DR: This paper presents a classical multiscale model which consists of a label pyramid and a whole observation field, and proposes a hierarchical Markov random field model based on this classical model, which results in a relaxation algorithm with a new annealing scheme: the multitemperatureAnnealing (MTA) scheme, which consist of associating higher temperatures to higher levels in order to be less sensitive to local minima at coarser grids.

...read moreread less

Journal Article•DOI•

Scalable parallel computational geometry for coarse grained multicomputers

[...]

Frank Dehne¹, Andreas Fabri², Andrew Rau-Chaplin¹•Institutions (2)

Carleton University¹, French Institute for Research in Computer Science and Automation²

01 Sep 1996-International Journal of Computational Geometry and Applications

TL;DR: This work presents O(Tsequential/p+Ts(n, p)) time scalable parallel algorithms for several computational geometry problems, which use only a small number of very large messages and greatly reduces the overhead for the communication protocol between processors.

...read moreread less

Abstract: We study scalable parallel computational geometry algorithms for the coarse grained multicomputer model: p processors solving a problem on n data items, were each processor has O(n/p)≫O(1) local memory and all processors are connected via some arbitrary interconnection network (e.g. mesh, hypercube, fat tree). We present O(Tsequential/p+Ts(n, p)) time scalable parallel algorithms for several computational geometry problems. Ts(n, p) refers to the time of a global sort operation. Our results are independent of the multicomputer’s interconnection network. Their time complexities become optimal when Tsequential/p dominates Ts(n, p) or when Ts(n, p) is optimal. This is the case for several standard architectures, including meshes and hypercubes, and a wide range of ratios n/p that include many of the currently available machine configurations. Our methods also have some important practical advantages: For interprocessor communication, they use only a small fixed number of one global routing operation, global sort, and all other programming is in the sequential domain. Furthermore, our algorithms use only a small number of very large messages, which greatly reduces the overhead for the communication protocol between processors. (Note however, that our time complexities account for the lengths of messages.) Experiments show that our methods are easy to implement and give good timing results.

...read moreread less

Journal Article•DOI•

Fast parallel sorting under LogP: experience with the CM-5

[...]

Andrea C. Dusseau¹, David E. Culler¹, Klaus Erik Schauser¹, Richard Martin¹•Institutions (1)

University of California, Berkeley¹

01 Aug 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: The LogP model is shown to be a valuable guide in the development of parallel algorithms and a good predictor of implementation performance; the model encourages the use of data layouts which minimize communication and balanced communication schedules which avoid contention.

...read moreread less

Abstract: In this paper, we analyze four parallel sorting algorithms (bitonic, column, radix, and sample sort) with the LogP model. LogP characterizes the performance of modern parallel machines with a small set of parameters: the communication latency (L), overhead (o), bandwidth (g), and the number of processors (P). We develop implementations of these algorithms in Split-C, a parallel extension to C, and compare the performance predicted by LogP to actual performance on a CM-5 of 32 to 512 processors for a range of problem sizes. We evaluate the robustness of the algorithms by varying the distribution and ordering of the key values. We also briefly examine the sensitivity of the algorithms to the communication parameters. We show that the LogP model is a valuable guide in the development of parallel algorithms and a good predictor of implementation performance. The model encourages the use of data layouts which minimize communication and balanced communication schedules which avoid contention. With an empirical model of local processor performance, LogP predictions closely match observed execution times on uniformly distributed keys across a broad range of problem and machine sizes. We find that communication performance is oblivious to the distribution of the key values, whereas the local processor performance is not; some communication phases are sensitive to the ordering of keys due to contention. Finally, our analysis shows that overhead is the most critical communication parameter in the sorting algorithms.

...read moreread less

Journal Article•DOI•

Synchronous and asynchronous parallel simulated annealing with multiple Markov chains

[...]

Soo-Young Lee¹, Kyung-Geun Lee²•Institutions (2)

Auburn University¹, Samsung²

01 Oct 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: New parallel simulated annealing algorithms which allow multiple Markov chains to be traced simultaneously by PEs which may communicate with each other which can find a solution of equivalent (or even better) quality up to an order of magnitude faster than the conventional parallel schemes.

...read moreread less

Abstract: Simulated annealing is a general-purpose optimization technique capable of finding an optimal or near-optimal solution in various applications. However, the long execution time required for a good quality solution has been a major drawback in practice. Extensive studies have been carried out to develop parallel algorithms for simulated annealing. Most of them were not very successful, mainly because multiple processing elements (PEs) were required to follow a single Markov chain and, therefore, only a limited parallelism was exploited. In this paper, we propose new parallel simulated annealing algorithms which allow multiple Markov chains to be traced simultaneously by PEs which may communicate with each other. We have considered both synchronous and asynchronous implementations of the algorithms. Their performance has been analyzed in detail and also verified by extensive experimental results. It has been shown that for graph partitioning the proposed parallel simulated annealing schemes can find a solution of equivalent (or even better) quality up to an order of magnitude faster than the conventional parallel schemes. Among the proposed schemes, the one where PEs exchange information dynamically (not with a fixed period) performs best.

...read moreread less

Journal Article•DOI•

Robust process simulation using interval methods

[...]

C.A. Schnepper¹, Mark A. Stadtherr¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Feb 1996-Computers & Chemical Engineering

TL;DR: Experiments indicate that the interval Newton/generalized bisection method works quite well on relatively small problems, providing a powerful method for finding all solutions to a problem, at least when reasonable initial bounds are not provided.

...read moreread less

Proceedings Article•DOI•

Implementation of an efficient parallel BDD package

[...]

Tony Stornetta¹, Forrest Brewer¹•Institutions (1)

University of California, Santa Barbara¹

01 Jun 1996

TL;DR: This paper presents an efficient parallel BDD package for a distributed environment such as a network of workstations or a distributed memory parallel computer that exploits a number of different forms of parallelism that can be found in depth-first algorithms.

...read moreread less

Abstract: Large BDD applications push completing resources to their limits. One solution to overcoming resource limitations is to distribute the BDD data structure across multiple networked workstations. This paper presents an efficient parallel BDD package for a distributed environment such as a network of workstations (NOW) or a distributed memory parallel computer. The implementation exploits a number of different forms of parallelism that can be found in depth-first algorithms. Significant effort is made to limit the communication overhead, including a two-level distributed hash table and an uncomputed cache. The package simultaneously executes multiple threads of computation on a distributed BDD.

...read moreread less

Journal Article•DOI•

Analysis of a parallel volume rendering system based on the shear-warp factorization

[...]

P. Lacroute

01 Sep 1996-IEEE Transactions on Visualization and Computer Graphics

TL;DR: The results indicate that shared-memory machines with hundreds of processors would be useful only for rendering very large data sets and that cache locality requirements impose a limit on parallelism in volume rendering algorithms.

...read moreread less

Abstract: This paper presents a parallel volume rendering algorithm that can render a 256/spl times/256/spl times/225 voxel medical data set at over 15 Hz and a 512/spl times/512/spl times/334 voxel data set at over 7 Hz on a 32-processor Silicon Graphics Challenge The algorithm achieves these results by minimizing each of the three components of execution time: computation time, synchronization time, and data communication time Computation time is low because the parallel algorithm is based on the recently-reported shear-warp serial volume rendering algorithm which is over five times faster than previous serial algorithms The algorithm uses run-length encoding to exploit coherence and an efficient volume traversal to reduce overhead Synchronization time is minimized by using dynamic load balancing and a task partition that minimizes synchronization events Data communication costs are low because the algorithm is implemented for shared-memory multiprocessors, a class of machines with hardware support for low-latency fine-grain communication and hardware caching to hide latency We draw two conclusions from our implementation First, we find that on shared-memory architectures data redistribution and communication costs do not dominate rendering time Second, we find that cache locality requirements impose a limit on parallelism in volume rendering algorithms Specifically, our results indicate that shared-memory machines with hundreds of processors would be useful only for rendering very large data sets

...read moreread less

Journal Article•DOI•

Cellular automata for elementary image enhancement

[...]

Gonzalo Hernández¹, Hans J. Herrmann¹•Institutions (1)

Forschungszentrum Jülich¹

01 Jan 1996-Graphical Models and Image Processing

TL;DR: The cellular automata methods studied present very fast convergence to fixed points, noise stability, and improvements on real images, which are features that allow them to propose them as a first level elementary image enhancement.

...read moreread less

Journal Article•DOI•

Edge-disjoint spanning trees on the star network with applications to fault tolerance

[...]

Paraskevi Fragopoulou¹, Selim G. Akl¹•Institutions (1)

Queen's University¹

01 Feb 1996-IEEE Transactions on Computers

TL;DR: N-1 directed edge disjoint spanning trees on the star network are constructed to derive a near optimal single node broadcasting algorithm, and fault tolerant algorithms for the single node and multinode broadcasting, and for thesingle node andMultinode scattering problems.

...read moreread less

Abstract: Data communication and fault tolerance are important issues in parallel computers in which the processors are interconnected according to a specific topology. One way to achieve fault tolerant interprocessor communication is by exploiting the disjoint paths that exist between pairs of source and destination nodes. We construct n-1 directed edge disjoint spanning trees on the star network. These spanning trees are used to derive a near optimal single node broadcasting algorithm, and fault tolerant algorithms for the single node and multinode broadcasting, and for the single node and multinode scattering problems. Broadcasting is the distribution of the same group of messages from one processor to all the other processors. Scattering is the distribution of distinct groups of messages from one processor to all the other processors. We consider broadcasting and scattering from a single processor of the network and simultaneously from all processors of the network. The single node broadcasting algorithm offers a speed up of n-1 for a large number of messages, over the straightforward algorithm that uses a single shortest path spanning tree. Fault tolerance is achieved by transmitting the same messages through a number of edge disjoint spanning trees. The fault tolerant algorithms operate successfully in the presence of up to n-2 faulty nodes or edges in the network. No prior knowledge of the faulty nodes or edges is required. All of the algorithms operate under the store and forward, all port communication model.

...read moreread less

Journal Article•DOI•

Parallel Computational Algorithms for the Kinematics and Dynamics of Planar and Spatial Parallel Manipulators

[...]

Clément Gosselin¹•Institutions (1)

Laval University¹

01 Mar 1996-Journal of Dynamic Systems Measurement and Control-transactions of The Asme

TL;DR: It is shown that, for this type of manipulator, the inverse kinematics and the inverse dynamics procedures can be easily parallelized and the result is a closed-form efficient algorithm using n processors.

...read moreread less

Abstract: This paper introduces a novel approach for the computation of the inverse dynamics of parallel manipulators. It is shown that, for this type of manipulator, the inverse kinematics and the inverse dynamics procedures can be easily parallelized. The result is a closed-form efficient algorithm using n processors, where n is the number of kinematic chains connecting the base to the end-effector. The dynamics computations are based on the Newton-Euler formalism. The parallel algorithm arises from a judicious choice of the coordinate frames attached to each of the legs, which allows the exploitation of the parallel nature of the mechanism itself. Examples of the application of the algorithm to a planar three-degree-of-freedom parallel manipulator and to a spatial six-degree-of-freedom parallel manipulator are presented.

...read moreread less

Journal Article•DOI•

An Efficient Algorithm for Concurrent Priority Queue Heaps

[...]

Galen C. Hunt¹, Maged M. Michael¹, Srinivasan Parthasarathy¹, Michael L. Scott¹•Institutions (1)

University of Rochester¹

11 Nov 1996-Information Processing Letters

TL;DR: Experimental results on a Silicon Graphics Challenge multiprocessor demonstrate good overall performance for the new algorithm on small heaps, and significant performance improvements over known alternatives on large heaps with mixed insertion/deletion workloads.

...read moreread less

Journal Article•DOI•

Parallel stereocorrelation on a reconfigurable multi-ring network

[...]

Hamid R. Arabnia¹, Suchendra M. Bhandarkar¹•Institutions (1)

University of Georgia¹

01 Sep 1996-The Journal of Supercomputing

TL;DR: The RMRN is shown to be a truly scalable network in that each node in the network has a fixed degree of connectivity and the reconfiguration mechanism ensures a network diameter of O(log2N) for anN-processor network.

...read moreread less

Abstract: A reconfigurable network termed as the reconfigurable multi-ring network (RMRN) is described. The RMRN is shown to be a truly scalable network in that each node in the network has a fixed degree of connectivity and the reconfiguration mechanism ensures a network diameter of O(log2 N) for anN-processor network. Algorithms for the two-dimensional mesh and the SIMD or SPMD n-cube are shown to map very elegantly onto the RMRN. Basic message passing and reconfiguration primitives for the SIMD/SPMD RMRN are designed for use as building blocks for more complex parallel algorithms. The RMRN is shown to be a viable architecture for image processing and computer vision problems using the parallel computation of the stereocorrelation imaging operation as an example. Stereocorrelation is one of the most computationally intensive imaging tasks. It is used as a visualization tool in many applications, including remote sensing, geographic information systems and robot vision.

...read moreread less

Collapse