Showing papers on "Parallel algorithm published in 2005"

PDF

Open Access

Proceedings Article•DOI•

A scalable collaborative filtering framework based on co-clustering

[...]

T. George¹, Srujana Merugu²•Institutions (2)

Texas A&M University¹, University of Texas at Austin²

27 Nov 2005

TL;DR: This paper designs incremental and parallel versions of the co-clustering algorithm and uses it to build an efficient real-time CF framework and demonstrates that this approach provides an accuracy comparable to that of the correlation and matrix factorization based approaches at a much lower computational cost.

...read moreread less

Abstract: Collaborative filtering-based recommender systems have become extremely popular due to the increase in Web-based activities such as e-commerce and online content distribution. Current collaborative filtering (CF) techniques such as correlation and SVD based methods provide good accuracy, but are computationally expensive and can be deployed only in static off-line settings. However, a number of practical scenarios require dynamic real-time collaborative filtering that can allow new users, items and ratings to enter the system at a rapid rate. In this paper, we consider a novel CF approach based on a proposed weighted co-clustering algorithm (Banerjee et al., 2004) that involves simultaneous clustering of users and items. We design incremental and parallel versions of the co-clustering algorithm and use it to build an efficient real-time CF framework. Empirical evaluation demonstrates that our approach provides an accuracy comparable to that of the correlation and matrix factorization based approaches at a much lower computational cost.

...read moreread less

445 citations

Journal Article•DOI•

An overview of SuperLU: Algorithms, implementation, and user interface

[...]

Xiaoye S. Li

01 Sep 2005-ACM Transactions on Mathematical Software

TL;DR: An overview of the algorithms, design philosophy, and implementation techniques in the software SuperLU, for solving sparse unsymmetric linear systems, and some examples of how the solver has been used in large-scale scientific applications, and the performance.

...read moreread less

Abstract: We give an overview of the algorithms, design philosophy, and implementation techniques in the software SuperLU, for solving sparse unsymmetric linear systems. In particular, we highlight the differences between the sequential SuperLU (including its multithreaded extension) and parallel SuperLU_DIST. These include the numerical pivoting strategy, the ordering strategy for preserving sparsity, the ordering in which the updating tasks are performed, the numerical kernel, and the parallelization strategy. Because of the scalability concern, the parallel code is drastically different from the sequential one. We describe the user interfaces of the libraries, and illustrate how to use the libraries most efficiently depending on some matrix characteristics. Finally, we give some examples of how the solver has been used in large-scale scientific applications, and the performance.

...read moreread less

371 citations

Journal Article•DOI•

Parallel particle swarm optimization and finite- difference time-domain (PSO/FDTD) algorithm for multiband and wide-band patch antenna designs

[...]

Nanbo Jin¹, Yahya Rahmat-Samii¹•Institutions (1)

University of California, Los Angeles¹

07 Nov 2005-IEEE Transactions on Antennas and Propagation

TL;DR: This paper presents a novel evolutionary optimization methodology for multiband and wide-band patch antenna designs that combines the particle swarm optimization and the finite-difference time-domain to achieve the optimum antenna satisfying a certain design criterion.

...read moreread less

Abstract: This paper presents a novel evolutionary optimization methodology for multiband and wide-band patch antenna designs. The particle swarm optimization (PSO) and the finite-difference time-domain (FDTD) are combined to achieve the optimum antenna satisfying a certain design criterion. The antenna geometric parameters are extracted to be optimized by PSO, and a fitness function is evaluated by FDTD simulations to represent the performance of each candidate design. The optimization process is implemented on parallel clusters to reduce the computational time introduced by full-wave analysis. Two examples are investigated in the paper: first, the design of rectangular patch antennas is presented as a test of the parallel PSO/FDTD algorithm. The optimizer is then applied to design E-shaped patch antennas. It is observed that by using different fitness functions, both dual-frequency and wide-band antennas with desired performance are obtained by the optimization. The optimized E-shaped patch antennas are analyzed, fabricated, and measured to validate the robustness of the algorithm. The measured less than - 18 dB return loss (for dual-frequency antenna) and 30.5% bandwidth (for wide-band antenna) exhibit the prospect of the parallel PSO/FDTD algorithm in practical patch antenna designs.

...read moreread less

306 citations

Journal Article•DOI•

High-order compact finite-difference methods on general overset grids

[...]

Scott E. Sherer¹, James Scott²•Institutions (2)

Air Force Research Laboratory¹, Ohio State University²

10 Dec 2005-Journal of Computational Physics

TL;DR: The employment of the overset-grid techniques, coupled with high- order interpolation at overset boundaries, was found to be an effective way of employing the high-order algorithm for more complex geometries than was previously possible.

...read moreread less

269 citations

Journal Article•

A Parallel Particle Swarm Optimization Algorithm with Communication Strategies

[...]

Jui-Fang Chang¹, Shu-Chuan Chu², John F. Roddick, Jeng-Shyang Pan¹•Institutions (2)

National Kaohsiung Normal University¹, Cheng Shiu University²

01 Jan 2005-Journal of Information Science and Engineering

TL;DR: A parallel version of the particle swarm optimization (PPSO) algorithm together with three communication strategies which can be used according to the independence of the data, which demonstrates the usefulness of the proposed PPSO algorithm.

...read moreread less

Abstract: Particle swarm optimization (PSO) is an alternative population-based evolutionary computation technique. It has been shown to be capable of optimizing hard mathematical problems in continuous or binary space. We present here a parallel version of the particle swarm optimization (PPSO) algorithm together with three communication strategies which can be used according to the independence of the data. The first strategy is designed for solution parameters that are independent or are only loosely correlated, such as the Rosenbrock and Rastrigrin functions. The second communication strategy can be applied to parameters that are more strongly correlated such as the Griewank function. In cases where the properties of the parameters are unknown, a third hybrid communication strategy can be used. Experimental results demonstrate the usefulness of the proposed PPSO algorithm.

...read moreread less

250 citations

Journal Article•DOI•

Reducing Complexity in Parallel Algebraic Multigrid Preconditioners

[...]

Hans De Sterck, Ulrike Meier Yang, Jeffrey J. Heys

31 Dec 2005-SIAM Journal on Matrix Analysis and Applications

TL;DR: Two new parallel AMG coarsening schemes are proposed, that are based on solely enforcing a maximum independent set property, resulting in sparser coarse grids and the performance of the new preconditioners is examined.

...read moreread less

Abstract: Algebraic multigrid (AMG) is a very efficient iterative solver and preconditioner for large unstructured sparse linear systems. Traditional coarsening schemes for AMG can, however, lead to computational complexity growth as problem size increases, resulting in increased memory use and execution time, and diminished scalability. Two new parallel AMG coarsening schemes are proposed that are based solely on enforcing a maximum independent set property, resulting in sparser coarse grids. The new coarsening techniques remedy memory and execution time complexity growth for various large three-dimensional (3D) problems. If used within AMG as a preconditioner for Krylov subspace methods, the resulting iterative methods tend to converge fast. This paper discusses complexity issues that can arise in AMG, describes the new coarsening schemes, and examines the performance of the new preconditioners for various large 3D problems.

...read moreread less

196 citations

Book Chapter•DOI•

An approach to performance prediction for parallel applications

[...]

Engin Ipek¹, Bronis R. de Supinski², Martin Schulz², Sally A. McKee¹•Institutions (2)

Cornell University¹, Lawrence Livermore National Laboratory²

30 Aug 2005

TL;DR: This study focuses on the high-performance, parallel application SMG2000, a much studied code whose variations in execution times are still not well understood, and employs multilayer neural networks trained on input data from executions on the target platform to predict performance.

...read moreread less

Abstract: Accurately modeling and predicting performance for large-scale applications becomes increasingly difficult as system complexity scales dramatically. Analytic predictive models are useful, but are difficult to construct, usually limited in scope, and often fail to capture subtle interactions between architecture and software. In contrast, we employ multilayer neural networks trained on input data from executions on the target platform. This approach is useful for predicting many aspects of performance, and it captures full system complexity. Our models are developed automatically from the training input set, avoiding the difficult and potentially error-prone process required to develop analytic models. This study focuses on the high-performance, parallel application SMG2000, a much studied code whose variations in execution times are still not well understood. Our model predicts performance on two large-scale parallel platforms within 5%-7% error across a large, multi-dimensional parameter space.

...read moreread less

182 citations

Book Chapter•DOI•

A lazy concurrent list-based set algorithm

[...]

Steven K. Heller¹, Maurice Herlihy², Victor Luchangco¹, Mark S. Moir¹, William N. Scherer³, Nir Shavit¹ - Show less +2 more•Institutions (3)

Sun Microsystems Laboratories¹, Brown University², University of Rochester³

12 Dec 2005

TL;DR: A novel “lazy” list-based implementation of a concurrent set object based on an optimistic locking scheme for inserts and removes, eliminating the need to use the equivalent of an atomically markable reference.

...read moreread less

Abstract: List-based implementations of sets are a fundamental building block of many concurrent algorithms. A skiplist based on the lock-free list-based set algorithm of Michael will be included in the JavaTM Concurrency Package of JDK 1.6.0. However, Michael's lock-free algorithm has several drawbacks, most notably that it requires all list traversal operations, including membership tests, to perform cleanup operations of logically removed nodes, and that it uses the equivalent of an atomically markable reference, a pointer that can be atomically “marked,” which is expensive in some languages and unavailable in others. We present a novel “lazy” list-based implementation of a concurrent set object. It is based on an optimistic locking scheme for inserts and removes, eliminating the need to use the equivalent of an atomically markable reference. It also has a novel wait-free membership test operation (as opposed to Michael's lock-free one) that does not need to perform cleanup operations and is more efficient than that of all previous algorithms. Empirical testing shows that the new lazy-list algorithm consistently outperforms all known algorithms, including Michael's lock-free algorithm, throughout the concurrency range. At high load, with 90% membership tests, the lazy algorithm is more than twice as fast as Michael's. This is encouraging given that typical search structure usage patterns include around 90% membership tests. By replacing the lock-free membership test of Michael's algorithm with our new wait-free one, we achieve an algorithm that slightly outperforms our new lazy-list (though it may not be as efficient in other contexts as it uses Java's RTTI mechanism to create pointers that can be atomically marked).

...read moreread less

178 citations

Book•

Monte Carlo Methods for Applied Scientists

[...]

Ivan Dimov, Sean McKee

15 May 2005

TL;DR: This work attempts to bridge the gap between theory and practice concentrating on modern algorithmic implementation on parallel architecture machines, with the main focus being on parallel algorithm development often to applied industrial problems.

...read moreread less

Abstract: The Monte Carlo method is inherently parallel and the extensive and rapid development in vector and parallel computers has resulted in renewed and increasing interest in this method. At the same time there has been an expansion in the application areas and the method is now widely used in many important areas of science including nuclear and semiconductor physics, statistical mechanics and heat and mass transfer. This work attempts to bridge the gap between theory and practice concentrating on modern algorithmic implementation on parallel architecture machines. Although a suitable text for final year or postgraduate mathematicians it is principally aimed at the applied scientists - only a small amount of mathematical knowledge is assumed and theorem proving is kept to a minimum, with the main focus being on parallel algorithm development often to applied industrial problems. Algorithms are developed both for MIMD machines with distributed memory and SIMD machines; a selection of programs are provided.

...read moreread less

175 citations

Journal Article•DOI•

Analysis and performance of a distributed memory multilevel fast multipole algorithm

[...]

S. Velamparambil, Weng Cho Chew¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

08 Aug 2005-IEEE Transactions on Antennas and Propagation

TL;DR: The communication pattern and study the scalability of a distributed memory implementation of the multilevel fast multipole algorithm (MLFMA) called ScaleME, which uses the message passing interface (MPI) for communication between processors.

...read moreread less

Abstract: In this paper, we analyze the communication pattern and study the scalability of a distributed memory implementation of the multilevel fast multipole algorithm (MLFMA) called ScaleME. ScaleME uses the message passing interface (MPI) for communication between processors. The parallelization of MLFMA uses a novel a hybrid scheme for distributing the workload across the processors. We study the communication and computational behavior and demonstrate the effectiveness of the parallelization scheme using realistic problems.

...read moreread less

153 citations

Journal Article•DOI•

CONDOR, a new parallel, constrained extension of Powell's UOBYQA algorithm: experimental results and comparison with the DFO algorithm

[...]

Frank Vanden Berghen¹, Hugues Bersini¹•Institutions (1)

Université libre de Bruxelles¹

01 Sep 2005-Journal of Computational and Applied Mathematics

TL;DR: An algorithmic extension of Powell's UOBYQA algorithm (Unconstrained Optimization BY Quadratical Approximation) is presented and a new, easily comprehensible and fully stand-alone implementation in C++ of the parallel algorithm is presented.

...read moreread less

Proceedings Article•DOI•

A framework for adaptive algorithm selection in STAPL

[...]

Nathan Thomas¹, Gabriel Tanase¹, Olga Tkachyshyn¹, Jack Perdue¹, Nancy M. Amato¹, Lawrence Rauchwerger¹ - Show less +2 more•Institutions (1)

Texas A&M University¹

15 Jun 2005

TL;DR: This work develops a general framework for adaptive algorithm selection for use in the Standard Template Adaptive Parallel Library (STAPL), using machine learning techniques to analyze data collected by STAPL installation benchmarks and to determine tests that will select among algorithmic options at run-time.

...read moreread less

Abstract: Writing portable programs that perform well on multiple platforms or for varying input sizes and types can be very difficult because performance is often sensitive to the system architecture, the run-time environment, and input data characteristics This is even more challenging on parallel and distributed systems due to the wide variety of system architectures One way to address this problem is to adaptively select the best parallel algorithm for the current input data and system from a set of functionally equivalent algorithmic options Toward this goal, we have developed a general framework for adaptive algorithm selection for use in the Standard Template Adaptive Parallel Library (STAPL) Our framework uses machine learning techniques to analyze data collected by STAPL installation benchmarks and to determine tests that will select among algorithmic options at run-time We apply a prototype implementation of our framework to two important parallel operations, sorting and matrix multiplication, on multiple platforms and show that the framework determines run-time tests that correctly select the best performing algorithm from among several competing algorithmic options in 86-100% of the cases studied, depending on the operation and the system

...read moreread less

Journal Article•DOI•

Shared memory parallelization of data mining algorithms: techniques, programming interface, and performance

[...]

Ruoming Jin¹, Ge Yang¹, Gagan Agrawal¹•Institutions (1)

Ohio State University¹

01 Jan 2005-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A series of techniques for parallelization of data mining algorithms, including full replication, full locking, fixed locking, optimized fulllocking, and cache-sensitive locking are developed, and a reduction-object-based interface for specifying a data mining algorithm is proposed.

...read moreread less

Abstract: With recent technological advances, shared memory parallel machines have become more scalable, and offer large main memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In This work, we focus on shared memory parallelization of data mining algorithms. We have developed a series of techniques for parallelization of data mining algorithms, including full replication, full locking, fixed locking, optimized full locking, and cache-sensitive locking. Unlike previous work on shared memory parallelization of specific data mining algorithms, all of our techniques apply to a large number of popular data mining algorithms. In addition, we propose a reduction-object-based interface for specifying a data mining algorithm. We show how our runtime system can apply any of the techniques we have developed starting from a common specification of the algorithm. We have carried out a detailed evaluation of the parallelization techniques and the programming interface. We have experimented with apriori and fp-tree-based association mining, k-means clustering, k-nearest neighbor classifier, and decision tree construction. The main results from our experiments are as follows: 1) Among full replication, optimized full locking, and cache-sensitive locking, there is no clear winner. Each of these three techniques can outperform others depending upon machine and dataset parameters. These three techniques perform significantly better than the other two techniques. 2) Good parallel efficiency is achieved for each of the four algorithms we experimented with, using our techniques and runtime system. 3) The overhead of the interface is within 10 percent in almost all cases. 4) In the case of decision tree construction, combining different techniques turned out to be crucial for achieving high performance.

...read moreread less

Journal Article•DOI•

A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters

[...]

Xiao Qin¹, Hong Jiang²•Institutions (2)

New Mexico Institute of Mining and Technology¹, University of Nebraska–Lincoln²

01 Aug 2005-Journal of Parallel and Distributed Computing

TL;DR: Results suggest that shortening scheduling times leads to a higher guarantee ratio, and if parallel scheduling algorithms are applied to shorten scheduling times, the performance of heterogeneous clusters will be further enhanced.

...read moreread less

Journal Article•DOI•

Optimization of multi-pass milling using parallel genetic algorithm and parallel genetic simulated annealing

[...]

Z.G. Wang¹, Mustafizur Rahman¹, Yoke San Wong¹, Jie Sun¹•Institutions (1)

National University of Singapore¹

01 Dec 2005-International Journal of Machine Tools & Manufacture

TL;DR: The parallel genetic simulated annealing (PGSA) has been developed and used to optimize the cutting parameters for multi-pass milling process and is shown to be more suitable and efficient for optimizing thecutting parameters for milling operation than GP+DP and PGA.

...read moreread less

Abstract: This paper presents an approach to select the optimal machining parameters for multi-pass milling. It is based on two recent approaches, genetic algorithm (GA) and simulated annealing (SA), which have been applied to many difficult combinatorial optimization problems with certain strengths and weaknesses. In this paper, a hybrid of GA and SA (GSA) is presented to use the strengths of GA and SA and overcome their weaknesses. In order to improve, the performance of GSA further, the parallel genetic simulated annealing (PGSA) has been developed and used to optimize the cutting parameters for multi-pass milling process. For comparison, conventional parallel GA (PGA) is also chosen as another optimization method. An application example that has been solved previously using the geometric programming (GP) and dynamic programming (DP) method is presented. From the given results, PGSA is shown to be more suitable and efficient for optimizing the cutting parameters for milling operation than GP+DP and PGA.

...read moreread less

Journal Article•DOI•

A fast, parallel spanning tree algorithm for symmetric multiprocessors (SMPs)

[...]

David A. Bader¹, Guojing Cong²•Institutions (2)

Georgia Institute of Technology¹, IBM²

01 Sep 2005-Journal of Parallel and Distributed Computing

TL;DR: A new randomized algorithm and implementation with superior performance that for the first time achieves parallel speedup on arbitrary graphs (both regular and irregular topologies) when compared with the best sequential implementation for finding a spanning tree.

...read moreread less

Journal Article•DOI•

Makespan minimization for scheduling unrelated parallel machines: A recovering beam search approach

[...]

Marco Ghirardi¹, Chris N. Potts²•Institutions (2)

Polytechnic University of Turin¹, University of Southampton²

01 Sep 2005

TL;DR: A Recovering Beam Search algorithm is developed for the unrelated parallel machine scheduling problem that requires polynomial time and is able to generate approximate solutions for instances with large size using a few minutes of computation time.

...read moreread less

Abstract: This paper considers the problem of scheduling jobs on unrelated parallel machines to minimize the makespan. Recovering Beam Search is a recently introduced method for obtaining approximate solutions to combinatorial optimization problems. A traditional Beam Search algorithm is a type of truncated branch and bound algorithm approach. However, Recovering Beam Search allows the possibility of correcting wrong decisions by replacing partial solutions with others. We develop a Recovering Beam Search algorithm for our unrelated parallel machine scheduling problem that requires polynomial time. Computational results show that it is able to generate approximate solutions for instances with large size (up to 1000 jobs) using a few minutes of computation time.

...read moreread less

Proceedings Article•DOI•

Efficient hardware data mining with the Apriori algorithm on FPGAs

[...]

Zachary K. Baker¹, Viktor K. Prasanna¹•Institutions (1)

University of Southern California¹

18 Apr 2005

TL;DR: This work introduces an efficient "systolic injection" method for intelligently reporting unpredictably generated mid-array results to a controller without any chance of collision or excessive stalling in the Apriori algorithm.

...read moreread less

Abstract: The Apriori algorithm is a popular correlation-based data mining kernel. However, it is a computationally expensive algorithm and the running times can stretch up to days for large databases, as database sizes can extend to Gigabytes. Through the use of a new extension to the systolic array architecture, time required for processing can be significantly reduced. Our array architecture implementation on a Xilinx Virtex-II Pro 100 provides a performance improvement that can be orders of magnitude faster than the state-of-the-art software implementations. The system is easily scalable and introduces an efficient "systolic injection" method for intelligently reporting unpredictably generated mid-array results to a controller without any chance of collision or excessive stalling.

...read moreread less

Proceedings Article•DOI•

Parallel mining of closed sequential patterns

[...]

Shengnan Cong¹, Jiawei Han¹, David Padua¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

21 Aug 2005

TL;DR: An algorithm, called Par-CSP (Parallel Closed Sequential Pattern mining), to conduct parallel mining of closed sequential patterns on a distributed memory system by exploiting the divide-and-conquer property so that the overhead of interprocessor communication is minimized.

...read moreread less

Abstract: Discovery of sequential patterns is an essential data mining task with broad applications. Among several variations of sequential patterns, closed sequential pattern is the most useful one since it retains all the information of the complete pattern set but is often much more compact than it. Unfortunately, there is no parallel closed sequential pattern mining method proposed yet. In this paper we develop an algorithm, called Par-CSP (Parallel Closed Sequential Pattern mining), to conduct parallel mining of closed sequential patterns on a distributed memory system. Par-CSP partitions the work among the processors by exploiting the divide-and-conquer property so that the overhead of interprocessor communication is minimized. Par-CSP applies dynamic scheduling to avoid processor idling. Moreover, it employs a technique, called selective sampling to address the load imbalance problem. We implement Par-CSP using MPI on a 64-node Linux cluster. Our experimental results show that Par-CSP attains good parallelization efficiencies on various input datasets.

...read moreread less

Proceedings Article•DOI•

On the architectural requirements for efficient execution of graph algorithms

[...]

David A. Bader, Guojing Cong¹, John Feo²•Institutions (2)

IBM¹, Cray²

14 Jun 2005

TL;DR: This paper considers the performance and scalability of two graph algorithms, list ranking and connected components, on two classes of shared-memory computers: symmetric multiprocessors such as the Sun Enterprise servers and multithreaded architectures (MTA)such as the Cray MTA-2.

...read moreread less

Abstract: Combinatorial problems such as those from graph theory pose serious challenges for parallel machines due to non-contiguous, concurrent accesses to global data structures with low degrees of locality The hierarchical memory systems of symmetric multiprocessor (SMP) clusters optimize for local, contiguous memory accesses, and so are inefficient platforms for such algorithms Few parallel graph algorithms outperform their best sequential implementation on SMP clusters due to long memory latencies and high synchronization costs In this paper, we consider the performance and scalability of two graph algorithms, list ranking and connected components, on two classes of shared-memory computers: symmetric multiprocessors such as the Sun Enterprise servers and multithreaded architectures (MTA) such as the Cray MTA-2 While previous studies have shown that parallel graph algorithms can speedup on SMPs, the systems' reliance on cache microprocessors limits performance The MTA's latency tolerant processors and hardware support for fine-grain synchronization makes performance a function of parallelism Since parallel graph algorithms have an abundance of parallelism, they perform and scale significantly better on the MTA We describe and give a performance model for each architecture We analyze the performance of the two algorithms and discuss how the features of each architecture affects algorithm development, ease of programming, performance, and scalability

...read moreread less

Journal Article•DOI•

Parallel implementation of back-propagation algorithm in networks of workstations

[...]

Sundaram Suresh¹, S. N. Omkar¹, V. Mani¹•Institutions (1)

Indian Institute of Science¹

01 Jan 2005-IEEE Transactions on Parallel and Distributed Systems

TL;DR: The analytical and experimental performance shows that the proposed parallel algorithm has better speed-up, less communication time, and better space reduction factor than the earlier algorithm.

...read moreread less

Abstract: This work presents an efficient mapping scheme for the multilayer perceptron (MLP) network trained using back-propagation (BP) algorithm on network of workstations (NOWs). Hybrid partitioning (HP) scheme is used to partition the network and each partition is mapped on to processors in NOWs. We derive the processing time and memory space required to implement the parallel BP algorithm in NOWs. The performance parameters like speed-up and space reduction factor are evaluated for the HP scheme and it is compared with earlier work involving vertical partitioning (VP) scheme for mapping the MLP on NOWs. The performance of the HP scheme is evaluated by solving optical character recognition (OCR) problem in a network of ALPHA machines. The analytical and experimental performance shows that the proposed parallel algorithm has better speed-up, less communication time, and better space reduction factor than the earlier algorithm. This work also presents a simple and efficient static mapping scheme on heterogeneous system. Using divisible load scheduling theory, a closed-form expression for number of neurons assigned to each processor in the NOW is obtained. Analytical and experimental results for static mapping problem on NOWs are also presented.

...read moreread less

Book Chapter•DOI•

Parallel genetic algorithms on programmable graphics hardware

[...]

Qizhi Yu¹, Chongcheng Chen², Zhigeng Pan¹•Institutions (2)

Zhejiang University¹, Fuzhou University²

27 Aug 2005

TL;DR: This paper describes how fine-grained parallel genetic algorithms can be mapped to programmable graphics hardware found in commodity PC and demonstrates the effectiveness of the approach by comparing it with compatible software implementation.

...read moreread less

Abstract: Parallel genetic algorithms are usually implemented on parallel machines or distributed systems. This paper describes how fine-grained parallel genetic algorithms can be mapped to programmable graphics hardware found in commodity PC. Our approach stores chromosomes and their fitness values in texture memory on graphics card. Both fitness evaluation and genetic operations are implemented entirely with fragment programs executed on graphics processing unit in parallel. We demonstrate the effectiveness of our approach by comparing it with compatible software implementation. The presented approach allows us benefit from the advantages of parallel genetic algorithms on low-cost platform.

...read moreread less

Journal Article•DOI•

Fast parallel molecular algorithms for DNA-based computation: factoring integers

[...]

Weng-Long Chang¹, Minyi Guo, Michael Ho•Institutions (1)

National Taiwan University¹

31 May 2005-IEEE Transactions on Nanobioscience

TL;DR: To factor the product of two large prime numbers, is a breakthrough in basic biological operations using a molecular computer and indicates that the cryptosystems using public-key are perhaps insecure and presents clear evidence of the ability of molecular computing to perform complicated mathematical operations.

...read moreread less

Abstract: The RSA public-key cryptosystem is an algorithm that converts input data to an unrecognizable encryption and converts the unrecognizable data back into its original decryption form. The security of the RSA public-key cryptosystem is based on the difficulty of factoring the product of two large prime numbers. This paper demonstrates to factor the product of two large prime numbers, and is a breakthrough in basic biological operations using a molecular computer. In order to achieve this, we propose three DNA-based algorithms for parallel subtractor, parallel comparator, and parallel modular arithmetic that formally verify our designed molecular solutions for factoring the product of two large prime numbers. Furthermore, this work indicates that the cryptosystems using public-key are perhaps insecure and also presents clear evidence of the ability of molecular computing to perform complicated mathematical operations.

...read moreread less

Journal Article•DOI•

A data distributed parallel algorithm for nonrigid image registration

[...]

Fumihiko Ino¹, Kanrou Ooyama², Kenichi Hagihara¹•Institutions (2)

Osaka University¹, Panasonic²

01 Jan 2005

TL;DR: A data distributed parallel algorithm that is capable of aligning large-scale three-dimensional images of deformable objects and requires less amount of memory resources, so that aligns datasets up to 1024x1024x590 voxel images with reducing the execution time from hours to minutes, a clinically compatible time.

...read moreread less

Abstract: Image registration is a technique for defining a geometric relationship between each point in images. This paper presents a data distributed parallel algorithm that is capable of aligning large-scale three-dimensional (3-D) images of deformable objects. The novelty of our algorithm is to overcome the limitations on the memory space as well as the execution time. In order to enable this, our algorithm incorporates data distribution, data-parallel processing, and load balancing techniques into Schnabel's registration algorithm that realizes robust and efficient alignment based on information theory and adaptive mesh refinement. We also present some experimental results obtained on a 128-CPU cluster of PCs interconnected by Myrinet and Fast Ethernet switches. The results show that our algorithm requires less amount of memory resources, so that aligns datasets up to 1024x1024x590 voxel images with reducing the execution time from hours to minutes, a clinically compatible time.

...read moreread less

A Parallel Algorithm for Adaptive Local Refinement of Tetrahedral Meshes Using Bisection

[...]

Linbo Zhang¹•Institutions (1)

Chinese Academy of Sciences¹

01 Jan 2005

TL;DR: A parallel algorithm for distributed memory parallel computers for adaptive local refinement of tetrahedral meshes using bisection, part of PHG, Parallel Hierarchical Grid, a toolbox under development for parallel adaptive multigrid solution of PDEs.

...read moreread less

Abstract: Local mesh refinement is one of the key steps in implementations of adaptive finite element methods. This paper presents a parallel algorithm for distributed memory parallel computers for adaptive local refinement of tetrahedral meshes using bisection. The algorithm is part of PHG, Parallel Hierarchical Grid, a toolbox under development for parallel adaptive multigrid solution of PDEs. The algorithm proposed is characterized by allowing simultaneous refinement of submeshes to arbitrary levels before synchronization between submeshes and without the need of a central coordinator process for managing new vertices. Some general properties on local refinement of conforming tetrahedral meshes using bisection are also discussed which are useful in analysing and validating the parallel refinement algorithm as well as in simplifying the implementation.

...read moreread less

Book Chapter•DOI•

Pitfalls in parallel job scheduling evaluation

[...]

Eitan Frachtenberg¹, Dror G. Feitelson²•Institutions (2)

Los Alamos National Laboratory¹, Hebrew University of Jerusalem²

19 Jun 2005

TL;DR: This paper document numerous pitfalls one may fall into when evaluating the performance of a complex system, with the hope of providing at least some help in avoiding them.

...read moreread less

Abstract: There are many choices to make when evaluating the performance of a complex system. In the context of parallel job scheduling, one must decide what workload to use and what measurements to take. These decisions sometimes have subtle implications that are easy to overlook. In this paper we document numerous pitfalls one may fall into, with the hope of providing at least some help in avoiding them. Along the way, we also identify topics that could benefit from additional research.

...read moreread less

Proceedings Article•DOI•

Artificial neural network computation on graphic process unit

[...]

Zhongwen Luo, Hongzhi Liu, Xincai Wu

27 Dec 2005

TL;DR: Application of commodity available GPU for two kinds of ANN models was explored, one is the self-organizing maps (SOM); the other is multi layer perceptron (MLP), and the result shows that ANN computing on GPU is much faster than on standard CPU when the neural network is large.

...read moreread less

Abstract: Artificial neural network (ANN) is widely used in pattern recognition related area In some case, the computational load is very heavy, in other case, real time process is required So there is a need to apply a parallel algorithm on it, and usually the computation for ANN is inherently parallel In this paper, graphic hardware is used to speed up the computation of ANN In recent years, graphic processing unit (GPU) grows faster than CPU Graphic hardware venders provide programmability on GPU In this paper, application of commodity available GPU for two kinds of ANN models was explored One is the self-organizing maps (SOM); the other is multi layer perceptron (MLP) The computation result shows that ANN computing on GPU is much faster than on standard CPU when the neural network is large And some design rules for improve the efficiency on GPU are given

...read moreread less

Journal Article•DOI•

A parallelized, adaptive algorithm for multiphase flows in general geometries

[...]

Mark Sussman¹•Institutions (1)

Florida State University¹

01 Feb 2005-Computers & Structures

TL;DR: This paper studies the speed-up gained via adaptive mesh refinement, and/or parallelization in multiphase flow in general geometries through parallelized, adaptive algorithm.

...read moreread less

Journal Article•DOI•

Finding strongly connected components in distributed graphs

[...]

William Clarence McLendon¹, Bruce Hendrickson¹, Steven J. Plimpton¹, Lawrence Rauchwerger²•Institutions (2)

Sandia National Laboratories¹, Texas A&M University²

01 Aug 2005-Journal of Parallel and Distributed Computing

TL;DR: The implementation of a recently proposed parallel algorithm that finds strongly connected components in distributed graphs, and how it is used in a radiation transport solver is described.

...read moreread less

Journal Article•DOI•

Iterative list scheduling for heterogeneous computing

[...]

Guoquan Liu¹, Kim-Leng Poh¹, Min Xie¹•Institutions (1)

National University of Singapore¹

01 May 2005-Journal of Parallel and Distributed Computing

TL;DR: This paper presents an iterative list scheduling algorithm to deal with scheduling on heterogeneous computing systems and shows that in the majority of the cases, there is significant improvement to the initial schedule.

...read moreread less

Collapse