Showing papers on "Parallel algorithm published in 2011"

PDF

Open Access

Journal Article•DOI•

Parallel Spectral Clustering in Distributed Systems

[...]

Wen-Yen Chen¹, Yangqiu Song², Hongjie Bai³, Chih-Jen Lin⁴, Edward Y. Chang³ - Show less +1 more•Institutions (4)

Yahoo!¹, Microsoft², Google³, National Taiwan University⁴

01 Mar 2011-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This work investigates two representative ways of approximating the dense similarity matrix and picks the strategy of sparsifying the matrix via retaining nearest neighbors and investigates its parallelization, which can effectively handle large problems.

...read moreread less

Abstract: Spectral clustering algorithms have been shown to be more effective in finding clusters than some traditional algorithms, such as k-means. However, spectral clustering suffers from a scalability problem in both memory use and computational time when the size of a data set is large. To perform clustering on large data sets, we investigate two representative ways of approximating the dense similarity matrix. We compare one approach by sparsifying the matrix with another by the Nystrom method. We then pick the strategy of sparsifying the matrix via retaining nearest neighbors and investigate its parallelization. We parallelize both memory use and computation on distributed computers. Through an empirical study on a document data set of 193,844 instances and a photo data set of 2,121,863, we show that our parallel algorithm can effectively handle large problems.

...read moreread less

591 citations

Book•

Deterministic coin tossing with applications to optimal parallel list ranking

[...]

Richard Cole¹, Uzi Vishkin², Uzi Vishkin¹•Institutions (2)

New York University¹, Tel Aviv University²

22 Aug 2011

TL;DR: The algorithms apply a novel “random-like” deterministic technique that provides for a fast and efficient breaking of an apparently symmetric situation in parallel and distributed computation.

...read moreread less

Abstract: The following problem is considered: given a linked list of length n , compute the distance from each element of the linked list to the end of the list. The problem has two standard deterministic algorithms: a linear time serial algorithm, and an O (log n ) time parallel algorithm using n processors. We present new deterministic parallel algorithms for the problem. Our strongest results are (1) O (log n log* n ) time using n /(log n log* n ) processors (this algorithm achieves optimal speed-up); (2) O (log n ) time using n log ( k ) n /log n processors, for any fixed positive integer k . The algorithms apply a novel “random-like” deterministic technique. This technique provides for a fast and efficient breaking of an apparently symmetric situation in parallel and distributed computation.

...read moreread less

474 citations

Book•

Parallel merge sort

[...]

Richard Cole

09 Sep 2011

TL;DR: In this paper, a parallel implementation of merge sort on a CREW PRAM that uses n processors and O(logn) time is given, and the constant in the running time is small.

...read moreread less

Abstract: We give a parallel implementation of merge sort on a CREW PRAM that uses n processors and O(logn) time; the constant in the running time is small. We also give a more complex version of the algorithm for the EREW PRAM; it also uses n processors and O(logn) time. The constant in the running time is still moderate, though not as small.

...read moreread less

346 citations

Journal Article•DOI•

A parallel bi-objective hybrid metaheuristic for energy-aware scheduling for cloud computing systems

[...]

Mohand Mezmaz¹, Nouredine Melab², Y. Kessaci², Young Choon Lee³, El-Ghazali Talbi⁴, Albert Y. Zomaya³, Daniel Tuyttens¹ - Show less +3 more•Institutions (4)

University of Mons¹, Lille University of Science and Technology², University of Sydney³, King Saud University⁴

01 Nov 2011-Journal of Parallel and Distributed Computing

TL;DR: This work proposes a new parallel bi-objective hybrid genetic algorithm that takes into account, not only makespan, but also energy consumption, and focuses on the island parallel model and the multi-start parallel model.

...read moreread less

327 citations

Journal Article•DOI•

Parallel Elite Genetic Algorithm and Its Application to Global Path Planning for Autonomous Robot Navigation

[...]

Ching-Chih Tsai¹, Hsu-Chih Huang², Cheng-Kai Chan¹•Institutions (2)

National Chung Hsing University¹, Hungkuang University²

28 Jan 2011-IEEE Transactions on Industrial Electronics

TL;DR: This PEGA, consisting of two parallel EGAs along with a migration operator, takes advantages of maintaining better population diversity, inhibiting premature convergence, and keeping parallelism in comparison with conventional GAs, thus significantly expediting computation speed.

...read moreread less

Abstract: This paper presents a parallel elite genetic algorithm (PEGA) and its application to global path planning for autonomous mobile robots navigating in structured environments. This PEGA, consisting of two parallel EGAs along with a migration operator, takes advantages of maintaining better population diversity, inhibiting premature convergence, and keeping parallelism in comparison with conventional GAs. This initial feasible path generated from the PEGA planner is then smoothed using the cubic B-spline technique, in order to construct a near-optimal collision-free continuous path. Both global path planner and smoother are implemented in one field-programmable gate array chip utilizing the system-on-a-programmable-chip technology and the pipelined hardware implementation scheme, thus significantly expediting computation speed. Simulations and experimental results are conducted to show the merit of the proposed PEGA path planner and smoother for global path planning of autonomous mobile robots.

...read moreread less

254 citations

Proceedings Article•DOI•

Parallel breadth-first search on distributed memory systems

[...]

Aydin Buluc¹, Kamesh Madduri¹•Institutions (1)

Lawrence Berkeley National Laboratory¹

12 Nov 2011

TL;DR: In this article, the design space of parallel algorithms for Breadth-First Search (BFS), a key subroutine in several graph algorithms, is explored and two highly-tuned parallel approaches for BFS on large parallel systems: a level-synchronous strategy that relies on a simple vertex-based partitioning of the graph and a two-dimensional sparse matrix partitioning-based approach that mitigates parallel communication overhead.

...read moreread less

Abstract: Data-intensive, graph-based computations are pervasive in several scientific applications, and are known to to be quite challenging to implement on distributed memory systems. In this work, we explore the design space of parallel algorithms for Breadth-First Search (BFS), a key subroutine in several graph algorithms. We present two highly-tuned parallel approaches for BFS on large parallel systems: a level-synchronous strategy that relies on a simple vertex-based partitioning of the graph, and a two-dimensional sparse matrix partitioning-based approach that mitigates parallel communication overhead. For both approaches, we also present hybrid versions with intra-node multithreading. Our novel hybrid two-dimensional algorithm reduces communication times by up to a factor of 3.5, relative to a common vertex based approach. Our experimental study identifies execution regimes in which these approaches will be competitive, and we demonstrate extremely high performance on leading distributed-memory parallel systems. For instance, for a 40,000-core parallel execution on Hopper, an AMD Magny-Cours based system, we achieve a BFS performance rate of 17.8 billion edge visits per second on an undirected graph of 4.3 billion vertices and 68.7 billion edges with skewed degree distribution.

...read moreread less

229 citations

Proceedings Article•DOI•

Copperhead: compiling an embedded data parallel language

[...]

Bryan Catanzaro¹, Michael Garland², Kurt Keutzer¹•Institutions (2)

University of California, Berkeley¹, Nvidia²

12 Feb 2011

TL;DR: The language, compiler, and runtime features that enable Copperhead to efficiently execute data parallel code are discussed and the program analysis techniques necessary for compiling Copperhead code into efficient low-level implementations are introduced.

...read moreread less

Abstract: Modern parallel microprocessors deliver high performance on applications that expose substantial fine-grained data parallelism. Although data parallelism is widely available in many computations, implementing data parallel algorithms in low-level languages is often an unnecessarily difficult task. The characteristics of parallel microprocessors and the limitations of current programming methodologies motivate our design of Copperhead, a high-level data parallel language embedded in Python. The Copperhead programmer describes parallel computations via composition of familiar data parallel primitives supporting both flat and nested data parallel computation on arrays of data. Copperhead programs are expressed in a subset of the widely used Python programming language and interoperate with standard Python modules, including libraries for numeric computation, data visualization, and analysis. In this paper, we discuss the language, compiler, and runtime features that enable Copperhead to efficiently execute data parallel code. We define the restricted subset of Python which Copperhead supports and introduce the program analysis techniques necessary for compiling Copperhead code into efficient low-level implementations. We also outline the runtime support by which Copperhead programs interoperate with standard Python modules. We demonstrate the effectiveness of our techniques with several examples targeting the CUDA platform for parallel programming on GPUs. Copperhead code is concise, on average requiring 3.6 times fewer lines of code than CUDA, and the compiler generates efficient code, yielding 45-100% of the performance of hand-crafted, well optimized CUDA code.

...read moreread less

225 citations

Proceedings Article•DOI•

Fast clustering using MapReduce

[...]

Alina Ene¹, Sungjin Im¹, Benjamin Moseley¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

21 Aug 2011

TL;DR: In this article, the authors consider the problem of k-center and k-median clustering in MapReduce and develop fast clustering algorithms with constant factor approximation guarantees.

...read moreread less

Abstract: Clustering problems have numerous applications and are becoming more challenging as the size of the data increases. In this paper, we consider designing clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets. We focus on the practical and popular clustering problems, k-center and k-median. We develop fast clustering algorithms with constant factor approximation guarantees. From a theoretical perspective, we give the first analysis that shows several clustering algorithms are in MRC0, a theoretical MapReduce class introduced by Karloff et al. [26]. Our algorithms use sampling to decrease the data size and they run a time consuming clustering algorithm such as local search or Lloyd's algorithm on the resulting data set. Our algorithms have sufficient flexibility to be used in practice since they run in a constant number of MapReduce rounds. We complement these results by performing experiments using our algorithms. We compare the empirical performance of our algorithms to several sequential and parallel algorithms for the k-median problem. The experiments show that our algorithms' solutions are similar to or better than the other algorithms' solutions. Furthermore, on data sets that are sufficiently large, our algorithms are faster than the other parallel algorithms that we tested.

...read moreread less

215 citations

Posted Content•

Sorting, Searching, and Simulation in the MapReduce Framework

[...]

Michael T. Goodrich¹, Nodari Sitchinava², Qin Zhang²•Institutions (2)

University of California, Irvine¹, Aarhus University²

10 Jan 2011-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: This work designs optimal simulations of the the well-established PRAM and BSP models in MapReduce, immediately resulting in optimal solutions to the problems of computing fixed-dimensional linear programming and 2-D and 3-D convex hulls.

...read moreread less

Abstract: In this paper, we study the MapReduce framework from an algorithmic standpoint and demonstrate the usefulness of our approach by designing and analyzing efficient MapReduce algorithms for fundamental sorting, searching, and simulation problems. This study is motivated by a goal of ultimately putting the MapReduce framework on an equal theoretical footing with the well-known PRAM and BSP parallel models, which would benefit both the theory and practice of MapReduce algorithms. We describe efficient MapReduce algorithms for sorting, multi-searching, and simulations of parallel algorithms specified in the BSP and CRCW PRAM models. We also provide some applications of these results to problems in parallel computational geometry for the MapReduce framework, which result in efficient MapReduce algorithms for sorting, 2- and 3-dimensional convex hulls, and fixed-dimensional linear programming. For the case when mappers and reducers have a memory/message-I/O size of $M=\Theta(N^\epsilon)$, for a small constant $\epsilon>0$, all of our MapReduce algorithms for these applications run in a constant number of rounds.

...read moreread less

201 citations

Book•

Faster optimal parallel prefix sums and list ranking

[...]

Richard Cole¹, Uzi Vishkin¹•Institutions (1)

New York University¹

27 Aug 2011

TL;DR: A parallel algorithm for the prefix sums problem which runs in timeO( logn/log logn) time using n/lognprocessors (optimal speedup) is presented.

...read moreread less

Abstract: We present a parallel algorithm for the prefix sums problem which runs in timeO( logn/log logn) usingnlog logn/lognprocessors (optimal speedup). This algorithm leads to a parallel list ranking algorithm which runs inO(logn) time usingn/lognprocessors (optimal speedup).

...read moreread less

199 citations

Journal Article•DOI•

High performance and scalable radix sorting: a case study of implementing dynamic parallelism for gpu computing

[...]

Duane Merrill¹, Andrew S. Grimshaw¹•Institutions (1)

University of Virginia¹

21 Nov 2011-Parallel Processing Letters

TL;DR: A family of very efficient parallel algorithms for radix sorting; and the authors' allocation-oriented algorithmic design strategies that match the strengths of GPU processor architecture to this genre of dynamic parallelism are presented.

...read moreread less

Abstract: The need to rank and order data is pervasive, and many algorithms are fundamentally dependent upon sorting and partitioning operations. Prior to this work, GPU stream processors have been perceived as challenging targets for problems with dynamic and global data-dependences such as sorting. This paper presents: (1) a family of very efficient parallel algorithms for radix sorting; and (2) our allocation-oriented algorithmic design strategies that match the strengths of GPU processor architecture to this genre of dynamic parallelism. We demonstrate multiple factors of speedup (up to 3.8x) compared to state-of-the-art GPU sorting. We also reverse the performance differentials observed between GPU and multi/many-core CPU architectures by recent comparisons in the literature, including those with 32-core CPU-based accelerators. Our average sorting rates exceed 1B 32-bit keys/sec on a single GPU microprocessor. Our sorting passes are constructed from a very efficient parallel prefix scan "runtime" that incorporates three design features: (1) kernel fusion for locally generating and consuming prefix scan data; (2) multi-scan for performing multiple related, concurrent prefix scans (one for each partitioning bin); and (3) flexible algorithm serialization for avoiding unnecessary synchronization and communication within algorithmic phases, allowing us to construct a single implementation that scales well across all generations and configurations of programmable NVIDIA GPUs.

...read moreread less

Journal Article•

Load Balanced Min-Min Algorithm for Static Meta-Task Scheduling in Grid Computing

[...]

T. Kokilavani, D. I. George Amalarethinam

30 Apr 2011-International Journal of Computer Applications

TL;DR: A Load Balanced Min-Min (LBMM) algorithm is proposed that reduces the makespan and increases the resource utilization in grid computing and it is shown that the proposed method has two-phases.

...read moreread less

Abstract: Grid computing has become a real alternative to traditional supercomputing environments for developing parallel applications that harness massive computational resources. However, the complexity incurred in building such parallel Grid-aware applications is higher than the traditional parallel computing environments. It addresses issues such as resource discovery, heterogeneity, fault tolerance and task scheduling. Load balanced task scheduling is very important problem in complex grid environment. So task scheduling which is one of the NP-Complete problems becomes a focus of research scholars in grid computing area. The traditional Min-Min algorithm is a simple algorithm that produces a schedule that minimizes the makespan than the other traditional algorithms in the literature. But it fails to produce a load balanced schedule. In this paper a Load Balanced Min-Min (LBMM) algorithm is proposed that reduces the makespan and increases the resource utilization. The proposed method has two-phases. In the first phase the traditional Min-Min algorithm is executed and in the second phase the tasks are rescheduled to use the unutilized resources effectively.

...read moreread less

Journal Article•DOI•

A bridging model for multi-core computing

[...]

Leslie G. Valiant¹•Institutions (1)

Harvard University¹

01 Jan 2011-Journal of Computer and System Sciences

TL;DR: It is suggested that the considerable intellectual effort needed for designing efficient algorithms for multi-core architectures may be most fruitfully expended in designing portable algorithms, once and for all, for such a bridging model.

...read moreread less

Posted Content•

Fast Clustering using MapReduce

[...]

Alina Ene¹, Sungjin Im¹, Benjamin Moseley¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

07 Sep 2011-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: This paper designs clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets, and focuses on the practical and popular clustering problems, k-center and k-median.

...read moreread less

Abstract: Clustering problems have numerous applications and are becoming more challenging as the size of the data increases. In this paper, we consider designing clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets. We focus on the practical and popular clustering problems, $k$-center and $k$-median. We develop fast clustering algorithms with constant factor approximation guarantees. From a theoretical perspective, we give the first analysis that shows several clustering algorithms are in $\mathcal{MRC}^0$, a theoretical MapReduce class introduced by Karloff et al. \cite{KarloffSV10}. Our algorithms use sampling to decrease the data size and they run a time consuming clustering algorithm such as local search or Lloyd's algorithm on the resulting data set. Our algorithms have sufficient flexibility to be used in practice since they run in a constant number of MapReduce rounds. We complement these results by performing experiments using our algorithms. We compare the empirical performance of our algorithms to several sequential and parallel algorithms for the $k$-median problem. The experiments show that our algorithms' solutions are similar to or better than the other algorithms' solutions. Furthermore, on data sets that are sufficiently large, our algorithms are faster than the other parallel algorithms that we tested.

...read moreread less

Journal Article•DOI•

Evaluation of parallel particle swarm optimization algorithms within the CUDA TM architecture

[...]

Luca Mussi¹, Fabio Daolio², Stefano Cagnoni¹•Institutions (2)

University of Parma¹, University of Lausanne²

01 Oct 2011-Information Sciences

TL;DR: In this article, the authors discuss possible approaches to parallelizing PSO on graphics hardware within the Compute Unified Device Architecture (CUDA(TM)), a GPU programming environment by nVIDIA(TM) which supports the company's latest cards.

...read moreread less

Journal Article•DOI•

An exact bit-parallel algorithm for the maximum clique problem

[...]

Pablo San Segundo¹, Diego Rodriguez-Losada¹, Agustín Jiménez¹•Institutions (1)

Technical University of Madrid¹

01 Feb 2011-Computers & Operations Research

TL;DR: This paper presents a new exact maximum clique algorithm which improves the bounds obtained in state of the art approximate coloring by reordering the vertices at each step, and significantly outperforms a current leading algorithm.

...read moreread less

Journal Article•DOI•

Parallel Proximal Algorithm for Image Restoration Using Hybrid Regularization

[...]

Nelly Pustelnik¹, Caroline Chaux¹, Jean-Christophe Pesquet¹•Institutions (1)

University of Marne-la-Vallée¹

01 Sep 2011-IEEE Transactions on Image Processing

TL;DR: In this article, the authors adopt a convex optimization framework where the criterion to be minimized is split in the sum of more than two terms, and an accelerated version of the Parallel Proximal Algorithm is proposed to perform the minimization.

...read moreread less

Abstract: Regularization approaches have demonstrated their effectiveness for solving ill-posed problems. However, in the context of variational restoration methods, a challenging question remains, namely how to find a good regularizer. While total variation introduces staircase effects, wavelet-domain regularization brings other artefacts, e.g., ringing. However, a tradeoff can be made by introducing a hybrid regularization including several terms not necessarily acting in the same domain (e.g., spatial and wavelet transform domains). While this approach was shown to provide good results for solving deconvolution problems in the presence of additive Gaussian noise, an important issue is to efficiently deal with this hybrid regularization for more general noise models. To solve this problem, we adopt a convex optimization framework where the criterion to be minimized is split in the sum of more than two terms. For spatial domain regularization, isotropic or anisotropic total variation definitions using various gradient filters are considered. An accelerated version of the Parallel Proximal Algorithm is proposed to perform the minimization. Some difficulties in the computation of the proximity operators involved in this algorithm are also addressed in this paper. Numerical experiments performed in the context of Poisson data recovery, show the good behavior of the algorithm as well as promising results concerning the use of hybrid regularization techniques.

...read moreread less

Journal Article•DOI•

Some observations on optimal frequency selection in DVFS-based energy consumption minimization

[...]

Nikzad Babaii Rizvandi¹, Javid Taheri¹, Albert Y. Zomaya¹•Institutions (1)

University of Sydney¹

01 Aug 2011-Journal of Parallel and Distributed Computing

TL;DR: A new slack reclamation algorithm is proposed by approaching the energy reduction problem from a different angle and a novel algorithm to find the best combination of frequencies to result the optimal energy is presented.

...read moreread less

Journal Article•DOI•

A new approach to the lattice Boltzmann method for graphics processing units

[...]

Christian Obrecht¹, Frédéric Kuznik¹, Bernard Tourancheau², Jean-Jacques Roux¹•Institutions (2)

University of Lyon¹, École normale supérieure de Lyon²

01 Jun 2011-Computers & Mathematics With Applications

TL;DR: This work follows carefully chosen data transfer schemes in global memory of the Lattice Boltzmann Method and shows that highly efficient implementations of LBM on GPUs are possible, even for complex models.

...read moreread less

Abstract: Emerging many-core processors, like CUDA capable nVidia GPUs, are promising platforms for regular parallel algorithms such as the Lattice Boltzmann Method (LBM). Since the global memory for graphic devices shows high latency and LBM is data intensive, the memory access pattern is an important issue for achieving good performances. Whenever possible, global memory loads and stores should be coalescent and aligned, but the propagation phase in LBM can lead to frequent misaligned memory accesses. Most previous CUDA implementations of 3D LBM addressed this problem by using low latency on chip shared memory. Instead of this, our CUDA implementation of LBM follows carefully chosen data transfer schemes in global memory. For the 3D lid-driven cavity test case, we obtained up to 86% of the global memory maximal throughput on nVidia's GT200. We show that as a consequence highly efficient implementations of LBM on GPUs are possible, even for complex models.

...read moreread less

Journal Article•DOI•

Parallel Algorithms for Fluid-Structure Interaction Problems in Haemodynamics

[...]

Paolo Crosetto¹, Simone Deparis, Gilles Fourestey, Alfio Quarteroni¹•Institutions (1)

École Polytechnique Fédérale de Lausanne¹

01 Jul 2011-SIAM Journal on Scientific Computing

TL;DR: This work introduces a class of parallel preconditioners for the FSI problem obtained by exploiting the block-structure of the linear system and shows that the construction and evaluation of the devised preconditionser is modular.

...read moreread less

Abstract: The increasing computational load required by most applications and the limits in hardware performances affecting scientific computing contributed in the last decades to the development of parallel software and architectures. In fluid-structure interaction (FSI) for haemodynamic applications, parallelization and scalability are key issues (see [L. Formaggia, A. Quarteroni, and A. Veneziani, eds., Cardiovascular Mathematics: Modeling and Simulation of the Circulatory System, Modeling, Simulation and Applications 1, Springer, Milan, 2009]). In this work we introduce a class of parallel preconditioners for the FSI problem obtained by exploiting the block-structure of the linear system. We stress the possibility of extending the approach to a general linear system with a block-structure, then we provide a bound in the condition number of the preconditioned system in terms of the conditioning of the preconditioned diagonal blocks, and finally we show that the construction and evaluation of the devised preconditioner is modular. The preconditioners are tested on a benchmark three-dimensional (3D) geometry discretized in both a coarse and a fine mesh, as well as on two physiological aorta geometries. The simulations that we have performed show an advantage in using the block preconditioners introduced and confirm our theoretical results.

...read moreread less

Proceedings Article•DOI•

Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication

[...]

Aydin Buluc¹, Samuel Williams¹, Leonid Oliker¹, James Demmel¹•Institutions (1)

Lawrence Berkeley National Laboratory¹

16 May 2011

TL;DR: This work gives a new data structure transformation, called bit masked register blocks, which promises significant reductions on bandwidth requirements by reducing the number of indexing elements without introducing additional fill-in zeros and shows how to incorporate this transformation into existing parallel algorithms without limiting their parallel scalability.

...read moreread less

Abstract: On multicore architectures, the ratio of peak memory bandwidth to peak floating-point performance (byte:flop ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a sparse matrix (as well as its transpose in the unsymmetric case) with a dense vector is the core of sparse iterative methods. In this paper, we present a new multithreaded algorithm for the symmetric case which potentially cuts the bandwidth requirements in half while exposing lots of parallelism in practice. We also give a new data structure transformation, called bit masked register blocks, which promises significant reductions on bandwidth requirements by reducing the number of indexing elements without introducing additional fill-in zeros. Our work shows how to incorporate this transformation into existing parallel algorithms (both symmetric and unsymmetric) without limiting their parallel scalability. Experimental results indicate that the combined benefits of bit masked register blocks and the new symmetric algorithm can be as high as a factor of 3.5x in multicore performance over an already scalable parallel approach. We also provide a model that accurately predicts the performance of the new methods, showing that even larger performance gains are expected in future multicore systems as current trends (decreasing byte:flop ratio and larger sparse matrices) continue.

...read moreread less

Journal Article•DOI•

Extraction of hydrological proximity measures from DEMs using parallel processing

[...]

Teklu K. Tesfa¹, David G. Tarboton², Daniel W. Watson², Kimberly A. T. Schreuders², Matthew E. Baker³, Robert M. Wallace⁴ - Show less +2 more•Institutions (4)

Pacific Northwest National Laboratory¹, Utah State University², University of Maryland, Baltimore³, Engineer Research and Development Center⁴

01 Dec 2011-Environmental Modelling and Software

TL;DR: A message passing interface (MPI) parallel approach designed to both increase the size and speed with which hydrological proximity measures (HPMs) from a Digital Elevation Model (DEM) are computed allows efficient analysis of much larger DEMs than were possible using the serial recursive algorithms.

...read moreread less

Abstract: Land surface topography is one of the most important terrain properties which impact hydrological, geomorphological, and ecological processes active on a landscape. In our previous efforts to develop a soil depth model based upon topographic and land cover variables, we derived a set of hydrological proximity measures (HPMs) from a Digital Elevation Model (DEM) as potential explanatory variables for soil depth. These HPMs are variations of the distance up to ridge points (cells with no incoming flow) and variations of the distance down to stream points (cells with a contributing area greater than a threshold), following the flow path. The HPMs were computed using the D-infinity flow model that apportions flow between adjacent neighbors based on the direction of steepest downward slope on the eight triangular facets constructed in a 3 x 3 grid cell window using the center cell and each pair of adjacent neighboring grid cells in turn. The D-infinity model typically results in multiple flow paths between 2 points on the topography, with the result that distances may be computed as the minimum, maximum or average of the individual flow paths. In addition, each of the HPMs, are calculated vertically, horizontally, and along the land surface. Previously, these HPMs were calculated using recursive serial algorithms which suffered from stack overflow problems when used to process large datasets, limiting the size of DEMs that could be analyzed. To overcome this limitation, we developed a message passing interface (MPI) parallel approach designed to both increase the size and speed with which these HPMs are computed. The parallel HPM algorithms spatially partition the input grid into stripes which are each assigned to separate processes for computation. Each of those processes then uses a queue data structure to order the processing of cells so that each cell is visited only once and the cross-process communications that are a standard part of MPI are handled in an efficient manner. This parallel approach allows efficient analysis of much larger DEMs than were possible using the serial recursive algorithms. The HPMs given here may also have other, more general modeling applicability in hydrology, geomorphology and ecology, and so are described here from a general perspective. In this paper, we present the definitions of the HPMs, the serial and parallel algorithms used in their computation and their potential applications.

...read moreread less

Book•

Synthesizing linear-array algorithms from nested for loop algorithms

[...]

Lee Peizong¹, Zvi M. Kedem¹•Institutions (1)

New York University¹

12 Sep 2011

TL;DR: The mapping of algorithms structured as depth-p nested FOR loops into special-purpose systolic VLSI linear arrays is addressed by using linear functions to transform the original sequential algorithms into a form suitable for parallel execution on linear arrays.

...read moreread less

Abstract: The mapping of algorithms structured as depth-p nested FOR loops into special-purpose systolic VLSI linear arrays is addressed The mappings are done by using linear functions to transform the original sequential algorithms into a form suitable for parallel execution on linear arrays A feasible mapping is derived by identifying formal criteria to be satisfied by both the original sequential algorithm and the proposed transformation function The methodology is illustrated by synthesizing algorithms for matrix multiplication and a version of the Warshall-Floyd transitive closure algorithm >

...read moreread less

Journal Article•DOI•

Parallel and distributed optimization methods for estimation and control in networks

[...]

Ion Necoara¹, Valentin Nedelcu¹, Ioan Dumitrache¹•Institutions (1)

Politehnica University of Bucharest¹

01 Jun 2011-Journal of Process Control

TL;DR: The paper presents a systematic framework for exploiting the potential of the decomposition structures as a way to obtain different parallel algorithms, each with a different tradeoff among convergence speed, message passing amount and distributed computation architecture.

...read moreread less

Book•

Hypercube algorithms and implementations

[...]

Oliver A. McBryan¹, Eric F. Van de Velde¹•Institutions (1)

New York University¹

19 Aug 2011

TL;DR: Parallel algorithms are presented for important components of computational fluid dynamics algorithms along with implementations on hypercube computers, used to solve hyperbolic and hypercube problems.

...read moreread less

Abstract: Parallel algorithms are presented for important components of computational fluid dynamics algorithms along with implementations on hypercube computers. These programs, used to solve hyperbolic and...

...read moreread less

Journal Article•DOI•

A new era in scientific computing: Domain decomposition methods in hybrid CPU–GPU architectures

[...]

Manolis Papadrakakis¹, George Stavroulakis¹, Alexander Karatarakis¹•Institutions (1)

National Technical University of Athens¹

01 Mar 2011-Computer Methods in Applied Mechanics and Engineering

TL;DR: This work demonstrates the implementation of the FETI method to a hybrid CPU–GPU computing environment and reveals the tremendous potential of this type of hybrid computing environment as a result of the full exploitation of multi-core CPU hardware resources and the intrinsic software and hardware features of the GPUs.

...read moreread less

Journal Article•DOI•

Empowering Visual Categorization With the GPU

[...]

K.E.A. van de Sande¹, Theo Gevers¹, Cees G. M. Snoek¹•Institutions (1)

University of Amsterdam¹

01 Feb 2011-IEEE Transactions on Multimedia

TL;DR: This paper analyzed the bag-of-words model for visual categorization in terms of computational cost and identified two major bottlenecks: the quantization step and the classification step and proposes two efficient algorithms for quantization and classification by exploiting the GPU hardware and the CUDA parallel programming model.

...read moreread less

Abstract: Visual categorization is important to manage large collections of digital images and video, where textual metadata is often incomplete or simply unavailable. The bag-of-words model has become the most powerful method for visual categorization of images and video. Despite its high accuracy, a severe drawback of this model is its high computational cost. As the trend to increase computational power in newer CPU and GPU architectures is to increase their level of parallelism, exploiting this parallelism becomes an important direction to handle the computational cost of the bag-of-words approach. When optimizing a system based on the bag-of-words approach, the goal is to minimize the time it takes to process batches of images. this paper, we analyze the bag-of-words model for visual categorization in terms of computational cost and identify two major bottlenecks: the quantization step and the classification step. We address these two bottlenecks by proposing two efficient algorithms for quantization and classification by exploiting the GPU hardware and the CUDA parallel programming model. The algorithms are designed to (1) keep categorization accuracy intact, (2) decompose the problem, and (3) give the same numerical results. In the experiments on large scale datasets, it is shown that, by using a parallel implementation on the Geforce GTX260 GPU, classifying unseen images is 4.8 times faster than a quad-core CPU version on the Core i7 920, while giving the exact same numerical results. In addition, we show how the algorithms can be generalized to other applications, such as text retrieval and video retrieval. Moreover, when the obtained speedup is used to process extra video frames in a video retrieval benchmark, the accuracy of visual categorization is improved by 29%.

...read moreread less

Proceedings Article•DOI•

The Strategy of Mining Association Rule Based on Cloud Computing

[...]

Lingjuan Li¹, Min Zhang¹•Institutions (1)

Nanjing University¹

29 Jul 2011

TL;DR: The results show that the strategy designed in this paper can archive higher efficiency when doing frequent item set mining in cloud computing environment.

...read moreread less

Abstract: Cloud computing provides cheap and efficient solutions of storing and analyzing mass data. It is very important to research the data mining strategy based on cloud computing from the theoretical view and practical view. In this paper, the strategy of mining association rules in cloud computing environment is focused on. Firstly, cloud computing, Hadoop, MapReduce programming model, Apriori algorithm and parallel association rule mining algorithm are introduced. Then, a parallel association rule mining strategy adapting to the cloud computing environment is designed. It includes data set division method, data set allocation method, improved Apriori algorithm, and the implementation procedure of the improved Apriori algorithm on MapReduce. Finally, the Hadoop platform is built and the experiment for testing performance of the strategy as well as the improved algorithm has been done. The results show that the strategy designed in this paper can archive higher efficiency when doing frequent item set mining in cloud computing environment.

...read moreread less

Proceedings Article•DOI•

Computing Strongly Connected Components in Parallel on CUDA

[...]

Jiri Barnat¹, Petr Bauch¹, Luboš Brim¹, Milan Ceka¹•Institutions (1)

Masaryk University¹

16 May 2011

TL;DR: This paper designs a new CUDA-aware procedure for pivot selection and adapt selected parallel algorithms for CUDA accelerated computation and experimentally demonstrates that with a single GTX 480 GPU card, this paper can easily outperform the optimal serial CPU implementation by an order of magnitude.

...read moreread less

Abstract: The problem of decomposing a directed graph into its strongly connected components is a fundamental graph problem inherently present in many scientific and commercial applications. In this paper we show how some of the existing parallel algorithms can be reformulated in order to be accelerated by NVIDIA CUDA technology. In particular, we design a new CUDA-aware procedure for pivot selection and we adapt selected parallel algorithms for CUDA accelerated computation. We also experimentally demonstrate that with a single GTX 480 GPU card we can easily outperform the optimal serial CPU implementation by an order of magnitude in most cases, 40 times on some sufficiently big instances. This is an interesting result as unlike the serial CPU case, the asymptotic complexity of the parallel algorithms is not optimal.

...read moreread less

Journal Article•DOI•

Optimization of Transit Priority in the Transportation Network Using a Genetic Algorithm

[...]

Mahmoud Mesbah¹, Majid Sarvi², Graham Currie²•Institutions (2)

University of Queensland¹, Monash University²

01 Sep 2011-IEEE Transactions on Intelligent Transportation Systems

TL;DR: It is found that the proposed methodology can successfully consider benefits of all stakeholders in the introduction of transit lanes and enables the methodology to be used for real-world-network scale in a shorter computer processing time.

...read moreread less

Abstract: This paper proposes a detailed formulation to optimize transit road space priority at the network level and utilizes an efficient heuristic method to find the optimum solution. Previous approaches to transit priority have a localized focus in which only limited combinations of transit exclusive lanes could be assessed. The aim of this work is to reallocate the road space between private car and transit modes so that the system is optimized. A bilevel programming approach is adapted for this purpose. The upper level involves an objective function from the system managers' perspective, whereas at the lower level, a users' perspective is modeled. To take into account the major effects of a priority provision, three models are used: 1) a modal split; 2) a user equilibrium traffic assignment; and 3) a transit assignment. A genetic algorithm (GA) approach is used, which enables the method to be applied to large networks. Application of a parallel GA is also demonstrated in the solution method, which has a considerably shorter execution time. The methodology is applied to an example network, and results are discussed. It is found that the proposed methodology can successfully consider benefits of all stakeholders in the introduction of transit lanes. Furthermore, using parallel GA enables the methodology to be used for real-world-network scale in a shorter computer processing time.

...read moreread less

Collapse