scispace - formally typeset
Search or ask a question

Showing papers on "Parallel algorithm published in 2010"


Proceedings ArticleDOI
05 Jun 2010
TL;DR: The combination of high-level abstractions for parallel data and computation, deferred evaluation and optimization, and efficient parallel primitives yields an easy-to-use system that approaches the efficiency of hand-optimized pipelines.
Abstract: MapReduce and similar systems significantly ease the task of writing data-parallel code. However, many real-world computations require a pipeline of MapReduces, and programming and managing such pipelines can be difficult. We present FlumeJava, a Java library that makes it easy to develop, test, and run efficient data-parallel pipelines. At the core of the FlumeJava library are a couple of classes that represent immutable parallel collections, each supporting a modest number of operations for processing them in parallel. Parallel collections and their operations present a simple, high-level, uniform abstraction over different data representations and execution strategies. To enable parallel operations to run efficiently, FlumeJava defers their evaluation, instead internally constructing an execution plan dataflow graph. When the final results of the parallel operations are eventually needed, FlumeJava first optimizes the execution plan, and then executes the optimized operations on appropriate underlying primitives (e.g., MapReduces). The combination of high-level abstractions for parallel data and computation, deferred evaluation and optimization, and efficient parallel primitives yields an easy-to-use system that approaches the efficiency of hand-optimized pipelines. FlumeJava is in active use by hundreds of pipeline developers within Google.

421 citations


Journal ArticleDOI
01 Jun 2010
TL;DR: The need for new algorithms that would split the computation in a way that would fully exploit the power that each of the hybrid components possesses is motivated, and the need for a DLA library similar to LAPACK but for hybrid manycore/GPU systems is envisioned.
Abstract: We highlight the trends leading to the increased appeal of using hybrid multicore+GPU systems for high performance computing. We present a set of techniques that can be used to develop efficient dense linear algebra algorithms for these systems. We illustrate the main ideas with the development of a hybrid LU factorization algorithm where we split the computation over a multicore and a graphics processor, and use particular techniques to reduce the amount of pivoting and communication between the hybrid components. This results in an efficient algorithm with balanced use of a multicore processor and a graphics processor.

398 citations


Journal ArticleDOI
TL;DR: In this article, the synergy between the LATIN multiscale method and the Proper Generalized Decomposition (PGD) method is discussed, which is the key of its performances.

275 citations


Journal ArticleDOI
TL;DR: The proposed algorithm not only improved several of the known solutions, but also presented a very satisfying scalability.

262 citations


Proceedings ArticleDOI
09 Jan 2010
TL;DR: To combine the benefits of the basic algorithms, this work proposes hybrid CR+PCR and CR+RD algorithms, which improve the performance of PCR, RD and CR by 21%, 31% and 61% respectively.
Abstract: We study the performance of three parallel algorithms and their hybrid variants for solving tridiagonal linear systems on a GPU: cyclic reduction (CR), parallel cyclic reduction (PCR) and recursive doubling (RD). We develop an approach to measure, analyze, and optimize the performance of GPU programs in terms of memory access, computation, and control overhead. We find that CR enjoys linear algorithm complexity but suffers from more algorithmic steps and bank conflicts, while PCR and RD have fewer algorithmic steps but do more work each step. To combine the benefits of the basic algorithms, we propose hybrid CR+PCR and CR+RD algorithms, which improve the performance of PCR, RD and CR by 21%, 31% and 61% respectively. Our GPU solvers achieve up to a 28x speedup over a sequential LAPACK solver, and a 12x speedup over a multi-threaded CPU solver.

232 citations


Journal ArticleDOI
TL;DR: Two genetic algorithms are developed with some heuristic principles that have been added to improve the performance and it has been found that the developed algorithms always outperform the traditional algorithms.

194 citations


Proceedings ArticleDOI
13 Jun 2010
TL;DR: A general method for analyzing nondeterministic programs that use reducers and it is shown that for a graph G=(V,E) with diameter D and bounded out-degree, this data-race-free version of PBFS algorithm attains near-perfect linear speedup if P << (V+E)/Dlg3(V/D).
Abstract: We have developed a multithreaded implementation of breadth-first search (BFS) of a sparse graph using the Cilk++ extensions to C++. Our PBFS program on a single processor runs as quickly as a standar. C++ breadth-first search implementation. PBFS achieves high work-efficiency by using a novel implementation of a multiset data structure, called a "bag," in place of the FIFO queue usually employed in serial breadth-first search algorithms. For a variety of benchmark input graphs whose diameters are significantly smaller than the number of vertices -- a condition met by many real-world graphs -- PBFS demonstrates good speedup with the number of processing cores.Since PBFS employs a nonconstant-time "reducer" -- "hyperobject" feature of Cilk++ -- the work inherent in a PBFS execution depends nondeterministically on how the underlying work-stealing scheduler load-balances the computation. We provide a general method for analyzing nondeterministic programs that use reducers. PBFS also is nondeterministic in that it contains benign races which affect its performance but not its correctness. Fixing these races with mutual-exclusion locks slows down PBFS empirically, but it makes the algorithm amenable to analysis. In particular, we show that for a graph G=(V,E) with diameter D and bounded out-degree, this data-race-free version of PBFS algorithm runs it time O((V+E)/P + Dlg3(V/D)) on P processors, which means that it attains near-perfect linear speedup if P

174 citations


Journal ArticleDOI
TL;DR: This work presents a parallel algorithm for k-nearest neighbor graph construction that uses Morton ordering, and shows that this approach has the following advantages over existing methods: faster construction of k-NEarest neighbor graphs in practice on multicore machines, less space usage, better cache efficiency, ability to handle large data sets, and ease of parallelization and implementation.
Abstract: We present a parallel algorithm for k-nearest neighbor graph construction that uses Morton ordering. Experiments show that our approach has the following advantages over existing methods: 1) faster construction of k-nearest neighbor graphs in practice on multicore machines, 2) less space usage, 3) better cache efficiency, 4) ability to handle large data sets, and 5) ease of parallelization and implementation. If the point set has a bounded expansion constant, our algorithm requires one-comparison-based parallel sort of points, according to Morton order plus near-linear additional steps to output the k-nearest neighbor graph.

163 citations


Proceedings ArticleDOI
21 Jun 2010
TL;DR: This work introduces LogGOPSim---a fast simulation framework for parallel algorithms at large-scale that utilizes a slightly extended version of the well-known LogGPS model in combination with full MPI message matching semantics and detailed simulation of collective operations.
Abstract: We introduce LogGOPSim---a fast simulation framework for parallel algorithms at large-scale. LogGOPSim utilizes a slightly extended version of the well-known LogGPS model in combination with full MPI message matching semantics and detailed simulation of collective operations. In addition, it enables simulation in the traditional LogP, LogGP, and LogGPS models. Its simple and fast single-queue design computes more than 1 million events per second on a single processor and enables large-scale simulations of more than 8 million processes. LogGOPSim also supports the simulation of full MPI applications by reading and simulating MPI profiling traces. We analyze the accuracy and the performance of the simulation and propose a simple extrapolation scheme for parallel applications. Our scheme extrapolates collective operations with high accuracy by rebuilding the communication pattern. Point-to-point operation patterns can be copied in the extrapolation and thus retain the main characteristics of scalable parallel applications.

159 citations


Proceedings ArticleDOI
19 Apr 2010
TL;DR: A new power-aware performance prediction model of hybrid MPI/OpenMP applications is used to derive a novel algorithm for power-efficient execution of realis tic applications from ASCS equoia and N PB MZ bench marks.
Abstract: Power-aware execution of parallel programs is now a primary concern in large-scale HPC environments. Prior research in this area has explored models and algorithms based on dynamic voltage and frequency scaling (DVFS) and dynamic concurrency throttling (DCT) to achieve power-aware execution of programs written in a single programming model, typically MPI or OpenMP. However, hybrid programming models combining MPI and OpenMP are growing in popularity as emerging large-scale systems have many nodes with several processors per node and multiple cores per process or. In th is paper we present and evaluate solutions for power-efficient execution of programs written in this hybrid model targeting large-scale distributed systems with multicore nodes. We use a new power-aware performance prediction model of hybrid MPI/OpenMP applications to derive a novel algorithm for power-efficient execution of realis tic applications from th e ASCS equoia and N PB MZ bench marks. Our new algorithm yields substantial energy savings (4.18% on average and up to 13.8%) with either negligible performance loss or performance gain (up to 7.2%).

153 citations


Journal ArticleDOI
TL;DR: The simulation code PIConGPU presented in this paper is, to the authors' knowledge, the first scalable GPU cluster implementation of the PIC algorithm in plasma physics.
Abstract: The particle-in-cell (PIC) algorithm is one of the most widely used algorithms in computational plasma physics. With the advent of graphical processing units (GPUs), large-scale plasma simulations on inexpensive GPU clusters are in reach. We present an implementation of a fully relativistic plasma PIC algorithm for GPUs based on the NVIDIA CUDA library. It supports a hybrid architecture consisting of single computation nodes interconnected in a standard cluster topology, with each node carrying one or more GPUs. The internode communication is realized using the message-passing interface. The simulation code PIConGPU presented in this paper is, to our knowledge, the first scalable GPU cluster implementation of the PIC algorithm in plasma physics.

Journal ArticleDOI
TL;DR: The algorithms and functionality of a new module developed to support overset grid assembly associated with performing time-dependent and adaptive moving body calculations of external aerodynamic flows using a multi-solver paradigm are described.

Journal ArticleDOI
TL;DR: This work uses thread and data parallelism to perform fast hierarchy construction, updating, and traversal using tight‐fitting bounding volumes such as oriented bounding boxes (OBB) and rectangular swept spheres (RSS) and describes efficient algorithms to compute a linear bounding volume hierarchy (LBVH) and update them using refitting methods.
Abstract: We present novel parallel algorithms for collision detection and separation distance computation for rigid and deformable models that exploit the computational capabilities of many-core GPUs. Our approach uses thread and data parallelism to perform fast hierarchy construction, updating, and traversal using tight-fitting bounding volumes such as oriented bounding boxes (OBB) and rectangular swept spheres (RSS). We also describe efficient algorithms to compute a linear bounding volume hierarchy (LBVH) and update them using refitting methods. Moreover, we show that tight-fitting bounding volume hierarchies offer improved performance on GPU-like throughput architectures. We use our algorithms to perform discrete and continuous collision detection including self-collisions, as well as separation distance computation between non-overlapping models. In practice, our approach (gProximity) can perform these queries in a few milliseconds on a PC with NVIDIA GTX 285 card on models composed of tens or hundreds of thousands of triangles used in cloth simulation, surgical simulation, virtual prototyping and N-body simulation. Moreover, we observe more than an order of magnitude performance improvement over prior GPU-based algorithms.

Journal ArticleDOI
TL;DR: The empirical results show that the resulting elimination-backoff stack performs as well as the simple stack at low loads, and increasingly outperforms all other methods (lock-based and non-blocking) as concurrency increases.

Journal ArticleDOI
TL;DR: An ensemble of DDE (eDDE) algorithms where each parameter set and crossover operator is assigned to one of the parallel populations with parallel populations is presented and compared against the best performing algorithms from the literature.

Journal ArticleDOI
TL;DR: Model-based clustering using a family of Gaussian mixture models, with parsimonious factor analysis-like covariance structure, is described and an ecient algorithm for its implementation is presented, showing its eectiveness when compared to existing software.

Proceedings ArticleDOI
09 Jan 2010
TL;DR: A novel approach to predict the sequential computation time accurately and efficiently for large-scale parallel applications on non-existing target machines is proposed and a performance prediction framework, called PHANTOM, is implemented, which integrates the above computation-time acquisition approach with a trace-driven network simulator.
Abstract: For designers of large-scale parallel computers, it is greatly desired that performance of parallel applications can be predicted at the design phase. However, this is difficult because the execution time of parallel applications is determined by several factors, including sequential computation time in each process, communication time and their convolution. Despite previous efforts, it remains an open problem to estimate sequential computation time in each process accurately and efficiently for large-scale parallel applications on non-existing target machines.This paper proposes a novel approach to predict the sequential computation time accurately and efficiently. We assume that there is at least one node of the target platform but the whole target system need not be available. We make two main technical contributions. First, we employ deterministic replay techniques to execute any process of a parallel application on a single node at real speed. As a result, we can simply measure the real sequential computation time on a target node for each process one by one. Second, we observe that computation behavior of processes in parallel applications can be clustered into a few groups while processes in each group have similar computation behavior. This observation helps us reduce measurement time significantly because we only need to execute representative parallel processes instead of all of them.We have implemented a performance prediction framework, called PHANTOM, which integrates the above computation-time acquisition approach with a trace-driven network simulator. We validate our approach on several platforms. For ASCI Sweep3D, the error of our approach is less than 5% on 1024 processor cores. Compared to a recent regression-based prediction approach, PHANTOM presents better prediction accuracy across different platforms.

Proceedings ArticleDOI
19 Apr 2010
TL;DR: In this article, the authors describe an approach to parallel graph partitioning that scales to hundreds of processors and produces a high solution quality, for many instances from Walshaw's benchmark collection.
Abstract: We describe an approach to parallel graph partitioning that scales to hundreds of processors and produces a high solution quality. For example, for many instances from Walshaw's benchmark collection we improve the best known partitioning. We use the well known framework of multi-level graph partitioning. All components are implemented by scalable parallel algorithms. Quality improvements compared to previous systems are due to better prioritization of edges to be contracted, better approximation algorithms for identifying matchings, better local search heuristics, and perhaps most notably, a parallelization of the FM local search algorithm that works more locally than previous approaches.

Proceedings ArticleDOI
13 Nov 2010
TL;DR: New parallel algorithms for parallel dynamic AMR on forest-ofoctrees geometries with arbitrary-order continuous and discontinuous finite/spectral element discretizations for multiscale geophysics problems are presented.
Abstract: Many problems are characterized by dynamics occurring on a wide range of length and time scales One approach to overcoming the tyranny of scales is adaptive mesh refinement/coarsening (AMR), which dynamically adapts the mesh to resolve features of interest However, the benefits of AMR are difficult to achieve in practice, particularly on the petascale computers that are essential for difficult problems Due to the complex dynamic data structures and frequent load balancing, scaling dynamic AMR to hundreds of thousands of cores has long been considered a challenge Another difficulty is extending parallel AMR techniques to high-order-accurate, complex-geometry-respecting methods that are favored for many classes of problems Here we present new parallel algorithms for parallel dynamic AMR on forest-ofoctrees geometries with arbitrary-order continuous and discontinuous finite/spectral element discretizations The implementations of these algorithms exhibit excellent weak and strong scaling to over 224,000 Cray XT5 cores for multiscale geophysics problems

Proceedings Article
01 Jan 2010
TL;DR: In this article, the authors proposed effective parallelization strategies for the ACO metaheuristic on Graphics Processing Units (GPUs) using the Max-Min Ant System (MMAS) algorithm augmented with 3-opt local search.
Abstract: The purpose of this paper is to propose effective parallelization strategies for the Ant Colony Optimization (ACO) metaheuristic on Graphics Processing Units (GPUs). The Max-Min Ant System (MMAS) algorithm augmented with 3-opt local search is used as a framework for the implementation of the parallel ants and multiple ant colonies general parallelization approaches. The four resulting GPU algorithms are extensively evaluated and compared on both speedup and solution quality on a state-of-the-art Fermi GPU architecture. A rigorous effort is made to keep parallel algorithms true to the original MMAS applied to the Traveling Salesman Problem. We report speedups of up to 23.60 with solution quality similar to the original sequential implementation. With the intent of providing a parallelization framework for ACO on GPUs, a comparative experimental study highlights the performance impact of ACO parameters, GPU technical configuration, memory structures and parallelization granularity.

Journal ArticleDOI
TL;DR: An e–cient hybrid MPI/OpenMP parallel implementation of an innovative approach that combines the Fast Fourier Transform and Multilevel Fast Multipole Algorithm has been successfully used to solve an electromagnetic problem involving 620millions of unknowns.
Abstract: MLFMA-FFT PARALLEL ALGORITHM FOR THE SO-LUTION OF LARGE-SCALE PROBLEMS IN ELECTRO-MAGNETICS (INVITED PAPER)J. M. TaboadaDepartment Tecnolog¶‡as de los Computadores y de lasComunicaciones, Escuela Polit¶ecnicaUniversidad de ExtremaduraC¶aceres 10071, SpainM. G. Araujo¶ and J. M. B¶ertoloDepartment Teor¶‡a do Sinal e Comunicaci¶ons, E.T.S.E.Telecomunicaci¶onUniversidade de VigoVigo (Pontevedra) 36310, SpainL. LandesaDepartment Tecnolog¶‡as de los Computadores y de lasComunicaciones, Escuela Polit¶ecnicaUniversidad de ExtremaduraC¶aceres 10071, SpainF. Obelleiro and J. L. RodriguezDepartment Teor¶‡a do Sinal e Comunicaci¶ons, E.T.S.E.Telecomunicaci¶onUniversidade de VigoVigo (Pontevedra) 36310, SpainAbstract|An e–cient hybrid MPI/OpenMP parallel implementationof an innovative approach that combines the Fast Fourier Transform(FFT) and Multilevel Fast Multipole Algorithm (MLFMA) has beensuccessfully used to solve an electromagnetic problem involving 620millions of unknowns. The MLFMA-FFT method can deal withextremely large problems due to its high scalability and its reducedcomputational complexity. The former is provided by the use of the

Proceedings ArticleDOI
13 Dec 2010
TL;DR: This paper focuses on document collections, which are characterized by a sparseness that allows effective pruning strategies, and proposes a new parallel algorithm within the MapReduce framework that outperforms the state of the art by a factor 4.5.
Abstract: iven a collection of objects, the Similarity Self-Join problem requires to discover all those pairs of objects whose similarity is above a user defined threshold. In this paper we focus on document collections, which are characterized by a sparseness that allows effective pruning strategies. Our contribution is a new parallel algorithm within the MapReduce framework. This work borrows from the state of the art in serial algorithms for similarity join and MapReduce-based techniques for set-similarity join. The proposed algorithm shows that it is possible to leverage a distributed file system to support communication patterns that do not naturally fit the MapReduce framework. Scalability is achieved by introducing a partitioning strategy able to overcome memory bottlenecks. Experimental evidence on real world data shows that our algorithm outperforms the state of the art by a factor 4.5.

Proceedings ArticleDOI
18 Jul 2010
TL;DR: This paper provides an implementation of the Differential Evolution (DE) algorithm in C-CUDA and demonstrates that the computing time can significantly be reduced using C- CUDA.
Abstract: Several areas of knowledge are being benefited with the reduction of the computing time by using the technology of Graphics Processing Units (GPU) and the Compute Unified Device Architecture (CUDA) platform. In case of Evolutionary algorithms, which are inherently parallel, this technology may be advantageous for running experiments demanding high computing time. In this paper, we provide an implementation of the Differential Evolution (DE) algorithm in C-CUDA. The algorithm was tested on a suite of well-known benchmark optimization problems and the computing time has been compared with the same algorithm implemented in C. Results demonstrate that the computing time can significantly be reduced using C-CUDA. As far as we know, this is the first implementation of DE algorithm in C-CUDA.

Proceedings ArticleDOI
24 May 2010
TL;DR: The major components of stapl are described and performance results for both algorithms and data structures showing scalability up to tens of thousands of processors are presented.
Abstract: The Standard Template Adaptive Parallel Library (stapl) is a high-productivity parallel programming framework that extends C++ and stl with unified support for shared and distributed memory parallelism. stapl provides distributed data structures (pContainers) and parallel algorithms (pAlgorithms) and a generic methodology for extending them to provide customized functionality. The stapl runtime system provides the abstraction for communication and program execution. In this paper, we describe the major components of stapl and present performance results for both algorithms and data structures showing scalability up to tens of thousands of processors.

Journal ArticleDOI
TL;DR: With the new code, the excellent scalability of the parallelization scheme is demonstrated in large-scale four-component multireference CI (MRCI) benchmark tests on two of the most common computer architectures, and the hardware-dependent aspects with respect to possible speedup limitations are discussed.
Abstract: We present a parallel implementation of a large-scale relativistic double-group configuration interaction (CI) program. It is applicable with a large variety of two- and four-component Hamiltonians. The parallel algorithm is based on a distributed data model in combination with a static load balancing scheme. The excellent scalability of our parallelization scheme is demonstrated in large-scale four-component multireference CI (MRCI) benchmark tests on two of the most common computer architectures, and we also discuss hardware-dependent aspects with respect to possible speedup limitations. With the new code we have been able to calculate accurate spectroscopic properties for the ground state and the first excited state of the BiH molecule using extensive basis sets. We focused, in particular, on an accurate description of the splitting of these two states which is caused by spin-orbit coupling. Our largest parallel MRCI calculation thereby comprised an expansion length of 2.7×109 Slater determinants.

Proceedings ArticleDOI
09 Jan 2010
TL;DR: SLAW, a Scalable Locality-aware Adaptive Work-stealing scheduler is designed for programming models where locality hints are provided to the runtime by the programmer or compiler, and achieves locality-awareness by grouping workers into places and achieves speedup over locality-oblivious scheduling.
Abstract: This poster introduces SLAW, a Scalable Locality-aware Adaptive Work-stealing scheduler. The SLAW features an adaptive task scheduling algorithm combined with a locality-aware scheduling framework.Past work has demonstrated the pros and cons of using fixed scheduling policies, such as work-first and help-first, in different cases without a clear winner. Prior work also assumes the availability and successful execution of a serial version of the parallel program. This assumption can limit the expressiveness of dynamic task parallel languages.The SLAW scheduler supports both work-first and help-first policies simultaneously. It does so by using an adaptive approach that selects a scheduling policy on a per-task basis at runtime. The SLAW scheduler also establishes bounds on the stack usage and the heap space needed to store tasks. The experimental results for the benchmarks studied show that SLAW's adaptive scheduler achieves 0.98x - 9.2$x speedup over the help-first scheduler and 0.97x - 4.5x speedup over the work-first scheduler for 64-thread executions, thereby establishing the robustness of using an adaptive approach instead of a fixed policy. In contrast, the help-first policy is 9.2x slower than work-first in the worst case for a fixed help-first policy, and the work-first policy is 3.7x slower than help-first in the worst case for a fixed work-first policy. Further, for large irregular recursive parallel computations, the adaptive scheduler runs with bounded stack usage and achieves performance (and supports data sizes) that cannot be delivered by the use of any single fixed policy.The SLAW scheduler is designed for programming models where locality hints are provided to the runtime by the programmer or compiler, and achieves locality-awareness by grouping workers into places. Locality awareness can lead to improved performance by increasing temporal data reuse within a worker and among workers in the same place. Our experimental results show that locality-aware scheduling can achieve up to 2.6x speedup over locality-oblivious scheduling, for the benchmarks studied.

Journal ArticleDOI
TL;DR: To the authors' knowledge, the multigrid algorithm presented in this work is the only matrix-free multiplicative geometric multigrids implementation for solving finite element equations on octree meshes using thousands of processors.
Abstract: In this article, we present a parallel geometric multigrid algorithm for solving variable-coefficient elliptic partial differential equations on the unit box (with Dirichlet or Neumann boundary conditions) using highly nonuniform, octree-based, conforming finite element discretizations. Our octrees are 2:1 balanced, that is, we allow no more than one octree-level difference between octants that share a face, edge, or vertex. We describe a parallel algorithm whose input is an arbitrary 2:1 balanced fine-grid octree and whose output is a set of coarser 2:1 balanced octrees that are used in the multigrid scheme. Also, we derive matrix-free schemes for the discretized finite element operators and the intergrid transfer operations. The overall scheme is second-order accurate for sufficiently smooth right-hand sides and material properties; its complexity for nearly uniform trees is $\mathcal{O}(\frac{N}{n_p}\log\frac{N}{n_p})+\mathcal{O}(n_p\log n_p)$, where $N$ is the number of octree nodes and $n_p$ is the number of processors. Our implementation uses the Message Passing Interface standard. We present numerical experiments for the Laplace and Navier (linear elasticity) operators that demonstrate the scalability of our method. Our largest run was a highly nonuniform, 8-billion-unknown, elasticity calculation using 32,000 processors on the Teragrid system, “Ranger,” at the Texas Advanced Computing Center. Our implementation is publically available in the Dendro library, which is built on top of the PETSc library from Argonne National Laboratory.

Book ChapterDOI
30 Jun 2010
TL;DR: This paper presents a scalable, parallel algorithm for data clustering in the astronomy simulation domain that matches the performance of an existing hand-optimized implementation used in astrophysics on a dataset with little skew and significantly outperforms it on a skewed dataset.
Abstract: Scientists' ability to generate and collect massive-scale datasets is increasing. As a result, constraints in data analysis capability rather than limitations in the availability of data have become the bottleneck to scientific discovery. MapReduce-style platforms hold the promise to address this growing data analysis problem, but it is not easy to express many scientific analyses in these new frameworks. In this paper, we study data analysis challenges found in the astronomy simulation domain. In particular, we present a scalable, parallel algorithm for data clustering in this domain. Our algorithm makes two contributions. First, it shows how a clustering problem can be efficiently implemented in a MapReduce-style framework. Second, it includes optimizations that enable scalability, even in the presence of skew. We implement our solution in the Dryad parallel data processing system using DryadLINQ. We evaluate its performance and scalability using a real dataset comprised of 906 million points, and show that in an 8-node cluster, our algorithm can process even a highly skewed dataset 17 times faster than the conventional implementation and offers near-linear scalability. Our approach matches the performance of an existing hand-optimized implementation used in astrophysics on a dataset with little skew and significantly outperforms it on a skewed dataset.

Proceedings ArticleDOI
01 Dec 2010
TL;DR: A novel parallel algorithm to speedup string matching performed on GPUs is proposed and a new state machine for string matching is innovated, the state machine of which is more suitable to be performed on GPU.
Abstract: Network Intrusion Detection System has been widely used to protect computer systems from network attacks. Due to the ever-increasing number of attacks and network complexity, traditional software approaches on uni-processors have become inadequate for the current high-speed network. In this paper, we propose a novel parallel algorithm to speedup string matching performed on GPUs. We also innovate new state machine for string matching, the state machine of which is more suitable to be performed on GPU. We have also described several speedup techniques considering special architecture properties of GPU. The experimental results demonstrate the new algorithm on GPUs achieves up to 4,000 times speedup compared to the AC algorithm on CPU. Compared to other GPU approaches, the new algorithm achieves 3 times faster with significant improvement on memory efficiency. Furthermore, because the new Algorithm reduces the complexity of the Aho-Corasick algorithm, the new algorithm also improves on memory requirements.

Journal ArticleDOI
TL;DR: Numerical results demonstrate the validity and potential of the parallel AMR approach for predicting fine-scale features of complex turbulent non-premixed flames and allow for automatic solution-directed mesh adaptation according to physics-based refinement criteria.