Showing papers on "Speedup published in 2010"

PDF

Open Access

Proceedings Article•DOI•

Efficient parallel set-similarity joins using MapReduce

[...]

Rares Vernica¹, Michael J. Carey¹, Chen Li¹•Institutions (1)

06 Jun 2010

TL;DR: This paper proposes a 3-stage approach for end-to-end set-similarity joins in parallel using the popular MapReduce framework, and reports results from extensive experiments on real datasets to evaluate the speedup and scaleup properties of the proposed algorithms using Hadoop.

...read moreread less

Abstract: In this paper we study how to efficiently perform set-similarity joins in parallel using the popular MapReduce framework. We propose a 3-stage approach for end-to-end set-similarity joins. We take as input a set of records and output a set of joined records based on a set-similarity condition. We efficiently partition the data across nodes in order to balance the workload and minimize the need for replication. We study both self-join and R-S join cases, and show how to carefully control the amount of data kept in main memory on each node. We also propose solutions for the case where, even if we use the most fine-grained partitioning, the data still does not fit in the main memory of a node. We report results from extensive experiments on real datasets, synthetically increased in size, to evaluate the speedup and scaleup properties of the proposed algorithms using Hadoop.

...read moreread less

516 citations

Proceedings Article•DOI•

Graphite: A distributed parallel simulator for multicores

[...]

Jason E. Miller¹, Harshad Kasture¹, George Kurian¹, Charles Gruenwald¹, Nathan Beckmann¹, Christopher Celio¹, Jonathan Eastep¹, Anant Agarwal¹ - Show less +4 more•Institutions (1)

Massachusetts Institute of Technology¹

01 Apr 2010

TL;DR: This paper introduces the Graphite open-source distributed parallel multicore simulator infrastructure and demonstrates that Graphite can simulate target architectures containing over 1000 cores on ten 8-core servers with near linear speedup.

...read moreread less

Abstract: This paper introduces the Graphite open-source distributed parallel multicore simulator infrastructure. Graphite is designed from the ground up for exploration of future multi-core processors containing dozens, hundreds, or even thousands of cores. It provides high performance for fast design space exploration and software development. Several techniques are used to achieve this including: direct execution, seamless multicore and multi-machine distribution, and lax synchronization. Graphite is capable of accelerating simulations by distributing them across multiple commodity Linux machines. When using multiple machines, it provides the illusion of a single process with a single, shared address space, allowing it to run off-the-shelf pthread applications with no source code modification. Our results demonstrate that Graphite can simulate target architectures containing over 1000 cores on ten 8-core servers. Performance scales well as more machines are added with near linear speedup in many cases. Simulation slowdown is as low as 41× versus native execution.

...read moreread less

498 citations

Proceedings Article•DOI•

Demystifying GPU microarchitecture through microbenchmarking

[...]

Henry Wong¹, Misel-Myrto Papadopoulou¹, Maryam Sadooghi-Alvandi¹, Andreas Moshovos¹•Institutions (1)

University of Toronto¹

28 Mar 2010

TL;DR: This work develops a microbechmark suite and measures the CUDA-visible architectural characteristics of the Nvidia GT200 (GTX280) GPU, exposing undocumented features that impact program performance and correctness.

...read moreread less

Abstract: Graphics processors (GPU) offer the promise of more than an order of magnitude speedup over conventional processors for certain non-graphics computations. Because the GPU is often presented as a C-like abstraction (e.g., Nvidia's CUDA), little is known about the characteristics of the GPU's architecture beyond what the manufacturer has documented. This work develops a microbechmark suite and measures the CUDA-visible architectural characteristics of the Nvidia GT200 (GTX280) GPU. Various undisclosed characteristics of the processing elements and the memory hierarchies are measured. This analysis exposes undocumented features that impact program performance and correctness. These measurements can be useful for improving performance optimization, analysis, and modeling on this architecture and offer additional insight on the decisions made in developing this GPU.

...read moreread less

471 citations

Journal Article•DOI•

The performance of MapReduce: an in-depth study

[...]

Dawei Jiang¹, Beng Chin Ooi¹, Lei Shi¹, Sai Wu¹•Institutions (1)

National University of Singapore¹

01 Sep 2010

TL;DR: By carefully tuning these factors, the overall performance of Hadoop can be improved by a factor of 2.5 to 3.5, and is thus more comparable to that of parallel database systems.

...read moreread less

Abstract: MapReduce has been widely used for large-scale data analysis in the Cloud. The system is well recognized for its elastic scalability and fine-grained fault tolerance although its performance has been noted to be suboptimal in the database context. According to a recent study [19], Hadoop, an open source implementation of MapReduce, is slower than two state-of-the-art parallel database systems in performing a variety of analytical tasks by a factor of 3.1 to 6.5. MapReduce can achieve better performance with the allocation of more compute nodes from the cloud to speed up computation; however, this approach of "renting more nodes" is not cost effective in a pay-as-you-go environment. Users desire an economical elastically scalable data processing system, and therefore, are interested in whether MapReduce can offer both elastic scalability and efficiency.In this paper, we conduct a performance study of MapReduce (Hadoop) on a 100-node cluster of Amazon EC2 with various levels of parallelism. We identify five design factors that affect the performance of Hadoop, and investigate alternative but known methods for each factor. We show that by carefully tuning these factors, the overall performance of Hadoop can be improved by a factor of 2.5 to 3.5 for the same benchmark used in [19], and is thus more comparable to that of parallel database systems. Our results show that it is therefore possible to build a cloud data processing system that is both elastically scalable and efficient.

...read moreread less

426 citations

Journal Article•DOI•

On the Utility of Graphics Cards to Perform Massively Parallel Simulation of Advanced Monte Carlo Methods

[...]

Anthony Lee, Christopher Yau, Michael B. Giles, Arnaud Doucet, Christopher Holmes - Show less +1 more

01 Dec 2010-Journal of Computational and Graphical Statistics

TL;DR: It is suggested that GPUs have the potential to facilitate the growth of statistical modeling into complex data-rich domains through the availability of cheap and accessible many-core computation.

...read moreread less

Abstract: We present a case-study on the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods. Graphics cards, containing multiple Graphics Processing Units (GPUs), are self-contained parallel computational devices that can be housed in conventional desktop and laptop computers and can be thought of as prototypes of the next generation of many-core processors. For certain classes of population-based Monte Carlo algorithms they offer massively parallel simulation, with the added advantage over conventional distributed multi-core processors that they are cheap, easily accessible, easy to maintain, easy to code, dedicated local devices with low power consumption. On a canonical set of stochastic simulation examples including population-based Markov chain Monte Carlo methods and Sequential Monte Carlo methods, we nd speedups from 35 to 500 fold over conventional single-threaded computer code. Our findings suggest that GPUs have the potential to facilitate the growth of statistical modelling into complex data rich domains through the availability of cheap and accessible many-core computation. We believe the speedup we observe should motivate wider use of parallelizable simulation methods and greater methodological attention to their design.

...read moreread less

334 citations

Proceedings Article•DOI•

3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

[...]

Anthony Nguyen¹, Nadathur Satish¹, Jatin Chhugani¹, Changkyu Kim¹, Pradeep Dubey¹ - Show less +1 more•Institutions (1)

Intel¹

13 Nov 2010

TL;DR: A novel 3.

...read moreread less

Abstract: Stencil computation sweeps over a spatial grid over multiple time steps to perform nearest-neighbor computations The bandwidth-to-compute requirement for a large class of stencil kernels is very high, and their performance is bound by the available memory bandwidth Since memory bandwidth grows slower than compute, the performance of stencil kernels will not scale with increasing compute density We present a novel 35D-blocking algorithm that performs 25D-spatial and temporal blocking of the input grid into on-chip memory for both CPUs and GPUs The resultant algorithm is amenable to both thread- level and data-level parallelism, and scales near-linearly with the SIMD width and multiple-cores Our performance numbers are faster or comparable to state-of-the-art-stencil implementations on CPUs and GPUs Our implementation of 7-point-stencil is 15X-faster on CPUs, and 18X faster on GPUs for single- precision floating point inputs than previously reported numbers For Lattice Boltzmann methods, the corresponding speedup number on CPUs is 21X

...read moreread less

299 citations

Journal Article•DOI•

High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster

[...]

Dimitri Komatitsch¹, Gordon Erlebacher², Dominik Göddeke³, David Michéa¹•Institutions (3)

University of Pau and Pays de l'Adour¹, Florida State University², Technical University of Dortmund³

01 Oct 2010-Journal of Computational Physics

TL;DR: A high-order finite-element application, which performs the numerical simulation of seismic wave propagation resulting from earthquakes at the scale of a continent or from active seismic acquisition experiments in the oil industry, on a large cluster of NVIDIA Tesla graphics cards using the CUDA programming environment and non-blocking message passing based on MPI.

...read moreread less

278 citations

Proceedings Article•

Dynamic Programming for Linear-Time Incremental Parsing

[...]

Liang Huang¹, Kenji Sagae²•Institutions (2)

Information Sciences Institute¹, Institute for Creative Technologies²

11 Jul 2010

TL;DR: It is shown that, surprisingly, dynamic programming is in fact possible for many shift-reduce parsers, by merging "equivalent" stacks based on feature values, and the final parser outperforms all previously reported dependency parsers for English and Chinese, yet is much faster.

...read moreread less

Abstract: Incremental parsing techniques such as shift-reduce have gained popularity thanks to their efficiency, but there remains a major problem: the search is greedy and only explores a tiny fraction of the whole space (even with beam search) as opposed to dynamic programming. We show that, surprisingly, dynamic programming is in fact possible for many shift-reduce parsers, by merging "equivalent" stacks based on feature values. Empirically, our algorithm yields up to a five-fold speedup over a state-of-the-art shift-reduce dependency parser with no loss in accuracy. Better search also leads to better learning, and our final parser outperforms all previously reported dependency parsers for English and Chinese, yet is much faster.

...read moreread less

253 citations

Proceedings Article•DOI•

An auto-tuning framework for parallel multicore stencil computations

[...]

Shoaib Kamil¹, Cy Chan¹, Leonid Oliker¹, John Shalf¹, Samuel Williams¹ - Show less +1 more•Institutions (1)

Lawrence Berkeley National Laboratory¹

19 Apr 2010

TL;DR: In this article, the authors present a stencil auto-tuning framework that significantly advances programmer productivity by automatically converting a straightforward sequential Fortran 95 stencil expression into tuned parallel implementations in Fortran, C, or CUDA.

...read moreread less

Abstract: Although stencil auto-tuning has shown tremendous potential in effectively utilizing architectural resources, it has hitherto been limited to single kernel instantiations; in addition, the large variety of stencil kernels used in practice makes this computation pattern difficult to assemble into a library This work presents a stencil auto-tuning framework that significantly advances programmer productivity by automatically converting a straightforward sequential Fortran 95 stencil expression into tuned parallel implementations in Fortran, C, or CUDA, thus allowing performance portability across diverse computer architectures, including the AMD Barcelona, Intel Nehalem, Sun Victoria Falls, and the latest NVIDIA GPUs Results show that our generalized methodology delivers significant performance gains of up to 22× speedup over the reference serial implementation Overall we demonstrate that such domain-specific auto-tuners hold enormous promise for architectural efficiency, programmer productivity, performance portability, and algorithmic adaptability on existing and emerging multicore systems

...read moreread less

243 citations

Proceedings Article•DOI•

An effective GPU implementation of breadth-first search

[...]

Lijuan Luo¹, Martin D. F. Wong¹, Wen-mei W. Hwu¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

13 Jun 2010

TL;DR: A new GPU implementation of BFS that uses a hierarchical queue management technique and a three-layer kernel arrangement strategy that guarantees the same computational complexity as the fastest sequential version and can achieve up to 10 times speedup.

...read moreread less

Abstract: Breadth-first search (BFS) has wide applications in electronic design automation (EDA) as well as in other fields. Researchers have tried to accelerate BFS on the GPU, but the two published works are both asymptotically slower than the fastest CPU implementation. In this paper, we present a new GPU implementation of BFS that uses a hierarchical queue management technique and a three-layer kernel arrangement strategy. It guarantees the same computational complexity as the fastest sequential version and can achieve up to 10 times speedup.

...read moreread less

235 citations

Proceedings Article•DOI•

Fast tridiagonal solvers on the GPU

[...]

Yao Zhang¹, Jonathan Cohen², John D. Owens¹•Institutions (2)

University of California, Davis¹, Nvidia²

09 Jan 2010

TL;DR: To combine the benefits of the basic algorithms, this work proposes hybrid CR+PCR and CR+RD algorithms, which improve the performance of PCR, RD and CR by 21%, 31% and 61% respectively.

...read moreread less

Abstract: We study the performance of three parallel algorithms and their hybrid variants for solving tridiagonal linear systems on a GPU: cyclic reduction (CR), parallel cyclic reduction (PCR) and recursive doubling (RD). We develop an approach to measure, analyze, and optimize the performance of GPU programs in terms of memory access, computation, and control overhead. We find that CR enjoys linear algorithm complexity but suffers from more algorithmic steps and bank conflicts, while PCR and RD have fewer algorithmic steps but do more work each step. To combine the benefits of the basic algorithms, we propose hybrid CR+PCR and CR+RD algorithms, which improve the performance of PCR, RD and CR by 21%, 31% and 61% respectively. Our GPU solvers achieve up to a 28x speedup over a sequential LAPACK solver, and a 12x speedup over a multi-threaded CPU solver.

...read moreread less

Proceedings Article•DOI•

Revisiting sorting for GPGPU stream architectures

[...]

Duane Merrill¹, Andrew S. Grimshaw¹•Institutions (1)

University of Virginia¹

11 Sep 2010

TL;DR: This poster presents efficient strategies for sorting large sequences of fixed-length keys (and values) using GPGPU stream processors using a parallel scan stream primitive that has been generalized in two ways: with local interfaces for producer/consumer operations (visiting logic), and with interfaces for performing multiple related, concurrent prefix scans (multi-scan).

...read moreread less

Abstract: This poster presents efficient strategies for sorting large sequences of fixed-length keys (and values) using GPGPU stream processors. Compared to the state-of-the-art, our radix sorting methods exhibit speedup of at least 2x for all generations of NVIDIA GPGPUs, and up to 3.7x for current GT200-based models. Our implementations demonstrate sorting rates of 482 million key-value pairs per second, and 550 million keys per second (32-bit). For this domain of sorting problems, we believe our sorting primitive to be the fastest available for any fully-programmable microarchitecture. These results motivate a different breed of parallel primitives for GPGPU stream architectures that can better exploit the memory and computational resources while maintaining the flexibility of a reusable component. Our sorting performance is derived from a parallel scan stream primitive that has been generalized in two ways: (1) with local interfaces for producer/consumer operations (visiting logic), and (2) with interfaces for performing multiple related, concurrent prefix scans (multi-scan).

...read moreread less

Proceedings Article•DOI•

A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers)

[...]

Charles E. Leiserson¹, Tao B. Schardl¹•Institutions (1)

Massachusetts Institute of Technology¹

13 Jun 2010

TL;DR: A general method for analyzing nondeterministic programs that use reducers and it is shown that for a graph G=(V,E) with diameter D and bounded out-degree, this data-race-free version of PBFS algorithm attains near-perfect linear speedup if P << (V+E)/Dlg3(V/D).

...read moreread less

Abstract: We have developed a multithreaded implementation of breadth-first search (BFS) of a sparse graph using the Cilk++ extensions to C++. Our PBFS program on a single processor runs as quickly as a standar. C++ breadth-first search implementation. PBFS achieves high work-efficiency by using a novel implementation of a multiset data structure, called a "bag," in place of the FIFO queue usually employed in serial breadth-first search algorithms. For a variety of benchmark input graphs whose diameters are significantly smaller than the number of vertices -- a condition met by many real-world graphs -- PBFS demonstrates good speedup with the number of processing cores.Since PBFS employs a nonconstant-time "reducer" -- "hyperobject" feature of Cilk++ -- the work inherent in a PBFS execution depends nondeterministically on how the underlying work-stealing scheduler load-balances the computation. We provide a general method for analyzing nondeterministic programs that use reducers. PBFS also is nondeterministic in that it contains benign races which affect its performance but not its correctness. Fixing these races with mutual-exclusion locks slows down PBFS empirically, but it makes the algorithm amenable to analysis. In particular, we show that for a graph G=(V,E) with diameter D and bounded out-degree, this data-race-free version of PBFS algorithm runs it time O((V+E)/P + Dlg3(V/D)) on P processors, which means that it attains near-perfect linear speedup if P

...read moreread less

Journal Article•DOI•

Limits of quantum speedup in photosynthetic light harvesting

[...]

Stephan Hoyer¹, Mohan Sarovar¹, K. Birgitta Whaley¹•Institutions (1)

University of California, Berkeley¹

01 Jun 2010-New Journal of Physics

TL;DR: In this article, the authors compare the dynamics of photosynthetic light-harvesting systems to quantum walks, in order to elucidate the limits of such quantum speedups.

...read moreread less

Abstract: It has been suggested that excitation transport in photosynthetic light-harvesting complexes features speedups analogous to those found in quantum algorithms. Here we compare the dynamics in these light-harvesting systems to the dynamics of quantum walks, in order to elucidate the limits of such quantum speedups. For the Fenna–Matthews–Olson complex of green sulfur bacteria, we show that while there is indeed speedup at short times, this is short lived (70 fs) despite longer-lived (ps) quantum coherence. Remarkably, this timescale is independent of the details of the decoherence model. More generally, we show that the distinguishing features of light-harvesting complexes not only limit the extent of quantum speedup but also reduce the rates of diffusive transport. These results suggest that quantum coherent effects in biological systems are optimized for efficiency or robustness rather than the more elusive goal of quantum speedup.

...read moreread less

Proceedings Article•DOI•

Parallel symbolic execution for structural test generation

[...]

Matt Staats¹, Corina S. Pǎsǎreanu²•Institutions (2)

University of Minnesota¹, Ames Research Center²

12 Jul 2010

TL;DR: This work proposes a technique, Simple Static Partitioning, for parallelizing symbolic execution, which uses a set of pre-conditions to partition the symbolic execution tree, allowing us to effectively distribute symbolic execution and decrease the time needed to explore the symbolic executions tree.

...read moreread less

Abstract: Symbolic execution is a popular technique for automatically generating test cases achieving high structural coverage. Symbolic execution suffers from scalability issues since the number of symbolic paths that need to be explored is very large (or even infinite) for most realistic programs. To address this problem, we propose a technique, Simple Static Partitioning, for parallelizing symbolic execution. The technique uses a set of pre-conditions to partition the symbolic execution tree, allowing us to effectively distribute symbolic execution and decrease the time needed to explore the symbolic execution tree. The proposed technique requires little communication between parallel instances and is designed to work with a variety of architectures, ranging from fast multi-core machines to cloud or grid computing environments. We implement our technique in the Java PathFinder verification tool-set and evaluate it on six case studies with respect to the performance improvement when exploring a finite symbolic execution tree and performing automatic test generation.We demonstrate speedup in both the analysis time over finite symbolic execution trees and in the time required to generate tests relative to sequential execution, with a maximum analysis time speedup of 90x observed using 128 workers and a maximum test generation speedup of 70x observed using 64 workers.

...read moreread less

Journal Article•DOI•

Optimized Block-Based Connected Components Labeling With Decision Trees

[...]

Costantino Grana, Daniele Borghesani, Rita Cucchiara

01 Jun 2010-IEEE Transactions on Image Processing

TL;DR: A new paradigm for eight-connection labeling is defined, which employs a general approach to improve neighborhood exploration and minimizes the number of memory accesses, and a new scanning technique that moves on a 2 × 2 pixel grid over the image, which is optimized by the automatically generated decision tree.

...read moreread less

Abstract: In this paper, we define a new paradigm for eight-connection labeling, which employes a general approach to improve neighborhood exploration and minimizes the number of memory accesses. First, we exploit and extend the decision table formalism introducing or-decision tables, in which multiple alternative actions are managed. An automatic procedure to synthesize the optimal decision tree from the decision table is used, providing the most effective conditions evaluation order. Second, we propose a new scanning technique that moves on a 2 × 2 pixel grid over the image, which is optimized by the automatically generated decision tree. An extensive comparison with the state of art approaches is proposed, both on synthetic and real datasets. The synthetic dataset is composed of different sizes and densities random images, while the real datasets are an artistic image analysis dataset, a document analysis dataset for text detection and recognition, and finally a standard resolution dataset for picture segmentation tasks. The algorithm provides an impressive speedup over the state of the art algorithms.

...read moreread less

Proceedings Article•DOI•

SLAW: A scalable locality-aware adaptive work-stealing scheduler

[...]

Yi Guo¹, Jisheng Zhao¹, Vincent Cavé¹, Vivek Sarkar¹•Institutions (1)

Rice University¹

19 Apr 2010

TL;DR: SLAW, a Scalable Locality-aware Adaptive Work-stealing scheduler is designed for programming models where locality hints are provided to the runtime by the programmer or compiler, and achieves locality-awareness by grouping workers into places and achieves speedup over locality-oblivious scheduling.

...read moreread less

Abstract: This paper introduces SLAW, a Scalable Locality-aware Adaptive Work-stealing scheduler. The SLAW scheduler is designed to address two common limitations in current work-stealing schedulers: use of a fixed task scheduling policy and locality-obliviousness due to randomized stealing. Past work has demonstrated the pros and cons of using fixed scheduling policies, such as work-first and help-first, in different cases without a clear win for one policy over the other. The SLAW scheduler addresses this limitation by supporting both work-first and help-first policies simultaneously. It does so by using an adaptive approach that selects a scheduling policy on a per-task basis at runtime. The SLAW scheduler also establishes bounds on the stack and heap space needed to store tasks. The experimental results for the benchmarks studied in this paper show that SLAW's adaptive scheduler achieves 0.98× to 9.2× speedup over the help-first scheduler and 0.97× to 4.5× speedup over the work-first scheduler for 64-thread executions, thereby establishing the robustness of using an adaptive approach instead of a fixed policy. In contrast, the help-first policy is 9.2× slower than work-first in the worst case for a fixed help-first policy, and the work-first policy is 3.7× slower than help-first in the worst case for a fixed work-first policy. Further, for large irregular recursive parallel computations, the adaptive scheduler runs with bounded stack usage and achieves performance (and supports data sizes) that cannot be delivered by the use of any single fixed policy. It is also known that work-stealing schedulers can be cache-unfriendly for some applications due to randomized stealing. The SLAW scheduler is designed for programming models where locality hints are provided to the runtime by the programmer or compiler, and achieves locality-awareness by grouping workers into places. Locality awareness can lead to improved performance by increasing temporal data reuse within a worker and among workers in the same place. Our experimental results show that locality-aware scheduling can achieve up to 2.6× speedup over locality-oblivious scheduling, for the benchmarks studied in this paper.

...read moreread less

Proceedings Article•DOI•

Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory

[...]

Roger Pearce¹, Roger Pearce², Maya Gokhale¹, Nancy M. Amato²•Institutions (2)

Lawrence Livermore National Laboratory¹, Texas A&M University²

13 Nov 2010

TL;DR: This work presents a novel asynchronous approach to compute Breadth-First-Search (BFS), Single-Source-Shortest-Paths, and Connected Components for large graphs in shared memory to overcome data latencies and provide significant speedup over alternative approaches.

...read moreread less

Abstract: Processing large graphs is becoming increasingly important for many domains such as social networks, bioinformatics, etc. Unfortunately, many algorithms and implementations do not scale with increasing graph sizes. As a result, researchers have attempted to meet the growing data demands using parallel and external memory techniques. We present a novel asynchronous approach to compute Breadth-First-Search (BFS), Single-Source-Shortest-Paths, and Connected Components for large graphs in shared memory. Our highly parallel asynchronous approach hides data latency due to both poor locality and delays in the underlying graph data storage. We present an experimental study applying our technique to both In-Memory and Semi-External Memory graphs utilizing multi-core processors and solid-state memory devices. Our experiments using synthetic and real-world datasets show that our asynchronous approach is able to overcome data latencies and provide significant speedup over alternative approaches. For example, on billion vertex graphs our asynchronous BFS scales up to 14x on 16-cores.

...read moreread less

Proceedings Article•DOI•

FPMR: MapReduce framework on FPGA

[...]

Yi Shan¹, Bo Wang¹, Jing Yan¹, Yu Wang¹, Ningyi Xu², Huazhong Yang¹ - Show less +2 more•Institutions (2)

Tsinghua University¹, Microsoft²

21 Feb 2010

TL;DR: FPMR, a MapReduce framework on FPGA, which provides programming abstraction, hardware architecture, and basic building blocks to developers so that more attention can be paid to the application itself and the speedup of this framework is demonstrated.

...read moreread less

Abstract: Machine learning and data mining are gaining increasing attentions of the computing society. FPGA provides a highly parallel, low power, and flexible hardware platform for this domain, while the difficulty of programming FPGA greatly limits its prevalence. MapReduce is a parallel programming framework that could easily utilize inherent parallelism in algorithms. In this paper, we describe FPMR, a MapReduce framework on FPGA, which provides programming abstraction, hardware architecture, and basic building blocks to developers.An on-chip processor scheduler is implemented to maximize the utilization of computation resources and achieve better load balancing. An efficient data access scheme is carefully designed to maximize data reuse and throughput. Meanwhile, the FPMR framework hides the task control, synchronization, and communication away from designers so that more attention can be paid to the application itself. A case study of RankBoost acceleration based on FPMR demonstrates that FPMR efficiently helps with the development productivity; and the speedup is 31.8x versus CPU-based implementation. This performance is comparable to a fully manually designed version, which achieves 33.5x speedup. Two other applications: SVM, PageRank are also discussed to show the generalization of the framework.

...read moreread less

Proceedings Article•DOI•

MapCG: writing parallel program portable between CPU and GPU

[...]

Chuntao Hong¹, Dehao Chen¹, Wenguang Chen¹, Weimin Zheng¹, Haibo Lin² - Show less +1 more•Institutions (2)

Tsinghua University¹, IBM²

11 Sep 2010

TL;DR: This research presents a novel and scalable approaches to solve the problem of high development and maintenance cost of writing GPU specific code with low level GPU APIs such as CUDA.

...read moreread less

Abstract: Graphics Processing Units (GPU) have been playing an important role in the general purpose computing market recently. The common approach to program GPU today is to write GPU specific code with low level GPU APIs such as CUDA. Although this approach can achieve very good performance, it raises serious portability issues: programmers are required to write a specific version of code for each potential target architecture. It results in high development and maintenance cost.We believe it is desired to have a programming model which provides source code portability between CPUs and GPUs, and different GPUs: Programmers only need to write one version of code and can be compiled and executed on either CPUs or GPUs efficiently without modification.In this paper, we propose MapCG, a MapReduce framework to provide source code level portability between CPU and GPU. Different from OpenCL, our framework is based on MapReduce, which provides a high level programming model, making programming much easier.We describe the design of the MapReduce-based high-level programming language and the underlying runtime system to enable portability between CPU and GPU. A prototype of MapCG runtime was implemented, supporting multi-core CPU and NVIDIA GPUs. Experiments show that our implementation can execute the same source code efficiently on multi-core CPU platforms and GPUs, achieving an average of 1.6-2.5x speedup over previous implementations of MapReduce on eight commonly used applications.

...read moreread less

Journal Article•DOI•

On developing B-spline registration algorithms for multi-core processors

[...]

James Shackleford¹, Nagarajan Kandasamy¹, Gregory C. Sharp²•Institutions (2)

Drexel University¹, Harvard University²

07 Nov 2010-Physics in Medicine and Biology

TL;DR: This paper proposes a grid-alignment scheme and associated data structures that greatly reduce the complexity of the registration algorithm, and develops highly data parallel designs for B-spline registration within the stream-processing model, suitable for implementation on multi-core processors such as graphics processing units (GPUs).

...read moreread less

Abstract: Spline-based deformable registration methods are quite popular within the medical-imaging community due to their flexibility and robustness. However, they require a large amount of computing time to obtain adequate results. This paper makes two contributions towards accelerating B-spline-based registration. First, we propose a grid-alignment scheme and associated data structures that greatly reduce the complexity of the registration algorithm. Based on this grid-alignment scheme, we then develop highly data parallel designs for B-spline registration within the stream-processing model, suitable for implementation on multi-core processors such as graphics processing units (GPUs). Particular attention is focused on an optimal method for performing analytic gradient computations in a data parallel fashion. CPU and GPU versions are validated for execution time and registration quality. Performance results on large images show that our GPU algorithm achieves a speedup of 15 times over the single-threaded CPU implementation whereas our multi-core CPU algorithm achieves a speedup of 8 times over the single-threaded implementation. The CPU and GPU versions achieve near-identical registration quality in terms of RMS differences between the generated vector fields.

...read moreread less

Proceedings Article•DOI•

Dynamic load balancing on single- and multi-GPU systems

[...]

Long Chen¹, Oreste Villa², Sriram Krishnamoorthy², Guang R. Gao¹•Institutions (2)

University of Delaware¹, Pacific Northwest National Laboratory²

19 Apr 2010

TL;DR: Experimental results show that the proposed task-based dynamic load-balancing solution can utilize the hardware more efficiently than the CUDA scheduler for unbalanced workload, and achieves near-linear speedup, load balance, and significant performance improvement over techniques based on standard CUDA APIs.

...read moreread less

Abstract: The computational power provided by many-core graphics processing units (GPUs) has been exploited in many applications. The programming techniques currently employed on these GPUs are not sufficient to address problems exhibiting irregular, and unbalanced workload. The problem is exacerbated when trying to effectively exploit multiple GPUs concurrently, which are commonly available in many modern systems. In this paper, we propose a task-based dynamic load-balancing solution for single-and multi-GPU systems. The solution allows load balancing at a finer granularity than what is supported in current GPU programming APIs, such as NVIDIA's CUDA. We evaluate our approach using both micro-benchmarks and a molecular dynamics application that exhibits significant load imbalance. Experimental results with a single-GPU configuration show that our fine-grained task solution can utilize the hardware more efficiently than the CUDA scheduler for unbalanced workload. On multi-GPU systems, our solution achieves near-linear speedup, load balance, and significant performance improvement over techniques based on standard CUDA APIs.

...read moreread less

Proceedings Article•DOI•

Interval simulation: Raising the level of abstraction in architectural simulation

[...]

Davy Genbrugge¹, Stijn Eyerman¹, Lieven Eeckhout¹•Institutions (1)

Ghent University¹

01 Apr 2010

TL;DR: In this paper, the authors propose interval simulation, which takes a completely different approach: interval simulation raises the level of abstraction and replaces the core-level cycle-accurate simulation model by a mechanistic analytical model.

...read moreread less

Abstract: Detailed architectural simulators suffer from a long development cycle and extremely long evaluation times. This longstanding problem is further exacerbated in the multi-core processor era. Existing solutions address the simulation problem by either sampling the simulated instruction stream or by mapping the simulation models on FPGAs; these approaches achieve substantial simulation speedups while simulating performance in a cycle-accurate manner. This paper proposes interval simulation which takes a completely different approach: interval simulation raises the level of abstraction and replaces the core-level cycle-accurate simulation model by a mechanistic analytical model. The analytical model estimates core-level performance by analyzing intervals, or the timing between two miss events (branch mispredictions and TLB/cache misses); the miss events are determined through simulation of the memory hierarchy, cache coherence protocol, interconnection network and branch predictor. By raising the level of abstraction, interval simulation reduces both development time and evaluation time. Our experimental results using the SPEC CPU2000 and PARSEC benchmark suites and the M5 multi-core simulator, show good accuracy up to eight cores (average error of 4.6% and max error of 11% for the multi-threaded full-system workloads), while achieving a one order of magnitude simulation speedup compared to cycle-accurate simulation. Moreover, interval simulation is easy to implement: our implementation of the mechanistic analytical model incurs only one thousand lines of code. Its high accuracy, fast simulation speed and ease-of-use make interval simulation a useful complement to the architect's toolbox for exploring system-level and high-level micro-architecture trade-offs.

...read moreread less

Proceedings Article•DOI•

Early exit optimizations for additive machine learned ranking systems

[...]

B. Barla Cambazoglu¹, Hugo Zaragoza¹, Olivier Chapelle¹, Jiang Chen¹, Ciya Liao¹, Zhaohui Zheng¹, Jon Degenhardt¹ - Show less +3 more•Institutions (1)

Yahoo!¹

04 Feb 2010

TL;DR: By proposing optimization strategies that allow short-circuiting score computations in additive learning systems, this paper is able to speedup the score computation process by more than four times with almost no loss in result quality.

...read moreread less

Abstract: Some commercial web search engines rely on sophisticated machine learning systems for ranking web documents. Due to very large collection sizes and tight constraints on query response times, online efficiency of these learning systems forms a bottleneck. An important problem in such systems is to speedup the ranking process without sacrificing much from the quality of results. In this paper, we propose optimization strategies that allow short-circuiting score computations in additive learning systems. The strategies are evaluated over a state-of-the-art machine learning system and a large, real-life query log, obtained from Yahoo!. By the proposed strategies, we are able to speedup the score computations by more than four times with almost no loss in result quality.

...read moreread less

Proceedings Article•DOI•

Accelerating Dynamic Time Warping Subsequence Search with GPUs and FPGAs

[...]

Doruk Sart¹, Abdullah Mueen¹, Walid Najjar¹, Eamonn Keogh¹, Vit Niennattrakul² - Show less +1 more•Institutions (2)

University of California, Riverside¹, Chulalongkorn University²

13 Dec 2010

TL;DR: This work argues that it is now close to exhausting all possible speedup from software, and that it must turn to hardware-based solutions, and investigates both GPU and FPGA based acceleration of subsequence similarity search under the DTW measure.

...read moreread less

Abstract: Many time series data mining problems require subsequence similarity search as a subroutine. Dozens of similarity/distance measures have been proposed in the last decade and there is increasing evidence that Dynamic Time Warping (DTW) is the best measure across a wide range of domains. Given DTW’s usefulness and ubiquity, there has been a large community-wide effort to mitigate its relative lethargy. Proposed speedup techniques include early abandoning strategies, lower-bound based pruning, indexing and embedding. In this work we argue that we are now close to exhausting all possible speedup from software, and that we must turn to hardware-based solutions. With this motivation, we investigate both GPU (Graphics Processing Unit) and FPGA (Field Programmable Gate Array) based acceleration of subsequence similarity search under the DTW measure. As we shall show, our novel algorithms allow GPUs to achieve two orders of magnitude speedup and FPGAs to produce four orders of magnitude speedup. We conduct detailed case studies on the classification of astronomical observations and demonstrate that our ideas allow us to tackle problems that would be untenable otherwise.

...read moreread less

Proceedings Article•DOI•

Tiled-MapReduce: optimizing resource usages of data-parallel applications on multicore with tiling

[...]

Rong Chen¹, Haibo Chen¹, Binyu Zang¹•Institutions (1)

Fudan University¹

11 Sep 2010

TL;DR: MapReduce, a simple and elegant programming model to program large scale clusters, has recently been shown to be a promising alternative to harness the multicore platform.

...read moreread less

Abstract: The prevalence of chip multiprocessor opens opportunities of running data-parallel applications originally in clusters on a single machine with many cores. MapReduce, a simple and elegant programming model to program large scale clusters, has recently been shown to be a promising alternative to harness the multicore platform.The differences such as memory hierarchy and communication patterns between clusters and multicore platforms raise new challenges to design and implement an efficient MapReduce system on multicore. This paper argues that it is more efficient for MapReduce to iteratively process small chunks of data in turn than processing a large chunk of data at one time on shared memory multicore platforms. Based on the argument, we extend the general MapReduce programming model with "tiling strategy", calledTiled-MapReduce (TMR). TMR partitions a large MapReduce job into a number of small sub-jobs and iteratively processes one subjob at a time with efficient use of resources; TMR finally merges the results of all sub-jobs for output. Based on Tiled-MapReduce, we design and implement several optimizing techniques targeting multicore, including the reuse of input and intermediate data structure among sub-jobs, a NUCA/NUMA-aware scheduler, and pipelining a sub-job's reduce phase with the successive sub-job's map phase, to optimize the memory, cache and CPU resources accordingly.We have implemented a prototype of Tiled-MapReduce based on Phoenix, an already highly optimized MapReduce runtime for shared memory multiprocessors. The prototype, namely Ostrich, runs on an Intel machine with 16 cores. Experiments on four different types of benchmarks show that Ostrich saves up to 85% memory, causes less cache misses and makes more efficient uses of CPU cores, resulting in a speedup ranging from 1.2X to 3.3X.

...read moreread less

Proceedings Article•DOI•

Modeling critical sections in Amdahl's law and its implications for multicore design

[...]

Stijn Eyerman¹, Lieven Eeckhout¹•Institutions (1)

Ghent University¹

19 Jun 2010

TL;DR: It is shown that parallel performance is not only limited by sequential code but is also fundamentally limited by synchronization through critical sections, and the surprising result that the impact of critical sections on parallel performance can be modeled as a completely sequential part and a completely parallel part.

...read moreread less

Abstract: This paper presents a fundamental law for parallel performance: it shows that parallel performance is not only limited by sequential code (as suggested by Amdahl's law) but is also fundamentally limited by synchronization through critical sections. Extending Amdahl's software model to include critical sections, we derive the surprising result that the impact of critical sections on parallel performance can be modeled as a completely sequential part and a completely parallel part. The sequential part is determined by the probability for entering a critical section and the contention probability (i.e., multiple threads wanting to enter the same critical section). This fundamental result reveals at least three important insights for multicore design. (i) Asymmetric multicore processors deliver less performance benefits relative to symmetric processors than suggested by Amdahl's law, and in some cases even worse performance. (ii) Amdahl's law suggests many tiny cores for optimum performance in asymmetric processors, however, we find that fewer but larger small cores can yield substantially better performance. (iii) Executing critical sections on the big core can yield substantial speedups, however, performance is sensitive to the accuracy of the critical section contention predictor.

...read moreread less

Proceedings Article•DOI•

An 80-Fold Speedup, 15.0 TFlops Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code

[...]

Takashi Shimokawabe¹, Takayuki Aoki¹, Chiashi Muroi, Jun-ichi Ishida, Kohei Kawano, Toshio Endo¹, Akira Nukada¹, Naoya Maruyama¹, Satoshi Matsuoka¹ - Show less +5 more•Institutions (1)

Tokyo Institute of Technology¹

13 Nov 2010

TL;DR: This work presents the first full CUDA porting of the high- resolution weather prediction model ASUCA, the first such one to be known to date, and demonstrates over 80-fold speedup and good weak scaling achieving 15.0 TFlops in single precision.

...read moreread less

Abstract: Regional weather forecasting demands fast simulation over fine-grained grids, resulting in extremely memory- bottlenecked computation, a difficult problem on conventional supercomputers. Early work on accelerating mainstream weather code WRF using GPUs with their high memory performance, however, resulted in only minor speedup due to partial GPU porting of the huge code. Our full CUDA porting of the high- resolution weather prediction model ASUCA is the first such one we know to date; ASUCA is a next-generation, production weather code developed by the Japan Meteorological Agency, similar to WRF in the underlying physics (non-hydrostatic model). Benchmark on the 528 (NVIDIA GT200 Tesla) GPU TSUBAME Supercomputer at the Tokyo Institute of Technology demonstrated over 80-fold speedup and good weak scaling achieving 15.0 TFlops in single precision for 6956 x 6052 x 48 mesh. Further benchmarks on TSUBAME 2.0, which will embody over 4000 NVIDIA Fermi GPUs and deployed in October 2010, will be presented.

...read moreread less

Book Chapter•DOI•

Markerless and efficient 26-DOF hand pose recovery

[...]

Iasonas Oikonomidis¹, Nikolaos Kyriazis, Antonis A. Argyros•Institutions (1)

University of Crete¹

08 Nov 2010

TL;DR: A novel method that, given a sequence of synchronized views of a human hand, recovers its 3D position, orientation and full articulation parameters using Particle Swarm Optimization and achieves a speedup of two orders of magnitude over the case of CPU processing.

...read moreread less

Abstract: We present a novel method that, given a sequence of synchronized views of a human hand, recovers its 3D position, orientation and full articulation parameters. The adopted hand model is based on properly selected and assembled 3D geometric primitives. Hypothesized configurations/poses of the hand model are projected to different camera views and image features such as edge maps and hand silhouettes are computed. An objective function is then used to quantify the discrepancy between the predicted and the actual, observed features. The recovery of the 3D hand pose amounts to estimating the parameters that minimize this objective function which is performed using Particle Swarm Optimization. All the basic components of the method (feature extraction, objective function evaluation, optimization process) are inherently parallel. Thus, a GPU-based implementation achieves a speedup of two orders of magnitude over the case of CPU processing. Extensive experimental results demonstrate qualitatively and quantitatively that accurate 3D pose recovery of a hand can be achieved robustly at a rate that greatly outperforms the current state of the art.

...read moreread less

Journal Article•DOI•

Acceleration of the Smith-Waterman algorithm using single and multiple graphics processors

[...]

Ali Khajeh-Saeed¹, Stephen W. Poole², J. Blair Perot¹•Institutions (2)

University of Massachusetts Amherst¹, Oak Ridge National Laboratory²

01 Jun 2010-Journal of Computational Physics

TL;DR: This work shows that effective use of the GPU requires a novel reformulation of the Smith-Waterman algorithm, and indicates that for large problems a single GPU is up to 45 times faster than a CPU for this application.

...read moreread less

Collapse