scispace - formally typeset
Search or ask a question

Showing papers on "Speedup published in 2010"


Proceedings ArticleDOI
06 Jun 2010
TL;DR: This paper proposes a 3-stage approach for end-to-end set-similarity joins in parallel using the popular MapReduce framework, and reports results from extensive experiments on real datasets to evaluate the speedup and scaleup properties of the proposed algorithms using Hadoop.
Abstract: In this paper we study how to efficiently perform set-similarity joins in parallel using the popular MapReduce framework. We propose a 3-stage approach for end-to-end set-similarity joins. We take as input a set of records and output a set of joined records based on a set-similarity condition. We efficiently partition the data across nodes in order to balance the workload and minimize the need for replication. We study both self-join and R-S join cases, and show how to carefully control the amount of data kept in main memory on each node. We also propose solutions for the case where, even if we use the most fine-grained partitioning, the data still does not fit in the main memory of a node. We report results from extensive experiments on real datasets, synthetically increased in size, to evaluate the speedup and scaleup properties of the proposed algorithms using Hadoop.

516 citations


Proceedings ArticleDOI
01 Apr 2010
TL;DR: This paper introduces the Graphite open-source distributed parallel multicore simulator infrastructure and demonstrates that Graphite can simulate target architectures containing over 1000 cores on ten 8-core servers with near linear speedup.
Abstract: This paper introduces the Graphite open-source distributed parallel multicore simulator infrastructure. Graphite is designed from the ground up for exploration of future multi-core processors containing dozens, hundreds, or even thousands of cores. It provides high performance for fast design space exploration and software development. Several techniques are used to achieve this including: direct execution, seamless multicore and multi-machine distribution, and lax synchronization. Graphite is capable of accelerating simulations by distributing them across multiple commodity Linux machines. When using multiple machines, it provides the illusion of a single process with a single, shared address space, allowing it to run off-the-shelf pthread applications with no source code modification. Our results demonstrate that Graphite can simulate target architectures containing over 1000 cores on ten 8-core servers. Performance scales well as more machines are added with near linear speedup in many cases. Simulation slowdown is as low as 41× versus native execution.

498 citations


Proceedings ArticleDOI
28 Mar 2010
TL;DR: This work develops a microbechmark suite and measures the CUDA-visible architectural characteristics of the Nvidia GT200 (GTX280) GPU, exposing undocumented features that impact program performance and correctness.
Abstract: Graphics processors (GPU) offer the promise of more than an order of magnitude speedup over conventional processors for certain non-graphics computations. Because the GPU is often presented as a C-like abstraction (e.g., Nvidia's CUDA), little is known about the characteristics of the GPU's architecture beyond what the manufacturer has documented. This work develops a microbechmark suite and measures the CUDA-visible architectural characteristics of the Nvidia GT200 (GTX280) GPU. Various undisclosed characteristics of the processing elements and the memory hierarchies are measured. This analysis exposes undocumented features that impact program performance and correctness. These measurements can be useful for improving performance optimization, analysis, and modeling on this architecture and offer additional insight on the decisions made in developing this GPU.

471 citations


Journal ArticleDOI
01 Sep 2010
TL;DR: By carefully tuning these factors, the overall performance of Hadoop can be improved by a factor of 2.5 to 3.5, and is thus more comparable to that of parallel database systems.
Abstract: MapReduce has been widely used for large-scale data analysis in the Cloud. The system is well recognized for its elastic scalability and fine-grained fault tolerance although its performance has been noted to be suboptimal in the database context. According to a recent study [19], Hadoop, an open source implementation of MapReduce, is slower than two state-of-the-art parallel database systems in performing a variety of analytical tasks by a factor of 3.1 to 6.5. MapReduce can achieve better performance with the allocation of more compute nodes from the cloud to speed up computation; however, this approach of "renting more nodes" is not cost effective in a pay-as-you-go environment. Users desire an economical elastically scalable data processing system, and therefore, are interested in whether MapReduce can offer both elastic scalability and efficiency.In this paper, we conduct a performance study of MapReduce (Hadoop) on a 100-node cluster of Amazon EC2 with various levels of parallelism. We identify five design factors that affect the performance of Hadoop, and investigate alternative but known methods for each factor. We show that by carefully tuning these factors, the overall performance of Hadoop can be improved by a factor of 2.5 to 3.5 for the same benchmark used in [19], and is thus more comparable to that of parallel database systems. Our results show that it is therefore possible to build a cloud data processing system that is both elastically scalable and efficient.

426 citations


Journal ArticleDOI
TL;DR: It is suggested that GPUs have the potential to facilitate the growth of statistical modeling into complex data-rich domains through the availability of cheap and accessible many-core computation.
Abstract: We present a case-study on the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods. Graphics cards, containing multiple Graphics Processing Units (GPUs), are self-contained parallel computational devices that can be housed in conventional desktop and laptop computers and can be thought of as prototypes of the next generation of many-core processors. For certain classes of population-based Monte Carlo algorithms they offer massively parallel simulation, with the added advantage over conventional distributed multi-core processors that they are cheap, easily accessible, easy to maintain, easy to code, dedicated local devices with low power consumption. On a canonical set of stochastic simulation examples including population-based Markov chain Monte Carlo methods and Sequential Monte Carlo methods, we nd speedups from 35 to 500 fold over conventional single-threaded computer code. Our findings suggest that GPUs have the potential to facilitate the growth of statistical modelling into complex data rich domains through the availability of cheap and accessible many-core computation. We believe the speedup we observe should motivate wider use of parallelizable simulation methods and greater methodological attention to their design.

334 citations


Proceedings ArticleDOI
Anthony Nguyen1, Nadathur Satish1, Jatin Chhugani1, Changkyu Kim1, Pradeep Dubey1 
13 Nov 2010
TL;DR: A novel 3.
Abstract: Stencil computation sweeps over a spatial grid over multiple time steps to perform nearest-neighbor computations The bandwidth-to-compute requirement for a large class of stencil kernels is very high, and their performance is bound by the available memory bandwidth Since memory bandwidth grows slower than compute, the performance of stencil kernels will not scale with increasing compute density We present a novel 35D-blocking algorithm that performs 25D-spatial and temporal blocking of the input grid into on-chip memory for both CPUs and GPUs The resultant algorithm is amenable to both thread- level and data-level parallelism, and scales near-linearly with the SIMD width and multiple-cores Our performance numbers are faster or comparable to state-of-the-art-stencil implementations on CPUs and GPUs Our implementation of 7-point-stencil is 15X-faster on CPUs, and 18X faster on GPUs for single- precision floating point inputs than previously reported numbers For Lattice Boltzmann methods, the corresponding speedup number on CPUs is 21X

299 citations


Journal ArticleDOI
TL;DR: A high-order finite-element application, which performs the numerical simulation of seismic wave propagation resulting from earthquakes at the scale of a continent or from active seismic acquisition experiments in the oil industry, on a large cluster of NVIDIA Tesla graphics cards using the CUDA programming environment and non-blocking message passing based on MPI.

278 citations


Proceedings Article
11 Jul 2010
TL;DR: It is shown that, surprisingly, dynamic programming is in fact possible for many shift-reduce parsers, by merging "equivalent" stacks based on feature values, and the final parser outperforms all previously reported dependency parsers for English and Chinese, yet is much faster.
Abstract: Incremental parsing techniques such as shift-reduce have gained popularity thanks to their efficiency, but there remains a major problem: the search is greedy and only explores a tiny fraction of the whole space (even with beam search) as opposed to dynamic programming. We show that, surprisingly, dynamic programming is in fact possible for many shift-reduce parsers, by merging "equivalent" stacks based on feature values. Empirically, our algorithm yields up to a five-fold speedup over a state-of-the-art shift-reduce dependency parser with no loss in accuracy. Better search also leads to better learning, and our final parser outperforms all previously reported dependency parsers for English and Chinese, yet is much faster.

253 citations


Proceedings ArticleDOI
19 Apr 2010
TL;DR: In this article, the authors present a stencil auto-tuning framework that significantly advances programmer productivity by automatically converting a straightforward sequential Fortran 95 stencil expression into tuned parallel implementations in Fortran, C, or CUDA.
Abstract: Although stencil auto-tuning has shown tremendous potential in effectively utilizing architectural resources, it has hitherto been limited to single kernel instantiations; in addition, the large variety of stencil kernels used in practice makes this computation pattern difficult to assemble into a library This work presents a stencil auto-tuning framework that significantly advances programmer productivity by automatically converting a straightforward sequential Fortran 95 stencil expression into tuned parallel implementations in Fortran, C, or CUDA, thus allowing performance portability across diverse computer architectures, including the AMD Barcelona, Intel Nehalem, Sun Victoria Falls, and the latest NVIDIA GPUs Results show that our generalized methodology delivers significant performance gains of up to 22× speedup over the reference serial implementation Overall we demonstrate that such domain-specific auto-tuners hold enormous promise for architectural efficiency, programmer productivity, performance portability, and algorithmic adaptability on existing and emerging multicore systems

243 citations


Proceedings ArticleDOI
13 Jun 2010
TL;DR: A new GPU implementation of BFS that uses a hierarchical queue management technique and a three-layer kernel arrangement strategy that guarantees the same computational complexity as the fastest sequential version and can achieve up to 10 times speedup.
Abstract: Breadth-first search (BFS) has wide applications in electronic design automation (EDA) as well as in other fields. Researchers have tried to accelerate BFS on the GPU, but the two published works are both asymptotically slower than the fastest CPU implementation. In this paper, we present a new GPU implementation of BFS that uses a hierarchical queue management technique and a three-layer kernel arrangement strategy. It guarantees the same computational complexity as the fastest sequential version and can achieve up to 10 times speedup.

235 citations


Proceedings ArticleDOI
09 Jan 2010
TL;DR: To combine the benefits of the basic algorithms, this work proposes hybrid CR+PCR and CR+RD algorithms, which improve the performance of PCR, RD and CR by 21%, 31% and 61% respectively.
Abstract: We study the performance of three parallel algorithms and their hybrid variants for solving tridiagonal linear systems on a GPU: cyclic reduction (CR), parallel cyclic reduction (PCR) and recursive doubling (RD). We develop an approach to measure, analyze, and optimize the performance of GPU programs in terms of memory access, computation, and control overhead. We find that CR enjoys linear algorithm complexity but suffers from more algorithmic steps and bank conflicts, while PCR and RD have fewer algorithmic steps but do more work each step. To combine the benefits of the basic algorithms, we propose hybrid CR+PCR and CR+RD algorithms, which improve the performance of PCR, RD and CR by 21%, 31% and 61% respectively. Our GPU solvers achieve up to a 28x speedup over a sequential LAPACK solver, and a 12x speedup over a multi-threaded CPU solver.

Proceedings ArticleDOI
11 Sep 2010
TL;DR: This poster presents efficient strategies for sorting large sequences of fixed-length keys (and values) using GPGPU stream processors using a parallel scan stream primitive that has been generalized in two ways: with local interfaces for producer/consumer operations (visiting logic), and with interfaces for performing multiple related, concurrent prefix scans (multi-scan).
Abstract: This poster presents efficient strategies for sorting large sequences of fixed-length keys (and values) using GPGPU stream processors. Compared to the state-of-the-art, our radix sorting methods exhibit speedup of at least 2x for all generations of NVIDIA GPGPUs, and up to 3.7x for current GT200-based models. Our implementations demonstrate sorting rates of 482 million key-value pairs per second, and 550 million keys per second (32-bit). For this domain of sorting problems, we believe our sorting primitive to be the fastest available for any fully-programmable microarchitecture. These results motivate a different breed of parallel primitives for GPGPU stream architectures that can better exploit the memory and computational resources while maintaining the flexibility of a reusable component. Our sorting performance is derived from a parallel scan stream primitive that has been generalized in two ways: (1) with local interfaces for producer/consumer operations (visiting logic), and (2) with interfaces for performing multiple related, concurrent prefix scans (multi-scan).

Proceedings ArticleDOI
13 Jun 2010
TL;DR: A general method for analyzing nondeterministic programs that use reducers and it is shown that for a graph G=(V,E) with diameter D and bounded out-degree, this data-race-free version of PBFS algorithm attains near-perfect linear speedup if P << (V+E)/Dlg3(V/D).
Abstract: We have developed a multithreaded implementation of breadth-first search (BFS) of a sparse graph using the Cilk++ extensions to C++. Our PBFS program on a single processor runs as quickly as a standar. C++ breadth-first search implementation. PBFS achieves high work-efficiency by using a novel implementation of a multiset data structure, called a "bag," in place of the FIFO queue usually employed in serial breadth-first search algorithms. For a variety of benchmark input graphs whose diameters are significantly smaller than the number of vertices -- a condition met by many real-world graphs -- PBFS demonstrates good speedup with the number of processing cores.Since PBFS employs a nonconstant-time "reducer" -- "hyperobject" feature of Cilk++ -- the work inherent in a PBFS execution depends nondeterministically on how the underlying work-stealing scheduler load-balances the computation. We provide a general method for analyzing nondeterministic programs that use reducers. PBFS also is nondeterministic in that it contains benign races which affect its performance but not its correctness. Fixing these races with mutual-exclusion locks slows down PBFS empirically, but it makes the algorithm amenable to analysis. In particular, we show that for a graph G=(V,E) with diameter D and bounded out-degree, this data-race-free version of PBFS algorithm runs it time O((V+E)/P + Dlg3(V/D)) on P processors, which means that it attains near-perfect linear speedup if P

Journal ArticleDOI
TL;DR: In this article, the authors compare the dynamics of photosynthetic light-harvesting systems to quantum walks, in order to elucidate the limits of such quantum speedups.
Abstract: It has been suggested that excitation transport in photosynthetic light-harvesting complexes features speedups analogous to those found in quantum algorithms. Here we compare the dynamics in these light-harvesting systems to the dynamics of quantum walks, in order to elucidate the limits of such quantum speedups. For the Fenna–Matthews–Olson complex of green sulfur bacteria, we show that while there is indeed speedup at short times, this is short lived (70 fs) despite longer-lived (ps) quantum coherence. Remarkably, this timescale is independent of the details of the decoherence model. More generally, we show that the distinguishing features of light-harvesting complexes not only limit the extent of quantum speedup but also reduce the rates of diffusive transport. These results suggest that quantum coherent effects in biological systems are optimized for efficiency or robustness rather than the more elusive goal of quantum speedup.

Proceedings ArticleDOI
12 Jul 2010
TL;DR: This work proposes a technique, Simple Static Partitioning, for parallelizing symbolic execution, which uses a set of pre-conditions to partition the symbolic execution tree, allowing us to effectively distribute symbolic execution and decrease the time needed to explore the symbolic executions tree.
Abstract: Symbolic execution is a popular technique for automatically generating test cases achieving high structural coverage. Symbolic execution suffers from scalability issues since the number of symbolic paths that need to be explored is very large (or even infinite) for most realistic programs. To address this problem, we propose a technique, Simple Static Partitioning, for parallelizing symbolic execution. The technique uses a set of pre-conditions to partition the symbolic execution tree, allowing us to effectively distribute symbolic execution and decrease the time needed to explore the symbolic execution tree. The proposed technique requires little communication between parallel instances and is designed to work with a variety of architectures, ranging from fast multi-core machines to cloud or grid computing environments. We implement our technique in the Java PathFinder verification tool-set and evaluate it on six case studies with respect to the performance improvement when exploring a finite symbolic execution tree and performing automatic test generation.We demonstrate speedup in both the analysis time over finite symbolic execution trees and in the time required to generate tests relative to sequential execution, with a maximum analysis time speedup of 90x observed using 128 workers and a maximum test generation speedup of 70x observed using 64 workers.

Journal ArticleDOI
TL;DR: A new paradigm for eight-connection labeling is defined, which employs a general approach to improve neighborhood exploration and minimizes the number of memory accesses, and a new scanning technique that moves on a 2 × 2 pixel grid over the image, which is optimized by the automatically generated decision tree.
Abstract: In this paper, we define a new paradigm for eight-connection labeling, which employes a general approach to improve neighborhood exploration and minimizes the number of memory accesses. First, we exploit and extend the decision table formalism introducing or-decision tables, in which multiple alternative actions are managed. An automatic procedure to synthesize the optimal decision tree from the decision table is used, providing the most effective conditions evaluation order. Second, we propose a new scanning technique that moves on a 2 × 2 pixel grid over the image, which is optimized by the automatically generated decision tree. An extensive comparison with the state of art approaches is proposed, both on synthetic and real datasets. The synthetic dataset is composed of different sizes and densities random images, while the real datasets are an artistic image analysis dataset, a document analysis dataset for text detection and recognition, and finally a standard resolution dataset for picture segmentation tasks. The algorithm provides an impressive speedup over the state of the art algorithms.

Proceedings ArticleDOI
19 Apr 2010
TL;DR: SLAW, a Scalable Locality-aware Adaptive Work-stealing scheduler is designed for programming models where locality hints are provided to the runtime by the programmer or compiler, and achieves locality-awareness by grouping workers into places and achieves speedup over locality-oblivious scheduling.
Abstract: This paper introduces SLAW, a Scalable Locality-aware Adaptive Work-stealing scheduler. The SLAW scheduler is designed to address two common limitations in current work-stealing schedulers: use of a fixed task scheduling policy and locality-obliviousness due to randomized stealing. Past work has demonstrated the pros and cons of using fixed scheduling policies, such as work-first and help-first, in different cases without a clear win for one policy over the other. The SLAW scheduler addresses this limitation by supporting both work-first and help-first policies simultaneously. It does so by using an adaptive approach that selects a scheduling policy on a per-task basis at runtime. The SLAW scheduler also establishes bounds on the stack and heap space needed to store tasks. The experimental results for the benchmarks studied in this paper show that SLAW's adaptive scheduler achieves 0.98× to 9.2× speedup over the help-first scheduler and 0.97× to 4.5× speedup over the work-first scheduler for 64-thread executions, thereby establishing the robustness of using an adaptive approach instead of a fixed policy. In contrast, the help-first policy is 9.2× slower than work-first in the worst case for a fixed help-first policy, and the work-first policy is 3.7× slower than help-first in the worst case for a fixed work-first policy. Further, for large irregular recursive parallel computations, the adaptive scheduler runs with bounded stack usage and achieves performance (and supports data sizes) that cannot be delivered by the use of any single fixed policy. It is also known that work-stealing schedulers can be cache-unfriendly for some applications due to randomized stealing. The SLAW scheduler is designed for programming models where locality hints are provided to the runtime by the programmer or compiler, and achieves locality-awareness by grouping workers into places. Locality awareness can lead to improved performance by increasing temporal data reuse within a worker and among workers in the same place. Our experimental results show that locality-aware scheduling can achieve up to 2.6× speedup over locality-oblivious scheduling, for the benchmarks studied in this paper.

Proceedings ArticleDOI
13 Nov 2010
TL;DR: This work presents a novel asynchronous approach to compute Breadth-First-Search (BFS), Single-Source-Shortest-Paths, and Connected Components for large graphs in shared memory to overcome data latencies and provide significant speedup over alternative approaches.
Abstract: Processing large graphs is becoming increasingly important for many domains such as social networks, bioinformatics, etc. Unfortunately, many algorithms and implementations do not scale with increasing graph sizes. As a result, researchers have attempted to meet the growing data demands using parallel and external memory techniques. We present a novel asynchronous approach to compute Breadth-First-Search (BFS), Single-Source-Shortest-Paths, and Connected Components for large graphs in shared memory. Our highly parallel asynchronous approach hides data latency due to both poor locality and delays in the underlying graph data storage. We present an experimental study applying our technique to both In-Memory and Semi-External Memory graphs utilizing multi-core processors and solid-state memory devices. Our experiments using synthetic and real-world datasets show that our asynchronous approach is able to overcome data latencies and provide significant speedup over alternative approaches. For example, on billion vertex graphs our asynchronous BFS scales up to 14x on 16-cores.

Proceedings ArticleDOI
Yi Shan1, Bo Wang1, Jing Yan1, Yu Wang1, Ningyi Xu2, Huazhong Yang1 
21 Feb 2010
TL;DR: FPMR, a MapReduce framework on FPGA, which provides programming abstraction, hardware architecture, and basic building blocks to developers so that more attention can be paid to the application itself and the speedup of this framework is demonstrated.
Abstract: Machine learning and data mining are gaining increasing attentions of the computing society. FPGA provides a highly parallel, low power, and flexible hardware platform for this domain, while the difficulty of programming FPGA greatly limits its prevalence. MapReduce is a parallel programming framework that could easily utilize inherent parallelism in algorithms. In this paper, we describe FPMR, a MapReduce framework on FPGA, which provides programming abstraction, hardware architecture, and basic building blocks to developers.An on-chip processor scheduler is implemented to maximize the utilization of computation resources and achieve better load balancing. An efficient data access scheme is carefully designed to maximize data reuse and throughput. Meanwhile, the FPMR framework hides the task control, synchronization, and communication away from designers so that more attention can be paid to the application itself. A case study of RankBoost acceleration based on FPMR demonstrates that FPMR efficiently helps with the development productivity; and the speedup is 31.8x versus CPU-based implementation. This performance is comparable to a fully manually designed version, which achieves 33.5x speedup. Two other applications: SVM, PageRank are also discussed to show the generalization of the framework.

Proceedings ArticleDOI
Chuntao Hong1, Dehao Chen1, Wenguang Chen1, Weimin Zheng1, Haibo Lin2 
11 Sep 2010
TL;DR: This research presents a novel and scalable approaches to solve the problem of high development and maintenance cost of writing GPU specific code with low level GPU APIs such as CUDA.
Abstract: Graphics Processing Units (GPU) have been playing an important role in the general purpose computing market recently. The common approach to program GPU today is to write GPU specific code with low level GPU APIs such as CUDA. Although this approach can achieve very good performance, it raises serious portability issues: programmers are required to write a specific version of code for each potential target architecture. It results in high development and maintenance cost.We believe it is desired to have a programming model which provides source code portability between CPUs and GPUs, and different GPUs: Programmers only need to write one version of code and can be compiled and executed on either CPUs or GPUs efficiently without modification.In this paper, we propose MapCG, a MapReduce framework to provide source code level portability between CPU and GPU. Different from OpenCL, our framework is based on MapReduce, which provides a high level programming model, making programming much easier.We describe the design of the MapReduce-based high-level programming language and the underlying runtime system to enable portability between CPU and GPU. A prototype of MapCG runtime was implemented, supporting multi-core CPU and NVIDIA GPUs. Experiments show that our implementation can execute the same source code efficiently on multi-core CPU platforms and GPUs, achieving an average of 1.6-2.5x speedup over previous implementations of MapReduce on eight commonly used applications.

Journal ArticleDOI
TL;DR: This paper proposes a grid-alignment scheme and associated data structures that greatly reduce the complexity of the registration algorithm, and develops highly data parallel designs for B-spline registration within the stream-processing model, suitable for implementation on multi-core processors such as graphics processing units (GPUs).
Abstract: Spline-based deformable registration methods are quite popular within the medical-imaging community due to their flexibility and robustness. However, they require a large amount of computing time to obtain adequate results. This paper makes two contributions towards accelerating B-spline-based registration. First, we propose a grid-alignment scheme and associated data structures that greatly reduce the complexity of the registration algorithm. Based on this grid-alignment scheme, we then develop highly data parallel designs for B-spline registration within the stream-processing model, suitable for implementation on multi-core processors such as graphics processing units (GPUs). Particular attention is focused on an optimal method for performing analytic gradient computations in a data parallel fashion. CPU and GPU versions are validated for execution time and registration quality. Performance results on large images show that our GPU algorithm achieves a speedup of 15 times over the single-threaded CPU implementation whereas our multi-core CPU algorithm achieves a speedup of 8 times over the single-threaded implementation. The CPU and GPU versions achieve near-identical registration quality in terms of RMS differences between the generated vector fields.

Proceedings ArticleDOI
19 Apr 2010
TL;DR: Experimental results show that the proposed task-based dynamic load-balancing solution can utilize the hardware more efficiently than the CUDA scheduler for unbalanced workload, and achieves near-linear speedup, load balance, and significant performance improvement over techniques based on standard CUDA APIs.
Abstract: The computational power provided by many-core graphics processing units (GPUs) has been exploited in many applications. The programming techniques currently employed on these GPUs are not sufficient to address problems exhibiting irregular, and unbalanced workload. The problem is exacerbated when trying to effectively exploit multiple GPUs concurrently, which are commonly available in many modern systems. In this paper, we propose a task-based dynamic load-balancing solution for single-and multi-GPU systems. The solution allows load balancing at a finer granularity than what is supported in current GPU programming APIs, such as NVIDIA's CUDA. We evaluate our approach using both micro-benchmarks and a molecular dynamics application that exhibits significant load imbalance. Experimental results with a single-GPU configuration show that our fine-grained task solution can utilize the hardware more efficiently than the CUDA scheduler for unbalanced workload. On multi-GPU systems, our solution achieves near-linear speedup, load balance, and significant performance improvement over techniques based on standard CUDA APIs.

Proceedings ArticleDOI
01 Apr 2010
TL;DR: In this paper, the authors propose interval simulation, which takes a completely different approach: interval simulation raises the level of abstraction and replaces the core-level cycle-accurate simulation model by a mechanistic analytical model.
Abstract: Detailed architectural simulators suffer from a long development cycle and extremely long evaluation times. This longstanding problem is further exacerbated in the multi-core processor era. Existing solutions address the simulation problem by either sampling the simulated instruction stream or by mapping the simulation models on FPGAs; these approaches achieve substantial simulation speedups while simulating performance in a cycle-accurate manner. This paper proposes interval simulation which takes a completely different approach: interval simulation raises the level of abstraction and replaces the core-level cycle-accurate simulation model by a mechanistic analytical model. The analytical model estimates core-level performance by analyzing intervals, or the timing between two miss events (branch mispredictions and TLB/cache misses); the miss events are determined through simulation of the memory hierarchy, cache coherence protocol, interconnection network and branch predictor. By raising the level of abstraction, interval simulation reduces both development time and evaluation time. Our experimental results using the SPEC CPU2000 and PARSEC benchmark suites and the M5 multi-core simulator, show good accuracy up to eight cores (average error of 4.6% and max error of 11% for the multi-threaded full-system workloads), while achieving a one order of magnitude simulation speedup compared to cycle-accurate simulation. Moreover, interval simulation is easy to implement: our implementation of the mechanistic analytical model incurs only one thousand lines of code. Its high accuracy, fast simulation speed and ease-of-use make interval simulation a useful complement to the architect's toolbox for exploring system-level and high-level micro-architecture trade-offs.

Proceedings ArticleDOI
04 Feb 2010
TL;DR: By proposing optimization strategies that allow short-circuiting score computations in additive learning systems, this paper is able to speedup the score computation process by more than four times with almost no loss in result quality.
Abstract: Some commercial web search engines rely on sophisticated machine learning systems for ranking web documents. Due to very large collection sizes and tight constraints on query response times, online efficiency of these learning systems forms a bottleneck. An important problem in such systems is to speedup the ranking process without sacrificing much from the quality of results. In this paper, we propose optimization strategies that allow short-circuiting score computations in additive learning systems. The strategies are evaluated over a state-of-the-art machine learning system and a large, real-life query log, obtained from Yahoo!. By the proposed strategies, we are able to speedup the score computations by more than four times with almost no loss in result quality.

Proceedings ArticleDOI
13 Dec 2010
TL;DR: This work argues that it is now close to exhausting all possible speedup from software, and that it must turn to hardware-based solutions, and investigates both GPU and FPGA based acceleration of subsequence similarity search under the DTW measure.
Abstract: Many time series data mining problems require subsequence similarity search as a subroutine. Dozens of similarity/distance measures have been proposed in the last decade and there is increasing evidence that Dynamic Time Warping (DTW) is the best measure across a wide range of domains. Given DTW’s usefulness and ubiquity, there has been a large community-wide effort to mitigate its relative lethargy. Proposed speedup techniques include early abandoning strategies, lower-bound based pruning, indexing and embedding. In this work we argue that we are now close to exhausting all possible speedup from software, and that we must turn to hardware-based solutions. With this motivation, we investigate both GPU (Graphics Processing Unit) and FPGA (Field Programmable Gate Array) based acceleration of subsequence similarity search under the DTW measure. As we shall show, our novel algorithms allow GPUs to achieve two orders of magnitude speedup and FPGAs to produce four orders of magnitude speedup. We conduct detailed case studies on the classification of astronomical observations and demonstrate that our ideas allow us to tackle problems that would be untenable otherwise.

Proceedings ArticleDOI
11 Sep 2010
TL;DR: MapReduce, a simple and elegant programming model to program large scale clusters, has recently been shown to be a promising alternative to harness the multicore platform.
Abstract: The prevalence of chip multiprocessor opens opportunities of running data-parallel applications originally in clusters on a single machine with many cores. MapReduce, a simple and elegant programming model to program large scale clusters, has recently been shown to be a promising alternative to harness the multicore platform.The differences such as memory hierarchy and communication patterns between clusters and multicore platforms raise new challenges to design and implement an efficient MapReduce system on multicore. This paper argues that it is more efficient for MapReduce to iteratively process small chunks of data in turn than processing a large chunk of data at one time on shared memory multicore platforms. Based on the argument, we extend the general MapReduce programming model with "tiling strategy", calledTiled-MapReduce (TMR). TMR partitions a large MapReduce job into a number of small sub-jobs and iteratively processes one subjob at a time with efficient use of resources; TMR finally merges the results of all sub-jobs for output. Based on Tiled-MapReduce, we design and implement several optimizing techniques targeting multicore, including the reuse of input and intermediate data structure among sub-jobs, a NUCA/NUMA-aware scheduler, and pipelining a sub-job's reduce phase with the successive sub-job's map phase, to optimize the memory, cache and CPU resources accordingly.We have implemented a prototype of Tiled-MapReduce based on Phoenix, an already highly optimized MapReduce runtime for shared memory multiprocessors. The prototype, namely Ostrich, runs on an Intel machine with 16 cores. Experiments on four different types of benchmarks show that Ostrich saves up to 85% memory, causes less cache misses and makes more efficient uses of CPU cores, resulting in a speedup ranging from 1.2X to 3.3X.

Proceedings ArticleDOI
19 Jun 2010
TL;DR: It is shown that parallel performance is not only limited by sequential code but is also fundamentally limited by synchronization through critical sections, and the surprising result that the impact of critical sections on parallel performance can be modeled as a completely sequential part and a completely parallel part.
Abstract: This paper presents a fundamental law for parallel performance: it shows that parallel performance is not only limited by sequential code (as suggested by Amdahl's law) but is also fundamentally limited by synchronization through critical sections. Extending Amdahl's software model to include critical sections, we derive the surprising result that the impact of critical sections on parallel performance can be modeled as a completely sequential part and a completely parallel part. The sequential part is determined by the probability for entering a critical section and the contention probability (i.e., multiple threads wanting to enter the same critical section). This fundamental result reveals at least three important insights for multicore design. (i) Asymmetric multicore processors deliver less performance benefits relative to symmetric processors than suggested by Amdahl's law, and in some cases even worse performance. (ii) Amdahl's law suggests many tiny cores for optimum performance in asymmetric processors, however, we find that fewer but larger small cores can yield substantially better performance. (iii) Executing critical sections on the big core can yield substantial speedups, however, performance is sensitive to the accuracy of the critical section contention predictor.

Proceedings ArticleDOI
13 Nov 2010
TL;DR: This work presents the first full CUDA porting of the high- resolution weather prediction model ASUCA, the first such one to be known to date, and demonstrates over 80-fold speedup and good weak scaling achieving 15.0 TFlops in single precision.
Abstract: Regional weather forecasting demands fast simulation over fine-grained grids, resulting in extremely memory- bottlenecked computation, a difficult problem on conventional supercomputers. Early work on accelerating mainstream weather code WRF using GPUs with their high memory performance, however, resulted in only minor speedup due to partial GPU porting of the huge code. Our full CUDA porting of the high- resolution weather prediction model ASUCA is the first such one we know to date; ASUCA is a next-generation, production weather code developed by the Japan Meteorological Agency, similar to WRF in the underlying physics (non-hydrostatic model). Benchmark on the 528 (NVIDIA GT200 Tesla) GPU TSUBAME Supercomputer at the Tokyo Institute of Technology demonstrated over 80-fold speedup and good weak scaling achieving 15.0 TFlops in single precision for 6956 x 6052 x 48 mesh. Further benchmarks on TSUBAME 2.0, which will embody over 4000 NVIDIA Fermi GPUs and deployed in October 2010, will be presented.

Book ChapterDOI
08 Nov 2010
TL;DR: A novel method that, given a sequence of synchronized views of a human hand, recovers its 3D position, orientation and full articulation parameters using Particle Swarm Optimization and achieves a speedup of two orders of magnitude over the case of CPU processing.
Abstract: We present a novel method that, given a sequence of synchronized views of a human hand, recovers its 3D position, orientation and full articulation parameters. The adopted hand model is based on properly selected and assembled 3D geometric primitives. Hypothesized configurations/poses of the hand model are projected to different camera views and image features such as edge maps and hand silhouettes are computed. An objective function is then used to quantify the discrepancy between the predicted and the actual, observed features. The recovery of the 3D hand pose amounts to estimating the parameters that minimize this objective function which is performed using Particle Swarm Optimization. All the basic components of the method (feature extraction, objective function evaluation, optimization process) are inherently parallel. Thus, a GPU-based implementation achieves a speedup of two orders of magnitude over the case of CPU processing. Extensive experimental results demonstrate qualitatively and quantitatively that accurate 3D pose recovery of a hand can be achieved robustly at a rate that greatly outperforms the current state of the art.

Journal ArticleDOI
TL;DR: This work shows that effective use of the GPU requires a novel reformulation of the Smith-Waterman algorithm, and indicates that for large problems a single GPU is up to 45 times faster than a CPU for this application.