scispace - formally typeset
Search or ask a question

Showing papers on "Multi-core processor published in 2008"


Proceedings ArticleDOI
11 Aug 2008
TL;DR: Presents a collection of slides covering the following topics: CUDA parallel programming model; CUDA toolkit and libraries; performance optimization; and application development.
Abstract: The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now parallel systems. Furthermore, their parallelism continues to scale with Moore's law. The challenge is to develop mainstream application software that transparently scales its parallelism to leverage the increasing number of processor cores, much as 3D graphics applications transparently scale their parallelism to manycore GPUs with widely varying numbers of cores.

2,216 citations


Journal ArticleDOI
TL;DR: This paper develops a general purpose molecular dynamics code that runs entirely on a single GPU and shows that the GPU implementation provides a performance equivalent to that of fast 30 processor core distributed memory cluster.

1,514 citations


Book
Maurice Herlihy1
14 Mar 2008
TL;DR: Transactional memory as discussed by the authors is a computational model in which threads synchronize by optimistic, lock-free transactions, and there is a growing community of researchers working on both software and hardware support for this approach.
Abstract: Computer architecture is about to undergo, if not another revolution, then a vigorous shaking-up. The major chip manufacturers have, for the time being, simply given up trying to make processors run faster. Instead, they have recently started shipping "multicore" architectures, in which multiple processors (cores) communicate directly through shared hardware caches, providing increased concurrency instead of increased clock speed.As a result, system designers and software engineers can no longer rely on increasing clock speed to hide software bloat. Instead, they must somehow learn to make effective use of increasing parallelism. This adaptation will not be easy. Conventional synchronization techniques based on locks and conditions are unlikely to be effective in such a demanding environment. Coarse-grained locks, which protect relatively large amounts of data, do not scale, and fine-grained locks introduce substantial software engineering problem.Transactional memory is a computational model in which threads synchronize by optimistic, lock-free transactions. This synchronization model promises to alleviate many (not all) of the problems associated with locking, and there is a growing community of researchers working on both software and hardware support for this approach. This talk will survey the area, with a focus on open research problems.

1,268 citations


Journal ArticleDOI
TL;DR: Augmenting Amdahl's law with a corollary for multicore hardware makes it relevant to future generations of chips with multiple processor cores.
Abstract: Augmenting Amdahl's law with a corollary for multicore hardware makes it relevant to future generations of chips with multiple processor cores. Obtaining optimal multicore performance will require further research in both extracting more parallelism and making sequential cores faster.

1,245 citations


Journal ArticleDOI
TL;DR: In this article, the authors present a framework to develop mainstream application software that transparently scales its parallelism to leverage the increasing number of processor cores, much as 3D graphics applications transparently scale their parallelism on manycore GPUs with widely varying numbers of cores.
Abstract: The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now parallel systems. Furthermore, their parallelism continues to scale with Moore’s law. The challenge is to develop mainstream application software that transparently scales its parallelism to leverage the increasing number of processor cores, much as 3D graphics applications transparently scale their parallelism to manycore GPUs with widely varying numbers of cores.

1,148 citations


Proceedings ArticleDOI
25 Oct 2008
TL;DR: Mars hides the programming complexity of the GPU behind the simple and familiar MapReduce interface, and is up to 16 times faster than its CPU-based counterpart for six common web applications on a quad-core machine.
Abstract: We design and implement Mars, a MapReduce framework, on graphics processors (GPUs). MapReduce is a distributed programming framework originally proposed by Google for the ease of development of web search applications on a large number of commodity CPUs. Compared with CPUs, GPUs have an order of magnitude higher computation power and memory bandwidth, but are harder to program since their architectures are designed as a special-purpose co-processor and their programming interfaces are typically for graphics applications. As the first attempt to harness GPU's power for MapReduce, we developed Mars on an NVIDIA G80 GPU, which contains over one hundred processors, and evaluated it in comparison with Phoenix, the state-of-the-art MapReduce framework on multi-core CPUs. Mars hides the programming complexity of the GPU behind the simple and familiar MapReduce interface. It is up to 16 times faster than its CPU-based counterpart for six common web applications on a quad-core machine.

793 citations


Proceedings ArticleDOI
15 Nov 2008
TL;DR: In this article, the authors present performance results for dense linear algebra using recent NVIDIA GPUs and argue that modern GPUs should be viewed as multithreaded multicore vector units, and exploit blocking similarly to vector computers and heterogeneity of the system.
Abstract: We present performance results for dense linear algebra using recent NVIDIA GPUs. Our matrix-matrix multiply routine (GEMM) runs up to 60% faster than the vendor's implementation and approaches the peak of hardware capabilities. Our LU, QR and Cholesky factorizations achieve up to 80--90% of the peak GEMM rate. Our parallel LU running on two GPUs achieves up to ~540 Gflop/s. These results are accomplished by challenging the accepted view of the GPU architecture and programming guidelines. We argue that modern GPUs should be viewed as multithreaded multicore vector units. We exploit blocking similarly to vector computers and heterogeneity of the system by computing both on GPU and CPU. This study includes detailed benchmarking of the GPU memory system that reveals sizes and latencies of caches and TLB. We present a couple of algorithmic optimizations aimed at increasing parallelism and regularity in the problem that provide us with slightly higher performance.

787 citations


Journal ArticleDOI
01 Aug 2008
TL;DR: This article consists of a collection of slides from the author's conference presentation, some of the topics discussed include: architecture convergence; Larrabee architecture; and graphics pipeline.
Abstract: This paper presents a many-core visual computing architecture code named Larrabee, a new software rendering pipeline, a manycore programming model, and performance analysis for several applications. Larrabee uses multiple in-order x86 CPU cores that are augmented by a wide vector processor unit, as well as some fixed function logic blocks. This provides dramatically higher performance per watt and per unit of area than out-of-order CPUs on highly parallel workloads. It also greatly increases the flexibility and programmability of the architecture as compared to standard GPUs. A coherent on-die 2nd level cache allows efficient inter-processor communication and high-bandwidth local data access by CPU cores. Task scheduling is performed entirely with software in Larrabee, rather than in fixed function logic. The customizable software graphics rendering pipeline for this architecture uses binning in order to reduce required memory bandwidth, minimize lock contention, and increase opportunities for parallelism relative to standard GPUs. The Larrabee native programming model supports a variety of highly parallel applications that use irregular data structures. Performance analysis on those applications demonstrates Larrabee's potential for a broad range of parallel computation.

784 citations


Journal ArticleDOI
01 Jun 2008
TL;DR: This work explores more aggressive 3D DRAM organizations that make better use of the additional die-to-die bandwidth provided by 3D stacking, as well as the additional transistor count, to achieve a 1.75x speedup over previously proposed 3D-DRAM approaches on memory-intensive multi-programmed workloads on a quad-core processor.
Abstract: Three-dimensional integration enables stacking memory directly on top of a microprocessor, thereby significantly reducing wire delay between the two. Previous studies have examined the performance benefits of such an approach, but all of these works only consider commodity 2D DRAM organizations. In this work, we explore more aggressive 3D DRAM organizations that make better use of the additional die-to-die bandwidth provided by 3D stacking, as well as the additional transistor count. Our simulation results show that with a few simple changes to the 3D-DRAM organization, we can achieve a 1.75x speedup over previously proposed 3D-DRAM approaches on our memory-intensive multi-programmed workloads on a quad-core processor. The significant increase in memory system performance makes the L2 miss handling architecture (MHA) a new bottleneck, which we address by combining a novel data structure called the Vector Bloom Filter with dynamic MSHR capacity tuning. Our scalable L2 MHA yields an additional 17.8% performance improvement over our 3D-stacked memory architecture.

679 citations


Journal ArticleDOI
TL;DR: This paper uses NVIDIA's C-like CUDA language and an engineering sample of their recently introduced GTX 260 GPU to explore the effectiveness of GPUs for a variety of application types, and describes some specific coding idioms that improve their performance on the GPU.

660 citations


Proceedings ArticleDOI
01 Feb 2008
TL;DR: The TILE64TM processor is a multicore SoC targeting the high-performance demands of a wide range of embedded applications across networking and digital multimedia applications.
Abstract: The TILE64TM processor is a multicore SoC targeting the high-performance demands of a wide range of embedded applications across networking and digital multimedia applications. A figure shows a block diagram with 64 tile processors arranged in an 8x8 array. These tiles connect through a scalable 2D mesh network with high-speed I/Os on the periphery. Each general-purpose processor is identical and capable of running SMP Linux.

Proceedings ArticleDOI
24 Oct 2008
TL;DR: Augmenting Amdahl's law with a corollary for multicore hardware makes it relevant to future generations of chips with multiple processor cores.
Abstract: Summary form only given. In this paper, we apply Amdahl's law to several multicore chips variants: symmetric cores, asymmetric cores and dynamic techniques that allow cores to work together on sequential execution. Starting with Amdahl's simple software model, we add a simple hardware model based on fixed chip resources.

Proceedings ArticleDOI
24 Oct 2008
TL;DR: This paper has comprehensively evaluated several representative cache partitioning schemes with different optimization objectives, including performance, fairness, and quality of service (QoS) and provides new insights into dynamic behaviors and interaction effects.
Abstract: Cache partitioning and sharing is critical to the effective utilization of multicore processors. However, almost all existing studies have been evaluated by simulation that often has several limitations, such as excessive simulation time, absence of OS activities and proneness to simulation inaccuracy. To address these issues, we have taken an efficient software approach to supporting both static and dynamic cache partitioning in OS through memory address mapping. We have comprehensively evaluated several representative cache partitioning schemes with different optimization objectives, including performance, fairness, and quality of service (QoS). Our software approach makes it possible to run the SPEC CPU2006 benchmark suite to completion. Besides confirming important conclusions from previous work, we are able to gain several insights from whole-program executions, which are infeasible from simulation. For example, giving up some cache space in one program to help another one may improve the performance of both programs for certain workloads due to reduced contention for memory bandwidth. Our evaluation of previously proposed fairness metrics is also significantly different from a simulation-based study. The contributions of this study are threefold. (1) To the best of our knowledge, this is a highly comprehensive execution- and measurement-based study on multicore cache partitioning. This paper not only confirms important conclusions from simulation-based studies, but also provides new insights into dynamic behaviors and interaction effects. (2) Our approach provides a unique and efficient option for evaluating multicore cache partitioning. The implemented software layer can be used as a tool in multicore performance evaluation and hardware design. (3) The proposed schemes can be further refined for OS kernels to improve performance.

Proceedings ArticleDOI
08 Dec 2008
TL;DR: This paper proposes three operating system abstractions (address ranges, kernel cores, and shares) that allow applications to control inter-core sharing and to take advantage of the likely abundance of cores by dedicating cores to specific operating system functions.
Abstract: Multiprocessor application performance can be limited by the operating system when the application uses the operating system frequently and the operating system services use data structures shared and modified by multiple processing cores If the application does not need the sharing, then the operating system will become an unnecessary bottleneck to the application's performanceThis paper argues that applications should control sharing: the kernel should arrange each data structure so that only a single processor need update it, unless directed otherwise by the application Guided by this design principle, this paper proposes three operating system abstractions (address ranges, kernel cores, and shares) that allow applications to control inter-core sharing and to take advantage of the likely abundance of cores by dedicating cores to specific operating system functionsMeasurements of microbenchmarks on the Corey prototype operating system, which embodies the new abstractions, show how control over sharing can improve performance Application benchmarks, using MapReduce and a Web server, show that the improvements can be significant for overall performance: MapReduce on Corey performs 25% faster than on Linux when using 16 cores Hardware event counters confirm that these improvements are due to avoiding operations that are expensive on multicore machines

Proceedings ArticleDOI
Shuai Che1, Jie Li1, Jeremy W. Sheaffer1, Kevin Skadron1, John Lach1 
08 Jun 2008
TL;DR: A comparative study of application behavior on accelerators considering performance and code complexity and an application characteristic to accelerator platform mapping are presented, which can aid developers in selecting an appropriate target architecture for their chosen application.
Abstract: Accelerators are special purpose processors designed to speed up compute-intensive sections of applications. Two extreme endpoints in the spectrum of possible accelerators are FPGAs and GPUs, which can often achieve better performance than CPUs on certain workloads. FPGAs are highly customizable, while GPUs provide massive parallel execution resources and high memory bandwidth. Applications typically exhibit vastly different performance characteristics depending on the accelerator. This is an inherent problem attributable to architectural design, middleware support and programming style of the target platform. For the best application-to-accelerator mapping, factors such as programmability, performance, programming cost and sources of overhead in the design flows must be all taken into consideration. In general, FPGAs provide the best expectation of performance, flexibility and low overhead, while GPUs tend to be easier to program and require less hardware resources. We present a performance study of three diverse applications - Gaussian elimination, data encryption standard (DES), and Needleman-Wunsch - on an FPGA, a GPU and a multicore CPU system. We perform a comparative study of application behavior on accelerators considering performance and code complexity. Based on our results, we present an application characteristic to accelerator platform mapping, which can aid developers in selecting an appropriate target architecture for their chosen application.

Proceedings ArticleDOI
01 Jan 2008

Proceedings ArticleDOI
01 Mar 2008
TL;DR: The Merge framework replaces current ad hoc approaches to parallel programming on heterogeneous platforms with a rigorous, library-based methodology that can automatically distribute computation across heterogeneous cores to achieve increased energy and performance efficiency.
Abstract: In this paper we propose the Merge framework, a general purpose programming model for heterogeneous multi-core systems. The Merge framework replaces current ad hoc approaches to parallel programming on heterogeneous platforms with a rigorous, library-based methodology that can automatically distribute computation across heterogeneous cores to achieve increased energy and performance efficiency. The Merge framework provides (1) a predicate dispatch-based library system for managing and invoking function variants for multiple architectures; (2) a high-level, library-oriented parallel language based on map-reduce; and (3) a compiler and runtime which implement the map-reduce language pattern by dynamically selecting the best available function implementations for a given input and machine configuration. Using a generic sequencer architecture interface for heterogeneous accelerators, the Merge framework can integrate function variants for specialized accelerators, offering the potential for to-the-metal performance for a wide range of heterogeneous architectures, all transparent to the user. The Merge framework has been prototyped on a heterogeneous platform consisting of an Intel Core 2 Duo CPU and an 8-core 32-thread Intel Graphics and Media Accelerator X3000, and a homogeneous 32-way Unisys SMP system with Intel Xeon processors. We implemented a set of benchmarks using the Merge framework and enhanced the library with X3000 specific implementations, achieving speedups of 3.6x -- 8.5x using the X3000 and 5.2x -- 22x using the 32-way system relative to the straight C reference implementation on a single IA32 core.

DOI
01 Jan 2008
TL;DR: This work explores multicore stencil (nearest-neighbor) computations - a class of algorithms at the heart of many structured grid codes, including PDE solvers - and develops a number of effective optimization strategies, and builds an auto-tuning environment that searches over the optimizations and their parameters to minimize runtime, while maximizing performance portability.
Abstract: Stencil Computation Optimization and Auto-tuning on State-of-the-Art Multicore Architectures Kaushik Datta ∗† , Mark Murphy † , Vasily Volkov † , Samuel Williams ∗† , Jonathan Carter ∗ , Leonid Oliker ∗† , David Patterson ∗† , John Shalf ∗ , and Katherine Yelick ∗† CRD/NERSC, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA Computer Science Division, University of California at Berkeley, Berkeley, CA 94720, USA Abstract Understanding the most efficient design and utilization of emerging multicore systems is one of the most chal- lenging questions faced by the mainstream and scientific computing industries in several decades Our work ex- plores multicore stencil (nearest-neighbor) computations — a class of algorithms at the heart of many structured grid codes, including PDE solvers We develop a number of effective optimization strategies, and build an auto-tuning environment that searches over our optimizations and their parameters to minimize runtime, while maximizing performance portability To evaluate the effectiveness of these strategies we explore the broadest set of multicore architectures in the current HPC literature, including the Intel Clovertown, AMD Barcelona, Sun Victoria Falls, IBM QS22 PowerXCell 8i, and NVIDIA GTX280 Overall, our auto-tuning optimization methodology results in the fastest multicore stencil performance to date Finally, we present several key insights into the architectural trade- offs of emerging multicore designs and their implications on scientific algorithm development 1 Introduction The computing industry has recently moved away from exponential scaling of clock frequency toward chip mul- tiprocessors (CMPs) in order to better manage trade-offs among performance, energy efficiency, and reliability [1] Because this design approach is relatively immature, there is a vast diversity of available CMP architectures System designers and programmers are confronted with a confus- ing variety of architectural features, such as multicore, SIMD, simultaneous multithreading, core heterogeneity, and unconventional memory hierarchies, often combined in novel arrangements Given the current flux in CMP design, it is unclear which architectural philosophy is best suited for a given class of algorithms Likewise, this architectural diversity leads to uncertainty on how to refactor existing algorithms and tune them to take the maximum advantage of existing and emerging platforms Understanding the most efficient design and utilization of these increasingly parallel multicore systems is one of the most challenging questions faced by the computing industry since it began This work presents a comprehensive set of multicore optimizations for stencil (nearest-neigbor) computations — a class of algorithms at the heart of most calculations involving structured (rectangular) grids, including both implicit and explicit partial differential equation (PDE) solvers Our work explores the relatively simple 3D heat equation, which can be used as a proxy for more complex stencil calculations In addition to their importance in scientific calculations, stencils are interesting as an archi- tectural evaluation benchmark because they have abundant parallelism and low computational intensity, offering a mixture of opportunities for on-chip parallelism and chal- lenges for associated memory systems Our optimizations include NUMA affinity, array padding, core/register blocking, prefetching, and SIMDiza- tion — as well as novel stencil algorithmic transformations that leverage multicore resources: thread blocking and circular queues Since there are complex and unpredictable interactions between our optimizations and the underlying architectures, we develop an auto-tuning environment for stencil codes that searches over a set of optimizations and their parameters to minimize runtime and provide performance portability across the breadth of existing and future architectures We believe such application-specific auto-tuners are the most practical near-term approach for obtaining high performance on multicore systems To evaluate the effectiveness of our optimization strate- gies we explore the broadest set of multicore archi- tectures in the current HPC literature, including the out-of-order cache-based microprocessor designs of the dual-socket×quad-core AMD Barcelona and the dual- socket×quad-core Intel Clovertown, the heterogeneous local-store based architecture of the dual-socket×eight- core fast double precision STI Cell QS22 PowerX- Cell 8i Blade, as well as one of the first scientific studies of the hardware-multithreaded dual-socket×eight- core×eight-thread Sun Victoria Falls machine Addition- ally, we present results on the single-socket×240-core mul- tithreaded streaming NVIDIA GeForce GTX280 general purpose graphics processing unit (GPGPU) This suite of architectures allows us to compare the mainstream multicore approach of replicating conventional cores that emphasize serial performance (Barcelona and

Journal Article
TL;DR: In this paper, the authors argue that the benefits of heterogeneous CMPs are bolstered by the usage of a dynamic assignment policy, i.e., a runtime mechanism which observes the behavior of the running threads and exploits thread migration between the cores.
Abstract: In a multi-programmed computing environment, threads of execution exhibit different runtime characteristics and hardware resource requirements. Not only do the behaviors of distinct threads differ, but each thread may also present diversity in its performance and resource usage over time. A heterogeneous chip multiprocessor (CMP) architecture consists of processor cores and caches of varying size and complexity. Prior work has shown that heterogeneous CMPs can meet the needs of a multi-programmed computing environment better than a homogeneous CMP system. In fact, the use of a combination of cores with different caches and instruction issue widths better accommodates threads with different computational requirements.A central issue in the design and use of heterogeneous systems is to determine an assignment of tasks to processors which better exploits the hardware resources in order to improve performance. In this paper we argue that the benefits of heterogeneous CMPs are bolstered by the usage of a dynamic assignment policy, i.e., a runtime mechanism which observes the behavior of the running threads and exploits thread migration between the cores. We validate our analysis by means of simulation. Specifically, our model assumes a combination of Alpha EV5 and Alpha EV6 processors and of integer and floating point programs from the SPEC2000 benchmark suite. We show that a dynamic assignment can outperform a static one by 20% to 40% on average and by as much as 80% in extreme cases, depending on the degree of multithreading simulated.

Proceedings ArticleDOI
31 Oct 2008
TL;DR: A programming model for those environments based on automatic function level parallelism that strives to be easy, flexible, portable, and performant is presented and it is demonstrated that it offers reasonable performance without tuning, and that it can rival highly tuned libraries with minimal tuning effort.
Abstract: Parallel programming on SMP and multi-core architectures is hard. In this paper we present a programming model for those environments based on automatic function level parallelism that strives to be easy, flexible, portable, and performant. Its main trait is its ability to exploit task level parallelism by analyzing task dependencies at run time. We present the programming environment in the context of algorithms from several domains and pinpoint its benefits compared to other approaches. We discuss its execution model and its scheduler. Finally we analyze its performance and demonstrate that it offers reasonable performance without tuning, and that it can rival highly tuned libraries with minimal tuning effort.

Journal ArticleDOI
TL;DR: An updated take on Amdahl's analytical model uses modern design constraints to analyze many-core design alternatives, providing computer architects with a better understanding of many- core design types, enabling them to make more informed tradeoffs.
Abstract: An updated take on Amdahl's analytical model uses modern design constraints to analyze many-core design alternatives. The revised models provide computer architects with a better understanding of many-core design types, enabling them to make more informed tradeoffs.

Journal ArticleDOI
01 Aug 2008
TL;DR: An efficient implementation and detailed analysis of MergeSort on current CPU architectures, and performance scalability of the proposed sorting algorithm with respect to certain salient architectural features of modern chip multiprocessor (CMP) architectures, including SIMD width and core-count.
Abstract: Sorting a list of input numbers is one of the most fundamental problems in the field of computer science in general and high-throughput database applications in particular. Although literature abounds with various flavors of sorting algorithms, different architectures call for customized implementations to achieve faster sorting times.This paper presents an efficient implementation and detailed analysis of MergeSort on current CPU architectures. Our SIMD implementation with 128-bit SSE is 3.3X faster than the scalar version. In addition, our algorithm performs an efficient multiway merge, and is not constrained by the memory bandwidth. Our multi-threaded, SIMD implementation sorts 64 million floating point numbers in less than0.5 seconds on a commodity 4-core Intel processor. This measured performance compares favorably with all previously published results.Additionally, the paper demonstrates performance scalability of the proposed sorting algorithm with respect to certain salient architectural features of modern chip multiprocessor (CMP) architectures, including SIMD width and core-count. Based on our analytical models of various architectural configurations, we see excellent scalability of our implementation with SIMD width scaling up to 16X wider than current SSE width of 128-bits, and CMP core-count scaling well beyond 32 cores. Cycle-accurate simulation of Intel's upcoming x86 many-core Larrabee architecture confirms scalability of our proposed algorithm.

Proceedings ArticleDOI
07 Jun 2008
TL;DR: A compiler technique for planning and orchestrating the execution of streaming applications on multicore platforms and a generalized code generation template for mapping the software pipeline onto the Cell architecture is presented.
Abstract: While multicore hardware has become ubiquitous, explicitly parallel programming models and compiler techniques for exploiting parallelism on these systems have noticeably lagged behind. Stream programming is one model that has wide applicability in the multimedia, graphics, and signal processing domains. Streaming models execute as a set of independent actors that explicitly communicate data through channels. This paper presents a compiler technique for planning and orchestrating the execution of streaming applications on multicore platforms. An integrated unfolding and partitioning step based on integer linear programming is presented that unfolds data parallel actors as needed and maximally packs actors onto cores. Next, the actors are assigned to pipeline stages in such a way that all communication is maximally overlapped with computation on the cores. To facilitate experimentation, a generalized code generation template for mapping the software pipeline onto the Cell architecture is presented. For a range of streaming applications, a geometric mean speedup of 14.7x is achieved on a 16-core Cell platform compared to a single core.

Proceedings ArticleDOI
David Luebke1
14 May 2008
TL;DR: This work describes the CUDA programming model and motivate its use in the biomedical imaging community and enables high-performance computing developers to harness the tremendous computational power and memory bandwidth of the GPU in a familiar programming environment - the C programming language.
Abstract: Graphics processing units (GPUs) originally designed for computer video cards have emerged as the most powerful chip in a high-performance workstation. Unlike multicore CPU architectures, which currently ship with two or four cores, GPU architectures are "manycore" with hundreds of cores capable of running thousands of threads in parallel. NVIDIA's CUDA is a co-evolved hardware-software architecture that enables high-performance computing developers to harness the tremendous computational power and memory bandwidth of the GPU in a familiar programming environment - the C programming language. We describe the CUDA programming model and motivate its use in the biomedical imaging community.

Proceedings ArticleDOI
15 Nov 2008
TL;DR: In this article, the authors describe strategies for decomposition and scheduling of computation among CPU cores and GPUs, and techniques for overlapping communication and CPU computation with GPU kernel execution, and report the adaptation of these techniques to NAMD, a widely used parallel molecular dynamics simulation package.
Abstract: Graphics processing units (GPUs) have become an attractive option for accelerating scientific computations as a result of advances in the performance and flexibility of GPU hardware, and due to the availability of GPU software development tools targeting general purpose and scientific computation. However, effective use of GPUs in clusters presents a number of application development and system integration challenges. We describe strategies for the decomposition and scheduling of computation among CPU cores and GPUs, and techniques for overlapping communication and CPU computation with GPU kernel execution. We report the adaptation of these techniques to NAMD, a widely-used parallel molecular dynamics simulation package, and present performance results for a 64-core 64-GPU cluster.

Journal ArticleDOI
TL;DR: Simulation results show that Dynamic Repartitioning can produce energy savings of about 8 percent even with the best energy-efficient partitioning algorithm, and Dynamic Core Scaling algorithm, which adjusts the number of active cores to reduce leakage power consumption under low load conditions.
Abstract: Multicore processors deliver a higher throughput at lower power consumption than unicore processors. In the near future, they will thus be widely used in mobile real-time systems. There have been many research on energy-efficient scheduling of real-time tasks using DVS. These approaches must be modified for multicore processors, however, since normally all the cores in a chip must run at the same performance level. Thus, blindly adopting existing DVS algorithms that do not consider the restriction will result in a waste of energy. This article suggests Dynamic Repartitioning algorithm based on existing partitioning approaches of multiprocessor systems. The algorithm dynamically balances the task loads of multiple cores to optimize power consumption during execution. We also suggest Dynamic Core Scaling algorithm, which adjusts the number of active cores to reduce leakage power consumption under low load conditions. Simulation results show that Dynamic Repartitioning can produce energy savings of about 8 percent even with the best energy-efficient partitioning algorithm. The results also show that Dynamic Core Scaling can reduce energy consumption by about 26 percent under low load conditions.

Proceedings ArticleDOI
25 Oct 2008
TL;DR: A multi-dimensional, online performance predictor is presented, which is deployed to address the problem of simultaneous runtime optimization of DVFS and DCT on multi-core systems and outperforms earlier solutions that adapt only DVFS or DCT, as well as one that sequentially applies DCT then DVFS.
Abstract: Power has become a primary concern for HPC systems. Dynamic voltage and frequency scaling (DVFS) and dynamic concurrency throttling (DCT) are two software tools (or knobs) for reducing the dynamic power consumption of HPC systems. To date, few works have considered the synergistic integration of DVFS and DCT in performance-constrained systems, and, to the best of our knowledge, no prior research has developed application-aware simultaneous DVFS and DCT controllers in real systems and parallel programming frameworks. We present a multi-dimensional, online performance predictor, which we deploy to address the problem of simultaneous runtime optimization of DVFS and DCT on multi-core systems. We present results from an implementation of the predictor in a runtime library linked to the Intel OpenMP environment and running on an actual dual-processor quad-core system. We show that our predictor derives near-optimal settings of the power-aware program adaptation knobs that we consider. Our overall framework achieves significant reductions in energy (19% mean) and ED2 (40% mean), through simultaneous power savings (6% mean) and performance improvements (14% mean). We also find that our framework outperforms earlier solutions that adapt only DVFS or DCT, as well as one that sequentially applies DCT then DVFS. Further, our results indicate that prediction-based schemes for runtime adaptation compare favorably and typically improve upon heuristic search-based approaches in both performance and energy savings.

Proceedings ArticleDOI
22 Apr 2008
TL;DR: The proposed approach can reasonably estimate the worst- case shared L2 instruction cache misses by considering inter-thread instruction conflicts and the WCET of applications running on multi-core processors estimated by the approach is much better than the estimation by simply assuming all L2 instructions are misses.
Abstract: Multi-core chips have been increasingly adopted by microprocessor industry. For real-time systems to safely harness the potential of multi-core computing, designers must be able to accurately obtain the worst-case execution time (WCET) of applications running on multi-core platforms, which is very challenging due to the possible runtime inter-core interferences in using shared resources such as the shared L2 caches. As the first step toward time-predictable multi-core computing, this paper presents a novel approach to bounding the worst-case performance for threads running on multi-core processors with shared L2 instruction caches. The idea of our approach is to compute the worst-case instruction access interferences between different threads based on the program control flow information of each thread, which can be statically analyzed. Our experiments indicate that the proposed approach can reasonably estimate the worst- case shared L2 instruction cache misses by considering inter-thread instruction conflicts. Also, the WCET of applications running on multi-core processors estimated by our approach is much better than the estimation by simply assuming all L2 instruction accesses are misses.

Proceedings ArticleDOI
20 Feb 2008
TL;DR: FastForward is presented, a cache-optimized single-producer/single-consumer concurrent lock-free queue for pipeline parallelism on multicore architectures, with weak to strongly ordered consistency models, with up to 5x faster than the next best solution.
Abstract: Low overhead core-to-core communication is critical for efficient pipeline-parallel software applications. This paper presents FastForward, a cache-optimized single-producer/single-consumer concurrent lock-free queue for pipeline parallelism on multicore architectures, with weak to strongly ordered consistency models. Enqueue and dequeue times on a 2.66 GHz Opteron 2218 based system are as low as 28.5 ns, up to 5x faster than the next best solution. FastForward's effectiveness is demonstrated for real applications by applying it to line-rate soft network processing on Gigabit Ethernet with general purpose commodity hardware.

Journal Article
TL;DR: Intel® Threading Building Blocks is a C++ runtime library that abstracts the low-level threading details necessary for effectively utilizing multi-core processors and uses C++ templates to eliminate the need to create and manage threads.
Abstract: Intel® Threading Building Blocks [1] is a C++ runtime library that abstracts the low-level threading details necessary for effectively utilizing multi-core processors. It uses C++ templates to eliminate the need to create and manage threads. Applications tend to be more portable since parallelism is achieved through library calls and utilization of a task manager for scheduling. The task manager analyzes the system the software is running on, chooses the optimal number of threads, and performs load balancing that spreads out the work evenly across all processor cores. The library consists of data structures and algorithms that simplify parallel programming in C++ by avoiding requiring a programmer to use native threading packages such as POSIX threads or Windows threads, or even the portable Boost Threads.