Showing papers on "Degree of parallelism published in 2011"

PDF

Open Access

Journal Article•DOI•

Performance improvements for iterative electron tomography reconstruction using graphics processing units (GPUs)

[...]

Willem Jan Palenstijn¹, Kees Joost Batenburg¹, Jan Sijbers¹•Institutions (1)

01 Nov 2011-Journal of Structural Biology

TL;DR: It is demonstrated that by making alternative design decisions in the GPU implementation, an additional speedup can be obtained, again of an order of magnitude, by carefully considering memory access locality when dividing the workload among blocks of threads, the GPU's cache is used more efficiently, making more effective use of the available memory bandwidth.

...read moreread less

330 citations

Proceedings Article•DOI•

Locality-Aware Reduce Task Scheduling for MapReduce

[...]

Mohammad Hammoud¹, Majd Sakr¹•Institutions (1)

Carnegie Mellon University¹

29 Nov 2011

TL;DR: LARTS attempts to collocate reduce tasks with the maximum required data computed after recognizing input data network locations and sizes and adopts a cooperative paradigm seeking a good data locality while circumventing scheduling delay, scheduling skew, poor system utilization, and low degree of parallelism.

...read moreread less

Abstract: MapReduce offers a promising programming model for big data processing. Inspired by functional languages, MapReduce allows programmers to write functional-style code which gets automatically divided into multiple map and/or reduce tasks and scheduled over distributed data across multiple machines. Hadoop, an open source implementation of MapReduce, schedules map tasks in the vicinity of their inputs in order to diminish network traffic and improve performance. However, Hadoop schedules reduce tasks at requesting nodes without considering data locality leading to performance degradation. This paper describes Locality-Aware Reduce Task Scheduler (LARTS), a practical strategy for improving MapReduce performance. LARTS attempts to collocate reduce tasks with the maximum required data computed after recognizing input data network locations and sizes. LARTS adopts a cooperative paradigm seeking a good data locality while circumventing scheduling delay, scheduling skew, poor system utilization, and low degree of parallelism. We implemented LARTS in Hadoop-0.20.2. Evaluation results show that LARTS outperforms the native Hadoop reduce task scheduler by an average of 7%, and up to 11.6%.

...read moreread less

155 citations

Book Chapter•DOI•

Invasive Computing: An Overview

[...]

Jürgen Teich¹, Jorg Henkel, Andreas Herkersdorf, Doris Schmitt-Landsiedel, Wolfgang Schröder-Preikschat, Gregor Snelting - Show less +2 more•Institutions (1)

University of Erlangen-Nuremberg¹

01 Jan 2011

TL;DR: This contribution reveals the main ideas, potential benefits and challenges for supporting invasive computing at the architectural, programming and compiler level in the future and gives an overview of required research topics rather than being able to present mature solutions yet.

...read moreread less

Abstract: A novel paradigm for designing and programming future parallel computing systems called invasive computing is proposed. The main idea and novelty of invasive computing is to introduce resource-aware programming support in the sense that a given program gets the ability to explore and dynamically spread its computations to neighbour processors in a phase called invasion, then to execute portions of code of high parallelism degree in parallel based on the available invasible region on a given multi-processor architecture. Afterwards, once the program terminates or if the degree of parallelism should be lower again, the program may enter a retreat phase, deallocate resources and resume execution again, for example, sequentially on a single processor. To support this idea of self-adaptive and resource-aware programming, not only new programming concepts, languages, compilers and operating systems are necessary but also revolutionary architectural changes in the design of Multi-Processor Systems-on-a-Chip must be provided so to efficiently support invasion, infection and retreat operations involving concepts for dynamic processor, interconnect and memory reconfiguration. This contribution reveals the main ideas, potential benefits and challenges for supporting invasive computing at the architectural, programming and compiler level in the future. It serves to give an overview of required research topics rather than being able to present mature solutions yet.

...read moreread less

144 citations

Proceedings Article•DOI•

How soccer players would do stream joins

[...]

Jens Teubner¹, Rene Mueller²•Institutions (2)

ETH Zurich¹, IBM²

12 Jun 2011

TL;DR: This work presents handshake join, a way of describing and executing window-based stream joins that is highly amenable to parallelized execution and gives a new intuition of window semantics, which it believes could inspire other stream processing algorithms or ongoing standardization efforts for stream query languages.

...read moreread less

Abstract: In spite of the omnipresence of parallel (multi-core) systems, the predominant strategy to evaluate window-based stream joins is still strictly sequential, mostly just straightforward along the definition of the operation semantics.In this work we present handshake join, a way of describing and executing window-based stream joins that is highly amenable to parallelized execution. Handshake join naturally leverages available hardware parallelism, which we demonstrate with an implementation on a modern multi-core system and on top of field-programmable gate arrays (FPGAs), an emerging technology that has shown distinctive advantages for high-throughput data processing.On the practical side, we provide a join implementation that substantially outperforms CellJoin (the fastest published result) and that will directly turn any degree of parallelism into higher throughput or larger supported window sizes. On the semantic side, our work gives a new intuition of window semantics, which we believe could inspire other stream processing algorithms or ongoing standardization efforts for stream query languages.

...read moreread less

144 citations

Proceedings Article•DOI•

Addressing link-level design tradeoffs for integrated photonic interconnects

[...]

Michael Georgas¹, Jonathan Leu¹, Benjamin Moss¹, Chen Sun¹, Vladimir Stojanovic¹ - Show less +1 more•Institutions (1)

Massachusetts Institute of Technology¹

20 Oct 2011

TL;DR: A set of link component models for performing interconnect design-space exploration connected to the underlying device and circuit technology is presented, demonstrating the link-level interactions between components in achieving the optimal degree of parallelism and energy-efficiency.

...read moreread less

Abstract: Integrated photonic interconnects have emerged recently as a potential solution for relieving on-chip and chip-to-chip bandwidth bottlenecks for next-generation many-core processors. To help bridge the gap between device and circuit/system designers, and aid in understanding of inherent photonic link tradeoffs, we present a set of link component models for performing interconnect design-space exploration connected to the underlying device and circuit technology. To compensate for process and thermal-induced ring resonator mismatches, we take advantage of device and circuit characteristics to propose an efficient ring tuning solution. Finally, we perform optimization of a wavelength-division-multiplexed link, demonstrating the link-level interactions between components in achieving the optimal degree of parallelism and energy-efficiency.

...read moreread less

134 citations

Patent•

System and method for high performance command processing in solid state drives

[...]

Marvin R. DeForest¹, Matthew Call¹, Mei-Man L. Syu¹•Institutions (1)

Western Digital¹

26 Aug 2011

TL;DR: In this article, the authors propose an approach to improve management of multiple I/O threads that take advantage of the high performing and concurrent nature of the back end media, so the resulting storage system can achieve a very high performance.

...read moreread less

Abstract: Solid State Drives (SSD) can yield very high performance if it is designed properly. A SSD typically includes both a front end that interfaces with the host and a back end that interfaces with the flash media. Typically SSDs include flash media that is designed with a high degree of parallelism that can support a very high bandwidth on input/output (I/O). A SSD front end designed according to a traditional hard disk drive (HDD) model will not be able to take advantage of the high performance offered by the typical flash media. Embodiments of the invention provide improved management of multiple I/O threads that take advantage of the high performing and concurrent nature of the back end media, so the resulting storage system can achieve a very high performance.

...read moreread less

128 citations

Book Chapter•DOI•

An intermediate algebra for optimizing RDF graph pattern matching on MapReduce

[...]

Padmashree Ravindra¹, HyeongSik Kim¹, Kemafor Anyanwu¹•Institutions (1)

North Carolina State University¹

29 May 2011

TL;DR: This paper proposes an approach for optimizing graph pattern matching by reinterpreting certain join tree structures as grouping operations which enables a greater degree of parallelism in join processing resulting in more "bushy" like query execution plans with fewer Map-Reduce cycles.

...read moreread less

Abstract: Existing MapReduce systems support relational style join operators which translate multi-join query plans into severalMap-Reduce cycles. This leads to high I/O and communication costs due to the multiple data transfer steps between map and reduce phases. SPARQL graph pattern matching is dominated by join operations, and is unlikely to be efficiently processed using existing techniques. This cost is prohibitive for RDF graph pattern matching queries which typically involve several join operations. In this paper, we propose an approach for optimizing graph pattern matching by reinterpreting certain join tree structures as grouping operations. This enables a greater degree of parallelism in join processing resulting in more "bushy" like query execution plans with fewer Map-Reduce cycles. This approach requires that the intermediate results are managed as sets of groups of triples or TripleGroups. We therefore propose a data model and algebra - Nested TripleGroup Algebra for capturing and manipulating TripleGroups. The relationship with the traditional relational style algebra used in Apache Pig is discussed. A comparative performance evaluation of the traditional Pig approach and RAPID+ (Pig extended with NTGA) for graph pattern matching queries on the BSBM benchmark dataset is presented. Results show up to 60% performance improvement of our approach over traditional Pig for some tasks.

...read moreread less

91 citations

Journal Article•DOI•

Parallelism orchestration using DoPE: the degree of parallelism executive

[...]

Arun Raman¹, Hanjun Kim¹, Taewook Oh¹, Jae W. Lee¹, David I. August¹ - Show less +1 more•Institutions (1)

Princeton University¹

04 Jun 2011

TL;DR: The Degree of Parallelism Executive (DoPE) is presented, an API and run-time system that separates the concern of exposing parallelism from that of optimizing it and dynamically optimizing the parallelism for a variety of performance goals.

...read moreread less

Abstract: In writing parallel programs, programmers expose parallelism and optimize it to meet a particular performance goal on a single platform under an assumed set of workload characteristics. In the field, changing workload characteristics, new parallel platforms, and deployments with different performance goals make the programmer's development-time choices suboptimal. To address this problem, this paper presents the Degree of Parallelism Executive (DoPE), an API and run-time system that separates the concern of exposing parallelism from that of optimizing it. Using the DoPE API, the application developer expresses parallelism options. During program execution, DoPE's run-time system uses this information to dynamically optimize the parallelism options in response to the facts on the ground. We easily port several emerging parallel applications to DoPE's API and demonstrate the DoPE run-time system's effectiveness in dynamically optimizing the parallelism for a variety of performance goals.

...read moreread less

65 citations

Proceedings Article•DOI•

Scheduling Mixed Real-Time and Non-real-Time Applications in MapReduce Environment

[...]

Xicheng Dong, Ying Wang, Huaming Liao

07 Dec 2011

TL;DR: A two-level MapReduce scheduler built on previous techniques and incorporating a deadline scheduler which adopts a sampling based approach and a resource allocation model to dynamically control each realtime job to execute with minimum tasks assignment in any time so as to maximize the number of concurrent real-time jobs.

...read moreread less

Abstract: MapReduce scheduling is becoming a hot topic as MapReduce attracts more and more attention from both industry and academia. In this paper, we focus on the scheduling of mixed real-time and non-real-time applications in MapReduce environment, which is a challenging problem but receives only limited attention. To solve this problem, we present a two-level MapReduce scheduler built on previous techniques and make two key contributions. First, to meet the performance goal of real-time applications, we propose a deadline scheduler which adopts (1) a sampling based approach-Tasks Forward Scheduling (TFS) to predict map/reduce task execution time(unlike prior work that requires users to input an estimated value). (2) a resource allocation model-Approximately Uniform Minimum Degree of parallelism (AUMD) to dynamically control each realtime job to execute with minimum tasks assignment in any time so as to maximize the number of concurrent real-time jobs. Second, through integrating this deadline scheduler into existing MapReduce scheduler, we develop a two-level scheduler with resource preemption supported, and it could schedule mixed real-time and non-real-time jobs according to their respective performance demands. We implement our scheduler in Hadoop system and experiments running on a real, small-scale cluster demonstrate that it could schedule mixed real-time and nonreal-time jobs to meet their different quality-of-service (QoS) demands.

...read moreread less

57 citations

Proceedings Article•DOI•

Collaborative multi-swarm PSO for task matching using graphics processing units

[...]

Steven Solomon¹, Parimala Thulasiraman¹, Ruppa K. Thulasiram¹•Institutions (1)

University of Manitoba¹

12 Jul 2011

TL;DR: This work investigates the performance of a highly parallel Particle Swarm Optimization (PSO) algorithm implemented on the GPU and shows that the GPU offers a high degree of performance and achieves a maximum of 37 times speedup over a sequential implementation when the problem size in terms of tasks is large and many swarms are used.

...read moreread less

Abstract: We investigate the performance of a highly parallel Particle Swarm Optimization (PSO) algorithm implemented on the GPU. In order to achieve this high degree of parallelism we implement a collaborative multi-swarm PSO algorithm on the GPU which relies on the use of many swarms rather than just one. We choose to apply our PSO algorithm against a real-world application: the task matching problem in a heterogeneous distributed computing environment. Due to the potential for large problem sizes with high dimensionality, the task matching problem proves to be very thorough in testing the GPUs capabilities for handling PSO. Our results show that the GPU offers a high degree of performance and achieves a maximum of 37 times speedup over a sequential implementation when the problem size in terms of tasks is large and many swarms are used.

...read moreread less

56 citations

Proceedings Article•

SSD characterization: from energy consumption's perspective

[...]

Balgeun Yoo¹, Youjip Won¹, Jongmoo Choi², Sungroh Yoon³, Seokhei Cho¹, Sooyong Kang¹ - Show less +2 more•Institutions (3)

Hanyang University¹, Dankook University², Korea University³

14 Jun 2011

TL;DR: It is found that SSDs adopt significantly different ways of exploiting channel-level and way-level parallelism to maximize write throughput, which governs the peak current consumption.

...read moreread less

Abstract: In this work, we perform µsec time scale analysis on energy consumption behavior of the SSD Write operation and exploit this information to extract key technical characteristics of SSD internals: channel utilization policy, page allocation strategy, cluster size, channel switch delay, way switch delay, etc. We found that some SSDs adopt a multi-page cluster as a write unit instead of a page. We found that SSDs adopt significantly different ways of exploiting channel-level and way-level parallelism to maximize write throughput, which governs the peak current consumption. The X25M(Intel) emphasizes the performance aspect of SSDs and linearly increases the channel parallelism as the IO size increases. The MXP(Samsung) puts more emphasis on energy consumption aspect of SSD and controls the degree of parallelism to reduce the peak current consumption. Cluster size of the X25M and the MXP correspond to one and eight pages, respectively. The current consumed when writing a page to NAND flash varies significantly depending on the NAND model(17 mA-35 mA).

...read moreread less

Proceedings Article•DOI•

Towards highly parallel event processing through reconfigurable hardware

[...]

Mohammad Sadoghi¹, Harsh Singh¹, Hans-Arno Jacobsen¹•Institutions (1)

University of Toronto¹

13 Jun 2011

TL;DR: In this article, the authors present an efficient event processing platform to support high-frequency and low-latency event matching over reconfigurable hardware, where each solution is formulated as a design trade-off between the degree of parallelism versus the desired application requirement.

...read moreread less

Abstract: We present fpga-ToPSS (Toronto Publish/Subscribe System), an efficient event processing platform to support high-frequency and low-latency event matching. fpga-ToPSS is built over reconfigurable hardware---FPGAs---to achieve line-rate processing by exploring various degrees of parallelism. Furthermore, each of our proposed FPGA-based designs is geared towards a unique application requirement, such as flexibility, adaptability, scalability, or pure performance, such that each solution is specifically optimized to attain a high level of parallelism. Therefore, each solution is formulated as a design trade-off between the degree of parallelism versus the desired application requirement. Moreover, our event processing engine supports Boolean expression matching with an expressive predicate language applicable to a wide range of applications including real-time data analysis, algorithmic trading, targeted advertisement, and (complex) event processing.

...read moreread less

Journal Article•DOI•

3-D Parallel Fault Simulation With GPGPU

[...]

Min Li, M. S. Hsiao

01 Oct 2011-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: An efficient parallel fault simulator, FSimGP2, that exploits the high degree of parallelism supported by a state-of-the-art graphic processing unit (GPU) with the NVIDIA compute unified device architecture to achieve extremely high computation efficiency on the GPU.

...read moreread less

Abstract: General purpose computing on graphical processing units (GPGPU) is a paradigm shift in computing that promises a dramatic increase in performance. GPGPU also brings an unprecedented level of complexity in algorithmic design and software development. In this paper, we present an efficient parallel fault simulator, FSimGP2, that exploits the high degree of parallelism supported by a state-of-the-art graphic processing unit (GPU) with the NVIDIA compute unified device architecture. A novel 3-D parallel fault simulation technique is proposed to achieve extremely high computation efficiency on the GPU. Global communication is minimized by concentrating as much work as possible on the local device's memory. We present results on a GPU platform from NVIDIA (a GeForce GTX 285 graphics card) that demonstrate a speedup of up to 63× and 4× compared to two other GPU-based fault simulators and up to 95× over a state-of-the-art algorithm on conventional processor architectures.

...read moreread less

Journal Article•DOI•

Abstract Next Subvolume Method

[...]

Bing Wang¹, Bonan Hou¹, Fei Xing¹, Yiping Yao¹•Institutions (1)

National University of Defense Technology¹

01 Jun 2011-Computational Biology and Chemistry

TL;DR: The Abstract Next Subvolume Method is introduced, in which the model representation is decouple from the sequential simulation algorithms, and it is proved that state trajectories generated by its executions statistically accord with those generated by the Next Sub volume Method.

...read moreread less

Proceedings Article•DOI•

Optimizing Dynamic Programming on Graphics Processing Units via Adaptive Thread-Level Parallelism

[...]

Chao-Chin Wu¹, Jenn-Yang Ke², Heshan Lin³, Wu-chun Feng³•Institutions (3)

National Changhua University of Education¹, Tatung University², Virginia Tech³

07 Dec 2011

TL;DR: This paper presents the GPU acceleration of an important category of DP problems called nonserial polyadic dynamic programming (NPDP), and proposes a methodology that can adaptively adjust the thread-level parallelism in mapping a NPDP problem onto the GPU, thus providing sufficient and steady degrees of parallelism across different compute stages.

...read moreread less

Abstract: Dynamic programming (DP) is an important computational method for solving a wide variety of discrete optimization problems such as scheduling, string editing, packaging, and inventory management. In general, DP is classified into four categories based on the characteristics of the optimization equation. Because applications that are classified in the same category of DP have similar program behavior, the research community has sought to propose general solutions for parallelizing each category of DP. However, most existing studies focus on running DP on CPU-based parallel systems rather than on accelerating DP algorithms on the graphics processing unit (GPU). This paper presents the GPU acceleration of an important category of DP problems called nonserial polyadic dynamic programming (NPDP). In NPDP applications, the degree of parallelism varies significantly in different stages of computation, making it difficult to fully utilize the compute power of hundreds of processing cores in a GPU. To address this challenge, we propose a methodology that can adaptively adjust the thread-level parallelism in mapping a NPDP problem onto the GPU, thus providing sufficient and steady degrees of parallelism across different compute stages. We realize our approach in a real-world NPDP application -- the optimal matrix parenthesization problem. Experimental results demonstrate our method can achieve a speedup of 13.40 over the previously published GPU algorithm.

...read moreread less

Proceedings Article•DOI•

Parallelism in dynamic well-spaced point sets

[...]

Umut A. Acar¹, Andrew Cotter², Benoît Hudson³, Duru Türkoglu⁴•Institutions (4)

Max Planck Society¹, Toyota Technological Institute at Chicago², Autodesk³, University of Chicago⁴

04 Jun 2011

TL;DR: This paper parallelizes a dynamic algorithm for well-spaced point sets, an important problem related to mesh refinement in computational geometry, and describes techniques for implementing the algorithm on modern multi-core computers and provides a prototype implementation.

...read moreread less

Abstract: Parallel algorithms and dynamic algorithms possess an interesting duality property: compared to sequential algorithms, parallel algorithms improve run-time while preserving work, while dynamic algorithms improve work but typically offer no parallelism Although they are often considered separately, parallel and dynamic algorithms employ similar design techniques They both identify parts of the computation that are independent of each other This suggests that dynamic algorithms could be parallelized to improve work efficiency while preserving fast parallel run-timeIn this paper, we parallelize a dynamic algorithm for well-spaced point sets, an important problem related to mesh refinement in computational geometry Our parallel dynamic algorithm computes a well-spaced superset of a dynamically changing set of points, allowing arbitrary dynamic modifications to the input set On an EREW PRAM, our algorithm processes batches of k modifications such as insertions and deletions in O(k log Δ) total work and in O(log Δ) parallel time using k processors, where Δ is the geometric spread of the data, while ensuring that the output is always within a constant factor of the optimal size EREW PRAM model is quite different from actual hardware such as modern multiprocessors We therefore describe techniques for implementing our algorithm on modern multi-core computers and provide a prototype implementation Our empirical evaluation shows that our algorithm can be practical, yielding a large degree of parallelism and good speedups

...read moreread less

Report•DOI•

Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures

[...]

Fengguang Song, Stanimire Tomov, Jack Dongarra

01 Jun 2011

TL;DR: A new methodology for utilizing all CPU cores and all GPUs efficiently on a heterogeneous multicore and multi-GPU system to support matrix computations efficiently and an auto-tuning method to determine the best tile sizes to attain both high performance and load balancing is presented.

...read moreread less

Abstract: Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures ∗ Fengguang Song Stanimire Tomov Jack Dongarra University of Tennessee EECS Department Knoxville, TN, USA University of Tennessee EECS Department Knoxville, TN, USA University of Tennessee Oak Ridge National Laboratory University of Manchester song@eecs.utk.edu tomov@eecs.utk.edu dongarra@eecs.utk.edu ABSTRACT We present a new methodology for utilizing all CPU cores and all GPUs on a heterogeneous multicore and multi-GPU system to support matrix computations efficiently. Our ap- proach is able to achieve the objectives of a high degree of parallelism, minimized synchronization, minimized commu- nication, and load balancing. Our main idea is to treat the heterogeneous system as a distributed-memory machine, and to use a heterogeneous 1-D block cyclic distribution to allo- cate data to the host system and GPUs to minimize commu- nication. We have designed heterogeneous algorithms with two different tile sizes (one for CPU cores and the other for GPUs) to cope with processor heterogeneity. We propose an auto-tuning method to determine the best tile sizes to attain both high performance and load balancing. We have also implemented a new runtime system and applied it to the Cholesky and QR factorizations. Our experiments on a compute node with two Intel Westmere hexa-core CPUs and three Nvidia Fermi GPUs demonstrate good weak scal- ability, strong scalability, load balance, and efficiency of our approach. INTRODUCTION As the performance of both multicore CPU and GPU con- tinues to scale at a Moore’s law rate, it is becoming perva- sive to use heterogeneous multicore and multi-GPU archi- tectures to attain the highest performance possible from a single compute node. Before making parallel programs run efficiently on a distributed-memory system, it is critical to achieve high performance on a single node first. However, the heterogeneity in the multi-core and multi-GPU architec- ture has introduced new challenges to algorithm design and system software. Over the last few years, our colleagues at the Univer- sity of Tennessee have developed the PLASMA library [2] to solve linear algebra problems on multicore architectures. In parallel with PLASMA, we have also developed another library called MAGMA [27] to solve linear algebra problems on GPUs. While PLASMA and MAGMA aim to provide the same routines as LAPACK [4], one is used for multicore CPUs, and the other for a single core with an attached GPU, respectively. Our goal is to utilize all cores and all GPUs efficiently on a single multicore and multi-GPU system to support matrix computations. ∗ This material is based upon work supported by the NSF grants CCF-0811642, OCI-0910735, by the DOE grant DE- FC02-06ER25761, and by Microsoft Research. GPU Device Memory Multicore Host System Host Memory PCIe Interface GPU Switch PCIe Interface GPU Switch GPU Device Memory GPU Device Memory GPU Device Memory Figure 1: An example of a heterogeneous multi-core and multi-GPU system. The host system is connected to four GPUs via two PCI Express connections. The host system and the GPUs have separate memory spaces. Figure 1 shows the architecture of a heterogeneous mul- ticore and multi-GPU system we are considering. The mul- ticore host system is connected to four GPUs via two PCI Express connections and each pair of GPUs share a GPU switch. To design new software on this type of heteroge- neous architectures, we must consider the following special features: (1) The host and the GPUs have different memory spaces and an explicit memory copy is required to transfer data between the host and a GPU; (2) The system is also dif- ferent from a distributed-memory machine since each GPU is actually controlled by a thread running on the host (more like pthreads on a shared-memory machine); (3) The pro- cessor heterogeneity between CPUs and GPUs; (4) GPUs are optimized for throughput and expect a larger input size than CPUs which are optimized for latency [24]; (5) As the performance gap between a GPU and its PCI-Express in- terconnection to the host becomes larger, network is even- tually the bottleneck for the entire system. In this work, we take into account all these factors and strive to meet the following objectives in order to obtain high performance: a high degree of parallelism, minimized synchronization, min- imized communication, and load balancing. We propose to design new heterogeneous algorithms and to use a simple but practical static data distribution to achieve the objec- tives simultaneously. This paper describes heterogeneous rectangular tile algo- rithms with hybrid tile sizes, heterogeneous 1-D block cyclic data distribution, a new runtime system, and an auto-tuning method to determine the hybrid tile sizes. The rectangu- lar tile algorithms build upon the previous tile algorithms, which divide a matrix into square tiles and exhibit a high de- gree of parallelism and minimized synchronizations [13, 14]

...read moreread less

Journal Article•DOI•

Robust heterogeneous data center design: a principled approach

[...]

Siddharth Garg¹, Shreyas Sundaram¹, Hiren D. Patel¹•Institutions (1)

University of Waterloo¹

21 Dec 2011

TL;DR: A principled approach to designing energy-efficient, heterogeneous data centers that are robust against data center workload variations, using Wald’s minimax criterion as a starting point is outlined.

...read moreread less

Abstract: Data centers represent the fastest growing component of information and communication technologies (ICT) energy footprint. With the advent of cloud computing, data centers will increasingly be used to process a wide array of jobs with differing characteristics such as degree of parallelism, memory access patterns etc.. From an energy efficiency perspective, the most energy efficient server architecture differs for jobs with different characteristics [4], motivating the need to consider heterogeneous data center designs consisting of many server types [3, 5]. Even though types of jobs that a data center is expected to serve might be known at design time, the workload statistics are often unknown until the data center is deployed. Therefore, data centers should be designed keeping in mind the uncertainty in workload statistics — in this paper, we outline a principled approach to designing energy-efficient, heterogeneous data centers that are robust against data center workload variations, using Wald’s minimax criterion as a starting point. In the proposed formulation, we assume that the only thing that is known at design time is an upper bound on the total rate (over all job types) at which jobs arrive at the data center, and design the data center to have the minimum worst-case energy consumption over all job type mixes. We then highlight a number of potential avenues for further investigation.

...read moreread less

Journal Article•DOI•

How reverse reactions influence the yield of self-assembly robots

[...]

Shuhei Miyashita¹, Maurice Göldi¹, Rolf Pfeifer¹•Institutions (1)

University of Zurich¹

01 Apr 2011-The International Journal of Robotics Research

TL;DR: Based on a novel self-assembly platform consisting of self-propulsive centimetre-sized modules capable of aggregation on the surface of water, the effect of stochasticity and morphology (shape) on the yield of targeted formations in self- assembly processes is studied.

...read moreread less

Abstract: The decay in structure size of manufacturing products has yielded new demands on spontaneous composition methods. The key for the realization of small-sized robots lies in how to achieve the efficient assembly sequence in a bottom-up manner, where most of the parts have only limited or no computational i.e. deliberative abilities. In this paper, based on a novel self-assembly platform consisting of self-propulsive centimetre-sized modules capable of aggregation on the surface of water, we study the effect of stochasticity and morphology shape on the yield of targeted formations in self-assembly processes. Specifically, we focus on a unique phenomenon: that a number of modules instantly compose a target product without forming intermediate subassemblies, some of which constitute undesired geometrical formations termed one-shot aggregation. Together with a focus on the role that the morphology of the modules plays, we validate the effect of one-shot aggregation with a kinetic rate mathematical model. Moreover, we examined the degree of parallelism of the assembly process, which is an essential factor in self-assembly, but is not systematically taken into account by existing frameworks.

...read moreread less

Proceedings Article•DOI•

Distributed Resource Reservation in Massively Parallel Processor Arrays

[...]

Vahid Lari¹, Frank Hannig¹, Jürgen Teich¹•Institutions (1)

University of Erlangen-Nuremberg¹

16 May 2011

TL;DR: This paper proposes a methodology for applications to automatically claim linear arrays of processing elements within massively parallel processor arrays at run-time depending on the available degree of parallelism or dynamic computing requirements.

...read moreread less

Abstract: This paper proposes a methodology for applications to automatically claim linear arrays of processing elements within massively parallel processor arrays at run-time depending on the available degree of parallelism or dynamic computing requirements. Using this methodology, parallel programs running on individual processing elements gain the capability of autonomously managing the available processing resources in their neighborhood. We present different protocols and architectural support for gathering and transporting the result of a resource exploration for informing a configuration loader about the number and location of the claimed resources. Timing and data overhead cost of four different approaches are mathematically evaluated. In order to verify and compare these decentralized algorithms, a simulation platform has been developed to compare the data overhead and scalability of each approach for different sizes of processor arrays.

...read moreread less

Journal Article•DOI•

Massively parallel FPGA-based implementation of BLASTp with the two-hit method

[...]

Lars Wienbrandt¹, Stefan Baumgart, Jost Bissel, Florian Schatz¹, Manfred Schimmler¹ - Show less +1 more•Institutions (1)

University of Kiel¹

01 Jan 2011

TL;DR: The very last step of the algorithm is performed to produce a gapped alignment with the Needleman-Wunsch algorithm in software, only with the option of hardware processing after reconfiguration, which saves FPGA-resources and allows an even higher degree of parallelism.

...read moreread less

Abstract: Protein database search requests are generally being performed using the BLASTp algorithm, introduced by NCBI [1] . Since it is computationally intensive, it becomes more and more ine_ective with today's growth of sequence database sizes. The needs for an e_cient parallel implementation arise. In this paper, we focus on a massive parallelization using the FPGA-based hardware architecture RIVYERA [2] . The aim is to reach speedups in orders of magnitude with a flexible implementation while saving energy costs compared to PC-based database search. We keep our implementation close to the structure published by Kasap et al. [3] , [4] and include ideas from Sotiriades et al. [5] such that all parts of the algorithm are organized in components of a long pipeline. We also use the idea of the two-hit method [6] to keep the computational e_ort small. Besides the related work, we perform the very last step of the algorithm to produce a gapped alignment with the Needleman-Wunsch algorithm in software, only with the option of hardware processing after reconfiguration. This saves FPGA-resources and allows an even higher degree of parallelism.

...read moreread less

DOI•

Parallel Smoothers for Matrix-based Multigrid Methods on Unstructured Meshes Using Multicore CPUs and GPUs

[...]

Vincent Heuveline¹, Dimitar Lukarski, Nico Trost, Jan-Philipp Weiss•Institutions (1)

Interdisciplinary Center for Scientific Computing¹

01 Jul 2011

TL;DR: This work uses the approach of matrix-based multigrid that has high flexibility and adapts well to the exigences of modern computing platforms, and investigates multi-colored Gauss-Seidel type smoothers, the power(q)-pattern enhanced multi- colored ILU(p) smoothers with fill-ins, and factorized sparse approximate inverse (FSAI) smooths.

...read moreread less

Abstract: Multigrid methods are efficient and fast solvers for problems typically modeled by partial differential equations of elliptic type. For problems with complex geometries and local singularities stencil-type discrete operators on equidistant Cartesian grids need to be replaced by more flexible concepts for unstructured meshes in order to properly resolve all problem-inherent specifics and for maintaining a moderate number of unknowns. However, flexibility in the meshes goes along with severe drawbacks with respect to parallel execution - especially with respect to the definition of adequate smoothers. This point becomes in particular pronounced in the framework of fine-grained parallelism on GPUs with hundreds of execution units. We use the approach of matrix-based multigrid that has high flexibility and adapts well to the exigences of modern computing platforms. In this work we investigate multi-colored Gauss-Seidel type smoothers, the power(q)-pattern enhanced multi-colored ILU(p) smoothers with fill-ins, and factorized sparse approximate inverse (FSAI) smoothers. These approaches provide efficient smoothers with a high degree of parallelism. In combination with matrix-based multigrid methods on unstructured meshes our smoothers provide powerful solvers that are applicable across a wide range of parallel computing platforms and almost arbitrary geometries. We describe the configuration of our smoothers in the context of the portable lmpLAtoolbox and the HiFlow3 parallel finite element package. In our approach, a single source code can be used across diverse platforms including multicore CPUs and GPUs. Highly optimized implementations are hidden behind a unified user interface. Efficiency and scalability of our multigrid solvers are demonstrated by means of a comprehensive performance analysis on multicore CPUs and GPUs.

...read moreread less

Book Chapter•DOI•

Computing degree of parallelism for BPMN processes

[...]

Yutian Sun¹, Jianwen Su¹•Institutions (1)

University of California, Santa Barbara¹

05 Dec 2011

TL;DR: This paper initiates a study on computing degree of parallelism for three classes of BPMN processes, which are defined based on the use of B PMN gateways, and presents an algorithm for Computing degree of Parallelism, which has polynomial time complexity.

...read moreread less

Abstract: For sequential processes and workflows (i.e., pipelined tasks), each enactment (process instance) only has one task being performed at each time instant. When a process allows tasks to be performed in parallel, an enactment may have a number of tasks being performed concurrently and this number may change in time. We define the “degree of parallelism” of a process as the maximum number of tasks to be performed concurrently during an execution of the process. This paper initiates a study on computing degree of parallelism for three classes of BPMN processes, which are defined based on the use of BPMN gateways. For each class, an algorithm for computing degree of parallelism is presented. In particular, the algorithms for “homogeneous” and acyclic “choice-less” processes (respectively) have polynomial time complexity, while the algorithm for “asynchronous” processes runs in exponential time.

...read moreread less

Journal Article•DOI•

Exploiting parallelism on progressive alignment methods

[...]

Miquel Orobitg, Fernando Guirado, Cedric Notredame, Fernando Cores

01 Nov 2011-The Journal of Supercomputing

TL;DR: An improvement to a parallel implementation of T-Coffee, a widely used MSA package, that resolves the bottleneck of the progressive alignment stage on MSA and shows improvements in execution time of over 68% while maintaining the biological accuracy.

...read moreread less

Abstract: Multiple Sequence Alignment (MSA) constitutes an extremely powerful tool for important biological applications such as phylogenetic analysis, identification of conserved motifs and domains and structure prediction. In spite of the improvement in speed and accuracy introduced by MSA programs, the computational requirements for large-scale alignments requires high-performance computing and parallel applications. In this paper we present an improvement to a parallel implementation of T-Coffee, a widely used MSA package. Our approximation resolves the bottleneck of the progressive alignment stage on MSA. This is achieved by increasing the degree of parallelism by balancing the guide tree that drives the progressive alignment process. The experimental results show improvements in execution time of over 68% while maintaining the biological accuracy.

...read moreread less

Parallel Smoothers for Matrix-Based Geometric Multigrid Methods on Locally Refined Meshes Using Multicore CPUs and GPUs.

[...]

Vincent Heuveline, Dimitar Lukarski¹, Nico Trost, Jan-Philipp Weiss¹•Institutions (1)

Karlsruhe Institute of Technology¹

01 Jan 2011

TL;DR: In this article, a matrix-based geometric multigrid method is proposed to solve finite element solvers with high flexibility with respect to complex geometries and local singularities, which adapts well to the exigences of modern computing platforms.

...read moreread less

Abstract: Multigrid methods are efficient and fast solvers for problems typically modeled by partial differential equations of elliptic type. We use the approach of matrix-based geometric multigrid that has high flexibility with respect to complex geometries and local singularities. Furthermore, it adapts well to the exigences of modern computing platforms. In this work we investigate multi-colored Gaus-Seidel type smoothers, the power(q)-pattern enhanced multi-colored ILU(p,q) smoothers with fill-ins, and factorized sparse approximate inverse (FSAI) smoothers. These approaches provide efficient smoothers with a high degree of parallelism. We describe the configuration of our smoothers in the context of the portable lmpLAtoolbox and the HiFlow 3 parallel finite element package. In our approach, a single source code can be used across diverse platforms including multicore CPUs and GPUs. Highly optimized implementations are hidden behind a unified user interface. Efficiency and scalability of our multigrid solvers are demonstrated by means of a comprehensive performance analysis on multicore CPUs and GPUs.

...read moreread less

Tarragon: a programming model for latency-hiding scientific computations

[...]

Scott B. Baden¹, Pietro Cicotti¹•Institutions (1)

University of California, San Diego¹

01 Jan 2011

TL;DR: The results presented in this dissertation demonstrate that data-driven execution, coupled with metadata abstractions, effectively support latency tolerance and enable performance optimization techniques that are decoupled from the algorithmic formulation and the control flow of the application code.

...read moreread less

Abstract: In supercomputing systems, architectural changes that increase computational power are often reflected in the programming model. As a result, in order to realize and sustain the potential performance of such systems, it is necessary in practice to deal with architectural details and explicitly manage the resources to an increasing extent. In particular, programmers are required to develop code that exposes a high degree of parallelism, exhibits high locality, dynamically adapts to the available resources, and hides communication latency. Hiding communication latency is crucial to realize the potential of today’s distributed memory machines with highly parallel processing modules, and technological trends indicate that communication latencies will continue to be an issue as the performance gap between computation and communication widens. However, under Bulk Synchronous Parallel models, the predominant paradigm in scientific computing, scheduling is embedded into the application code. All the phases of a computation are defined and laid out as a linear sequence of operations limiting overlap and the program’s ability to adapt to communication delays. This thesis proposes an alternative model, called Tarragon, to overcome the limitations of Bulk Synchronous Parallelism. Tarragon, which is based on dataflow, targets latency tolerant scientific computations. Tarragon supports a task-dependency graph abstraction in which tasks, the basic unit of computation, are organized as a graph according to their data dependencies, i.e. task precedence. In addition to the task graph, Tarragon supports metadata abstractions, annotations to the task graph, to express locality information and scheduling policies to improve performance. Tarragon’s functionality and underlying programming methodology are demonstrated on three classes of computations used in scientific domains: structured grids, sparse linear algebra, and dynamic programming. In the application studies, Tarragon implementations achieve high performance, in many cases exceeding the performance of equivalent latency-tolerant, hard coded MPI implementations. The results presented in this dissertation demonstrate that data-driven execution, coupled with metadata abstractions, effectively support latency tolerance. In addition, performance metadata enable performance optimization techniques that are decoupled from the algorithmic formulation and the control flow of the application code. By expressing the structure of the computation and its characteristics with metadata, the programmer can focus on the application and rely on Tarragon and its run-time system to automatically overlap communication with computation and optimize the performance.

...read moreread less

Journal Article•DOI•

A system-centric metric for the evaluation of online job schedules

[...]

Uwe Schwiegelshohn¹•Institutions (1)

Technical University of Dortmund¹

01 Dec 2011-Journal of Scheduling

TL;DR: A rarely used metric is discussed that is well suited to evaluate online schedules for independent jobs on massively parallel processors and proves an almost tight competitive factor of 1.25 for nondelay schedules and no constant competitive factor exists.

...read moreread less

Abstract: The paper discusses a rarely used metric that is well suited to evaluate online schedules for independent jobs on massively parallel processors. The metric is based on the total weighted completion time objective with the weight being the resource consumption of the job. Although every job contributes to the objective value, the metric exhibits many properties that are similar to the properties of the makespan objective. For this metric, we particularly address nonclairvoyant online scheduling of sequential jobs on parallel identical machines and prove an almost tight competitive factor of 1.25 for nondelay schedules. For the extension of the problem to rigid parallel jobs, we show that no constant competitive factor exists. However, if all jobs are released at time 0, List Scheduling in descending order of the degree of parallelism guarantees an approximation factor of 2.

...read moreread less

Proceedings Article•DOI•

Evolutionary Scheduling of Parallel Tasks Graphs onto Homogeneous Clusters

[...]

Sascha Hunold, Joachim Lepping

26 Sep 2011

TL;DR: The evolutionary strategy ensures that EMTS can be used with any underlying model for predicting the execution time of moldable tasks, and significantly reduces the make span of PTGs compared to other heuristics for both non-monotonically and monotonically decreasing models.

...read moreread less

Abstract: Parallel task graphs (PTGs) arise when parallel programs are combined to larger applications, e.g., scientific workflows. Scheduling these PTGs onto clusters is a challenging problem due to the additional degree of parallelism stemming from moldable tasks. Most algorithms are based on the assumption that the execution time of a parallel task is monotonically decreasing as the number of processors increases. But this assumption does not hold in practice since parallel programs often perform better if the number of processors is a multiple of internally used block sizes. In this article, we introduce the Evolutionary Moldable Task Scheduling (EMTS) algorithm for scheduling static PTGs onto homogeneous clusters. We apply an evolutionary approach to determine the processor allocation of each task. The evolutionary strategy ensures that EMTS can be used with any underlying model for predicting the execution time of moldable tasks. With the purpose of finding solutions quickly, EMTS considers results of other heuristics (e.g., HCPA, MCPA) as starting solutions. The experimental results show that EMTS significantly reduces the make span of PTGs compared to other heuristics for both non-monotonically and monotonically decreasing models.

...read moreread less

Proceedings Article•DOI•

Latency Hiding and Performance Tuning with Graph-Based Execution

[...]

Pietro Cicotti¹, Scott B. Baden¹•Institutions (1)

University of California, San Diego¹

10 Oct 2011

TL;DR: Tarragon, which is based on dataflow, targets latency tolerant scientific computations and achieves high performance, in many cases exceeding the performance of equivalent latency-tolerant, hard coded MPI implementations.

...read moreread less

Abstract: In the current practice, scientific programmer and HPC users are required todevelop code that exposes a high degree of parallelism, exhibits high locality,dynamically adapts to the available resources, and hides communication latency.Hiding communication latency is crucial to realize the potential of today'sdistributed memory machines with highly parallel processing modules, andtechnological trends indicate that communication latencies will continue to bean issue as the performance gap between computation and communication widens.However, under Bulk Synchronous Parallel models, the predominant paradigm inscientific computing, scheduling is embedded into the application code. All thephases of a computation are defined and laid out as a linear sequence ofoperations limiting overlap and the program's ability to adapt to communicationdelays.In this paper we present an alternative model, called Tarragon, to overcome thelimitations of Bulk Synchronous Parallelism. Tarragon, which is based ondataflow, targets latency tolerant scientific computations. Tarragon supports atask-dependency graph abstraction in which tasks, the basic unit ofcomputation, are organized as a graph according to their data dependencies,i.e. task precedence. In addition to the task graph, Tarragon supports metadataabstractions, annotations to the task graph, to express locality informationand scheduling policies to improve performance.Tarragon's functionality and underlying programming methodology aredemonstrated on three classes of computations used in scientific domains:structured grids, sparse linear algebra, and dynamic programming. In theapplication studies, Tarragon implementations achieve high performance, in manycases exceeding the performance of equivalent latency-tolerant, hard coded MPIimplementations.

...read moreread less

Journal Article•DOI•

GPGPU: Hardware/Software Co-Design for the Masses

[...]

Zotlán Ádám Mann

01 Jan 2011-Computing and Informatics \/ Computers and Artificial Intelligence

TL;DR: The analogy between GPGPU and hardware/software co-design (HSCD), a more mature design paradigm, is spotlighted to derive a design process for GPGU, which will ease GPG PU design significantly.

...read moreread less

Abstract: With the recent development of high-performance graphical processing units (GPUs), capable of performing general-purpose computation (GPGPU: general-purpose computation on the GPU), a new platform is emerging. It consists of a central processing unit (CPU), which is very fast in sequential execution, and a GPU, which exhibits high degree of parallelism and thus very high performance on certain types of computations. Optimally leveraging the advantages of this platform is challenging in practice. We spotlight the analogy between GPGPU and hardware/software co-design (HSCD), a more mature design paradigm, to derive a design process for GPGPU. This process, with appropriate tool support and automation, will ease GPGPU design significantly. Identifying the challenges associated with establishing this process can serve as a roadmap for the future development of the GPGPU field.

...read moreread less