scispace - formally typeset
Search or ask a question

Showing papers on "Degree of parallelism published in 2011"


Journal ArticleDOI
TL;DR: It is demonstrated that by making alternative design decisions in the GPU implementation, an additional speedup can be obtained, again of an order of magnitude, by carefully considering memory access locality when dividing the workload among blocks of threads, the GPU's cache is used more efficiently, making more effective use of the available memory bandwidth.

330 citations


Proceedings ArticleDOI
29 Nov 2011
TL;DR: LARTS attempts to collocate reduce tasks with the maximum required data computed after recognizing input data network locations and sizes and adopts a cooperative paradigm seeking a good data locality while circumventing scheduling delay, scheduling skew, poor system utilization, and low degree of parallelism.
Abstract: MapReduce offers a promising programming model for big data processing. Inspired by functional languages, MapReduce allows programmers to write functional-style code which gets automatically divided into multiple map and/or reduce tasks and scheduled over distributed data across multiple machines. Hadoop, an open source implementation of MapReduce, schedules map tasks in the vicinity of their inputs in order to diminish network traffic and improve performance. However, Hadoop schedules reduce tasks at requesting nodes without considering data locality leading to performance degradation. This paper describes Locality-Aware Reduce Task Scheduler (LARTS), a practical strategy for improving MapReduce performance. LARTS attempts to collocate reduce tasks with the maximum required data computed after recognizing input data network locations and sizes. LARTS adopts a cooperative paradigm seeking a good data locality while circumventing scheduling delay, scheduling skew, poor system utilization, and low degree of parallelism. We implemented LARTS in Hadoop-0.20.2. Evaluation results show that LARTS outperforms the native Hadoop reduce task scheduler by an average of 7%, and up to 11.6%.

155 citations


Book ChapterDOI
01 Jan 2011
TL;DR: This contribution reveals the main ideas, potential benefits and challenges for supporting invasive computing at the architectural, programming and compiler level in the future and gives an overview of required research topics rather than being able to present mature solutions yet.
Abstract: A novel paradigm for designing and programming future parallel computing systems called invasive computing is proposed. The main idea and novelty of invasive computing is to introduce resource-aware programming support in the sense that a given program gets the ability to explore and dynamically spread its computations to neighbour processors in a phase called invasion, then to execute portions of code of high parallelism degree in parallel based on the available invasible region on a given multi-processor architecture. Afterwards, once the program terminates or if the degree of parallelism should be lower again, the program may enter a retreat phase, deallocate resources and resume execution again, for example, sequentially on a single processor. To support this idea of self-adaptive and resource-aware programming, not only new programming concepts, languages, compilers and operating systems are necessary but also revolutionary architectural changes in the design of Multi-Processor Systems-on-a-Chip must be provided so to efficiently support invasion, infection and retreat operations involving concepts for dynamic processor, interconnect and memory reconfiguration. This contribution reveals the main ideas, potential benefits and challenges for supporting invasive computing at the architectural, programming and compiler level in the future. It serves to give an overview of required research topics rather than being able to present mature solutions yet.

144 citations


Proceedings ArticleDOI
Jens Teubner1, Rene Mueller2
12 Jun 2011
TL;DR: This work presents handshake join, a way of describing and executing window-based stream joins that is highly amenable to parallelized execution and gives a new intuition of window semantics, which it believes could inspire other stream processing algorithms or ongoing standardization efforts for stream query languages.
Abstract: In spite of the omnipresence of parallel (multi-core) systems, the predominant strategy to evaluate window-based stream joins is still strictly sequential, mostly just straightforward along the definition of the operation semantics.In this work we present handshake join, a way of describing and executing window-based stream joins that is highly amenable to parallelized execution. Handshake join naturally leverages available hardware parallelism, which we demonstrate with an implementation on a modern multi-core system and on top of field-programmable gate arrays (FPGAs), an emerging technology that has shown distinctive advantages for high-throughput data processing.On the practical side, we provide a join implementation that substantially outperforms CellJoin (the fastest published result) and that will directly turn any degree of parallelism into higher throughput or larger supported window sizes. On the semantic side, our work gives a new intuition of window semantics, which we believe could inspire other stream processing algorithms or ongoing standardization efforts for stream query languages.

144 citations


Proceedings ArticleDOI
20 Oct 2011
TL;DR: A set of link component models for performing interconnect design-space exploration connected to the underlying device and circuit technology is presented, demonstrating the link-level interactions between components in achieving the optimal degree of parallelism and energy-efficiency.
Abstract: Integrated photonic interconnects have emerged recently as a potential solution for relieving on-chip and chip-to-chip bandwidth bottlenecks for next-generation many-core processors. To help bridge the gap between device and circuit/system designers, and aid in understanding of inherent photonic link tradeoffs, we present a set of link component models for performing interconnect design-space exploration connected to the underlying device and circuit technology. To compensate for process and thermal-induced ring resonator mismatches, we take advantage of device and circuit characteristics to propose an efficient ring tuning solution. Finally, we perform optimization of a wavelength-division-multiplexed link, demonstrating the link-level interactions between components in achieving the optimal degree of parallelism and energy-efficiency.

134 citations


Patent
26 Aug 2011
TL;DR: In this article, the authors propose an approach to improve management of multiple I/O threads that take advantage of the high performing and concurrent nature of the back end media, so the resulting storage system can achieve a very high performance.
Abstract: Solid State Drives (SSD) can yield very high performance if it is designed properly. A SSD typically includes both a front end that interfaces with the host and a back end that interfaces with the flash media. Typically SSDs include flash media that is designed with a high degree of parallelism that can support a very high bandwidth on input/output (I/O). A SSD front end designed according to a traditional hard disk drive (HDD) model will not be able to take advantage of the high performance offered by the typical flash media. Embodiments of the invention provide improved management of multiple I/O threads that take advantage of the high performing and concurrent nature of the back end media, so the resulting storage system can achieve a very high performance.

128 citations


Book ChapterDOI
29 May 2011
TL;DR: This paper proposes an approach for optimizing graph pattern matching by reinterpreting certain join tree structures as grouping operations which enables a greater degree of parallelism in join processing resulting in more "bushy" like query execution plans with fewer Map-Reduce cycles.
Abstract: Existing MapReduce systems support relational style join operators which translate multi-join query plans into severalMap-Reduce cycles. This leads to high I/O and communication costs due to the multiple data transfer steps between map and reduce phases. SPARQL graph pattern matching is dominated by join operations, and is unlikely to be efficiently processed using existing techniques. This cost is prohibitive for RDF graph pattern matching queries which typically involve several join operations. In this paper, we propose an approach for optimizing graph pattern matching by reinterpreting certain join tree structures as grouping operations. This enables a greater degree of parallelism in join processing resulting in more "bushy" like query execution plans with fewer Map-Reduce cycles. This approach requires that the intermediate results are managed as sets of groups of triples or TripleGroups. We therefore propose a data model and algebra - Nested TripleGroup Algebra for capturing and manipulating TripleGroups. The relationship with the traditional relational style algebra used in Apache Pig is discussed. A comparative performance evaluation of the traditional Pig approach and RAPID+ (Pig extended with NTGA) for graph pattern matching queries on the BSBM benchmark dataset is presented. Results show up to 60% performance improvement of our approach over traditional Pig for some tasks.

91 citations


Journal ArticleDOI
Arun Raman1, Hanjun Kim1, Taewook Oh1, Jae W. Lee1, David I. August1 
04 Jun 2011
TL;DR: The Degree of Parallelism Executive (DoPE) is presented, an API and run-time system that separates the concern of exposing parallelism from that of optimizing it and dynamically optimizing the parallelism for a variety of performance goals.
Abstract: In writing parallel programs, programmers expose parallelism and optimize it to meet a particular performance goal on a single platform under an assumed set of workload characteristics. In the field, changing workload characteristics, new parallel platforms, and deployments with different performance goals make the programmer's development-time choices suboptimal. To address this problem, this paper presents the Degree of Parallelism Executive (DoPE), an API and run-time system that separates the concern of exposing parallelism from that of optimizing it. Using the DoPE API, the application developer expresses parallelism options. During program execution, DoPE's run-time system uses this information to dynamically optimize the parallelism options in response to the facts on the ground. We easily port several emerging parallel applications to DoPE's API and demonstrate the DoPE run-time system's effectiveness in dynamically optimizing the parallelism for a variety of performance goals.

65 citations


Proceedings ArticleDOI
07 Dec 2011
TL;DR: A two-level MapReduce scheduler built on previous techniques and incorporating a deadline scheduler which adopts a sampling based approach and a resource allocation model to dynamically control each realtime job to execute with minimum tasks assignment in any time so as to maximize the number of concurrent real-time jobs.
Abstract: MapReduce scheduling is becoming a hot topic as MapReduce attracts more and more attention from both industry and academia. In this paper, we focus on the scheduling of mixed real-time and non-real-time applications in MapReduce environment, which is a challenging problem but receives only limited attention. To solve this problem, we present a two-level MapReduce scheduler built on previous techniques and make two key contributions. First, to meet the performance goal of real-time applications, we propose a deadline scheduler which adopts (1) a sampling based approach-Tasks Forward Scheduling (TFS) to predict map/reduce task execution time(unlike prior work that requires users to input an estimated value). (2) a resource allocation model-Approximately Uniform Minimum Degree of parallelism (AUMD) to dynamically control each realtime job to execute with minimum tasks assignment in any time so as to maximize the number of concurrent real-time jobs. Second, through integrating this deadline scheduler into existing MapReduce scheduler, we develop a two-level scheduler with resource preemption supported, and it could schedule mixed real-time and non-real-time jobs according to their respective performance demands. We implement our scheduler in Hadoop system and experiments running on a real, small-scale cluster demonstrate that it could schedule mixed real-time and nonreal-time jobs to meet their different quality-of-service (QoS) demands.

57 citations


Proceedings ArticleDOI
12 Jul 2011
TL;DR: This work investigates the performance of a highly parallel Particle Swarm Optimization (PSO) algorithm implemented on the GPU and shows that the GPU offers a high degree of performance and achieves a maximum of 37 times speedup over a sequential implementation when the problem size in terms of tasks is large and many swarms are used.
Abstract: We investigate the performance of a highly parallel Particle Swarm Optimization (PSO) algorithm implemented on the GPU. In order to achieve this high degree of parallelism we implement a collaborative multi-swarm PSO algorithm on the GPU which relies on the use of many swarms rather than just one. We choose to apply our PSO algorithm against a real-world application: the task matching problem in a heterogeneous distributed computing environment. Due to the potential for large problem sizes with high dimensionality, the task matching problem proves to be very thorough in testing the GPUs capabilities for handling PSO. Our results show that the GPU offers a high degree of performance and achieves a maximum of 37 times speedup over a sequential implementation when the problem size in terms of tasks is large and many swarms are used.

56 citations


Proceedings Article
14 Jun 2011
TL;DR: It is found that SSDs adopt significantly different ways of exploiting channel-level and way-level parallelism to maximize write throughput, which governs the peak current consumption.
Abstract: In this work, we perform µsec time scale analysis on energy consumption behavior of the SSD Write operation and exploit this information to extract key technical characteristics of SSD internals: channel utilization policy, page allocation strategy, cluster size, channel switch delay, way switch delay, etc. We found that some SSDs adopt a multi-page cluster as a write unit instead of a page. We found that SSDs adopt significantly different ways of exploiting channel-level and way-level parallelism to maximize write throughput, which governs the peak current consumption. The X25M(Intel) emphasizes the performance aspect of SSDs and linearly increases the channel parallelism as the IO size increases. The MXP(Samsung) puts more emphasis on energy consumption aspect of SSD and controls the degree of parallelism to reduce the peak current consumption. Cluster size of the X25M and the MXP correspond to one and eight pages, respectively. The current consumed when writing a page to NAND flash varies significantly depending on the NAND model(17 mA-35 mA).

Proceedings ArticleDOI
13 Jun 2011
TL;DR: In this article, the authors present an efficient event processing platform to support high-frequency and low-latency event matching over reconfigurable hardware, where each solution is formulated as a design trade-off between the degree of parallelism versus the desired application requirement.
Abstract: We present fpga-ToPSS (Toronto Publish/Subscribe System), an efficient event processing platform to support high-frequency and low-latency event matching. fpga-ToPSS is built over reconfigurable hardware---FPGAs---to achieve line-rate processing by exploring various degrees of parallelism. Furthermore, each of our proposed FPGA-based designs is geared towards a unique application requirement, such as flexibility, adaptability, scalability, or pure performance, such that each solution is specifically optimized to attain a high level of parallelism. Therefore, each solution is formulated as a design trade-off between the degree of parallelism versus the desired application requirement. Moreover, our event processing engine supports Boolean expression matching with an expressive predicate language applicable to a wide range of applications including real-time data analysis, algorithmic trading, targeted advertisement, and (complex) event processing.

Journal ArticleDOI
TL;DR: An efficient parallel fault simulator, FSimGP2, that exploits the high degree of parallelism supported by a state-of-the-art graphic processing unit (GPU) with the NVIDIA compute unified device architecture to achieve extremely high computation efficiency on the GPU.
Abstract: General purpose computing on graphical processing units (GPGPU) is a paradigm shift in computing that promises a dramatic increase in performance. GPGPU also brings an unprecedented level of complexity in algorithmic design and software development. In this paper, we present an efficient parallel fault simulator, FSimGP2, that exploits the high degree of parallelism supported by a state-of-the-art graphic processing unit (GPU) with the NVIDIA compute unified device architecture. A novel 3-D parallel fault simulation technique is proposed to achieve extremely high computation efficiency on the GPU. Global communication is minimized by concentrating as much work as possible on the local device's memory. We present results on a GPU platform from NVIDIA (a GeForce GTX 285 graphics card) that demonstrate a speedup of up to 63× and 4× compared to two other GPU-based fault simulators and up to 95× over a state-of-the-art algorithm on conventional processor architectures.

Journal ArticleDOI
TL;DR: The Abstract Next Subvolume Method is introduced, in which the model representation is decouple from the sequential simulation algorithms, and it is proved that state trajectories generated by its executions statistically accord with those generated by the Next Sub volume Method.

Proceedings ArticleDOI
07 Dec 2011
TL;DR: This paper presents the GPU acceleration of an important category of DP problems called nonserial polyadic dynamic programming (NPDP), and proposes a methodology that can adaptively adjust the thread-level parallelism in mapping a NPDP problem onto the GPU, thus providing sufficient and steady degrees of parallelism across different compute stages.
Abstract: Dynamic programming (DP) is an important computational method for solving a wide variety of discrete optimization problems such as scheduling, string editing, packaging, and inventory management. In general, DP is classified into four categories based on the characteristics of the optimization equation. Because applications that are classified in the same category of DP have similar program behavior, the research community has sought to propose general solutions for parallelizing each category of DP. However, most existing studies focus on running DP on CPU-based parallel systems rather than on accelerating DP algorithms on the graphics processing unit (GPU). This paper presents the GPU acceleration of an important category of DP problems called nonserial polyadic dynamic programming (NPDP). In NPDP applications, the degree of parallelism varies significantly in different stages of computation, making it difficult to fully utilize the compute power of hundreds of processing cores in a GPU. To address this challenge, we propose a methodology that can adaptively adjust the thread-level parallelism in mapping a NPDP problem onto the GPU, thus providing sufficient and steady degrees of parallelism across different compute stages. We realize our approach in a real-world NPDP application -- the optimal matrix parenthesization problem. Experimental results demonstrate our method can achieve a speedup of 13.40 over the previously published GPU algorithm.

Proceedings ArticleDOI
04 Jun 2011
TL;DR: This paper parallelizes a dynamic algorithm for well-spaced point sets, an important problem related to mesh refinement in computational geometry, and describes techniques for implementing the algorithm on modern multi-core computers and provides a prototype implementation.
Abstract: Parallel algorithms and dynamic algorithms possess an interesting duality property: compared to sequential algorithms, parallel algorithms improve run-time while preserving work, while dynamic algorithms improve work but typically offer no parallelism Although they are often considered separately, parallel and dynamic algorithms employ similar design techniques They both identify parts of the computation that are independent of each other This suggests that dynamic algorithms could be parallelized to improve work efficiency while preserving fast parallel run-timeIn this paper, we parallelize a dynamic algorithm for well-spaced point sets, an important problem related to mesh refinement in computational geometry Our parallel dynamic algorithm computes a well-spaced superset of a dynamically changing set of points, allowing arbitrary dynamic modifications to the input set On an EREW PRAM, our algorithm processes batches of k modifications such as insertions and deletions in O(k log Δ) total work and in O(log Δ) parallel time using k processors, where Δ is the geometric spread of the data, while ensuring that the output is always within a constant factor of the optimal size EREW PRAM model is quite different from actual hardware such as modern multiprocessors We therefore describe techniques for implementing our algorithm on modern multi-core computers and provide a prototype implementation Our empirical evaluation shows that our algorithm can be practical, yielding a large degree of parallelism and good speedups

ReportDOI
01 Jun 2011
TL;DR: A new methodology for utilizing all CPU cores and all GPUs efficiently on a heterogeneous multicore and multi-GPU system to support matrix computations efficiently and an auto-tuning method to determine the best tile sizes to attain both high performance and load balancing is presented.
Abstract: Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures ∗ Fengguang Song Stanimire Tomov Jack Dongarra University of Tennessee EECS Department Knoxville, TN, USA University of Tennessee EECS Department Knoxville, TN, USA University of Tennessee Oak Ridge National Laboratory University of Manchester song@eecs.utk.edu tomov@eecs.utk.edu dongarra@eecs.utk.edu ABSTRACT We present a new methodology for utilizing all CPU cores and all GPUs on a heterogeneous multicore and multi-GPU system to support matrix computations efficiently. Our ap- proach is able to achieve the objectives of a high degree of parallelism, minimized synchronization, minimized commu- nication, and load balancing. Our main idea is to treat the heterogeneous system as a distributed-memory machine, and to use a heterogeneous 1-D block cyclic distribution to allo- cate data to the host system and GPUs to minimize commu- nication. We have designed heterogeneous algorithms with two different tile sizes (one for CPU cores and the other for GPUs) to cope with processor heterogeneity. We propose an auto-tuning method to determine the best tile sizes to attain both high performance and load balancing. We have also implemented a new runtime system and applied it to the Cholesky and QR factorizations. Our experiments on a compute node with two Intel Westmere hexa-core CPUs and three Nvidia Fermi GPUs demonstrate good weak scal- ability, strong scalability, load balance, and efficiency of our approach. INTRODUCTION As the performance of both multicore CPU and GPU con- tinues to scale at a Moore’s law rate, it is becoming perva- sive to use heterogeneous multicore and multi-GPU archi- tectures to attain the highest performance possible from a single compute node. Before making parallel programs run efficiently on a distributed-memory system, it is critical to achieve high performance on a single node first. However, the heterogeneity in the multi-core and multi-GPU architec- ture has introduced new challenges to algorithm design and system software. Over the last few years, our colleagues at the Univer- sity of Tennessee have developed the PLASMA library [2] to solve linear algebra problems on multicore architectures. In parallel with PLASMA, we have also developed another library called MAGMA [27] to solve linear algebra problems on GPUs. While PLASMA and MAGMA aim to provide the same routines as LAPACK [4], one is used for multicore CPUs, and the other for a single core with an attached GPU, respectively. Our goal is to utilize all cores and all GPUs efficiently on a single multicore and multi-GPU system to support matrix computations. ∗ This material is based upon work supported by the NSF grants CCF-0811642, OCI-0910735, by the DOE grant DE- FC02-06ER25761, and by Microsoft Research. GPU Device Memory Multicore Host System Host Memory PCIe Interface GPU Switch PCIe Interface GPU Switch GPU Device Memory GPU Device Memory GPU Device Memory Figure 1: An example of a heterogeneous multi-core and multi-GPU system. The host system is connected to four GPUs via two PCI Express connections. The host system and the GPUs have separate memory spaces. Figure 1 shows the architecture of a heterogeneous mul- ticore and multi-GPU system we are considering. The mul- ticore host system is connected to four GPUs via two PCI Express connections and each pair of GPUs share a GPU switch. To design new software on this type of heteroge- neous architectures, we must consider the following special features: (1) The host and the GPUs have different memory spaces and an explicit memory copy is required to transfer data between the host and a GPU; (2) The system is also dif- ferent from a distributed-memory machine since each GPU is actually controlled by a thread running on the host (more like pthreads on a shared-memory machine); (3) The pro- cessor heterogeneity between CPUs and GPUs; (4) GPUs are optimized for throughput and expect a larger input size than CPUs which are optimized for latency [24]; (5) As the performance gap between a GPU and its PCI-Express in- terconnection to the host becomes larger, network is even- tually the bottleneck for the entire system. In this work, we take into account all these factors and strive to meet the following objectives in order to obtain high performance: a high degree of parallelism, minimized synchronization, min- imized communication, and load balancing. We propose to design new heterogeneous algorithms and to use a simple but practical static data distribution to achieve the objec- tives simultaneously. This paper describes heterogeneous rectangular tile algo- rithms with hybrid tile sizes, heterogeneous 1-D block cyclic data distribution, a new runtime system, and an auto-tuning method to determine the hybrid tile sizes. The rectangu- lar tile algorithms build upon the previous tile algorithms, which divide a matrix into square tiles and exhibit a high de- gree of parallelism and minimized synchronizations [13, 14]

Journal ArticleDOI
21 Dec 2011
TL;DR: A principled approach to designing energy-efficient, heterogeneous data centers that are robust against data center workload variations, using Wald’s minimax criterion as a starting point is outlined.
Abstract: Data centers represent the fastest growing component of information and communication technologies (ICT) energy footprint. With the advent of cloud computing, data centers will increasingly be used to process a wide array of jobs with differing characteristics such as degree of parallelism, memory access patterns etc.. From an energy efficiency perspective, the most energy efficient server architecture differs for jobs with different characteristics [4], motivating the need to consider heterogeneous data center designs consisting of many server types [3, 5]. Even though types of jobs that a data center is expected to serve might be known at design time, the workload statistics are often unknown until the data center is deployed. Therefore, data centers should be designed keeping in mind the uncertainty in workload statistics — in this paper, we outline a principled approach to designing energy-efficient, heterogeneous data centers that are robust against data center workload variations, using Wald’s minimax criterion as a starting point. In the proposed formulation, we assume that the only thing that is known at design time is an upper bound on the total rate (over all job types) at which jobs arrive at the data center, and design the data center to have the minimum worst-case energy consumption over all job type mixes. We then highlight a number of potential avenues for further investigation.

Journal ArticleDOI
TL;DR: Based on a novel self-assembly platform consisting of self-propulsive centimetre-sized modules capable of aggregation on the surface of water, the effect of stochasticity and morphology (shape) on the yield of targeted formations in self- assembly processes is studied.
Abstract: The decay in structure size of manufacturing products has yielded new demands on spontaneous composition methods. The key for the realization of small-sized robots lies in how to achieve the efficient assembly sequence in a bottom-up manner, where most of the parts have only limited or no computational i.e. deliberative abilities. In this paper, based on a novel self-assembly platform consisting of self-propulsive centimetre-sized modules capable of aggregation on the surface of water, we study the effect of stochasticity and morphology shape on the yield of targeted formations in self-assembly processes. Specifically, we focus on a unique phenomenon: that a number of modules instantly compose a target product without forming intermediate subassemblies, some of which constitute undesired geometrical formations termed one-shot aggregation. Together with a focus on the role that the morphology of the modules plays, we validate the effect of one-shot aggregation with a kinetic rate mathematical model. Moreover, we examined the degree of parallelism of the assembly process, which is an essential factor in self-assembly, but is not systematically taken into account by existing frameworks.

Proceedings ArticleDOI
16 May 2011
TL;DR: This paper proposes a methodology for applications to automatically claim linear arrays of processing elements within massively parallel processor arrays at run-time depending on the available degree of parallelism or dynamic computing requirements.
Abstract: This paper proposes a methodology for applications to automatically claim linear arrays of processing elements within massively parallel processor arrays at run-time depending on the available degree of parallelism or dynamic computing requirements. Using this methodology, parallel programs running on individual processing elements gain the capability of autonomously managing the available processing resources in their neighborhood. We present different protocols and architectural support for gathering and transporting the result of a resource exploration for informing a configuration loader about the number and location of the claimed resources. Timing and data overhead cost of four different approaches are mathematically evaluated. In order to verify and compare these decentralized algorithms, a simulation platform has been developed to compare the data overhead and scalability of each approach for different sizes of processor arrays.

Journal ArticleDOI
01 Jan 2011
TL;DR: The very last step of the algorithm is performed to produce a gapped alignment with the Needleman-Wunsch algorithm in software, only with the option of hardware processing after reconfiguration, which saves FPGA-resources and allows an even higher degree of parallelism.
Abstract: Protein database search requests are generally being performed using the BLASTp algorithm, introduced by NCBI [1] . Since it is computationally intensive, it becomes more and more ine_ective with today's growth of sequence database sizes. The needs for an e_cient parallel implementation arise. In this paper, we focus on a massive parallelization using the FPGA-based hardware architecture RIVYERA [2] . The aim is to reach speedups in orders of magnitude with a flexible implementation while saving energy costs compared to PC-based database search. We keep our implementation close to the structure published by Kasap et al. [3] , [4] and include ideas from Sotiriades et al. [5] such that all parts of the algorithm are organized in components of a long pipeline. We also use the idea of the two-hit method [6] to keep the computational e_ort small. Besides the related work, we perform the very last step of the algorithm to produce a gapped alignment with the Needleman-Wunsch algorithm in software, only with the option of hardware processing after reconfiguration. This saves FPGA-resources and allows an even higher degree of parallelism.

DOI
01 Jul 2011
TL;DR: This work uses the approach of matrix-based multigrid that has high flexibility and adapts well to the exigences of modern computing platforms, and investigates multi-colored Gauss-Seidel type smoothers, the power(q)-pattern enhanced multi- colored ILU(p) smoothers with fill-ins, and factorized sparse approximate inverse (FSAI) smooths.
Abstract: Multigrid methods are efficient and fast solvers for problems typically modeled by partial differential equations of elliptic type. For problems with complex geometries and local singularities stencil-type discrete operators on equidistant Cartesian grids need to be replaced by more flexible concepts for unstructured meshes in order to properly resolve all problem-inherent specifics and for maintaining a moderate number of unknowns. However, flexibility in the meshes goes along with severe drawbacks with respect to parallel execution - especially with respect to the definition of adequate smoothers. This point becomes in particular pronounced in the framework of fine-grained parallelism on GPUs with hundreds of execution units. We use the approach of matrix-based multigrid that has high flexibility and adapts well to the exigences of modern computing platforms. In this work we investigate multi-colored Gauss-Seidel type smoothers, the power(q)-pattern enhanced multi-colored ILU(p) smoothers with fill-ins, and factorized sparse approximate inverse (FSAI) smoothers. These approaches provide efficient smoothers with a high degree of parallelism. In combination with matrix-based multigrid methods on unstructured meshes our smoothers provide powerful solvers that are applicable across a wide range of parallel computing platforms and almost arbitrary geometries. We describe the configuration of our smoothers in the context of the portable lmpLAtoolbox and the HiFlow3 parallel finite element package. In our approach, a single source code can be used across diverse platforms including multicore CPUs and GPUs. Highly optimized implementations are hidden behind a unified user interface. Efficiency and scalability of our multigrid solvers are demonstrated by means of a comprehensive performance analysis on multicore CPUs and GPUs.

Book ChapterDOI
05 Dec 2011
TL;DR: This paper initiates a study on computing degree of parallelism for three classes of BPMN processes, which are defined based on the use of B PMN gateways, and presents an algorithm for Computing degree of Parallelism, which has polynomial time complexity.
Abstract: For sequential processes and workflows (i.e., pipelined tasks), each enactment (process instance) only has one task being performed at each time instant. When a process allows tasks to be performed in parallel, an enactment may have a number of tasks being performed concurrently and this number may change in time. We define the “degree of parallelism” of a process as the maximum number of tasks to be performed concurrently during an execution of the process. This paper initiates a study on computing degree of parallelism for three classes of BPMN processes, which are defined based on the use of BPMN gateways. For each class, an algorithm for computing degree of parallelism is presented. In particular, the algorithms for “homogeneous” and acyclic “choice-less” processes (respectively) have polynomial time complexity, while the algorithm for “asynchronous” processes runs in exponential time.

Journal ArticleDOI
TL;DR: An improvement to a parallel implementation of T-Coffee, a widely used MSA package, that resolves the bottleneck of the progressive alignment stage on MSA and shows improvements in execution time of over 68% while maintaining the biological accuracy.
Abstract: Multiple Sequence Alignment (MSA) constitutes an extremely powerful tool for important biological applications such as phylogenetic analysis, identification of conserved motifs and domains and structure prediction. In spite of the improvement in speed and accuracy introduced by MSA programs, the computational requirements for large-scale alignments requires high-performance computing and parallel applications. In this paper we present an improvement to a parallel implementation of T-Coffee, a widely used MSA package. Our approximation resolves the bottleneck of the progressive alignment stage on MSA. This is achieved by increasing the degree of parallelism by balancing the guide tree that drives the progressive alignment process. The experimental results show improvements in execution time of over 68% while maintaining the biological accuracy.

01 Jan 2011
TL;DR: In this article, a matrix-based geometric multigrid method is proposed to solve finite element solvers with high flexibility with respect to complex geometries and local singularities, which adapts well to the exigences of modern computing platforms.
Abstract: Multigrid methods are efficient and fast solvers for problems typically modeled by partial differential equations of elliptic type. We use the approach of matrix-based geometric multigrid that has high flexibility with respect to complex geometries and local singularities. Furthermore, it adapts well to the exigences of modern computing platforms. In this work we investigate multi-colored Gaus-Seidel type smoothers, the power(q)-pattern enhanced multi-colored ILU(p,q) smoothers with fill-ins, and factorized sparse approximate inverse (FSAI) smoothers. These approaches provide efficient smoothers with a high degree of parallelism. We describe the configuration of our smoothers in the context of the portable lmpLAtoolbox and the HiFlow 3 parallel finite element package. In our approach, a single source code can be used across diverse platforms including multicore CPUs and GPUs. Highly optimized implementations are hidden behind a unified user interface. Efficiency and scalability of our multigrid solvers are demonstrated by means of a comprehensive performance analysis on multicore CPUs and GPUs.

01 Jan 2011
TL;DR: The results presented in this dissertation demonstrate that data-driven execution, coupled with metadata abstractions, effectively support latency tolerance and enable performance optimization techniques that are decoupled from the algorithmic formulation and the control flow of the application code.
Abstract: In supercomputing systems, architectural changes that increase computational power are often reflected in the programming model. As a result, in order to realize and sustain the potential performance of such systems, it is necessary in practice to deal with architectural details and explicitly manage the resources to an increasing extent. In particular, programmers are required to develop code that exposes a high degree of parallelism, exhibits high locality, dynamically adapts to the available resources, and hides communication latency. Hiding communication latency is crucial to realize the potential of today’s distributed memory machines with highly parallel processing modules, and technological trends indicate that communication latencies will continue to be an issue as the performance gap between computation and communication widens. However, under Bulk Synchronous Parallel models, the predominant paradigm in scientific computing, scheduling is embedded into the application code. All the phases of a computation are defined and laid out as a linear sequence of operations limiting overlap and the program’s ability to adapt to communication delays. This thesis proposes an alternative model, called Tarragon, to overcome the limitations of Bulk Synchronous Parallelism. Tarragon, which is based on dataflow, targets latency tolerant scientific computations. Tarragon supports a task-dependency graph abstraction in which tasks, the basic unit of computation, are organized as a graph according to their data dependencies, i.e. task precedence. In addition to the task graph, Tarragon supports metadata abstractions, annotations to the task graph, to express locality information and scheduling policies to improve performance. Tarragon’s functionality and underlying programming methodology are demonstrated on three classes of computations used in scientific domains: structured grids, sparse linear algebra, and dynamic programming. In the application studies, Tarragon implementations achieve high performance, in many cases exceeding the performance of equivalent latency-tolerant, hard coded MPI implementations. The results presented in this dissertation demonstrate that data-driven execution, coupled with metadata abstractions, effectively support latency tolerance. In addition, performance metadata enable performance optimization techniques that are decoupled from the algorithmic formulation and the control flow of the application code. By expressing the structure of the computation and its characteristics with metadata, the programmer can focus on the application and rely on Tarragon and its run-time system to automatically overlap communication with computation and optimize the performance.

Journal ArticleDOI
TL;DR: A rarely used metric is discussed that is well suited to evaluate online schedules for independent jobs on massively parallel processors and proves an almost tight competitive factor of 1.25 for nondelay schedules and no constant competitive factor exists.
Abstract: The paper discusses a rarely used metric that is well suited to evaluate online schedules for independent jobs on massively parallel processors. The metric is based on the total weighted completion time objective with the weight being the resource consumption of the job. Although every job contributes to the objective value, the metric exhibits many properties that are similar to the properties of the makespan objective. For this metric, we particularly address nonclairvoyant online scheduling of sequential jobs on parallel identical machines and prove an almost tight competitive factor of 1.25 for nondelay schedules. For the extension of the problem to rigid parallel jobs, we show that no constant competitive factor exists. However, if all jobs are released at time 0, List Scheduling in descending order of the degree of parallelism guarantees an approximation factor of 2.

Proceedings ArticleDOI
26 Sep 2011
TL;DR: The evolutionary strategy ensures that EMTS can be used with any underlying model for predicting the execution time of moldable tasks, and significantly reduces the make span of PTGs compared to other heuristics for both non-monotonically and monotonically decreasing models.
Abstract: Parallel task graphs (PTGs) arise when parallel programs are combined to larger applications, e.g., scientific workflows. Scheduling these PTGs onto clusters is a challenging problem due to the additional degree of parallelism stemming from moldable tasks. Most algorithms are based on the assumption that the execution time of a parallel task is monotonically decreasing as the number of processors increases. But this assumption does not hold in practice since parallel programs often perform better if the number of processors is a multiple of internally used block sizes. In this article, we introduce the Evolutionary Moldable Task Scheduling (EMTS) algorithm for scheduling static PTGs onto homogeneous clusters. We apply an evolutionary approach to determine the processor allocation of each task. The evolutionary strategy ensures that EMTS can be used with any underlying model for predicting the execution time of moldable tasks. With the purpose of finding solutions quickly, EMTS considers results of other heuristics (e.g., HCPA, MCPA) as starting solutions. The experimental results show that EMTS significantly reduces the make span of PTGs compared to other heuristics for both non-monotonically and monotonically decreasing models.

Proceedings ArticleDOI
10 Oct 2011
TL;DR: Tarragon, which is based on dataflow, targets latency tolerant scientific computations and achieves high performance, in many cases exceeding the performance of equivalent latency-tolerant, hard coded MPI implementations.
Abstract: In the current practice, scientific programmer and HPC users are required todevelop code that exposes a high degree of parallelism, exhibits high locality,dynamically adapts to the available resources, and hides communication latency.Hiding communication latency is crucial to realize the potential of today'sdistributed memory machines with highly parallel processing modules, andtechnological trends indicate that communication latencies will continue to bean issue as the performance gap between computation and communication widens.However, under Bulk Synchronous Parallel models, the predominant paradigm inscientific computing, scheduling is embedded into the application code. All thephases of a computation are defined and laid out as a linear sequence ofoperations limiting overlap and the program's ability to adapt to communicationdelays.In this paper we present an alternative model, called Tarragon, to overcome thelimitations of Bulk Synchronous Parallelism. Tarragon, which is based ondataflow, targets latency tolerant scientific computations. Tarragon supports atask-dependency graph abstraction in which tasks, the basic unit ofcomputation, are organized as a graph according to their data dependencies,i.e. task precedence. In addition to the task graph, Tarragon supports metadataabstractions, annotations to the task graph, to express locality informationand scheduling policies to improve performance.Tarragon's functionality and underlying programming methodology aredemonstrated on three classes of computations used in scientific domains:structured grids, sparse linear algebra, and dynamic programming. In theapplication studies, Tarragon implementations achieve high performance, in manycases exceeding the performance of equivalent latency-tolerant, hard coded MPIimplementations.

Journal ArticleDOI
TL;DR: The analogy between GPGPU and hardware/software co-design (HSCD), a more mature design paradigm, is spotlighted to derive a design process for GPGU, which will ease GPG PU design significantly.
Abstract: With the recent development of high-performance graphical processing units (GPUs), capable of performing general-purpose computation (GPGPU: general-purpose computation on the GPU), a new platform is emerging. It consists of a central processing unit (CPU), which is very fast in sequential execution, and a GPU, which exhibits high degree of parallelism and thus very high performance on certain types of computations. Optimally leveraging the advantages of this platform is challenging in practice. We spotlight the analogy between GPGPU and hardware/software co-design (HSCD), a more mature design paradigm, to derive a design process for GPGPU. This process, with appropriate tool support and automation, will ease GPGPU design significantly. Identifying the challenges associated with establishing this process can serve as a roadmap for the future development of the GPGPU field.