scispace - formally typeset
Search or ask a question

Showing papers on "Degree of parallelism published in 2016"


Journal ArticleDOI
TL;DR: With high flexibility and throughput, NeuroFlow provides a viable environment for large-scale neural network simulation and supports the spike-timing-dependent plasticity (STDP) rule for learning.
Abstract: NeuroFlow is a scalable spiking neural network simulation platform for off-the-shelf high performance computing systems using customizable hardware processors such as Field-Programmable Gate Arrays (FPGAs). Unlike multi-core processors and application-specific integrated circuits, the processor architecture of NeuroFlow can be redesigned and reconfigured to suit a particular simulation to deliver optimized performance, such as the degree of parallelism to employ. The compilation process supports using PyNN, a simulator-independent neural network description language, to configure the processor. NeuroFlow supports a number of commonly used current or conductance based neuronal models such as integrate-and-fire and Izhikevich models, and the spike-timing-dependent plasticity (STDP) rule for learning. A 6-FPGA system can simulate a network of up to ~600,000 neurons and can achieve a real-time performance of 400,000 neurons. Using one FPGA, NeuroFlow delivers a speedup of up to 33.6 times the speed of an 8-core processor, or 2.83 times the speed of GPU-based platforms. With high flexibility and throughput, NeuroFlow provides a viable environment for large-scale neural network simulation.

87 citations


Journal ArticleDOI
TL;DR: This article introduces a parallel and distributed memory-based algorithm that builds vulnerability-based attack graphs on a distributed multi-agent platform and introduces a rich attack template and network model in order to form chains of vulnerability exploits in attack graphs more precisely.
Abstract: Attack graphs show possible paths that an attacker can use to intrude into a target network and gain privileges through series of vulnerability exploits. The computation of attack graphs suffers from the state explosion problem occurring most notably when the number of vulnerabilities in the target network grows large. Parallel computation of attack graphs can be utilized to attenuate this problem. When employed in online network security evaluation, the computation of attack graphs can be triggered with the correlated intrusion alerts received from sensors scattered throughout the target network. In such cases, distributed computation of attack graphs becomes valuable. This article introduces a parallel and distributed memory-based algorithm that builds vulnerability-based attack graphs on a distributed multi-agent platform. A virtual shared memory abstraction is proposed to be used over such a platform, whose memory pages are initialized by partitioning the network reachability information. We demonstrate the feasibility of parallel distributed computation of attack graphs and show that even a small degree of parallelism can effectively speed up the generation process as the problem size grows. We also introduce a rich attack template and network model in order to form chains of vulnerability exploits in attack graphs more precisely.

70 citations


Proceedings ArticleDOI
16 May 2016
TL;DR: NXgraph can adaptively choose the fastest strategy for different graph problems according to the graph size and the available memory resources to fully utilize the memory space and reduce the amount of data transfer.
Abstract: Recent studies show that graph processing systems on a single machine can achieve competitive performance compared with cluster-based graph processing systems. In this paper, we present NXgraph, an efficient graph processing system on a single machine. We propose the Destination-Sorted Sub-Shard (DSSS) structure to store a graph. To ensure graph data access locality and enable fine-grained scheduling, NXgraph divides vertices and edges into intervals and sub-shards. To reduce write conflicts among different threads and achieve a high degree of parallelism, NXgraph sorts edges within each sub-shard according to their destination vertices. Then, three updating strategies, i.e., Single-Phase Update (SPU), Double-Phase Update (DPU), and Mixed-Phase Update (MPU), are proposed in this paper. NXgraph can adaptively choose the fastest strategy for different graph problems according to the graph size and the available memory resources to fully utilize the memory space and reduce the amount of data transfer. All these three strategies exploit streamlined disk access patterns. Extensive experiments on three real-world graphs and five synthetic graphs show that NXgraph outperforms GraphChi, TurboGraph, VENUS, and GridGraph in various situations. Moreover, NXgraph, running on a single commodity PC, can finish an iteration of PageRank on the Twitter [1] graph with 1.5 billion edges in 2.05 seconds; while PowerGraph, a distributed graph processing system, needs 3.6s to finish the same task on a 64-node cluster.

60 citations


Journal ArticleDOI
TL;DR: A parallel implementation of the Bellman-Ford algorithm that exploits the architectural characteristics of recent GPU architectures (i.e., NVIDIA Kepler, Maxwell) to improve both performance and work efficiency is presented.
Abstract: Finding the shortest paths from a single source to all other vertices is a common problem in graph analysis. The Bellman-Ford's algorithm is the solution that solves such a single-source shortest path (SSSP) problem and better applies to be parallelized for many-core architectures. Nevertheless, the high degree of parallelism is guaranteed at the cost of low work efficiency, which, compared to similar algorithms in literature (e.g., Dijkstra's) involves much more redundant work and a consequent waste of power consumption. This article presents a parallel implementation of the Bellman-Ford algorithm that exploits the architectural characteristics of recent GPU architectures (i.e., NVIDIA Kepler, Maxwell) to improve both performance and work efficiency. The article presents different optimizations to the implementation, which are oriented both to the algorithm and to the architecture. The experimental results show that the proposed implementation provides an average speedup of $5 \times$ higher than the existing most efficient parallel implementations for SSSP, that it works on graphs where those implementations cannot work or are inefficient (e.g., graphs with negative weight edges, sparse graphs), and that it sensibly reduces the redundant work caused by the parallelization process.

59 citations


Proceedings ArticleDOI
01 Aug 2016
TL;DR: In this article, the authors propose two new techniques to increase the degree of parallelism during decompression, which can exploit the massive parallelism of modern multi-core processors and GPUs for data decompression within a block.
Abstract: Today's exponentially increasing data volumes and the high cost of storage make compression essential for the Big Data industry. Although research has concentrated on efficient compression, fast decompression is critical for analytics queries that repeatedly read compressed data. While decompression can be parallelized somewhat by assigning each data block to a different process, break-through speed-ups require exploiting the massive parallelism of modern multi-core processors and GPUs for data decompression within a block. We propose two new techniques to increase the degree of parallelism during decompression. The first technique exploits the massive parallelism of GPU and SIMD architectures. The second sacrifices some compression efficiency to eliminate data dependencies that limit parallelism during decompression. We evaluate these techniques on the decompressor of the DEFLATE scheme, called Inflate, which is based on LZ77 compression and Huffman encoding. We achieve a 2× speed-up in a head-to-head comparison with several multi core CPU-based libraries, while achieving a 17% energy saving with comparable compression ratios.

42 citations


Posted Content
TL;DR: In this article, the authors propose two new techniques to increase the degree of parallelism during decompression of DEFLATE, which is based on LZ77 compression and Huffman encoding and achieves a 2X speedup in a head-to-head comparison with several multi-core CPU-based libraries.
Abstract: Today's exponentially increasing data volumes and the high cost of storage make compression essential for the Big Data industry. Although research has concentrated on efficient compression, fast decompression is critical for analytics queries that repeatedly read compressed data. While decompression can be parallelized somewhat by assigning each data block to a different process, break-through speed-ups require exploiting the massive parallelism of modern multi-core processors and GPUs for data decompression within a block. We propose two new techniques to increase the degree of parallelism during decompression. The first technique exploits the massive parallelism of GPU and SIMD architectures. The second sacrifices some compression efficiency to eliminate data dependencies that limit parallelism during decompression. We evaluate these techniques on the decompressor of the DEFLATE scheme, called Inflate, which is based on LZ77 compression and Huffman encoding. We achieve a 2X speed-up in a head-to-head comparison with several multi-core CPU-based libraries, while achieving a 17% energy saving with comparable compression ratios.

37 citations


Proceedings ArticleDOI
23 May 2016
TL;DR: IHK/McKernel is described, a hybrid software stack that seamlessly blends an LWK with Linux by selectively offloading system services from the lightweight kernel to Linux by focusing on transparent reuse of Linux device drivers and detail the design of the framework that enables the LWK to naturally leverage the Linux driver codebase without sacrificing scalability or the POSIX API.
Abstract: Extreme degree of parallelism in high-end computing requires low operating system noise so that large scale, bulk-synchronous parallel applications can be run efficiently. Noiseless execution has been historically achieved by deploying lightweight kernels (LWK), which, on the other hand, can provide only a restricted set of the POSIX API in exchange for scalability. However, the increasing prevalence of more complex application constructs, such as in-situ analysis and workflow composition, dictates the need for the rich programming APIs of POSIX/Linux. In order to comply with these seemingly contradictory requirements, hybrid kernels, where Linux and a lightweight kernel (LWK) are run side-by-side on compute nodes, have been recently recognized as a promising approach. Although multiple research projects are now pursuing this direction, the questions of how node resources are shared between the two types of kernels, how exactly the two kernels interact with each other and to what extent they are integrated, remain subjects of ongoing debate. In this paper, we describe IHK/McKernel, a hybrid software stack that seamlessly blends an LWK with Linux by selectively offloading system services from the lightweight kernel to Linux. Specifically, we are focusing on transparent reuse of Linux device drivers and detail the design of our framework that enables the LWK to naturally leverage the Linux driver codebase without sacrificing scalability or the POSIX API. Through rigorous evaluation on a medium size cluster we demonstrate how McKernel provides consistent, isolated performance for simulations even in face of competing, in-situ workloads.

36 citations


Journal ArticleDOI
TL;DR: In this article, the authors consider the problem of minimizing the continuous valued total variation subject to different unary terms on trees and propose fast direct algorithms based on dynamic programming to solve these problems.
Abstract: We consider the problem of minimizing the continuous valued total variation subject to different unary terms on trees and propose fast direct algorithms based on dynamic programming to solve these problems. We treat both the convex and the nonconvex case and derive worst-case complexities that are equal to or better than existing methods. We show applications to total variation based two dimensional image processing and computer vision problems based on a Lagrangian decomposition approach. The resulting algorithms are very efficient, offer a high degree of parallelism, and come along with memory requirements which are only in the order of the number of image pixels.

36 citations


Proceedings ArticleDOI
13 Jun 2016
TL;DR: A novel graph-based Complex Event Processing system GraphCEP is proposed and its performance is evaluated in the setting of two case studies from the DEBS Grand Challenge 2016.
Abstract: In recent years, the proliferation of highly dynamic graph-structured data streams fueled the demand for real-time data analytics. For instance, detecting recent trends in social networks enables new applications in areas such as disaster detection, business analytics or health-care. Parallel Complex Event Processing has evolved as the paradigm of choice to analyze data streams in a timely manner, where the incoming data streams are split and processed independently by parallel operator instances. However, the degree of parallelism is limited by the feasibility of splitting the data streams into independent parts such that correctness of event processing is still ensured. In this paper, we overcome this limitation for graph-structured data by further parallelizing individual operator instances using modern graph processing systems. These systems partition the graph data and execute graph algorithms in a highly parallel fashion, for instance using cloud resources. To this end, we propose a novel graph-based Complex Event Processing system GraphCEP and evaluate its performance in the setting of two case studies from the DEBS Grand Challenge 2016.

28 citations


Proceedings ArticleDOI
23 May 2016
TL;DR: The approach significantly reduces runtime overhead and improves GPU utilization, leading to speedup factors from 90x to 3300x over basic DP-based solutions and speedups from 2x to 6x over flat implementations.
Abstract: GPUs have been widely used to accelerate computations exhibiting simple patterns of parallelism -- such as flat or two-level parallelism -- and a degree of parallelism that can be statically determined based on the size of the input dataset. However, the effective use of GPUs for algorithms exhibiting complex patterns of parallelism, possibly known only at runtime, is still an open problem. Recently, Nvidia has introduced Dynamic Parallelism (DP) in its GPUs. By making it possible to launch kernels directly from GPU threads, this feature enables nested parallelism at runtime. However, the effective use of DP must still be understood: a naive use of this feature may suffer from significant runtime overhead and lead to GPU underutilization, resulting in poor performance. In this work, we target this problem. First, we demonstrate how a naive use of DP can result in poor performance. Second, we propose three workload consolidation schemes to improve performance and hardware utilization of DP-based codes, and we implement these code transformations in a directive-based compiler. Finally, we evaluate our framework on two categories of applications: algorithms including irregular loops and algorithms exhibiting parallel recursion. Our experiments show that our approach significantly reduces runtime overhead and improves GPU utilization, leading to speedup factors from 90x to 3300x over basic DP-based solutions and speedups from 2x to 6x over flat implementations.

22 citations


Proceedings ArticleDOI
01 Jun 2016
TL;DR: This paper proposes a Workflow Partitioning for Energy Minimization (WPEM) algorithm that allows reducing the network energy consumption of the workflow and the total amount of data communication while achieving a high degree of parallelism.
Abstract: Energy consumption is emerging as a new crucial issue of the Cloud Computing environments such as data centers. The problem of power consumption is more challenging especially in the context of scientific workflows deployment in the Cloud as they trigger intensive computational tasks and data manipulation steps which begets excessive data movement operations over communication networks. For instance, it was revealed that network devices consume up to one-third of the total energy consumption of Cloud data centers. In this paper, we propose an energy-aware approach for scientific workflows scheduling in the Cloud. In the first step, we propose a Workflow Partitioning for Energy Minimization (WPEM) algorithm that allows reducing the network energy consumption of the workflow and the total amount of data communication while achieving a high degree of parallelism. In the second step, we use the heuristic of Cat Swarm Optimization to schedule the generated partitions in order to minimize the workflow's overall energy consumption and execution time. We evaluated the proposed approach using three real cases of data intensive workflows and compare it with other algorithms from literature. The experimental results show that our proposal allows to reduce remarkably the network energy consumption of the tested workflows (up to 96% of network energy consumption saving for memory intensive workflows) and the overall energy consumption of the workflows while ensuring a reasonable execution time and using less Cloud resources.

Journal ArticleDOI
TL;DR: The proposed level based autonomic Workflow-and-Platform Aware (WPA) task clustering technique aims to achieve maximum possible parallelism among the tasks at a level of a workflow while minimizing the system overheads and resource wastage.

Journal ArticleDOI
TL;DR: A two-level space–time domain decomposition method for solving an inverse source problem associated with the time-dependent convection–diffusion equation in three dimensions and eliminates the sequential steps in the optimization outer loop and the inner forward and backward time marching processes, thus achieves high degree of parallelism.
Abstract: As the number of processor cores on supercomputers becomes larger and larger, algorithms with high degree of parallelism attract more attention. In this work, we propose a two-level space---time domain decomposition method for solving an inverse source problem associated with the time-dependent convection---diffusion equation in three dimensions. We introduce a mixed finite element/finite difference method and a one-level and a two-level space---time parallel domain decomposition preconditioner for the Karush---Kuhn---Tucker system induced from reformulating the inverse problem as an output least-squares optimization problem in the entire space-time domain. The new full space---time approach eliminates the sequential steps in the optimization outer loop and the inner forward and backward time marching processes, thus achieves high degree of parallelism. Numerical experiments validate that this approach is effective and robust for recovering unsteady moving sources. We will present strong scalability results obtained on a supercomputer with more than 1000 processors.

Posted Content
TL;DR: PACMAN, a parallel database recovery mechanism that is specifically designed for lightweight, coarse-grained transaction-level logging, is proposed and can significantly reduce recovery time without compromising the efficiency of transaction processing.
Abstract: Main-memory database management systems (DBMS) can achieve excellent performance when processing massive volume of on-line transactions on modern multi-core machines. But existing durability schemes, namely, tuple-level and transaction-level logging-and-recovery mechanisms, either degrade the performance of transaction processing or slow down the process of failure recovery. In this paper, we show that, by exploiting application semantics, it is possible to achieve speedy failure recovery without introducing any costly logging overhead to the execution of concurrent transactions. We propose PACMAN, a parallel database recovery mechanism that is specifically designed for lightweight, coarse-grained transaction-level logging. PACMAN leverages a combination of static and dynamic analyses to parallelize the log recovery: at compile time, PACMAN decomposes stored procedures by carefully analyzing dependencies within and across programs; at recovery time, PACMAN exploits the availability of the runtime parameter values to attain an execution schedule with a high degree of parallelism. As such, recovery performance is remarkably increased. We evaluated PACMAN in a fully-fledged main-memory DBMS running on a 40-core machine. Compared to several state-of-the-art database recovery mechanisms, PACMAN can significantly reduce recovery time without compromising the efficiency of transaction processing.

Journal ArticleDOI
TL;DR: This paper proposes a new multi-scale semi-dense point tracker called Video Extruder, whose purpose is to fill the gap between short-term, dense motion estimation (optical flow) and long- term, sparse salient point tracking, and presents a new detector, including a new salience function with low computational complexity and a new selection strategy that allows to obtain a large number of keypoints.
Abstract: Two crucial aspects of general-purpose embedded visual point tracking are addressed in this paper. First, the algorithm should reliably track as many points as possible. Second, the computation should achieve real-time video processing, which is challenging on low power embedded platforms. We propose a new multi-scale semi-dense point tracker called Video Extruder, whose purpose is to fill the gap between short-term, dense motion estimation (optical flow) and long-term, sparse salient point tracking. This paper presents a new detector, including a new salience function with low computational complexity and a new selection strategy that allows to obtain a large number of keypoints. Its density and reliability in mobile video scenarios are compared with those of the FAST detector. Then, a multi-scale matching strategy is presented, based on hybrid regional coarse-to-fine and temporal prediction, which provides robustness to large camera and object accelerations. Filtering and merging strategies are then used to eliminate most of the wrong or useless trajectories. Thanks to its high degree of parallelism, the proposed algorithm extracts beams of trajectories from the video very efficiently. We compare it with the state-of-the-art pyramidal Lucas---Kanade point tracker and show that, in short range mobile video scenarios, it yields similar quality results, while being up to one order of magnitude faster. Three different parallel implementations of this tracker are presented, on multi-core CPU, GPU and ARM SoCs. On a commodity 2010 CPU, it can track 8,500 points in a 640 × 480 video at 150 Hz.

Proceedings Article
01 Mar 2016
TL;DR: Adaptively parallelized plans show optimal multi-core utilization and up to five times improvement compared to heuristically paral- lelized plans on the workload under evaluation.
Abstract: With the rise of multi-core CPU platforms, their optimal utilization for in-memory OLAP workloads using column store databases has become one of the biggest challenges. Some of the inherent limi- tations in the achievable query parallelism are due to the degree of parallelism dependency on the data skew, the overheads incurred by thread coordination, and the hardware resource limits. Finding the right balance between the degree of parallelism and the multi-core utilization is even more trickier. It makes parallel plan generation using traditional query optimizers a complex task. In this paper we introduce adaptive parallelization, which ex- ploits execution feedback to gradually increase the level of paral- lelism until we reach a sweet-spot. After each query has been exe- cuted, we replace an expensive operator (or a sequence) by a faster parallel version, i.e. the query plan is morphed into a faster one. A convergence algorithm is designed to reach the optimum as quick as possible. The approach is evaluated against a full-fledged column-store using micro-benchmarks and a subset of the TPC-H and TPC-DS queries. It confirms the feasibility of the design and proofs to be competitive against a statically optimized heuristic plan generator. Adaptively parallelized plans show optimal multi-core utilization and up to five times improvement compared to heuristically paral- lelized plans on the workload under evaluation.

Patent
29 Aug 2016
TL;DR: In this paper, a hybrid parallelization of in-memory table scans is described, where the work for each granule is further parallelized by dividing the work granule into one or more tasks.
Abstract: Techniques are described herein for hybrid parallelization of in-memory table scans. Work for an in-memory scan is divided into granules based on a degree of parallelism. The granules are assigned to one or more processes. The work for each granule is further parallelized by dividing the work granule into one or more tasks. The tasks are assigned to one or more threads, the number of which can be dynamically adjusted.

Journal ArticleDOI
TL;DR: A physics-based material model, the Jiles-Atherton model, is implemented in a GPU to compute the B-H hysteretic relationship, which can be directly incorporated in FE simulations and the performance of the GPU is compared with that of the given microprocessor in terms of computational time.
Abstract: Design engineers are always looking for extra computational power to speed up the execution of their tasks. One way to achieve this speedup is to identify tasks with a high degree of parallelism and process them with graphic processing units (GPUs). GPUs are optimized to process such tasks efficiently and quickly in massive multicore hardware. The steps involved in a finite-element (FE) electromagnetic simulation are computationally very expensive. One such step is the communication between FE solver and the material loss model that takes place for all the elements in the mesh for each time step. This task is massively parallel and, thus, could be executed in a GPU. As an example, a physics-based material model, the Jiles–Atherton model, is implemented in a GPU to compute the $B$ – $H$ hysteretic relationship, which can be directly incorporated in FE simulations. The performance of the GPU is compared with that of the given microprocessor in terms of computational time. A time gain of 13.8 times has been achieved.

Journal ArticleDOI
TL;DR: This work proposes a technique that employs the preprocessing of fault scenarios based on forecasting fault tendencies, which is performed with a fault threshold circuit operating in accordance with high-level software, and proposes methods for dissimilarity analysis of scenariosbased on cross-correlation measurements of link fault matrices.

Journal ArticleDOI
18 Jun 2016
TL;DR: Experimental results show that the proposed software-hardware cooperative mechanism can effectively increase the opportunity of threads entering the critical section in low-overhead spinning phase, reducing the competition overhead averagely and accelerating the execution of the Region-of-Interest averagely across all 25 benchmark programs.
Abstract: With the degree of parallelism increasing, performance of multi-threaded shared variable applications is not only limited by serialized critical section execution, but also by the serialized competition overhead for threads to get access to critical section. As the number of concurrent threads grows, such competition overhead may exceed the time spent in critical section itself, and become the dominating factor limiting the performance of parallel applications.In modern operating systems, queue spinlock, which comprises a low-overhead spinning phase and a high-overhead sleeping phase, is often used to lock critical sections. In the paper, we show that this advanced locking solution may create very high competition overhead for multithreaded applications executing in NoC-based CMPs. Then we propose a software-hardware cooperative mechanism that can opportunistically maximize the chance that a thread wins the critical section access in the low-overhead spinning phase, thereby reducing the competition overhead. At the OS primitives level, we monitor the remaining times of retry (RTR) in a thread's spinning phase, which reflects in how long the thread must enter into the high-overhead sleep mode. At the hardware level, we integrate the RTR information into the packets of locking requests, and let the NoC prioritize locking request packets according to the RTR information. The principle is that the smaller RTR a locking request packet carries, the higher priority it gets and thus quicker delivery.We evaluate our opportunistic competition overhead reduction technique with cycle-accurate full-system simulations in GEM5 using PARSEC (11 programs) and SPEC OMP2012 (14 programs) benchmarks. Compared to the original queue spinlock implementation, experimental results show that our method can effectively increase the opportunity of threads entering the critical section in low-overhead spinning phase, reducing the competition overhead averagely by 39.9% (maximally by 61.8%) and accelerating the execution of the Region-of-Interest averagely by 14.4% (maximally by 24.5%) across all 25 benchmark programs.

Journal ArticleDOI
TL;DR: A novel numerical framework for pricing American options in high dimensions that processes an entire cross section of options in a single execution and offers an immediate solution to the estimation of hedging coefficients through finite differences, which brings valuable advantages over Monte Carlo simulations.
Abstract: We introduce a novel numerical framework for pricing American options in high dimensions. Such settings naturally arise for derivatives with multiple underlying assets, like basket options. They are equally important for single-asset options because high-dimensional models are best capable of capturing observed price dynamics. Yet, higher-dimensional settings come at the cost of a loss of tractability due to the accompanying exponential growth of computational complexity. Our scheme manages to alleviate the problem of dimension scaling through the use of adaptive sparse grids. We approximate the value function with a low number of points and recursively apply fast approximations of the expectation operator from an exercise period to the previous one. The algorithm copes with discretely spaced, possibly nonuniform, time grids. This makes it particularly fast for options with a limited number of exercise periods, like Bermudan options, and options for which the optimal exercise schedule is known ex ante. As compared to Monte Carlo simulations, our scheme processes an entire cross section of options in a single execution. It thereby offers an immediate solution to the estimation of hedging coefficients through finite differences and is ideal when multiple related options need to be analyzed. The algorithm is also capable of dealing with discrete dividends with no performance deterioration, thus improving on the documented inefficiency of exercise policies under continuous dividend yield approximations. We benchmark our algorithm under both the canonical model of Black and Scholes and the stochastic volatility model of Heston in the presence of discrete dividends. We illustrate the massive improvement of complexity scaling over dense grids with a basket option study including up to eight underlying assets. We show how the high degree of parallelism of our scheme makes it suitable for deployment on massively parallel computing units to scale to higher dimensions or further speed up the solution process.

Journal ArticleDOI
01 May 2016
TL;DR: In this paper, the authors present algorithms for parallel query optimization in left-deep and bushy plan spaces, where each worker returns the optimal plan in its partition to the master which determines the globally optimal plan from the partition-optimal plans.
Abstract: Data processing systems offer an ever increasing degree of parallelism on the levels of cores, CPUs, and processing nodes. Query optimization must exploit high degrees of parallelism in order not to gradually become the bottleneck of query evaluation. We show how to parallelize query optimization at a massive scale.We present algorithms for parallel query optimization in left-deep and bushy plan spaces. At optimization start, we divide the plan space for a given query into partitions of equal size that are explored in parallel by worker nodes. At the end of optimization, each worker returns the optimal plan in its partition to the master which determines the globally optimal plan from the partition-optimal plans. No synchronization or data exchange is required during the actual optimization phase. The amount of data sent over the network, at the start and at the end of optimization, as well as the complexity of serial steps within our algorithms increase only linearly in the number of workers and in the query size. The time and space complexity of optimization within one partition decreases uniformly in the number of workers. We parallelize single- and multi-objective query optimization over a cluster with 100 nodes in our experiments, using more than 250 concurrent worker threads (Spark executors). Despite high network latency and task assignment overheads, parallelization yields speedups of up to one order of magnitude for large queries whose optimization takes minutes on a single node.

Proceedings ArticleDOI
01 Jul 2016
TL;DR: This paper presents XL-STaGe, a cross-layer tool for traffic-inclusive directed acyclic graph generation and implementation, which consists of a set of processes that generate the task-graphs as well as a detailed process model for each node in each graph.
Abstract: This paper presents XL-STaGe, a cross-layer tool for traffic-inclusive directed acyclic graph generation and implementation. In contrast to other graph-generation tools which focus on high-level DAG models, XL-STaGe consists of a set of processes that generate the task-graphs as well as a detailed process model for each node in each graph. The tool is highly customizable, with many parameters that can be tuned to meet the user's requirements to control the topology, connection density, degree of parallelism and duration the task-graph. Moreover, two use cases are presented, a high-level one, which illustrate the benefit of the developed tool in application mapping and a circuit-level one to verify the accuracy of the XL-STaGe process models when implemented in hardware.

Journal ArticleDOI
TL;DR: Efficient realization of Householder Transform (HT) based QR factorization through algorithm-architecture co-design is presented where performance improvement of 3-90x in-terms of Gflops/watt over state-of-the-art multicore, General Purpose Graphics Processing Units (GPGPUs), Field Programmable Gate Arrays (FPGAs), and ClearSpeed CSX700 is achieved.
Abstract: We present efficient realization of Householder Transform (HT) based QR factorization through algorithm-architecture co-design where we achieve performance improvement of 3-90x in-terms of Gflops/watt over state-of-the-art multicore, General Purpose Graphics Processing Units (GPGPUs), Field Programmable Gate Arrays (FPGAs), and ClearSpeed CSX700. Theoretical and experimental analysis of classical HT is performed for opportunities to exhibit higher degree of parallelism where parallelism is quantified as a number of parallel operations per level in the Directed Acyclic Graph (DAG) of the transform. Based on theoretical analysis of classical HT, an opportunity re-arrange computations in the classical HT is identified that results in Modified HT (MHT) where it is shown that MHT exhibits 1.33x times higher parallelism than classical HT. Experiments in off-the-shelf multicore and General Purpose Graphics Processing Units (GPGPUs) for HT and MHT suggest that MHT is capable of achieving slightly better or equal performance compared to classical HT based QR factorization realizations in the optimized software packages for Dense Linear Algebra (DLA). We implement MHT on a customized platform for Dense Linear Algebra (DLA) and show that MHT achieves 1.3x better performance than native implementation of classical HT on the same accelerator. For custom realization of HT and MHT based QR factorization, we also identify macro operations in the DAGs of HT and MHT that are realized on a Reconfigurable Data-path (RDP). We also observe that due to re-arrangement in the computations in MHT, custom realization of MHT is capable of achieving 12% better performance improvement over multicore and GPGPUs than the performance improvement reported by General Matrix Multiplication (GEMM) over highly tuned DLA software packages for multicore and GPGPUs which is counter-intuitive.

Journal ArticleDOI
TL;DR: A novel protocol, which aims at reconfiguring cloud applications, able to ensure communication between virtual machines and resolve dependencies by exchanging messages, (dis)connecting, and starting/stopping components in a specific order.

Journal ArticleDOI
TL;DR: TransMap stores only a single implementation and applies a series for transformations to the stored bitstream for remapping or parallelizing an application, and offers significant reductions in configuration memory requirements, compared to state of the art compaction techniques.
Abstract: In the era of platforms hosting multiple applications with arbitrary inter application communication and computation patterns, compile time mapping decisions are neither optimal nor desirable. As a solution to this problem, recently proposed architectures offer run-time remapping. The run-time remapping techniques displace or parallelize/serialize an application to optimize different parameters (e.g., utilization and energy). To implement the dynamic remapping, reconfigurable architectures commonly store multiple (compile-time generated) implementations of an application. Each implementation represents a different platform location and/or degree of parallelism. The optimal implementation is selected at run-time. However, the compile-time binding either incurs excessive configuration memory overheads and/or is unable to map/parallelize an application even when sufficient resources are available. As a solution to this problem, we present Transformation based reMapping and parallelism (TransMap). TransMap stores only a single implementation and applies a series for transformations to the stored bitstream for remapping or parallelizing an application. Compared to state of the art, in addition to simple relocation in horizontal/vertical directions, TransMap also allows to rotate an application for mapping or parallelizing an application in resource constrained scenarios. By storing only a single implementation, TransMap offers significant reductions in configuration memory requirements (up to 73 percent for the tested applications), compared to state of the art compaction techniques. Simulation results reveal that the additional flexibility reduces the energy requirements by 33 percent and enhances the device utilization by 50 percent for the tested applications. Gate level analysis reveals that TransMap incurs negligible silicon (0.2 percent of the platform) and timing (6 additional cycles per application) penalty.

Journal ArticleDOI
TL;DR: The performance of a Monte Carlo model for the simulation of electromagnetic wave propagation in particle-filled atmospheres has been conducted for different CUDA versions and design approaches, showing a high degree of parallelism which allows favorable implementation in a GPU.
Abstract: The performance of a Monte Carlo model for the simulation of electromagnetic wave propagation in particle-filled atmospheres has been conducted for different CUDA versions and design approaches. The proposed algorithm exhibits a high degree of parallelism, which allows favorable implementation in a GPU. Practical implementation aspects of the model have been also explained and their impact assessed, such as the use of the different types of memories present in a GPU. A number of setups have been chosen in order to compare performance for manually optimized versus Unified Virtual Memory (UVM) implementations for different CUDA versions. Features and relative performance impact of the different options have been discussed, extracting practical hints and rules useful to speed up CUDA programs.

01 Feb 2016
TL;DR: The most significant contributions about computation/communication overlapping are gathered and technical explanation of how such overlap can be achieved on modern supercomputers is provided.
Abstract: In High Performance Computing (HPC), minimizing communication overhead is one of the most important goals in order to get high performance. This is more than ever important on exascale platforms, where there will be a much higher degree of parallelism compared to petascale platforms, resulting in increased communication overhead with considerable impact on application execution time and energy expenses. A good strategy for containing this overhead is to hide communication costs by overlapping them with computation. Despite the increasing interest in achieving computation/communication overlapping, details about the reasons that prevent it from succeeding are not easy to find, leading to confusion and poor application optimization. The Message Passing Interface (MPI) library, a de-facto standard in the HPC world, has always provided non-blocking communication routines able, in theory, to achieve communication/computation overlapping. Unfortunately, several factors related with the MPI independent progress and offload capability of the underlying network, make this overlap hard do achieve. With the introduction of one-sided communication routines, providing high quality MPI implementations, able to progress communication independently, is becoming as important as providing low latency and high bandwidth communication. In this paper, we gather the most significant contributions about computation/communication overlapping and provide technical explanation of how such overlap can be achieved on modern supercomputers.

Journal ArticleDOI
TL;DR: This work employs a per-application predictive power manager that autonomously controls the power states of the cores with the goal of energy efficiency, and allows the applications to lend their idle cores for a short time period to expedite other critical applications.
Abstract: We present a scalable Dynamic Power Management (DPM) schem e where malleable applications may change their degree of parallelism at run time depending upon the workload and performance constraints. We employ a per-application predictive power manager that autonomously controls the power states of the cores with the goal of energy efficiency. Furthermore, our DPM allows the applications to lend their idle cores for a short time period to expedite other critical applications. In this way, it allows for application-level scalability, while aiming at the overall system energy optimization. Compared to state-of-the-art centralized and distributed power management approaches, we achieve up to 58 percent (average ≍15-20 percent) ED2P reduction.

Book ChapterDOI
01 Jan 2016
TL;DR: A novel set of concurrency-related source code metrics to be used as the basis for bug prediction methods is proposed and discussed; the approach with respect to the existing state of the art is discussed, and the research challenges that have to be addressed are outlined.
Abstract: As physical limits began to negate the assumption known as Moore’s law, chip manufacturers started focusing on multicore architectures as the main solution to improve the processing power in modern computers. Today, multicore CPUs are commonly found in servers, PCs, smartphone, cars, airplanes, and home appliances. As this happens, more and more programs are designed with some degree of parallelism to take advantage of these implicitly concurrent architectures. In this context, new challenges are presented to software engineers. For example, software validation becomes much more expensive (since testing concurrency is difficult) and strategies such as bug prediction could be used to better focus the effort during the development process. However, most of the existing bug prediction approaches have been designed with sequential programs in mind. In this paper, we propose a novel set of concurrency-related source code metrics to be used as the basis for bug prediction methods; we discuss our approach with respect to the existing state of the art, and we outline the research challenges that have to be addressed to realize our goal.