Showing papers on "Degree of parallelism published in 2016"

PDF

Open Access

Journal Article•DOI•

NeuroFlow: A General Purpose Spiking Neural Network Simulation Platform using Customizable Processors

[...]

Kit Cheung¹, Simon R. Schultz¹, Wayne Luk¹•Institutions (1)

14 Jan 2016-Frontiers in Neuroscience

TL;DR: With high flexibility and throughput, NeuroFlow provides a viable environment for large-scale neural network simulation and supports the spike-timing-dependent plasticity (STDP) rule for learning.

...read moreread less

Abstract: NeuroFlow is a scalable spiking neural network simulation platform for off-the-shelf high performance computing systems using customizable hardware processors such as Field-Programmable Gate Arrays (FPGAs). Unlike multi-core processors and application-specific integrated circuits, the processor architecture of NeuroFlow can be redesigned and reconfigured to suit a particular simulation to deliver optimized performance, such as the degree of parallelism to employ. The compilation process supports using PyNN, a simulator-independent neural network description language, to configure the processor. NeuroFlow supports a number of commonly used current or conductance based neuronal models such as integrate-and-fire and Izhikevich models, and the spike-timing-dependent plasticity (STDP) rule for learning. A 6-FPGA system can simulate a network of up to ~600,000 neurons and can achieve a real-time performance of 400,000 neurons. Using one FPGA, NeuroFlow delivers a speedup of up to 33.6 times the speed of an 8-core processor, or 2.83 times the speed of GPU-based platforms. With high flexibility and throughput, NeuroFlow provides a viable environment for large-scale neural network simulation.

...read moreread less

87 citations

Journal Article•DOI•

Distributed Attack Graph Generation

[...]

Kerem Kaynar¹, Fikret Sivrikaya¹•Institutions (1)

Technical University of Berlin¹

01 Sep 2016-IEEE Transactions on Dependable and Secure Computing

TL;DR: This article introduces a parallel and distributed memory-based algorithm that builds vulnerability-based attack graphs on a distributed multi-agent platform and introduces a rich attack template and network model in order to form chains of vulnerability exploits in attack graphs more precisely.

...read moreread less

Abstract: Attack graphs show possible paths that an attacker can use to intrude into a target network and gain privileges through series of vulnerability exploits. The computation of attack graphs suffers from the state explosion problem occurring most notably when the number of vulnerabilities in the target network grows large. Parallel computation of attack graphs can be utilized to attenuate this problem. When employed in online network security evaluation, the computation of attack graphs can be triggered with the correlated intrusion alerts received from sensors scattered throughout the target network. In such cases, distributed computation of attack graphs becomes valuable. This article introduces a parallel and distributed memory-based algorithm that builds vulnerability-based attack graphs on a distributed multi-agent platform. A virtual shared memory abstraction is proposed to be used over such a platform, whose memory pages are initialized by partitioning the network reachability information. We demonstrate the feasibility of parallel distributed computation of attack graphs and show that even a small degree of parallelism can effectively speed up the generation process as the problem size grows. We also introduce a rich attack template and network model in order to form chains of vulnerability exploits in attack graphs more precisely.

...read moreread less

70 citations

Proceedings Article•DOI•

NXgraph: An efficient graph processing system on a single machine

[...]

Yuze Chi¹, Guohao Dai¹, Yu Wang¹, Guangyu Sun², Guoliang Li¹, Huazhong Yang¹ - Show less +2 more•Institutions (2)

Tsinghua University¹, Peking University²

16 May 2016

TL;DR: NXgraph can adaptively choose the fastest strategy for different graph problems according to the graph size and the available memory resources to fully utilize the memory space and reduce the amount of data transfer.

...read moreread less

Abstract: Recent studies show that graph processing systems on a single machine can achieve competitive performance compared with cluster-based graph processing systems. In this paper, we present NXgraph, an efficient graph processing system on a single machine. We propose the Destination-Sorted Sub-Shard (DSSS) structure to store a graph. To ensure graph data access locality and enable fine-grained scheduling, NXgraph divides vertices and edges into intervals and sub-shards. To reduce write conflicts among different threads and achieve a high degree of parallelism, NXgraph sorts edges within each sub-shard according to their destination vertices. Then, three updating strategies, i.e., Single-Phase Update (SPU), Double-Phase Update (DPU), and Mixed-Phase Update (MPU), are proposed in this paper. NXgraph can adaptively choose the fastest strategy for different graph problems according to the graph size and the available memory resources to fully utilize the memory space and reduce the amount of data transfer. All these three strategies exploit streamlined disk access patterns. Extensive experiments on three real-world graphs and five synthetic graphs show that NXgraph outperforms GraphChi, TurboGraph, VENUS, and GridGraph in various situations. Moreover, NXgraph, running on a single commodity PC, can finish an iteration of PageRank on the Twitter [1] graph with 1.5 billion edges in 2.05 seconds; while PowerGraph, a distributed graph processing system, needs 3.6s to finish the same task on a 64-node cluster.

...read moreread less

60 citations

Journal Article•DOI•

An Efficient Implementation of the Bellman-Ford Algorithm for Kepler GPU Architectures

[...]

Federico Busato¹, Nicola Bombieri¹•Institutions (1)

University of Verona¹

01 Aug 2016-IEEE Transactions on Parallel and Distributed Systems

TL;DR: A parallel implementation of the Bellman-Ford algorithm that exploits the architectural characteristics of recent GPU architectures (i.e., NVIDIA Kepler, Maxwell) to improve both performance and work efficiency is presented.

...read moreread less

Abstract: Finding the shortest paths from a single source to all other vertices is a common problem in graph analysis. The Bellman-Ford's algorithm is the solution that solves such a single-source shortest path (SSSP) problem and better applies to be parallelized for many-core architectures. Nevertheless, the high degree of parallelism is guaranteed at the cost of low work efficiency, which, compared to similar algorithms in literature (e.g., Dijkstra's) involves much more redundant work and a consequent waste of power consumption. This article presents a parallel implementation of the Bellman-Ford algorithm that exploits the architectural characteristics of recent GPU architectures (i.e., NVIDIA Kepler, Maxwell) to improve both performance and work efficiency. The article presents different optimizations to the implementation, which are oriented both to the algorithm and to the architecture. The experimental results show that the proposed implementation provides an average speedup of $5 \times$ higher than the existing most efficient parallel implementations for SSSP, that it works on graphs where those implementations cannot work or are inefficient (e.g., graphs with negative weight edges, sparse graphs), and that it sensibly reduces the redundant work caused by the parallelization process.

...read moreread less

59 citations

Proceedings Article•DOI•

Massively-Parallel Lossless Data Decompression

[...]

Evangelia Sitaridi¹, Rene Mueller², Tim Kaldewey², Guy M. Lohman², Kenneth A. Ross¹ - Show less +1 more•Institutions (2)

Columbia University¹, IBM²

01 Aug 2016

TL;DR: In this article, the authors propose two new techniques to increase the degree of parallelism during decompression, which can exploit the massive parallelism of modern multi-core processors and GPUs for data decompression within a block.

...read moreread less

Abstract: Today's exponentially increasing data volumes and the high cost of storage make compression essential for the Big Data industry. Although research has concentrated on efficient compression, fast decompression is critical for analytics queries that repeatedly read compressed data. While decompression can be parallelized somewhat by assigning each data block to a different process, break-through speed-ups require exploiting the massive parallelism of modern multi-core processors and GPUs for data decompression within a block. We propose two new techniques to increase the degree of parallelism during decompression. The first technique exploits the massive parallelism of GPU and SIMD architectures. The second sacrifices some compression efficiency to eliminate data dependencies that limit parallelism during decompression. We evaluate these techniques on the decompressor of the DEFLATE scheme, called Inflate, which is based on LZ77 compression and Huffman encoding. We achieve a 2× speed-up in a head-to-head comparison with several multi core CPU-based libraries, while achieving a 17% energy saving with comparable compression ratios.

...read moreread less

42 citations

Posted Content•

Massively-Parallel Lossless Data Decompression

[...]

Evangelia Sitaridi¹, Rene Mueller², Tim Kaldewey², Guy M. Lohman², Kenneth A. Ross¹ - Show less +1 more•Institutions (2)

Columbia University¹, IBM²

02 Jun 2016-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: In this article, the authors propose two new techniques to increase the degree of parallelism during decompression of DEFLATE, which is based on LZ77 compression and Huffman encoding and achieves a 2X speedup in a head-to-head comparison with several multi-core CPU-based libraries.

...read moreread less

Abstract: Today's exponentially increasing data volumes and the high cost of storage make compression essential for the Big Data industry. Although research has concentrated on efficient compression, fast decompression is critical for analytics queries that repeatedly read compressed data. While decompression can be parallelized somewhat by assigning each data block to a different process, break-through speed-ups require exploiting the massive parallelism of modern multi-core processors and GPUs for data decompression within a block. We propose two new techniques to increase the degree of parallelism during decompression. The first technique exploits the massive parallelism of GPU and SIMD architectures. The second sacrifices some compression efficiency to eliminate data dependencies that limit parallelism during decompression. We evaluate these techniques on the decompressor of the DEFLATE scheme, called Inflate, which is based on LZ77 compression and Huffman encoding. We achieve a 2X speed-up in a head-to-head comparison with several multi-core CPU-based libraries, while achieving a 17% energy saving with comparable compression ratios.

...read moreread less

37 citations

Proceedings Article•DOI•

On the Scalability, Performance Isolation and Device Driver Transparency of the IHK/McKernel Hybrid Lightweight Kernel

[...]

Balazs Gerofi, Masamichi Takagi, Atsushi Hori, Gou Nakamura¹, Tomoki Shirasawa¹, Yutaka Ishikawa - Show less +2 more•Institutions (1)

Hitachi¹

23 May 2016

TL;DR: IHK/McKernel is described, a hybrid software stack that seamlessly blends an LWK with Linux by selectively offloading system services from the lightweight kernel to Linux by focusing on transparent reuse of Linux device drivers and detail the design of the framework that enables the LWK to naturally leverage the Linux driver codebase without sacrificing scalability or the POSIX API.

...read moreread less

Abstract: Extreme degree of parallelism in high-end computing requires low operating system noise so that large scale, bulk-synchronous parallel applications can be run efficiently. Noiseless execution has been historically achieved by deploying lightweight kernels (LWK), which, on the other hand, can provide only a restricted set of the POSIX API in exchange for scalability. However, the increasing prevalence of more complex application constructs, such as in-situ analysis and workflow composition, dictates the need for the rich programming APIs of POSIX/Linux. In order to comply with these seemingly contradictory requirements, hybrid kernels, where Linux and a lightweight kernel (LWK) are run side-by-side on compute nodes, have been recently recognized as a promising approach. Although multiple research projects are now pursuing this direction, the questions of how node resources are shared between the two types of kernels, how exactly the two kernels interact with each other and to what extent they are integrated, remain subjects of ongoing debate. In this paper, we describe IHK/McKernel, a hybrid software stack that seamlessly blends an LWK with Linux by selectively offloading system services from the lightweight kernel to Linux. Specifically, we are focusing on transparent reuse of Linux device drivers and detail the design of our framework that enables the LWK to naturally leverage the Linux driver codebase without sacrificing scalability or the POSIX API. Through rigorous evaluation on a medium size cluster we demonstrate how McKernel provides consistent, isolated performance for simulations even in face of competing, in-situ workloads.

...read moreread less

36 citations

Journal Article•DOI•

Total Variation on a Tree

[...]

Vladimir Kolmogorov¹, Thomas Pock², Michal Rolinek¹•Institutions (2)

Institute of Science and Technology Austria¹, Austrian Institute of Technology²

03 May 2016-Siam Journal on Imaging Sciences

TL;DR: In this article, the authors consider the problem of minimizing the continuous valued total variation subject to different unary terms on trees and propose fast direct algorithms based on dynamic programming to solve these problems.

...read moreread less

Abstract: We consider the problem of minimizing the continuous valued total variation subject to different unary terms on trees and propose fast direct algorithms based on dynamic programming to solve these problems. We treat both the convex and the nonconvex case and derive worst-case complexities that are equal to or better than existing methods. We show applications to total variation based two dimensional image processing and computer vision problems based on a Lagrangian decomposition approach. The resulting algorithms are very efficient, offer a high degree of parallelism, and come along with memory requirements which are only in the order of the number of image pixels.

...read moreread less

36 citations

Proceedings Article•DOI•

GraphCEP: real-time data analytics using parallel complex event and graph processing

[...]

Ruben Mayer¹, Christian Mayer¹, Muhammad Adnan Tariq¹, Kurt Rothermel¹•Institutions (1)

University of Stuttgart¹

13 Jun 2016

TL;DR: A novel graph-based Complex Event Processing system GraphCEP is proposed and its performance is evaluated in the setting of two case studies from the DEBS Grand Challenge 2016.

...read moreread less

Abstract: In recent years, the proliferation of highly dynamic graph-structured data streams fueled the demand for real-time data analytics. For instance, detecting recent trends in social networks enables new applications in areas such as disaster detection, business analytics or health-care. Parallel Complex Event Processing has evolved as the paradigm of choice to analyze data streams in a timely manner, where the incoming data streams are split and processed independently by parallel operator instances. However, the degree of parallelism is limited by the feasibility of splitting the data streams into independent parts such that correctness of event processing is still ensured. In this paper, we overcome this limitation for graph-structured data by further parallelizing individual operator instances using modern graph processing systems. These systems partition the graph data and execute graph algorithms in a highly parallel fashion, for instance using cloud resources. To this end, we propose a novel graph-based Complex Event Processing system GraphCEP and evaluate its performance in the setting of two case studies from the DEBS Grand Challenge 2016.

...read moreread less

28 citations

Proceedings Article•DOI•

Compiler-Assisted Workload Consolidation for Efficient Dynamic Parallelism on GPU

[...]

Hancheng Wu¹, Da Li¹, Michela Becchi¹•Institutions (1)

University of Missouri¹

23 May 2016

TL;DR: The approach significantly reduces runtime overhead and improves GPU utilization, leading to speedup factors from 90x to 3300x over basic DP-based solutions and speedups from 2x to 6x over flat implementations.

...read moreread less

Abstract: GPUs have been widely used to accelerate computations exhibiting simple patterns of parallelism -- such as flat or two-level parallelism -- and a degree of parallelism that can be statically determined based on the size of the input dataset. However, the effective use of GPUs for algorithms exhibiting complex patterns of parallelism, possibly known only at runtime, is still an open problem. Recently, Nvidia has introduced Dynamic Parallelism (DP) in its GPUs. By making it possible to launch kernels directly from GPU threads, this feature enables nested parallelism at runtime. However, the effective use of DP must still be understood: a naive use of this feature may suffer from significant runtime overhead and lead to GPU underutilization, resulting in poor performance. In this work, we target this problem. First, we demonstrate how a naive use of DP can result in poor performance. Second, we propose three workload consolidation schemes to improve performance and hardware utilization of DP-based codes, and we implement these code transformations in a directive-based compiler. Finally, we evaluate our framework on two categories of applications: algorithms including irregular loops and algorithms exhibiting parallel recursion. Our experiments show that our approach significantly reduces runtime overhead and improves GPU utilization, leading to speedup factors from 90x to 3300x over basic DP-based solutions and speedups from 2x to 6x over flat implementations.

...read moreread less

22 citations

Proceedings Article•DOI•

Energy Efficient Partitioning and Scheduling Approach for Scientific Workflows in the Cloud

[...]

Khadija Bousselmi¹, Zaki Brahmi, Mohamed Mohsen Gammoudi•Institutions (1)

University of Savoy¹

01 Jun 2016

TL;DR: This paper proposes a Workflow Partitioning for Energy Minimization (WPEM) algorithm that allows reducing the network energy consumption of the workflow and the total amount of data communication while achieving a high degree of parallelism.

...read moreread less

Abstract: Energy consumption is emerging as a new crucial issue of the Cloud Computing environments such as data centers. The problem of power consumption is more challenging especially in the context of scientific workflows deployment in the Cloud as they trigger intensive computational tasks and data manipulation steps which begets excessive data movement operations over communication networks. For instance, it was revealed that network devices consume up to one-third of the total energy consumption of Cloud data centers. In this paper, we propose an energy-aware approach for scientific workflows scheduling in the Cloud. In the first step, we propose a Workflow Partitioning for Energy Minimization (WPEM) algorithm that allows reducing the network energy consumption of the workflow and the total amount of data communication while achieving a high degree of parallelism. In the second step, we use the heuristic of Cat Swarm Optimization to schedule the generated partitions in order to minimize the workflow's overall energy consumption and execution time. We evaluated the proposed approach using three real cases of data intensive workflows and compare it with other algorithms from literature. The experimental results show that our proposal allows to reduce remarkably the network energy consumption of the tested workflows (up to 96% of network energy consumption saving for memory intensive workflows) and the overall energy consumption of the workflows while ensuring a reasonable execution time and using less Cloud resources.

...read moreread less

Journal Article•DOI•

Workflow-and-Platform Aware task clustering for scientific workflow execution in Cloud environment

[...]

Jyoti Sahni¹, Deo Prakash Vidyarthi¹•Institutions (1)

Jawaharlal Nehru University¹

01 Nov 2016-Future Generation Computer Systems

TL;DR: The proposed level based autonomic Workflow-and-Platform Aware (WPA) task clustering technique aims to achieve maximum possible parallelism among the tasks at a level of a workflow while minimizing the system overheads and resource wastage.

...read moreread less

Journal Article•DOI•

Two-Level Space---Time Domain Decomposition Methods for Three-Dimensional Unsteady Inverse Source Problems

[...]

Xiaomao Deng¹, Xiao-Chuan Cai², Jun Zou³•Institutions (3)

Chinese Academy of Sciences¹, University of Colorado Boulder², The Chinese University of Hong Kong³

01 Jun 2016-Journal of Scientific Computing

TL;DR: A two-level space–time domain decomposition method for solving an inverse source problem associated with the time-dependent convection–diffusion equation in three dimensions and eliminates the sequential steps in the optimization outer loop and the inner forward and backward time marching processes, thus achieves high degree of parallelism.

...read moreread less

Abstract: As the number of processor cores on supercomputers becomes larger and larger, algorithms with high degree of parallelism attract more attention. In this work, we propose a two-level space---time domain decomposition method for solving an inverse source problem associated with the time-dependent convection---diffusion equation in three dimensions. We introduce a mixed finite element/finite difference method and a one-level and a two-level space---time parallel domain decomposition preconditioner for the Karush---Kuhn---Tucker system induced from reformulating the inverse problem as an output least-squares optimization problem in the entire space-time domain. The new full space---time approach eliminates the sequential steps in the optimization outer loop and the inner forward and backward time marching processes, thus achieves high degree of parallelism. Numerical experiments validate that this approach is effective and robust for recovering unsteady moving sources. We will present strong scalability results obtained on a supercomputer with more than 1000 processors.

...read moreread less

Posted Content•

Fast Failure Recovery for Main-Memory DBMSs on Multicores

[...]

Yingjun Wu¹, Wentian Guo¹, Chee-Yong Chan¹, Kian-Lee Tan¹•Institutions (1)

National University of Singapore¹

12 Apr 2016-arXiv: Databases

TL;DR: PACMAN, a parallel database recovery mechanism that is specifically designed for lightweight, coarse-grained transaction-level logging, is proposed and can significantly reduce recovery time without compromising the efficiency of transaction processing.

...read moreread less

Abstract: Main-memory database management systems (DBMS) can achieve excellent performance when processing massive volume of on-line transactions on modern multi-core machines. But existing durability schemes, namely, tuple-level and transaction-level logging-and-recovery mechanisms, either degrade the performance of transaction processing or slow down the process of failure recovery. In this paper, we show that, by exploiting application semantics, it is possible to achieve speedy failure recovery without introducing any costly logging overhead to the execution of concurrent transactions. We propose PACMAN, a parallel database recovery mechanism that is specifically designed for lightweight, coarse-grained transaction-level logging. PACMAN leverages a combination of static and dynamic analyses to parallelize the log recovery: at compile time, PACMAN decomposes stored procedures by carefully analyzing dependencies within and across programs; at recovery time, PACMAN exploits the availability of the runtime parameter values to attain an execution schedule with a high degree of parallelism. As such, recovery performance is remarkably increased. We evaluated PACMAN in a fully-fledged main-memory DBMS running on a 40-core machine. Compared to several state-of-the-art database recovery mechanisms, PACMAN can significantly reduce recovery time without compromising the efficiency of transaction processing.

...read moreread less

Journal Article•DOI•

Video Extruder: a semi-dense point tracker for extracting beams of trajectories in real time

[...]

Matthieu Garrigues¹, Antoine Manzanera¹, Thierry M. Bernard¹•Institutions (1)

Superior National School of Advanced Techniques¹

01 Apr 2016-Journal of Real-time Image Processing

TL;DR: This paper proposes a new multi-scale semi-dense point tracker called Video Extruder, whose purpose is to fill the gap between short-term, dense motion estimation (optical flow) and long- term, sparse salient point tracking, and presents a new detector, including a new salience function with low computational complexity and a new selection strategy that allows to obtain a large number of keypoints.

...read moreread less

Abstract: Two crucial aspects of general-purpose embedded visual point tracking are addressed in this paper. First, the algorithm should reliably track as many points as possible. Second, the computation should achieve real-time video processing, which is challenging on low power embedded platforms. We propose a new multi-scale semi-dense point tracker called Video Extruder, whose purpose is to fill the gap between short-term, dense motion estimation (optical flow) and long-term, sparse salient point tracking. This paper presents a new detector, including a new salience function with low computational complexity and a new selection strategy that allows to obtain a large number of keypoints. Its density and reliability in mobile video scenarios are compared with those of the FAST detector. Then, a multi-scale matching strategy is presented, based on hybrid regional coarse-to-fine and temporal prediction, which provides robustness to large camera and object accelerations. Filtering and merging strategies are then used to eliminate most of the wrong or useless trajectories. Thanks to its high degree of parallelism, the proposed algorithm extracts beams of trajectories from the video very efficiently. We compare it with the state-of-the-art pyramidal Lucas---Kanade point tracker and show that, in short range mobile video scenarios, it yields similar quality results, while being up to one order of magnitude faster. Three different parallel implementations of this tracker are presented, on multi-core CPU, GPU and ARM SoCs. On a commodity 2010 CPU, it can track 8,500 points in a 640 × 480 video at 150 Hz.

...read moreread less

Proceedings Article•

Adaptive query parallelization in multi-core column stores

[...]

M.M. Gawade¹, Martin L. Kersten¹•Institutions (1)

Centrum Wiskunde & Informatica¹

01 Mar 2016

TL;DR: Adaptively parallelized plans show optimal multi-core utilization and up to five times improvement compared to heuristically paral- lelized plans on the workload under evaluation.

...read moreread less

Abstract: With the rise of multi-core CPU platforms, their optimal utilization for in-memory OLAP workloads using column store databases has become one of the biggest challenges. Some of the inherent limi- tations in the achievable query parallelism are due to the degree of parallelism dependency on the data skew, the overheads incurred by thread coordination, and the hardware resource limits. Finding the right balance between the degree of parallelism and the multi-core utilization is even more trickier. It makes parallel plan generation using traditional query optimizers a complex task. In this paper we introduce adaptive parallelization, which ex- ploits execution feedback to gradually increase the level of paral- lelism until we reach a sweet-spot. After each query has been exe- cuted, we replace an expensive operator (or a sequence) by a faster parallel version, i.e. the query plan is morphed into a faster one. A convergence algorithm is designed to reach the optimum as quick as possible. The approach is evaluated against a full-fledged column-store using micro-benchmarks and a subset of the TPC-H and TPC-DS queries. It confirms the feasibility of the design and proofs to be competitive against a statically optimized heuristic plan generator. Adaptively parallelized plans show optimal multi-core utilization and up to five times improvement compared to heuristically paral- lelized plans on the workload under evaluation.

...read moreread less

Patent•

Efficient hybrid parallelization for in-memory scans

[...]

Teck Hua Lee¹, Shasank K. Chavan¹, Chinmayi Krishnappa¹, Allison L. Holloway¹, Vicente Hernandez¹, Lui Dennis¹ - Show less +2 more•Institutions (1)

Business International Corporation¹

29 Aug 2016

TL;DR: In this paper, a hybrid parallelization of in-memory table scans is described, where the work for each granule is further parallelized by dividing the work granule into one or more tasks.

...read moreread less

Abstract: Techniques are described herein for hybrid parallelization of in-memory table scans. Work for an in-memory scan is divided into granules based on a degree of parallelism. The granules are assigned to one or more processes. The work for each granule is further parallelized by dividing the work granule into one or more tasks. The tasks are assigned to one or more threads, the number of which can be dynamically adjusted.

...read moreread less

Journal Article•DOI•

Implementation of Iron Loss Model on Graphic Processing Units

[...]

Sajid Hussain¹, Rodrigo Silva¹, David A. Lowther¹•Institutions (1)

McGill University¹

01 Mar 2016-IEEE Transactions on Magnetics

TL;DR: A physics-based material model, the Jiles-Atherton model, is implemented in a GPU to compute the B-H hysteretic relationship, which can be directly incorporated in FE simulations and the performance of the GPU is compared with that of the given microprocessor in terms of computational time.

...read moreread less

Abstract: Design engineers are always looking for extra computational power to speed up the execution of their tasks. One way to achieve this speedup is to identify tasks with a high degree of parallelism and process them with graphic processing units (GPUs). GPUs are optimized to process such tasks efficiently and quickly in massive multicore hardware. The steps involved in a finite-element (FE) electromagnetic simulation are computationally very expensive. One such step is the communication between FE solver and the material loss model that takes place for all the elements in the mesh for each time step. This task is massively parallel and, thus, could be executed in a GPU. As an example, a physics-based material model, the Jiles–Atherton model, is implemented in a GPU to compute the $B$ – $H$ hysteretic relationship, which can be directly incorporated in FE simulations. The performance of the GPU is compared with that of the given microprocessor in terms of computational time. A time gain of 13.8 times has been achieved.

...read moreread less

Journal Article•DOI•

Scenario preprocessing approach for the reconfiguration of fault-tolerant NoC-based MPSoCs

[...]

Jarbas Silveira¹, Cesar Marcon², Paulo César Cortez¹, Giovanni Cordeiro Barroso¹, Joao M. Ferreira¹, R. Mota¹ - Show less +2 more•Institutions (2)

Federal University of Ceará¹, Pontifícia Universidade Católica do Rio Grande do Sul²

01 Feb 2016-Microprocessors and Microsystems

TL;DR: This work proposes a technique that employs the preprocessing of fault scenarios based on forecasting fault tendencies, which is performed with a fault threshold circuit operating in accordance with high-level software, and proposes methods for dissimilarity analysis of scenariosbased on cross-correlation measurements of link fault matrices.

...read moreread less

Journal Article•DOI•

Opportunistic competition overhead reduction for expediting critical section in NoC based CMPs

[...]

Yuan Yao¹, Zhonghai Lu¹•Institutions (1)

Royal Institute of Technology¹

18 Jun 2016

TL;DR: Experimental results show that the proposed software-hardware cooperative mechanism can effectively increase the opportunity of threads entering the critical section in low-overhead spinning phase, reducing the competition overhead averagely and accelerating the execution of the Region-of-Interest averagely across all 25 benchmark programs.

...read moreread less

Abstract: With the degree of parallelism increasing, performance of multi-threaded shared variable applications is not only limited by serialized critical section execution, but also by the serialized competition overhead for threads to get access to critical section. As the number of concurrent threads grows, such competition overhead may exceed the time spent in critical section itself, and become the dominating factor limiting the performance of parallel applications.In modern operating systems, queue spinlock, which comprises a low-overhead spinning phase and a high-overhead sleeping phase, is often used to lock critical sections. In the paper, we show that this advanced locking solution may create very high competition overhead for multithreaded applications executing in NoC-based CMPs. Then we propose a software-hardware cooperative mechanism that can opportunistically maximize the chance that a thread wins the critical section access in the low-overhead spinning phase, thereby reducing the competition overhead. At the OS primitives level, we monitor the remaining times of retry (RTR) in a thread's spinning phase, which reflects in how long the thread must enter into the high-overhead sleep mode. At the hardware level, we integrate the RTR information into the packets of locking requests, and let the NoC prioritize locking request packets according to the RTR information. The principle is that the smaller RTR a locking request packet carries, the higher priority it gets and thus quicker delivery.We evaluate our opportunistic competition overhead reduction technique with cycle-accurate full-system simulations in GEM5 using PARSEC (11 programs) and SPEC OMP2012 (14 programs) benchmarks. Compared to the original queue spinlock implementation, experimental results show that our method can effectively increase the opportunity of threads entering the critical section in low-overhead spinning phase, reducing the competition overhead averagely by 39.9% (maximally by 61.8%) and accelerating the execution of the Region-of-Interest averagely by 14.4% (maximally by 24.5%) across all 25 benchmark programs.

...read moreread less

Journal Article•DOI•

Pricing American Options Under High-Dimensional Models with Recursive Adaptive Sparse Expectations

[...]

Simon Scheidegger¹, Adrien Treccani²•Institutions (2)

University of Lausanne¹, University of Zurich²

31 Oct 2016-Social Science Research Network

TL;DR: A novel numerical framework for pricing American options in high dimensions that processes an entire cross section of options in a single execution and offers an immediate solution to the estimation of hedging coefficients through finite differences, which brings valuable advantages over Monte Carlo simulations.

...read moreread less

Abstract: We introduce a novel numerical framework for pricing American options in high dimensions. Such settings naturally arise for derivatives with multiple underlying assets, like basket options. They are equally important for single-asset options because high-dimensional models are best capable of capturing observed price dynamics. Yet, higher-dimensional settings come at the cost of a loss of tractability due to the accompanying exponential growth of computational complexity. Our scheme manages to alleviate the problem of dimension scaling through the use of adaptive sparse grids. We approximate the value function with a low number of points and recursively apply fast approximations of the expectation operator from an exercise period to the previous one. The algorithm copes with discretely spaced, possibly nonuniform, time grids. This makes it particularly fast for options with a limited number of exercise periods, like Bermudan options, and options for which the optimal exercise schedule is known ex ante. As compared to Monte Carlo simulations, our scheme processes an entire cross section of options in a single execution. It thereby offers an immediate solution to the estimation of hedging coefficients through finite differences and is ideal when multiple related options need to be analyzed. The algorithm is also capable of dealing with discrete dividends with no performance deterioration, thus improving on the documented inefficiency of exercise policies under continuous dividend yield approximations. We benchmark our algorithm under both the canonical model of Black and Scholes and the stochastic volatility model of Heston in the presence of discrete dividends. We illustrate the massive improvement of complexity scaling over dense grids with a basket option study including up to eight underlying assets. We show how the high degree of parallelism of our scheme makes it suitable for deployment on massively parallel computing units to scale to higher dimensions or further speed up the solution process.

...read moreread less

Journal Article•DOI•

Parallelizing query optimization on shared-nothing architectures

[...]

Immanuel Trummer¹, Christoph Koch¹•Institutions (1)

École Polytechnique Fédérale de Lausanne¹

01 May 2016

TL;DR: In this paper, the authors present algorithms for parallel query optimization in left-deep and bushy plan spaces, where each worker returns the optimal plan in its partition to the master which determines the globally optimal plan from the partition-optimal plans.

...read moreread less

Abstract: Data processing systems offer an ever increasing degree of parallelism on the levels of cores, CPUs, and processing nodes. Query optimization must exploit high degrees of parallelism in order not to gradually become the bottleneck of query evaluation. We show how to parallelize query optimization at a massive scale.We present algorithms for parallel query optimization in left-deep and bushy plan spaces. At optimization start, we divide the plan space for a given query into partitions of equal size that are explored in parallel by worker nodes. At the end of optimization, each worker returns the optimal plan in its partition to the master which determines the globally optimal plan from the partition-optimal plans. No synchronization or data exchange is required during the actual optimization phase. The amount of data sent over the network, at the start and at the end of optimization, as well as the complexity of serial steps within our algorithms increase only linearly in the number of workers and in the query size. The time and space complexity of optimization within one partition decreases uniformly in the number of workers. We parallelize single- and multi-objective query optimization over a cluster with 100 nodes in our experiments, using more than 250 concurrent worker threads (Spark executors). Despite high network latency and task assignment overheads, parallelization yields speedups of up to one order of magnitude for large queries whose optimization takes minutes on a single node.

...read moreread less

Proceedings Article•DOI•

XL-STaGe: A cross-layer scalable tool for graph generation, evaluation and implementation

[...]

Pedro Campos¹, Nizar Dahir¹, Colin Bonney¹, Martin A. Trefzer¹, Andy M. Tyrrell¹, Gianluca Tempesti¹ - Show less +2 more•Institutions (1)

University of York¹

01 Jul 2016

TL;DR: This paper presents XL-STaGe, a cross-layer tool for traffic-inclusive directed acyclic graph generation and implementation, which consists of a set of processes that generate the task-graphs as well as a detailed process model for each node in each graph.

...read moreread less

Abstract: This paper presents XL-STaGe, a cross-layer tool for traffic-inclusive directed acyclic graph generation and implementation. In contrast to other graph-generation tools which focus on high-level DAG models, XL-STaGe consists of a set of processes that generate the task-graphs as well as a detailed process model for each node in each graph. The tool is highly customizable, with many parameters that can be tuned to meet the user's requirements to control the topology, connection density, degree of parallelism and duration the task-graph. Moreover, two use cases are presented, a high-level one, which illustrate the benefit of the developed tool in application mapping and a circuit-level one to verify the accuracy of the XL-STaGe process models when implemented in hardware.

...read moreread less

Journal Article•DOI•

Efficient Realization of Householder Transform through Algorithm-Architecture Co-design for Acceleration of QR Factorization

[...]

Farhad Merchant¹, Tarun Vatwani², Anupam Chattopadhyay², Soumyendu Raha³, S. K. Nandy³, Ranjani Narayan - Show less +2 more•Institutions (3)

RWTH Aachen University¹, Nanyang Technological University², Indian Institute of Science³

14 Dec 2016-arXiv: Performance

TL;DR: Efficient realization of Householder Transform (HT) based QR factorization through algorithm-architecture co-design is presented where performance improvement of 3-90x in-terms of Gflops/watt over state-of-the-art multicore, General Purpose Graphics Processing Units (GPGPUs), Field Programmable Gate Arrays (FPGAs), and ClearSpeed CSX700 is achieved.

...read moreread less

Abstract: We present efficient realization of Householder Transform (HT) based QR factorization through algorithm-architecture co-design where we achieve performance improvement of 3-90x in-terms of Gflops/watt over state-of-the-art multicore, General Purpose Graphics Processing Units (GPGPUs), Field Programmable Gate Arrays (FPGAs), and ClearSpeed CSX700. Theoretical and experimental analysis of classical HT is performed for opportunities to exhibit higher degree of parallelism where parallelism is quantified as a number of parallel operations per level in the Directed Acyclic Graph (DAG) of the transform. Based on theoretical analysis of classical HT, an opportunity re-arrange computations in the classical HT is identified that results in Modified HT (MHT) where it is shown that MHT exhibits 1.33x times higher parallelism than classical HT. Experiments in off-the-shelf multicore and General Purpose Graphics Processing Units (GPGPUs) for HT and MHT suggest that MHT is capable of achieving slightly better or equal performance compared to classical HT based QR factorization realizations in the optimized software packages for Dense Linear Algebra (DLA). We implement MHT on a customized platform for Dense Linear Algebra (DLA) and show that MHT achieves 1.3x better performance than native implementation of classical HT on the same accelerator. For custom realization of HT and MHT based QR factorization, we also identify macro operations in the DAGs of HT and MHT that are realized on a Reconfigurable Data-path (RDP). We also observe that due to re-arrangement in the computations in MHT, custom realization of MHT is capable of achieving 12% better performance improvement over multicore and GPGPUs than the performance improvement reported by General Matrix Multiplication (GEMM) over highly tuned DLA software packages for multicore and GPGPUs which is counter-intuitive.

...read moreread less

Journal Article•DOI•

Formal design of dynamic reconfiguration protocol for cloud applications

[...]

Rim Abid¹, Gwen Salaün¹, Noel De Palma¹•Institutions (1)

University of Grenoble¹

15 Feb 2016-Science of Computer Programming

TL;DR: A novel protocol, which aims at reconfiguring cloud applications, able to ensure communication between virtual machines and resolve dependencies by exchanging messages, (dis)connecting, and starting/stopping components in a specific order.

...read moreread less

Journal Article•DOI•

TransMap: Transformation Based Re m apping and P arallelism for High Utilization and Energy Efficiency in CGRAs

[...]

Syed M. A. H. Jafri¹, Masoud Daneshtalab¹, Naeem Abbas², Guillermo Serrano Leon, Ahmed Hemani¹ - Show less +1 more•Institutions (2)

Royal Institute of Technology¹, National University of Sciences and Technology²

01 Nov 2016-IEEE Transactions on Computers

TL;DR: TransMap stores only a single implementation and applies a series for transformations to the stored bitstream for remapping or parallelizing an application, and offers significant reductions in configuration memory requirements, compared to state of the art compaction techniques.

...read moreread less

Abstract: In the era of platforms hosting multiple applications with arbitrary inter application communication and computation patterns, compile time mapping decisions are neither optimal nor desirable. As a solution to this problem, recently proposed architectures offer run-time remapping. The run-time remapping techniques displace or parallelize/serialize an application to optimize different parameters (e.g., utilization and energy). To implement the dynamic remapping, reconfigurable architectures commonly store multiple (compile-time generated) implementations of an application. Each implementation represents a different platform location and/or degree of parallelism. The optimal implementation is selected at run-time. However, the compile-time binding either incurs excessive configuration memory overheads and/or is unable to map/parallelize an application even when sufficient resources are available. As a solution to this problem, we present Transformation based reMapping and parallelism (TransMap). TransMap stores only a single implementation and applies a series for transformations to the stored bitstream for remapping or parallelizing an application. Compared to state of the art, in addition to simple relocation in horizontal/vertical directions, TransMap also allows to rotate an application for mapping or parallelizing an application in resource constrained scenarios. By storing only a single implementation, TransMap offers significant reductions in configuration memory requirements (up to 73 percent for the tested applications), compared to state of the art compaction techniques. Simulation results reveal that the additional flexibility reduces the energy requirements by 33 percent and enhances the device utilization by 50 percent for the tested applications. Gate level analysis reveals that TransMap incurs negligible silicon (0.2 percent of the platform) and timing (6 additional cycles per application) penalty.

...read moreread less

Journal Article•DOI•

A Performance Study of CUDA UVM versus Manual Optimizations in a Real-World Setup: Application to a Monte Carlo Wave-Particle Event-Based Interaction Model

[...]

Jose M. Nadal-Serrano¹, Marisa Lopez-Vallejo¹•Institutions (1)

Technical University of Madrid¹

01 Jun 2016-IEEE Transactions on Parallel and Distributed Systems

TL;DR: The performance of a Monte Carlo model for the simulation of electromagnetic wave propagation in particle-filled atmospheres has been conducted for different CUDA versions and design approaches, showing a high degree of parallelism which allows favorable implementation in a GPU.

...read moreread less

Abstract: The performance of a Monte Carlo model for the simulation of electromagnetic wave propagation in particle-filled atmospheres has been conducted for different CUDA versions and design approaches. The proposed algorithm exhibits a high degree of parallelism, which allows favorable implementation in a GPU. Practical implementation aspects of the model have been also explained and their impact assessed, such as the use of the different types of memories present in a GPU. A number of setups have been chosen in order to compare performance for manually optimized versus Unified Virtual Memory (UVM) implementations for different CUDA versions. Features and relative performance impact of the different options have been discussed, extracting practical hints and rules useful to speed up CUDA programs.

...read moreread less

Overlapping Communication with Computation in MPI Applications

[...]

Cardellini, Alessandro Fanfarillo, Salvatore Filippone

01 Feb 2016

TL;DR: The most significant contributions about computation/communication overlapping are gathered and technical explanation of how such overlap can be achieved on modern supercomputers is provided.

...read moreread less

Abstract: In High Performance Computing (HPC), minimizing communication overhead is one of the most important goals in order to get high performance. This is more than ever important on exascale platforms, where there will be a much higher degree of parallelism compared to petascale platforms, resulting in increased communication overhead with considerable impact on application execution time and energy expenses. A good strategy for containing this overhead is to hide communication costs by overlapping them with computation. Despite the increasing interest in achieving computation/communication overlapping, details about the reasons that prevent it from succeeding are not easy to find, leading to confusion and poor application optimization. The Message Passing Interface (MPI) library, a de-facto standard in the HPC world, has always provided non-blocking communication routines able, in theory, to achieve communication/computation overlapping. Unfortunately, several factors related with the MPI independent progress and offload capability of the underlying network, make this overlap hard do achieve. With the introduction of one-sided communication routines, providing high quality MPI implementations, able to progress communication independently, is becoming as important as providing low latency and high bandwidth communication. In this paper, we gather the most significant contributions about computation/communication overlapping and provide technical explanation of how such overlap can be achieved on modern supercomputers.

...read moreread less

Journal Article•DOI•

Scalable Power Management for On-Chip Systems with Malleable Applications

[...]

Muhammad Shafique¹, Anton Ivanov¹, Benjamin Vogel¹, Jorg Henkel¹•Institutions (1)

Karlsruhe Institute of Technology¹

01 Nov 2016-IEEE Transactions on Computers

TL;DR: This work employs a per-application predictive power manager that autonomously controls the power states of the cores with the goal of energy efficiency, and allows the applications to lend their idle cores for a short time period to expedite other critical applications.

...read moreread less

Abstract: We present a scalable Dynamic Power Management (DPM) schem e where malleable applications may change their degree of parallelism at run time depending upon the workload and performance constraints. We employ a per-application predictive power manager that autonomously controls the power states of the cores with the goal of energy efficiency. Furthermore, our DPM allows the applications to lend their idle cores for a short time period to expedite other critical applications. In this way, it allows for application-level scalability, while aiming at the overall system energy optimization. Compared to state-of-the-art centralized and distributed power management approaches, we achieve up to 58 percent (average ≍15-20 percent) ED2P reduction.

...read moreread less

Book Chapter•DOI•

Improving Bug Predictions in Multicore Cyber-Physical Systems

[...]

Paolo Ciancarini¹, Francesco Poggi¹, Davide Rossi¹, Alberto Sillitti•Institutions (1)

University of Bologna¹

01 Jan 2016

TL;DR: A novel set of concurrency-related source code metrics to be used as the basis for bug prediction methods is proposed and discussed; the approach with respect to the existing state of the art is discussed, and the research challenges that have to be addressed are outlined.

...read moreread less

Abstract: As physical limits began to negate the assumption known as Moore’s law, chip manufacturers started focusing on multicore architectures as the main solution to improve the processing power in modern computers. Today, multicore CPUs are commonly found in servers, PCs, smartphone, cars, airplanes, and home appliances. As this happens, more and more programs are designed with some degree of parallelism to take advantage of these implicitly concurrent architectures. In this context, new challenges are presented to software engineers. For example, software validation becomes much more expensive (since testing concurrency is difficult) and strategies such as bug prediction could be used to better focus the effort during the development process. However, most of the existing bug prediction approaches have been designed with sequential programs in mind. In this paper, we propose a novel set of concurrency-related source code metrics to be used as the basis for bug prediction methods; we discuss our approach with respect to the existing state of the art, and we outline the research challenges that have to be addressed to realize our goal.

...read moreread less