scispace - formally typeset
Search or ask a question

Showing papers presented at "Irregular Applications: Architectures and Algorithms in 2017"


Proceedings ArticleDOI
12 Nov 2017
TL;DR: This work proposes a novel work-efficient parallel algorithm for the DFS traversal of directed acyclic graph (DAG) that outperforms sequential DFS on the CPU by up to 6x in the authors' experiments.
Abstract: Depth-First Search (DFS) is a pervasive algorithm, often used as a building block for topological sort, connectivity and planarity testing, among many other applications. We propose a novel work-efficient parallel algorithm for the DFS traversal of directed acyclic graph (DAG). The algorithm traverses the entire DAG in a BFS-like fashion no more than three times. As a result it finds the DFS pre-order (discovery) and post-order (finish time) as well as the parent relationship associated with every node in a DAG. We analyse the runtime and work complexity of this novel parallel algorithm. Also, we show that our algorithm is easy to implement and optimize for performance. In particular, we show that its CUDA implementation on the GPU outperforms sequential DFS on the CPU by up to 6x in our experiments.

12 citations


Proceedings ArticleDOI
12 Nov 2017
TL;DR: A load-balanced GPU kernel for computing the sparse matrix vector (SpMV) product is proposed, making heavy use of the latest GPU programming features and being superior to the most popular SpMV implementations.
Abstract: In this paper we propose a load-balanced GPU kernel for computing the sparse matrix vector (SpMV) product. Making heavy use of the latest GPU programming features, we also enable satisfying performance for irregular and unbalanced matrices. In a performance comparison using 400 test matrices we reveal the new kernel being superior to the most popular SpMV implementations.

11 citations


Proceedings ArticleDOI
12 Nov 2017
TL;DR: A new optimization called context combining is introduced to further boost SGNS performance on multicore systems and it is shown that this approach is 3.53x faster than the original multithreaded Word2Vec implementation and 1.28x better than a recent parallel Word2 Vec implementation.
Abstract: The Skip-gram with negative sampling (SGNS) method of Word2Vec is an unsupervised approach to map words in a text corpus to low dimensional real vectors. The learned vectors capture semantic relationships between co-occurring words and can be used as inputs to many natural language processing and machine learning tasks. There are several high-performance implementations of the Word2Vec SGNS method. In this paper, we introduce a new optimization called context combining to further boost SGNS performance on multicore systems. For processing the One Billion Word benchmark dataset on a 16-core platform, we show that our approach is 3.53x faster than the original multithreaded Word2Vec implementation and 1.28x faster than a recent parallel Word2Vec implementation. We also show that our accuracy on benchmark queries is comparable to state-of-the-art implementations.

11 citations


Proceedings ArticleDOI
12 Nov 2017
TL;DR: Emphasis is on supporting the growing classes of algorithms where there is significant sparsity, irregularity, and lack of locality in the memory reference patterns, especially when migration is managed by hardware without requiring software intervention.
Abstract: Modern supercomputers have millions of cores, each capable of executing one or more threads of program execution. In these computers the site of execution for program threads rarely, if ever, changes from the node in which they were born. This paper discusses the advantages that may accrue when thread states migrate freely from node to node, especially when migration is managed by hardware without requiring software intervention. Emphasis is on supporting the growing classes of algorithms where there is significant sparsity, irregularity, and lack of locality in the memory reference patterns. Evidence is drawn from reformulation of several kernels into a migrating thread context approximating that of an emerging architecture with such capabilities.

10 citations


Proceedings ArticleDOI
12 Nov 2017
TL;DR: This paper provides a quantitative study of the KNL for HPC proxy applications including Lulesh, HPCG, AMG, and Hotspot when using DDR4 and MCDRAM, and indicates that HBM significantly improves the performance of memory intensive applications for as many as three times better than DDR4 in HPCg.
Abstract: The Intel Knight Landing (KNL) manycore chip includes 3D-stacked memory named MCDRAM, also known as High Bandwidth Memory (HBM) for parallel applications that needs to scale to high thread count. In this paper, we provide a quantitative study of the KNL for HPC proxy applications including Lulesh, HPCG, AMG, and Hotspot when using DDR4 and MCDRAM. The results indicate that HBM significantly improves the performance of memory intensive applications for as many as three times better than DDR4 in HPCG, and Lulesh and HPCG for as many as 40% and 200%. For the selected compute intensive applications, the performance advantage of MCDRAM over DDR4 varies from 2% to 28%. We also observed that the cross-points, where MCDRAM starts outperforming DDR4, are around 8 to 16 threads.

5 citations


Proceedings ArticleDOI
12 Nov 2017
TL;DR: In this article, the authors investigate the use of massively data computation capabilities supported by modern GPUs to solve the initial credit problem for energy games, and present different parallel implementations on multi-core CPU and GPU systems.
Abstract: Quantitative games, where quantitative objectives are defined on weighted game arenas, provide natural tools for designing faithful models of embedded controllers. Instances of these games are the so called Energy Games. Starting from a sequential baseline implementation, we investigate the use of massively data computation capabilities supported by modern GPUs to solve the initial credit problem for Energy Games. We present different parallel implementations on multi-core CPU and GPU systems. Our solution outperforms the baseline implementation by up to 36x speedup and obtains a faster convergence time on real-world graphs.

5 citations


Proceedings ArticleDOI
12 Nov 2017
TL;DR: The optimal algorithm is used to scale the driving materials science application, which is shown to deliver over 17X speedup using 32 OpenMP threads on data sets containing many millions of atoms.
Abstract: In this short paper, we report the performance of multiple thread-parallel algorithms for spherical region queries on multicore architectures motivated by a challenging data analytics application in materials science. Performances of two tree-based algorithms and a naive algorithm are compared to identify the length scales at which these approaches perform optimally. The optimal algorithm is then used to scale the driving materials science application, which is shown to deliver over 17X speedup using 32 OpenMP threads on data sets containing many millions of atoms.

3 citations


Proceedings ArticleDOI
12 Nov 2017
TL;DR: This paper proposes progressive load balancing to manage progress imbalance in asynchronous algorithms dynamically, and shows that under these conditions the balanced asynchronous method outperforms synchronous, semi-synchronous and totally asynchronous implementations in terms of time to solution.
Abstract: Synchronisation in the presence of noise and hardware performance variability is a key challenge that prevents applications from scaling to large problems and machines. Using asynchronous or semi-synchronous algorithms can help overcome this issue, but at the cost of reduced stability or convergence rate. In this paper we propose progressive load balancing to manage progress imbalance in asynchronous algorithms dynamically. In our technique the balancing is done over time, not instantaneously.Using Jacobi iterations as a test case, we show that, with CPU performance variability present, this approach leads to higher iteration rate and lower progress imbalance between parts of the solution space. We also show that under these conditions the balanced asynchronous method outperforms synchronous, semi-synchronous and totally asynchronous implementations in terms of time to solution.

3 citations


Proceedings ArticleDOI
12 Nov 2017
TL;DR: In this paper, the authors propose vertex grouping as a technique that enables trade-off between memory consumption and the work efficiency in a GPU-based iterative vertex-centric graph processing framework.
Abstract: Massive parallel processing power of GPUs has attracted researchers to develop iterative vertex-centric graph processing frameworks for GPUs. Enabling work-efficiency in these solutions, however, is not straightforward and comes at the cost of SIMD-inefficiency and load imbalance. This paper offers techniques that overcome these challenges when processing the graph on a GPU. For a SIMD-efficient kernel operation involving gathering of neighbors and performing reduction on them, we employ an effective task expansion strategy that avoids intra-warp thread underutilization. As recording vertex activeness requires additional data structures, to attenuate the graph storage overhead on limited GPU DRAM, we introduce vertex grouping as a technique that enables trade-off between memory consumption and the work efficiency in our solution. Our experiments show that these techniques provide up to 5.46x over the recently proposed WS-VR [4] framework over multiple algorithms and inputs.

2 citations


Proceedings ArticleDOI
12 Nov 2017
TL;DR: This research focuses on mapping the compiler's instruction cost scheduling logic to hardware managed concurrency controls in order to minimize pipeline stalls and introduces a thread context switching method that is managed directly via a set of hardware-based mechanisms that are coupled to the compiler instruction scheduler.
Abstract: Given the increasing importance of efficient data intensive computing, we find that modern processor designs are not well suited to the irregular memory access patterns found in these algorithms. This research focuses on mapping the compiler's instruction cost scheduling logic to hardware managed concurrency controls in order to minimize pipeline stalls. In this manner, the hardware modules managing the low-latency thread concurrency can be directly understood by modern compilers. We introduce a thread context switching method that is managed directly via a set of hardware-based mechanisms that are coupled to the compiler instruction scheduler. As individual instructions from a thread execute, their respective cost is accumulated into a control register. Once the register reaches a pre-determined saturation point, the thread is forced to context switch. We evaluate the performance benefits of our approach using a series of 24 benchmarks that exhibit performance acceleration of up to 14.6X.

2 citations


Proceedings ArticleDOI
12 Nov 2017
TL;DR: This approach shows that an optimized and locality-aware parallel sFFT can perform 7x faster than the original sequential s FFT library on a multicore platform and is also approximately 10x faster more than the parallel FFTW.
Abstract: Fast Fourier Transform (FFT) is one of the most important numerical algorithms widely used in numerous scientific and engineering computations. With the emergence of big data problems, however, it is challenging to acquire, process and store a sufficient amount of data to compute the FFT in the first place. Recently developed sparse FFT (sFFT) algorithm provides a solution to this problem. sFFT computes a compressed Fourier transform by using only a small subset of the input data, thus achieving significant performance improvement.While the increase in the number of cores and memory bandwidth on modern architectures provide an opportunity to improve the performance through sophisticated parallel algorithm design, sFFT is inherently complex, and numerous challenges need to be addressed. Among all the challenges, sFFT falls into the category of irregular applications in which memory access patterns are indirect and irregular that exhibit poor data locality. In this paper, we explore data layout transformation algorithms to tackle the challenge. Our approach shows that an optimized and locality-aware parallel sFFT can perform 7x faster than the original sequential sFFT library on a multicore platform. This optimized locality-aware parallel sFFT is also approximately 10x faster than the parallel FFTW.