scispace - formally typeset
Search or ask a question

Showing papers presented at "Irregular Applications: Architectures and Algorithms in 2018"


Proceedings ArticleDOI
01 Nov 2018
TL;DR: The utility and implementation of software prefetching in an unstructured finite volume CFD code of representative size and complexity to an industrial application and across a number of processors is demonstrated and significant full application speed-ups are shown.
Abstract: Applications that exhibit regular memory access patterns usually benefit transparently from hardware prefetchers that bring data into the fast on-chip cache just before it is required, thereby avoiding expensive cache misses. Unfortunately, unstructured mesh applications contain irregular access patterns that are often more difficult to identify in hardware. An alternative for such workloads is software prefetching, where special non-blocking instructions load data into the cache hierarchy. However, there are currently few examples in the literature on how to incorporate such software prefetches into existing applications with positive results. This paper addresses these issues by demonstrating the utility and implementation of software prefetching in an unstructured finite volume CFD code of representative size and complexity to an industrial application and across a number of processors. We present the benefits of auto-tuning for finding the optimal prefetch distance values across different computational kernels and architectures and demonstrate the importance of choosing the right prefetch destination across the available cache levels for best performance. We discuss the impact of the data layout on the number of prefetch instructions required in kernels with indirect-access patterns and show how to integrate them on top of existing optimisations such as vectorisation. Through this we show significant full application speed-ups on a range of processors, such as the Intel Xeon Skylake CPU (15%) as well as on the in-order Intel Xeon Phi Knights Corner (1.99x) architecture and the out-of-order Knights Landing (33%) many-core processor.

9 citations


Proceedings ArticleDOI
01 Nov 2018
TL;DR: This paper presents a new approach to merge sort using vector instructions, which not only surpasses the SIMD based bitonic counterpart, but that it is over 2.94X faster than a traditional merge, merging over 300M keys per second in one thread and over 16B keys persecond in parallel.
Abstract: Merging and sorting algorithms are the backbone of many modern computer applications. As such, efficient implementations are desired. Recent architectural advancements in CPUs (Central Processing Units), such as wider and more powerful vector instructions, allow for algorithmic improvements. This paper presents a new approach to merge sort using vector instructions. Traditional approaches to vectorized sorting typically utilize a bitonic sorting network (Batcher's Algorithm) which adds significant overhead. Our approach eliminates the overhead from this approach. We start with a branch-avoiding merge algorithm and then use the Merge Path algorithm to split up merging between the different SIMD lanes. Testing demonstrates that the algorithm not only surpasses the SIMD based bitonic counterpart, but that it is over 2.94X faster than a traditional merge, merging over 300M keys per second in one thread and over 16B keys per second in parallel. Our new sort reaches is over 5X faster than quicksort and 2X faster than Intel's IPP library sort, sorting over 5.3M keys per second for a single processor and in parallel over 500M keys per second and a speedup of over 2x from a traditional merge sort.

5 citations


Proceedings ArticleDOI
01 Nov 2018
TL;DR: By estimating the runtime cost and likelihood of success of a constrained walk in a labeled graph one can inform these optimization decisions and propose a preliminary solution to make these estimates.
Abstract: We have developed [Reza et al. SC'18] a highly scalable algorithmic pipeline for pattern matching in labeled graphs and demonstrated it on trillion-edge graphs. This pipeline: (i) supports arbitrary search patterns, (ii) identifies all the vertices and edges that participate in matches - offering 100% precision and recall, and (iii) supports realistic data analytics scenarios. This pipeline is based on graph pruning: it decomposes the search template into individual constraints and uses them to repeatedly prune the graph to a final solution. Our current solution, however, makes a number of ad-hoc intuition-based decisions with impact on performance. In a nutshell these relate to (i) constraint selection - which constraints to generate? (ii) constraint ordering - in which order to use them? and (iii) individual constraint generation - how to best verify them? This position paper makes the observation that by estimating the runtime cost and likelihood of success of a constrained walk in a labeled graph one can inform these optimization decisions. We propose a preliminary solution to make these estimates, and demonstrate - using a prototype shared-memory implementation - that this: (i) is feasible with low overheads, and (ii) offers accurate enough information to optimize our pruning pipeline by a significant margin.

5 citations


Proceedings ArticleDOI
01 Nov 2018
TL;DR: This work addresses the acceleration of the PageRank al- gorithm for web information retrieval on graphics processing units (GPUs) via a modular precision framework that adapts the data format in memory to the numerical requirements as the iteration converges.
Abstract: We address the acceleration of the PageRank al- gorithm for web information retrieval on graphics processing units (GPUs) via a modular precision framework that adapts the data format in memory to the numerical requirements as the iteration converges. In detail, we abandon the IEEE 754 single- and double-precision number representation formats, employed in the standard implementation of PageRank, to instead store the data in memory in some specialized formats. Furthermore, we avoid the data duplication by leveraging a data layout based on mantissa segmentation. Our evaluation on a V100 graphics card from NVIDIA shows acceleration factors of up to 30% with respect to the standard algorithm operating in double-precision.

5 citations


Proceedings ArticleDOI
01 Nov 2018
TL;DR: This study finds that initially distributing work evenly across the system is inadequate to maintain load balancing over time due to the migratory nature of Emu threads, and demonstrates that known matrix reordering techniques can improve SpMV performance on the Emu architecture by as much as 70% by encouraging more consistent load balancing.
Abstract: Achieving high performance for sparse applications is challenging due to irregular access patterns and weak locality. These properties preclude many static optimizations and degrade cache performance on traditional systems. To address these challenges, novel systems such as the Emu architecture have been proposed. The Emu design uses light-weight migratory threads, narrow memory, and near-memory processing capabilities to address weak locality and reduce the total load on the memory system. Because the Emu architecture is fundamentally different than cache based hierarchical memory systems, it is crucial to understand the cost-benefit tradeoffs of standard sparse algorithm optimizations on Emu hardware. In this work, we explore sparse matrix-vector multiplication (SpMV) on the Emu architecture. We investigate the effects of different sparse optimizations such as dense vector data layouts, work distributions, and matrix reorderings. Our study finds that initially distributing work evenly across the system is inadequate to maintain load balancing over time due to the migratory nature of Emu threads. In severe cases, matrix sparsity patterns produce hot-spots as many migratory threads converge on a single resource. We demonstrate that known matrix reordering techniques can improve SpMV performance on the Emu architecture by as much as 70% by encouraging more consistent load balancing. This can be compared with a performance gain of no more than 16% on a cache-memory based system.

5 citations


Proceedings ArticleDOI
01 Nov 2018
TL;DR: It is concluded that mix-and-match BFS is a competitive approach for doing fast graph traversal, while being easily extended to include more BFS implementations and easily adaptable to other types of processors or specific types of graphs.
Abstract: It is universally accepted that the performance of graph algorithms is heavily dependent on the algorithm, the execution platform, and the structure of the input graph. This variability remains difficult to predict and hinders the choice of the right algorithm for a given problem. In this work, we focus on a case study: breadth-first search (BFS), a level-based graph traversal algorithm, running on GPUs. We first demonstrate the severity of this variability by presenting 32 variations of 5 implementation strategies for GPU-enabled BFS, and showing how selecting one single algorithm for the entire traversal can significantly limit performance. To alleviate these performance losses, we propose to mix-and-match, at runtime, different algorithms to compose the best performing BFS traversal. Our approach is based on two novel elements: a predictive model, based on a decision tree, which is able to dynamically select the best performing algorithm for each BFS level, and a quick context switch between algorithms, which limits the overhead of the combined BFS. We demonstrate empirically that our dynamic switching BFS achieves better performance, outperforming our non-switching implementations by 2x and existing state-of-the-art GPU BFS implementations by 3x. We conclude that mix-and-match BFS is a competitive approach for doing fast graph traversal, while being easily extended to include more BFS implementations and easily adaptable to other types of processors or specific types of graphs.

4 citations


Proceedings ArticleDOI
01 Nov 2018
TL;DR: This work designs and implements a NUMA aware graph processing framework that embraces design philosophies of distributed graph processing system: in particular explicit partitioning and inter-partition communication, and at the same time exploits optimization opportunities specific to single-node systems.
Abstract: Modern shared-memory systems embrace the NUMA architecture which has proven to be more scalable than the SMP architecture. In many ways, a NUMA system resembles a shared-nothing distributed system: physically distinct processing units and memory regions. Memory accesses to remote NUMA domains are more expensive than local accesses. This poses the opportunity to transfer the know-how and design of distributed graph processing to develop shared-memory graph processing solutions optimized for NUMA systems. To this end, we explore if a distributed-memory like middleware that makes graph partitioning and communication between partitions explicit, can improve the performance on a NUMA system. We design and implement a NUMA aware graph processing framework that embraces design philosophies of distributed graph processing system: in particular explicit partitioning and inter-partition communication, and at the same time exploits optimization opportunities specific to single-node systems. We demonstrate up to 13.9x speedup over a state-of-the-art NUMA-aware framework, Polymer and up to 3.7x scalability on a four-socket NUMA machine using graphs with tens of billions of edges.

3 citations


Proceedings ArticleDOI
12 Nov 2018
TL;DR: This work presents an approach based on ad-hoc blocking and reordering strategies that allows local, independent collective-oriented processing of small dense blocks that delivers robust preconditioners that in the authors' experiments obtain an average speedup of ~6x even for tough matrices from optimization problems.
Abstract: Large sparse symmetric indefinite matrices are notoriously hard to precondition. They often lack diagonal dominance and exhibit Schur-complements that render zero fill-in preconditioning ineffective. Pivoting, a necessity for stable LDLt factorizations, complicates parallel approaches that can take advantage of the latest massively-parallel HPC hardware such as GPUs. We present an approach based on ad-hoc blocking and reordering strategies that allows local, independent collective-oriented processing of small dense blocks. A hybrid block-memory layout compensates for irregular memory access patterns found in sparse matrices. Our method allows restricted fill-in, supernodal pivoting and a dual threshold dropping strategy at little additional cost. It delivers robust preconditioners that in our experiments obtain an average speedup of ~6x even for tough matrices from optimization problems.

2 citations