scispace - formally typeset
Search or ask a question

Showing papers by "Michael Garland published in 2016"


Proceedings ArticleDOI
Duane Merrill1, Michael Garland1
13 Nov 2016
TL;DR: This work presents a strictly balanced method for the parallel computation of sparse matrix-vector products (SpMV) that operates directly upon the Compressed Sparse Row (CSR) sparse matrix format without preprocessing, inspection, reformatting, or supplemental encoding.
Abstract: We present a strictly balanced method for the parallel computation of sparse matrix-vector products (SpMV). Our algorithm operates directly upon the Compressed Sparse Row (CSR) sparse matrix format without preprocessing, inspection, reformatting, or supplemental encoding. Regardless of nonzero structure, our equitable 2D merge-based decomposition tightly bounds the workload assigned to each processing element. Furthermore, our technique is suitable for recursively partitioning CSR datasets themselves into multi-scale, distributed, NUMA, and GPU environments that are constrained by fixed-size local memories. We evaluate our method on both CPU and GPU microarchitectures across a very large corpus of diverse sparse matrix datasets. We show that traditional CsrMV methods are inconsistent performers, often subject to order-of-magnitude performance variation across similarly-sized datasets. In comparison, our method provides predictable performance that is substantially uncorrelated to the distribution of nonzeros among rows and broadly improves upon that of current CsrMV methods.

95 citations


Proceedings ArticleDOI
Duane Merrill1, Michael Garland1
27 Feb 2016
TL;DR: It is shown that traditional CsrMV methods are inconsistent performers subject to order-of-magnitude slowdowns, whereas the performance response of the method is substantially impervious to row-length heterogeneity.
Abstract: We present a perfectly balanced, "merge-based" parallel method for computing sparse matrix-vector products (SpMV). Our algorithm operates directly upon the Compressed Sparse Row (CSR) sparse matrix format, a predominant in-memory representation for general-purpose sparse linear algebra computations. Our CsrMV performs an equitable multi-partitioning of the input dataset, ensuring that no single thread can be overwhelmed by assignment to (a) arbitrarily-long rows or (b) an arbitrarily-large number of zero-length rows. This parallel decomposition requires neither offline preprocessing nor specialized/ancillary data formats. We evaluate our method on both CPU and GPU microarchitecture across an enormous corpus of diverse real world matrix datasets. We show that traditional CsrMV methods are inconsistent performers subject to order-of-magnitude slowdowns, whereas the performance response of our method is substantially impervious to row-length heterogeneity.

53 citations


Proceedings ArticleDOI
25 Mar 2016
TL;DR: This work defines a new approach called architecture-adaptive code variant tuning, where the variant selection model is learned on a set of source architectures, and then used to predict variants on a new target architecture without having to repeat the training process.
Abstract: Code variants represent alternative implementations of a computation, and are common in high-performance libraries and applications to facilitate selecting the most appropriate implementation for a specific execution context (target architecture and input dataset). Automating code variant selection typically relies on machine learning to construct a model during an offline learning phase that can be quickly queried at runtime once the execution context is known. In this paper, we define a new approach called architecture-adaptive code variant tuning, where the variant selection model is learned on a set of source architectures, and then used to predict variants on a new target architecture without having to repeat the training process. We pose this as a multi-task learning problem, where each source architecture corresponds to a task; we use device features in the construction of the variant selection model. This work explores the effectiveness of multi-task learning and the impact of different strategies for device feature selection. We evaluate our approach on a set of benchmarks and a collection of six NVIDIA GPU architectures from three distinct generations. We achieve performance results that are mostly comparable to the previous approach of tuning for a single GPU architecture without having to repeat the learning phase.

33 citations


Patent
06 Oct 2016
TL;DR: A history preserving data pipeline as mentioned in this paper is a system that provides immutable and versioned datasets, which makes it possible to determine the data in a dataset at a point in time in the past, even if that data is no longer in the current version of the dataset.
Abstract: A history preserving data pipeline computer system and method. In one aspect, the history preserving data pipeline system provides immutable and versioned datasets. Because datasets are immutable and versioned, the system makes it possible to determine the data in a dataset at a point in time in the past, even if that data is no longer in the current version of the dataset.

8 citations


Journal ArticleDOI
TL;DR: Surge is a nested data-parallel programming system designed to simplify the porting and tuning of parallel applications to multiple target architectures that automatically generates CPU and GPU implementations that perform on par with or better than manually optimized versions.
Abstract: This article describes Surge, a nested data-parallel programming system designed to simplify the porting and tuning of parallel applications to multiple target architectures. Surge decouples high-level specification of computations, expressed using a C++ programming interface, from low-level implementation details using two first-class constructs: schedules and policies. Schedules describe the valid ways in which data-parallel operators may be implemented, while policies encapsulate a set of parameters that govern platform-specific code generation. These two mechanisms are used to implement a code generation system that analyzes computations and automatically generates a search space of valid platform-specific implementations. An input and architecture-adaptive autotuning system then explores this search space to find optimized implementations. We express in Surge five real-world benchmarks from domains such as machine learning and sparse linear algebra and from the high-level specifications, Surge automatically generates CPU and GPU implementations that perform on par with or better than manually optimized versions.

3 citations