Showing papers by "Michael Garland published in 2016"

PDF

Open Access

Proceedings Article•DOI•

Merge-based parallel sparse matrix-vector multiplication

[...]

Duane Merrill¹, Michael Garland¹•Institutions (1)

13 Nov 2016

TL;DR: This work presents a strictly balanced method for the parallel computation of sparse matrix-vector products (SpMV) that operates directly upon the Compressed Sparse Row (CSR) sparse matrix format without preprocessing, inspection, reformatting, or supplemental encoding.

...read moreread less

Abstract: We present a strictly balanced method for the parallel computation of sparse matrix-vector products (SpMV). Our algorithm operates directly upon the Compressed Sparse Row (CSR) sparse matrix format without preprocessing, inspection, reformatting, or supplemental encoding. Regardless of nonzero structure, our equitable 2D merge-based decomposition tightly bounds the workload assigned to each processing element. Furthermore, our technique is suitable for recursively partitioning CSR datasets themselves into multi-scale, distributed, NUMA, and GPU environments that are constrained by fixed-size local memories. We evaluate our method on both CPU and GPU microarchitectures across a very large corpus of diverse sparse matrix datasets. We show that traditional CsrMV methods are inconsistent performers, often subject to order-of-magnitude performance variation across similarly-sized datasets. In comparison, our method provides predictable performance that is substantially uncorrelated to the distribution of nonzeros among rows and broadly improves upon that of current CsrMV methods.

...read moreread less

95 citations

Proceedings Article•DOI•

Merge-based sparse matrix-vector multiplication (SpMV) using the CSR storage format

[...]

Duane Merrill¹, Michael Garland¹•Institutions (1)

Nvidia¹

27 Feb 2016

TL;DR: It is shown that traditional CsrMV methods are inconsistent performers subject to order-of-magnitude slowdowns, whereas the performance response of the method is substantially impervious to row-length heterogeneity.

...read moreread less

Abstract: We present a perfectly balanced, "merge-based" parallel method for computing sparse matrix-vector products (SpMV). Our algorithm operates directly upon the Compressed Sparse Row (CSR) sparse matrix format, a predominant in-memory representation for general-purpose sparse linear algebra computations. Our CsrMV performs an equitable multi-partitioning of the input dataset, ensuring that no single thread can be overwhelmed by assignment to (a) arbitrarily-long rows or (b) an arbitrarily-large number of zero-length rows. This parallel decomposition requires neither offline preprocessing nor specialized/ancillary data formats. We evaluate our method on both CPU and GPU microarchitecture across an enormous corpus of diverse real world matrix datasets. We show that traditional CsrMV methods are inconsistent performers subject to order-of-magnitude slowdowns, whereas the performance response of our method is substantially impervious to row-length heterogeneity.

...read moreread less

53 citations

Proceedings Article•DOI•

Architecture-Adaptive Code Variant Tuning

[...]

Saurav Muralidharan¹, Amit Roy¹, Mary Hall¹, Michael Garland², Piyush Rai³ - Show less +1 more•Institutions (3)

University of Utah¹, Nvidia², Indian Institute of Technology Kanpur³

25 Mar 2016

TL;DR: This work defines a new approach called architecture-adaptive code variant tuning, where the variant selection model is learned on a set of source architectures, and then used to predict variants on a new target architecture without having to repeat the training process.

...read moreread less

Abstract: Code variants represent alternative implementations of a computation, and are common in high-performance libraries and applications to facilitate selecting the most appropriate implementation for a specific execution context (target architecture and input dataset). Automating code variant selection typically relies on machine learning to construct a model during an offline learning phase that can be quickly queried at runtime once the execution context is known. In this paper, we define a new approach called architecture-adaptive code variant tuning, where the variant selection model is learned on a set of source architectures, and then used to predict variants on a new target architecture without having to repeat the training process. We pose this as a multi-task learning problem, where each source architecture corresponds to a task; we use device features in the construction of the variant selection model. This work explores the effectiveness of multi-task learning and the impact of different strategies for device feature selection. We evaluate our approach on a set of benchmarks and a collection of six NVIDIA GPU architectures from three distinct generations. We achieve performance results that are mostly comparable to the previous approach of tuning for a single GPU architecture without having to repeat the learning phase.

...read moreread less

33 citations

Patent•

Universal data pipeline

[...]

Jacob Meacham¹, Michael Harris¹, Gustav Brodman¹, Lynn Cuthriell¹, Hannah Korus¹, Brian Toth¹, Jonathan Hsiao¹, Mark Elliot¹, Brian Schimpf¹, Michael Garland¹, Evelyn Nguyen¹ - Show less +7 more•Institutions (1)

Palantir Technologies¹

06 Oct 2016

TL;DR: A history preserving data pipeline as mentioned in this paper is a system that provides immutable and versioned datasets, which makes it possible to determine the data in a dataset at a point in time in the past, even if that data is no longer in the current version of the dataset.

...read moreread less

Abstract: A history preserving data pipeline computer system and method. In one aspect, the history preserving data pipeline system provides immutable and versioned datasets. Because datasets are immutable and versioned, the system makes it possible to determine the data in a dataset at a point in time in the past, even if that data is no longer in the current version of the dataset.

...read moreread less

8 citations

Journal Article•DOI•

Designing a Tunable Nested Data-Parallel Programming System

[...]

Saurav Muralidharan¹, Michael Garland², Albert Sidelnik², Mary Hall¹•Institutions (2)

University of Utah¹, Nvidia²

28 Dec 2016-ACM Transactions on Architecture and Code Optimization

TL;DR: Surge is a nested data-parallel programming system designed to simplify the porting and tuning of parallel applications to multiple target architectures that automatically generates CPU and GPU implementations that perform on par with or better than manually optimized versions.

...read moreread less

Abstract: This article describes Surge, a nested data-parallel programming system designed to simplify the porting and tuning of parallel applications to multiple target architectures. Surge decouples high-level specification of computations, expressed using a C++ programming interface, from low-level implementation details using two first-class constructs: schedules and policies. Schedules describe the valid ways in which data-parallel operators may be implemented, while policies encapsulate a set of parameters that govern platform-specific code generation. These two mechanisms are used to implement a code generation system that analyzes computations and automatically generates a search space of valid platform-specific implementations. An input and architecture-adaptive autotuning system then explores this search space to find optimized implementations. We express in Surge five real-world benchmarks from domains such as machine learning and sparse linear algebra and from the high-level specifications, Surge automatically generates CPU and GPU implementations that perform on par with or better than manually optimized versions.

...read moreread less

3 citations