Author

Michael E. Wolf

Bio: Michael E. Wolf is an academic researcher. The author has contributed to research in topics: Write-once & Cache pollution. The author has an hindex of 1, co-authored 1 publications receiving 958 citations.

Topics: Write-once, Cache pollution, Cache, Loop tiling, MESI protocol ...read more

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

The cache performance and optimizations of blocked algorithms

[...]

Monica D. Lam, Edward E. Rothberg, Michael E. Wolf

01 Apr 1991

TL;DR: It is shown that the degree of cache interference is highly sensitive to the stride of data accesses and the size of the blocks, and can cause wide variations in machine performance for different matrix sizes.

...read moreread less

Abstract: Blocking is a well-known optimization technique for improving the effectiveness of memory hierarchies. Instead of operating on entire rows or columns of an array, blocked algorithms operate on submatrices or blocks, so that data loaded into the faster levels of the memory hierarchy are reused. This paper presents cache performance data for blocked programs and evaluates several optimization to improve this performance. The data is obtained by a theoretical model of data conflicts in the cache, which has been validated by large amounts of simulation. We show that the degree of cache interference is highly sensitive to the stride of data accesses and the size of the blocks, and can cause wide variations in machine performance for different matrix sizes. The conventional wisdom of frying to use the entire cache, or even a fixed fraction of the cache, is incorrect. If a fixed block size is used for a given cache size, the block size that minimizes the expected number of cache misses is very small. Tailoring the block size according to the matrix size and cache parameters can improve the average performance and reduce the variance in performance for different matrix sizes. Finally, whenever possible, it is beneficial to copy non-contiguous reused data into consecutive locations.

...read moreread less

982 citations

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

A data locality optimizing algorithm

[...]

Michael Wolf¹, Monica S. Lam¹•Institutions (1)

Stanford University¹

01 May 1991

TL;DR: An algorithm that improves the locality of a loop nest by transforming the code via interchange, reversal, skewing and tiling is proposed, and is successful in optimizing codes such as matrix multiplication, successive over-relaxation, LU decomposition without pivoting, and Givens QR factorization.

...read moreread less

Abstract: This paper proposes an algorithm that improves the locality of a loop nest by transforming the code via interchange, reversal, skewing and tiling. The loop transformation algorithm is based on two concepts: a mathematical formulation of reuse and locality, and a loop transformation theory that unifies the various transforms as unimodular matrix transformations.The algorithm has been implemented in the SUIF (Stanford University Intermediate Format) compiler, and is successful in optimizing codes such as matrix multiplication, successive over-relaxation (SOR), LU decomposition without pivoting, and Givens QR factorization. Performance evaluation indicates that locality optimization is especially crucial for scaling up the performance of parallel code.

...read moreread less

1,352 citations

Proceedings Article•DOI•

Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

[...]

Shane Ryoo¹, Christopher I. Rodrigues¹, Sara S. Baghsorkhi¹, Sam S. Stone¹, David B. Kirk², Wen-mei W. Hwu¹ - Show less +2 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, Nvidia²

20 Feb 2008

TL;DR: This work discusses the GeForce 8800 GTX processor's organization, features, and generalized optimization strategies, and achieves increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations.

...read moreread less

Abstract: GPUs have recently attracted the attention of many application developers as commodity data-parallel coprocessors. The newest generations of GPU architecture provide easier programmability and increased generality while maintaining the tremendous memory bandwidth and computational power of traditional GPUs. This opportunity should redirect efforts in GPGPU research from ad hoc porting of applications to establishing principles and strategies that allow efficient mapping of computation to graphics hardware. In this work we discuss the GeForce 8800 GTX processor's organization, features, and generalized optimization strategies. Key to performance on this platform is using massive multithreading to utilize the large number of cores and hide global memory latency. To achieve this, developers face the challenge of striking the right balance between each thread's resource usage and the number of simultaneously active threads. The resources to manage include the number of registers and the amount of on-chip memory used per thread, number of threads per multiprocessor, and global memory bandwidth. We also obtain increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations. We apply these strategies across a variety of applications and domains and achieve between a 10.5X to 457X speedup in kernel codes and between 1.16X to 431X total application speedup.

...read moreread less

993 citations

Book•

Computer Architecture, Fifth Edition: A Quantitative Approach

[...]

John L. Hennessy, David A. Patterson

29 Sep 2011

TL;DR: The Fifth Edition of Computer Architecture focuses on this dramatic shift in the ways in which software and technology in the "cloud" are accessed by cell phones, tablets, laptops, and other mobile computing devices.

...read moreread less

Abstract: The computing world today is in the middle of a revolution: mobile clients and cloud computing have emerged as the dominant paradigms driving programming and hardware innovation today. The Fifth Edition of Computer Architecture focuses on this dramatic shift, exploring the ways in which software and technology in the "cloud" are accessed by cell phones, tablets, laptops, and other mobile computing devices. Each chapter includes two real-world examples, one mobile and one datacenter, to illustrate this revolutionary change. Updated to cover the mobile computing revolutionEmphasizes the two most important topics in architecture today: memory hierarchy and parallelism in all its forms.Develops common themes throughout each chapter: power, performance, cost, dependability, protection, programming models, and emerging trends ("What's Next")Includes three review appendices in the printed text. Additional reference appendices are available online.Includes updated Case Studies and completely new exercises.

...read moreread less

984 citations

Journal Article•DOI•

Compiler transformations for high-performance computing

[...]

David F. Bacon¹, Susan L. Graham¹, Oliver Sharp¹•Institutions (1)

University of California, Berkeley¹

01 Dec 1994-ACM Computing Surveys

TL;DR: This survey is a comprehensive overview of the important high-level program restructuring techniques for imperative languages, such as C and Fortran, and describes the purpose of each transformation, how to determine if it is legal, and an example of its application.

...read moreread less

Abstract: In the last three decades a large number of compiler transformations for optimizing programs have been implemented. Most optimizations for uniprocessors reduce the number of instructions executed by the program using transformations based on the analysis of scalar quantities and data-flow techniques. In contrast, optimizations for high-performance superscalar, vector, and parallel processors maximize parallelism and memory locality with transformations that rely on tracking the properties of arrays using loop dependence analysis.This survey is a comprehensive overview of the important high-level program restructuring techniques for imperative languages, such as C and Fortran. Transformations for both sequential and various types of parallel architectures are covered in depth. We describe the purpose of each transformation, explain how to determine if it is legal, and give an example of its application.Programmers wishing to enhance the performance of their code can use this survey to improve their understanding of the optimizations that compilers can perform, or as a reference for techniques to be applied manually. Students can obtain an overview of optimizing compiler technology. Compiler writers can use this survey as a reference for most of the important optimizations developed to date, and as bibliographic reference for the details of each optimization. Readers are expected to be familiar with modern computer architecture and basic program compilation techniques.

...read moreread less

946 citations

Proceedings Article•DOI•

Design and evaluation of a compiler algorithm for prefetching

[...]

Todd C. Mowry, Monica S. Lam, Anoop Gupta

01 Sep 1992

TL;DR: This paper proposes a compiler algorithm to insert prefetch instructions into code that operates on dense matrices, and shows that this algorithm significantly improves the execution speed of the benchmark programs-some of the programs improve by as much as a factor of two.

...read moreread less

Abstract: Software-controlled data prefetching is a promising technique for improving the performance of the memory subsystem to match today’s high-performance processors. While prefctching is useful in hiding the latency, issuing prefetches incurs an instruction overhead and can increase the load on the memory subsystem. As a resu 1~ care must be taken to ensure that such overheads do not exceed the benefits. This paper proposes a compiler algorithm to insert prefetch instructions into code that operates on dense matrices. Our algorithm identiEes those references that are likely to be cache misses, and issues prefetches only for them. We have implemented our algorithm in the SUfF (Stanford University Intermediate Form) optimizing compiler. By generating fully functional code, we have been able to measure not only the improvements in cache miss rates, but also the oversdl performance of a simulated system. We show that our algorithm significantly improves the execution speed of our benchmark programs-some of the programs improve by as much as a factor of two. When compared to an algorithm that indiscriminately prefetches alf array accesses, our algorithm can eliminate many of the unnecessary prefetches without any significant decrease in the coverage of the cache misses.

...read moreread less

789 citations

Collapse