Home
/
Authors
/
Joseph L. Greathouse

Author

Joseph L. Greathouse

Other affiliations: University of Michigan

Bio: Joseph L. Greathouse is an academic researcher from Advanced Micro Devices. The author has contributed to research in topics: Cache & General-purpose computing on graphics processing units. The author has an hindex of 16, co-authored 51 publications receiving 1256 citations. Previous affiliations of Joseph L. Greathouse include University of Michigan.

Papers published on a yearly basis

2023
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2008

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

TOP-PIM: throughput-oriented programmable processing in memory

[...]

Dong Ping Zhang¹, Nuwan Jayasena¹, Alexander Lyashevsky¹, Joseph L. Greathouse¹, Lifan Xu², Michael Ignatowski¹ - Show less +2 more•Institutions (2)

Advanced Micro Devices¹, University of Delaware²

23 Jun 2014

TL;DR: This work explores the use of 3D die stacking to move memory-intensive computations closer to memory and introduces a methodology for rapid design space exploration by analytically predicting performance and energy of in-memory processors based on metrics obtained from execution on today's GPU hardware.

...read moreread less

Abstract: As computation becomes increasingly limited by data movement and energy consumption, exploiting locality throughout the memory hierarchy becomes critical to continued performance scaling. Moving computation closer to memory presents an opportunity to reduce both energy and data movement overheads. We explore the use of 3D die stacking to move memory-intensive computations closer to memory. This approach to processing in memory addresses some drawbacks of prior research on in-memory computing and is commercially viable in the foreseeable future.Because 3D stacking provides increased bandwidth, we study throughput-oriented computing using programmable GPU compute units across a broad range of benchmarks, including graph and HPC applications. We also introduce a methodology for rapid design space exploration by analytically predicting performance and energy of in-memory processors based on metrics obtained from execution on today's GPU hardware. Our results show that, on average, viable PIM configurations show moderate performance losses (27%) in return for significant energy efficiency improvements (76\% reduction in EDP) relative to a representative mainstream GPU at 22nm technology. At 16nm technology, on average, viable PIM configurations are performance competitive with a representative mainstream GPU (7% speedup) and provide even greater energy efficiency improvements (85\% reduction in EDP).

...read moreread less

338 citations

Proceedings Article•DOI•

GPGPU performance and power estimation using machine learning

[...]

Gene Wu¹, Joseph L. Greathouse², Alexander Lyashevsky², Nuwan Jayasena², Derek Chiou¹ - Show less +1 more•Institutions (2)

University of Texas at Austin¹, Advanced Micro Devices²

07 Feb 2015

TL;DR: A GPU performance and power estimation model that uses machine learning techniques on measurements from real GPU hardware that runs as fast as, or faster than the program running natively on real hardware after an initial training phase.

...read moreread less

Abstract: Graphics Processing Units (GPUs) have numerous configuration and design options, including core frequency, number of parallel compute units (CUs), and available memory bandwidth. At many stages of the design process, it is important to estimate how application performance and power are impacted by these options. This paper describes a GPU performance and power estimation model that uses machine learning techniques on measurements from real GPU hardware. The model is trained on a collection of applications that are run at numerous different hardware configurations. From the measured performance and power data, the model learns how applications scale as the GPU's configuration is changed. Hardware performance counter values are then gathered when running a new application on a single GPU configuration. These dynamic counter values are fed into a neural network that predicts which scaling curve from the training data best represents this kernel. This scaling curve is then used to estimate the performance and power of the new application at different GPU configurations. Over an 8× range of the number of CUs, a 3.3× range of core frequencies, and a 2.9× range of memory bandwidth, our model's performance and power estimates are accurate to within 15% and 10% of real hardware, respectively. This is comparable to the accuracy of cycle-level simulators. However, after an initial training phase, our model runs as fast as, or faster than the program running natively on real hardware.

...read moreread less

196 citations

Proceedings Article•DOI•

Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format

[...]

Joseph L. Greathouse¹, Mayank Daga¹•Institutions (1)

Advanced Micro Devices¹

16 Nov 2014

TL;DR: This work proposes a novel algorithm, CSR-Adaptive, which keeps the CSR format intact and maps well to GPUs, and achieves an average speedup of 14.7× over existingCSR-based algorithms and 2.3× over clSpMV cocktail, which uses an assortment of matrix formats.

...read moreread less

Abstract: The performance of sparse matrix vector multiplication (SpMV) is important to computational scientists. Compressed sparse row (CSR) is the most frequently used format to store sparse matrices. However, CSR-based SpMV on graphics processing units (GPUs) has poor performance due to irregular memory access patterns, load imbalance, and reduced parallelism. This has led researchers to propose new storage formats. Unfortunately, dynamically transforming CSR into these formats has significant runtime and storage overheads. We propose a novel algorithm, CSR-Adaptive, which keeps the CSR format intact and maps well to GPUs. Our implementation addresses the aforementioned challenges by (i) efficiently accessing DRAM by streaming data into the local scratchpad memory and (ii) dynamically assigning different numbers of rows to each parallel GPU compute unit. CSR-Adaptive achieves an average speedup of 14.7 × over existing CSR-based algorithms and 2.3× over clSpMV cocktail, which uses an assortment of matrix formats.

...read moreread less

182 citations

Proceedings Article•DOI•

PPEP: Online Performance, Power, and Energy Prediction Framework and DVFS Space Exploration

[...]

Bo Su¹, Junli Gu², Li Shen¹, Wei Huang², Joseph L. Greathouse², Zhiying Wang¹ - Show less +2 more•Institutions (2)

National University of Defense Technology¹, Advanced Micro Devices²

13 Dec 2014

TL;DR: This work proposes PPEP, an online PPE prediction framework that proactively and rapidly searches the DVFS space and finds that it is important to carefully consider background workloads for DVFS policies and that enabling north bridge DVFS can offer up to 20% additional energy saving or a 1.4x performance improvement.

...read moreread less

Abstract: Performance, power, and energy (PPE) are critical aspects of modern computing. It is challenging to accurately predict, in real time, the effect of dynamic voltage and frequency scaling (DVFS) on PPE across a wide range of voltages and frequencies. This results in the use of reactive, iterative, and inefficient algorithms for dynamically finding good DVFS states. We propose PPEP, an online PPE prediction framework that proactively and rapidly searches the DVFS space. PPEP uses hardware events to implement both a cycles-per-instruction (CPI) model as well as a per-core power model in order to predict PPE across all DVFS states. We verify on modern AMD CPUs that the PPEP power model achieves an average error of 4.6% (2.8% standard deviation) on 152 benchmark combinations at 5 distinct voltage-frequency states. Predicting average chip power across different DVFS states achieves an average error of 4.2% with a 3.6% standard deviation. Further, we demonstrate the usage of PPEP by creating and evaluating a highly responsive power capping mechanism that can meet power targets in a single step. PPEP also provides insights for future development of DVFS technologies. For example, we find that it is important to carefully consider background workloads for DVFS policies and that enabling north bridge DVFS can offer up to 20% additional energy saving or a 1.4x performance improvement.

...read moreread less

79 citations

Proceedings Article•DOI•

Adaptive GPU cache bypassing

[...]

Yingying Tian¹, Sooraj Puthoor², Joseph L. Greathouse², Bradford M. Beckmann², Daniel A. Jimenez¹ - Show less +1 more•Institutions (2)

Texas A&M University¹, Advanced Micro Devices²

07 Feb 2015

TL;DR: A GPU cache management technique that adaptively bypasses the GPU cache for blocks that are unlikely to be referenced again before being evicted, resulting in better performance for programs that do not use programmer-managed scratchpad memories.

...read moreread less

Abstract: Modern graphics processing units (GPUs) include hardware- controlled caches to reduce bandwidth requirements and energy consumption. However, current GPU cache hierarchies are inefficient for general purpose GPU (GPGPU) comput- ing. GPGPU workloads tend to include data structures that would not fit in any reasonably sized caches, leading to very low cache hit rates. This problem is exacerbated by the design of current GPUs, which share small caches be- tween many threads. Caching these streaming data struc- tures needlessly burns power while evicting data that may otherwise fit into the cache. We propose a GPU cache management technique to im- prove the efficiency of small GPU caches while further re- ducing their power consumption. It adaptively bypasses the GPU cache for blocks that are unlikely to be referenced again before being evicted. This technique saves energy by avoid- ing needless insertions and evictions while avoiding cache pollution, resulting in better performance. We show that, with a 16KB L1 data cache, dynamic bypassing achieves sim- ilar performance to a double-sized L1 cache while reducing energy consumption by 25% and power by 18%. The technique is especially interesting for programs that do not use programmer-managed scratchpad memories. We give a case study to demonstrate the inefficiency of current GPU caches compared to programmer-managed scratchpad memories and show the extent to which cache bypassing can make up for the potential performance loss where the effort to program scratchpad memories is impractical.

...read moreread less

67 citations

1
2
3
4
…
5
6
7
8
9
10
11

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

National Institute of Standards and Technology における超伝導研究及び生活

[...]

尚島影

01 Oct 2001-Ieej Transactions on Fundamentals and Materials

2,687 citations

Journal Article•DOI•

PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory

[...]

Ping Chi¹, Shuangchen Li¹, Cong Xu², Tao Zhang³, Jishen Zhao¹, Yongpan Liu⁴, Yu Wang⁴, Yuan Xie¹ - Show less +4 more•Institutions (4)

University of California¹, Hewlett-Packard², Nvidia³, Tsinghua University⁴

18 Jun 2016

TL;DR: This work proposes a novel PIM architecture, called PRIME, to accelerate NN applications in ReRAM based main memory, and distinguishes itself from prior work on NN acceleration, with significant performance improvement and energy saving.

...read moreread less

Abstract: Processing-in-memory (PIM) is a promising solution to address the "memory wall" challenges for future computer systems. Prior proposed PIM architectures put additional computation logic in or near memory. The emerging metal-oxide resistive random access memory (ReRAM) has showed its potential to be used for main memory. Moreover, with its crossbar array structure, ReRAM can perform matrix-vector multiplication efficiently, and has been widely studied to accelerate neural network (NN) applications. In this work, we propose a novel PIM architecture, called PRIME, to accelerate NN applications in ReRAM based main memory. In PRIME, a portion of ReRAM crossbar arrays can be configured as accelerators for NN applications or as normal memory for a larger memory space. We provide microarchitecture and circuit designs to enable the morphable functions with an insignificant area overhead. We also design a software/hardware interface for software developers to implement various NNs on PRIME. Benefiting from both the PIM architecture and the efficiency of using ReRAM for NN computation, PRIME distinguishes itself from prior work on NN acceleration, with significant performance improvement and energy saving. Our experimental results show that, compared with a state-of-the-art neural processing unit design, PRIME improves the performance by ~2360× and the energy consumption by ~895×, across the evaluated machine learning benchmarks.

...read moreread less

1,197 citations

Proceedings Article•DOI•

A scalable processing-in-memory accelerator for parallel graph processing

[...]

Junwhan Ahn¹, Sungpack Hong², Sungjoo Yoo¹, Onur Mutlu³, Kiyoung Choi¹ - Show less +1 more•Institutions (3)

Seoul National University¹, Oracle Corporation², Carnegie Mellon University³

13 Jun 2015

TL;DR: This work argues that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve memory-capacity-proportional performance and designs a programmable PIM accelerator for large-scale graph processing called Tesseract.

...read moreread less

Abstract: The explosion of digital data and the ever-growing need for fast data analysis have made in-memory big-data processing in computer systems increasingly important. In particular, large-scale graph processing is gaining attention due to its broad applicability from social science to machine learning. However, scalable hardware design that can efficiently process large graphs in main memory is still an open problem. Ideally, cost-effective and scalable graph processing systems can be realized by building a system whose performance increases proportionally with the sizes of graphs that can be stored in the system, which is extremely challenging in conventional systems due to severe memory bandwidth limitations. In this work, we argue that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve such an objective. The key modern enabler for PIM is the recent advancement of the 3D integration technology that facilitates stacking logic and memory dies in a single package, which was not available when the PIM concept was originally examined. In order to take advantage of such a new technology to enable memory-capacity-proportional performance, we design a programmable PIM accelerator for large-scale graph processing called Tesseract. Tesseract is composed of (1) a new hardware architecture that fully utilizes the available memory bandwidth, (2) an efficient method of communication between different memory partitions, and (3) a programming interface that reflects and exploits the unique hardware design. It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model. Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems.

...read moreread less

718 citations

Proceedings Article•DOI•

PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning

[...]

Linghao Song¹, Xuehai Qian², Hai Li³, Yi Chen⁴•Institutions (4)

University of Pittsburgh¹, University of Southern California², Nanyang Technological University³, Chinese Academy of Sciences⁴

01 Feb 2017

TL;DR: PipeLayer is presented, a ReRAM-based PIM accelerator for CNNs that support both training and testing and proposes highly parallel design based on the notion of parallelism granularity and weight replication, which enables the highly pipelined execution of bothTraining and testing, without introducing the potential stalls in previous work.

...read moreread less

Abstract: Convolution neural networks (CNNs) are the heart of deep learning applications. Recent works PRIME [1] and ISAAC [2] demonstrated the promise of using resistive random access memory (ReRAM) to perform neural computations in memory. We found that training cannot be efficiently supported with the current schemes. First, they do not consider weight update and complex data dependency in training procedure. Second, ISAAC attempts to increase system throughput with a very deep pipeline. It is only beneficial when a large number of consecutive images can be fed into the architecture. In training, the notion of batch (e.g. 64) limits the number of images can be processed consecutively, because the images in the next batch need to be processed based on the updated weights. Third, the deep pipeline in ISAAC is vulnerable to pipeline bubbles and execution stall. In this paper, we present PipeLayer, a ReRAM-based PIM accelerator for CNNs that support both training and testing. We analyze data dependency and weight update in training algorithms and propose efficient pipeline to exploit inter-layer parallelism. To exploit intra-layer parallelism, we propose highly parallel design based on the notion of parallelism granularity and weight replication. With these design choices, PipeLayer enables the highly pipelined execution of both training and testing, without introducing the potential stalls in previous work. The experiment results show that, PipeLayer achieves the speedups of 42.45x compared with GPU platform on average. The average energy saving of PipeLayer compared with GPU implementation is 7.17x.

...read moreread less

633 citations

Proceedings Article•DOI•

Ambit: in-memory accelerator for bulk bitwise operations using commodity DRAM technology

[...]

Vivek Seshadri¹, Donghyuk Lee¹, Thomas Mullins¹, Hasan Hassan², Amirali Boroumand¹, Jeremie S. Kim¹, Michael Kozuch³, Onur Mutlu¹, Phillip B. Gibbons¹, Todd C. Mowry¹ - Show less +6 more•Institutions (3)

Carnegie Mellon University¹, ETH Zurich², Intel³

14 Oct 2017

TL;DR: Ambit is proposed, an Accelerator-in-Memory for bulk bitwise operations that largely exploits existing DRAM structure, and hence incurs low cost on top of commodity DRAM designs (1% of DRAM chip area).

...read moreread less

Abstract: Many important applications trigger bulk bitwise operations, i.e., bitwise operations on large bit vectors. In fact, recent works design techniques that exploit fast bulk bitwise operations to accelerate databases (bitmap indices, BitWeaving) and web search (BitFunnel). Unfortunately, in existing architectures, the throughput of bulk bitwise operations is limited by the memory bandwidth available to the processing unit (e.g., CPU, GPU, FPGA, processing-in-memory).To overcome this bottleneck, we propose Ambit, an Accelerator-in-Memory for bulk bitwise operations. Unlike prior works, Ambit exploits the analog operation of DRAM technology to perform bitwise operations completely inside DRAM, thereby exploiting the full internal DRAM bandwidth. Ambit consists of two components. First, simultaneous activation of three DRAM rows that share the same set of sense amplifiers enables the system to perform bitwise AND and OR operations. Second, with modest changes to the sense amplifier, the system can use the inverters present inside the sense amplifier to perform bitwise NOT operations. With these two components, Ambit can perform any bulk bitwise operation efficiently inside DRAM. Ambit largely exploits existing DRAM structure, and hence incurs low cost on top of commodity DRAM designs (1% of DRAM chip area). Importantly, Ambit uses the modern DRAM interface without any changes, and therefore it can be directly plugged onto the memory bus.Our extensive circuit simulations show that Ambit works as expected even in the presence of significant process variation. Averaged across seven bulk bitwise operations, Ambit improves performance by 32X and reduces energy consumption by 35X compared to state-of-the-art systems. When integrated with Hybrid Memory Cube (HMC), a 3D-stacked DRAM with a logic layer, Ambit improves performance of bulk bitwise operations by 9.7X compared to processing in the logic layer of the HMC. Ambit improves the performance of three real-world data-intensive applications, 1) database bitmap indices, 2) BitWeaving, a technique to accelerate database scans, and 3) bit-vector-based implementation of sets, by 3X-7X compared to a state-of-the-art baseline using SIMD optimizations. We describe four other applications that can benefit from Ambit, including a recent technique proposed to speed up web search. We believe that large performance and energy improvements provided by Ambit can enable other applications to use bulk bitwise operations.CCS CONCEPTS• Computer systems organization → Single instruction, multiple data; • Hardware → Hardware accelerator; • Hardware → Dynamic memory;

...read moreread less

444 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse