Home
/
Authors
/
Peter Mattson

Author

Peter Mattson

Bio: Peter Mattson is an academic researcher from Google. The author has contributed to research in topics: Benchmark (computing) & Stream processing. The author has an hindex of 19, co-authored 36 publications receiving 3430 citations. Previous affiliations of Peter Mattson include Stanford University.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Memory access scheduling

[...]

Scott Rixner¹, William J. Dally², Ujval J. Kapasi², Peter Mattson², John D. Owens² - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, Stanford University²

01 May 2000

TL;DR: This paper introduces memory access scheduling, a technique that improves the performance of a memory system by reordering memory references to exploit locality within the 3-D memory structure.

...read moreread less

Abstract: The bandwidth and latency of a memory system are strongly dependent on the manner in which accesses interact with the “3-D” structure of banks, rows, and columns characteristic of contemporary DRAM chips. There is nearly an order of magnitude difference in bandwidth between successive references to different columns within a row and different rows within a bank. This paper introduces memory access scheduling, a technique that improves the performance of a memory system by reordering memory references to exploit locality within the 3-D memory structure. Conservative reordering, in which the first ready reference in a sequence is performed, improves bandwidth by 40% for traces from five media benchmarks. Aggressive reordering, in which operations are scheduled to optimize memory bandwidth, improves bandwidth by 93% for the same set of applications. Memory access scheduling is particularly important for media processors where it enables the processor to make the most efficient use of scarce memory bandwidth.

...read moreread less

1,009 citations

Journal Article•DOI•

Imagine: media processing with streams

[...]

Brucek Khailany¹, William J. Dally, Ujval J. Kapasi, Peter Mattson, J. Namkoong, John D. Owens, Brian Towles, A. Chang, Scott Rixner - Show less +5 more•Institutions (1)

Stanford University¹

01 Mar 2001-IEEE Micro

TL;DR: The power-efficient Imagine stream processor achieves performance densities comparable to those of special-purpose embedded processors and can sustain 18.3 gops on mpeg-2 encoding.

...read moreread less

Abstract: The power-efficient Imagine stream processor achieves performance densities comparable to those of special-purpose embedded processors. Executing programs mapped to streams and kernels, a single Imagine processor is expected to have a peak performance of 20 gflops and sustain 18.3 gops on mpeg-2 encoding.

...read moreread less

396 citations

Journal Article•DOI•

Programmable stream processors

[...]

Ujval J. Kapasi¹, Scott Rixner², William J. Dally¹, Brucek Khailany¹, Jung Ho Ahn¹, Peter Mattson, John D. Owens³ - Show less +3 more•Institutions (3)

Stanford University¹, Rice University², University of California, Davis³

01 Aug 2003-IEEE Computer

TL;DR: The central idea behind stream processing is to organize an application into streams and kernels to expose the inherent locality and concurrency in media-processing applications.

...read moreread less

Abstract: The demand for flexibility in media processing motivates the use of programmable processors. Stream processing bridges the gap between inflexible special-purpose solutions and current programmable architectures that cannot meet the computational demands of media-processing applications. The central idea behind stream processing is to organize an application into streams and kernels to expose the inherent locality and concurrency in media-processing applications. The performance of the Imagine stream processor on these media application is given.

...read moreread less

335 citations

Proceedings Article•DOI•

[...]

Scott Rixner¹, William J. Dally, Brucek Khailany, Peter Mattson, Ujval J. Kapasi, John D. Owens¹ - Show less +2 more•Institutions (1)

Stanford University¹

20 Jan 2000

TL;DR: It is shown that partitioning the register file along three axes reduces the cost of register storage and communication without significantly impacting performance, and develops a taxonomy of register architectures by partitioning across the data-parallel, instruction-level-par parallel and memory-hierarchy axes.

...read moreread less

Abstract: Processor architectures with tens to hundreds of arithmetic units are emerging to handle media processing applications. These applications, such as image coding, image synthesis and image understanding, require arithmetic rates of up to 10/sup 11/ operations per second. As the number of arithmetic units in a processor increases to meet these demands, register storage and communication between the arithmetic units dominate the area, delay and power of the arithmetic units. In this paper, we show that partitioning the register file along three axes reduces the cost of register storage and communication without significantly impacting performance. We develop a taxonomy of register architectures by partitioning across the data-parallel, instruction-level-parallel and memory-hierarchy axes, and by optimizing the hierarchical register organization for operation on streams of data. Compared to a centralized global register file, the most compact of these organizations reduces the register file area, delay and power dissipation of a media processor by factors of 195, 230 and 430 respectively. This reduction in cost is achieved with a performance degradation of only 8% on a representative set of media processing benchmarks.

...read moreread less

315 citations

Proceedings Article•DOI•

MLPerf inference benchmark

[...]

Vijay Janapa Reddi¹, Christine Cheng², David Kanter, Peter Mattson³, Guenther Schmuelling⁴, Carole-Jean Wu⁵, Brian M. Anderson³, Maximilien Breughe⁶, Mark Charlebois⁷, William Chou⁷, Ramesh Chukka², Cody Coleman⁸, Sam Davis, Pan Deng⁹, Greg Diamos, Jared Duke³, Dave Fick, J. Scott Gardner, Itay Hubara, Sachin Satish Idgunji⁶, Thomas B. Jablin³, Jeff Jiao, Tom St. John, Pankaj Kanwar³, David Lee¹⁰, Jeffery Liao¹¹, Anton Lokhmotov, Francisco Massa⁵, Peng Meng⁹, Paulius Micikevicius⁶, Colin Osborne, Gennady Pekhimenko¹², Arun Tejusve Raghunath Rajan², Dilip Sequeira⁶, Ashish Sirasao¹³, Fei Sun⁵, Hanlin Tang², Michael Thomson¹⁴, Frank Wei¹⁵, Ephrem C. Wu¹³, Lingjie Xu, Koichi Yamada², Bing Yu¹⁰, George Yuan⁶, Aaron Zhong, Peizhao Zhang⁵, Yuchen Zhou¹⁶ - Show less +43 more•Institutions (16)

Harvard University¹, Intel², Google³, Microsoft⁴, Facebook⁵, Nvidia⁶, Qualcomm⁷, Stanford University⁸, Tencent⁹, MediaTek¹⁰, Synopsys¹¹, University of Toronto¹², Xilinx¹³, Centaur Technology¹⁴, Alibaba Group¹⁵, General Motors¹⁶

30 May 2020

TL;DR: This paper presents the benchmarking method for evaluating ML inference systems, MLPerf Inference, and prescribes a set of rules and best practices to ensure comparability across systems with wildly differing architectures.

...read moreread less

Abstract: Machine-learning (ML) hardware and software system demand is burgeoning. Driven by ML applications, the number of different ML inference systems has exploded. Over 100 organizations are building ML inference chips, and the systems that incorporate existing models span at least three orders of magnitude in power consumption and five orders of magnitude in performance; they range from embedded devices to data-center solutions. Fueling the hardware are a dozen or more software frameworks and libraries. The myriad combinations of ML hardware and ML software make assessing ML-system performance in an architecture-neutral, representative, and reproducible manner challenging. There is a clear need for industry-wide standard ML benchmarking and evaluation criteria. MLPerf Inference answers that call. In this paper, we present our benchmarking method for evaluating ML inference systems. Driven by more than 30 organizations as well as more than 200 ML engineers and practitioners, MLPerf prescribes a set of rules and best practices to ensure comparability across systems with wildly differing architectures. The first call for submissions garnered more than 600 reproducible inference-performance measurements from 14 organizations, representing over 30 systems that showcase a wide range of capabilities. The submissions attest to the benchmark’s flexibility and adaptability.

...read moreread less

284 citations

1
2
3
4
…
5
6
7
8
9
10

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

A survey of research and practices of Network-on-chip

[...]

Tobias Bjerregaard¹, Shankar Mahadevan¹•Institutions (1)

Technical University of Denmark¹

29 Jun 2006-ACM Computing Surveys

TL;DR: The research shows that NoC constitutes a unification of current trends of intrachip communication rather than an explicit new alternative.

...read moreread less

Abstract: The scaling of microchip technologies has enabled large scale systems-on-chip (SoC). Network-on-chip (NoC) research addresses global communication in SoC, involving (i) a move from computation-centric to communication-centric design and (ii) the implementation of scalable communication structures. This survey presents a perspective on existing NoC research. We define the following abstractions: system, network adapter, network, and link to explain and structure the fundamental concepts. First, research relating to the actual network design is reviewed. Then system level design and modeling are discussed. We also evaluate performance analysis techniques. The research shows that NoC constitutes a unification of current trends of intrachip communication rather than an explicit new alternative.

...read moreread less

1,720 citations

Обнаружение транспортных средств на изображениях загородных шоссе на основе метода Single shot multibox Detector

[...]

Р Ю Чуйков, Д А Юдин

01 Jan 2017

1,687 citations

Proceedings Article•DOI•

Analyzing CUDA workloads using a detailed GPU simulator

[...]

Ali Bakhoda¹, George L. Yuan¹, Wilson W. L. Fung¹, Henry Wong¹, Tor M. Aamodt¹ - Show less +1 more•Institutions (1)

University of British Columbia¹

26 Apr 2009

TL;DR: In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.

...read moreread less

Abstract: Modern Graphic Processing Units (GPUs) provide sufficiently flexible programming models that understanding their performance can provide insight in designing tomorrow's manycore processors, whether those are GPUs or otherwise. The combination of multiple, multithreaded, SIMD cores makes studying these GPUs useful in understanding tradeoffs among memory, data, and thread level parallelism. While modern GPUs offer orders of magnitude more raw computing power than contemporary CPUs, many important applications, even those with abundant data level parallelism, do not achieve peak performance. This paper characterizes several non-graphics applications written in NVIDIA's CUDA programming model by running them on a novel detailed microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set. For this study, we selected twelve non-trivial CUDA applications demonstrating varying levels of performance improvement on GPU hardware (versus a CPU-only sequential version of the application). We study the performance of these applications on our GPU performance simulator with configurations comparable to contemporary high-end graphics cards. We characterize the performance impact of several microarchitecture design choices including choice of interconnect topology, use of caches, design of memory controller, parallel workload distribution mechanisms, and memory request coalescing hardware. Two observations we make are (1) that for the applications we study, performance is more sensitive to interconnect bisection bandwidth rather than latency, and (2) that, for some applications, running fewer threads concurrently than on-chip resources might otherwise allow can improve performance by reducing contention in the memory system.

...read moreread less

1,558 citations

Journal Article•DOI•

Brook for GPUs: stream computing on graphics hardware

[...]

Ian Buck¹, Tim Foley¹, Daniel Reiter Horn¹, Jeremy Sugerman¹, Kayvon Fatahalian¹, Mike Houston¹, Pat Hanrahan¹ - Show less +3 more•Institutions (1)

Stanford University¹

01 Aug 2004

TL;DR: This paper presents Brook for GPUs, a system for general-purpose computation on programmable graphics hardware that abstracts and virtualizes many aspects of graphics hardware, and presents an analysis of the effectiveness of the GPU as a compute engine compared to the CPU.

...read moreread less

Abstract: In this paper, we present Brook for GPUs, a system for general-purpose computation on programmable graphics hardware. Brook extends C to include simple data-parallel constructs, enabling the use of the GPU as a streaming co-processor. We present a compiler and runtime system that abstracts and virtualizes many aspects of graphics hardware. In addition, we present an analysis of the effectiveness of the GPU as a compute engine compared to the CPU, to determine when the GPU can outperform the CPU for a particular algorithm. We evaluate our system with five applications, the SAXPY and SGEMV BLAS operators, image segmentation, FFT, and ray tracing. For these applications, we demonstrate that our Brook implementations perform comparably to hand-written GPU code and up to seven times faster than their CPU counterparts.

...read moreread less

1,288 citations

Book Chapter•DOI•

StreamIt: A Language for Streaming Applications

[...]

William Thies¹, Michal Karczmarek¹, Saman Amarasinghe¹•Institutions (1)

Massachusetts Institute of Technology¹

08 Apr 2002

TL;DR: The StreamIt language provides novel high-level representations to improve programmer productivity and program robustness within the streaming domain and the StreamIt compiler aims to improve the performance of streaming applications via stream-specific analyses and optimizations.

...read moreread less

Abstract: We characterize high-performance streaming applications as a new and distinct domain of programs that is becoming increasingly important. The StreamIt language provides novel high-level representations to improve programmer productivity and program robustness within the streaming domain. At the same time, the StreamIt compiler aims to improve the performance of streaming applications via stream-specific analyses and optimizations. In this paper, we motivate, describe and justify the language features of StreamIt, which include: a structured model of streams, a messaging system for control, a re-initialization mechanism, and a natural textual syntax.

...read moreread less

1,224 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse