Home
/
Authors
/
David Brooks

Author

David Brooks

Other affiliations: IBM, Princeton University, Intel ...read more

Bio: David Brooks is an academic researcher from Harvard University. The author has contributed to research in topics: Hardware acceleration & Computer science. The author has an hindex of 54, co-authored 242 publications receiving 15012 citations. Previous affiliations of David Brooks include IBM & Princeton University.

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
1983

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Wattch: a framework for architectural-level power analysis and optimizations

[...]

David Brooks¹, Vivek Tiwari², Margaret Martonosi¹•Institutions (2)

Princeton University¹, Intel²

01 May 2000

TL;DR: Wattch is presented, a framework for analyzing and optimizing microprocessor power dissipation at the architecture-level and opens up the field of power-efficient computing to a wider range of researchers by providing a power evaluation methodology within the portable and familiar SimpleScalar framework.

...read moreread less

Abstract: Power dissipation and thermal issues are increasingly significant in modern processors. As a result, it is crucial that power/performance tradeoffs be made more visible to chip architects and even compiler writers, in addition to circuit designers. Most existing power analysis tools achieve high accuracy by calculating power estimates for designs only after layout or floorplanning are complete. In addition to being available only late in the design process, such tools are often quite slow, which compounds the difficulty of running them for a large space of design possibilities.This paper presents Wattch, a framework for analyzing and optimizing microprocessor power dissipation at the architecture-level. Wattch is 1000X or more faster than existing layout-level power tools, and yet maintains accuracy within 10% of their estimates as verified using industry tools on leading-edge designs. This paper presents several validations of Wattch's accuracy. In addition, we present three examples that demonstrate how architects or compiler writers might use Wattch to evaluate power consumption in their design process.We see Wattch as a complement to existing lower-level tools; it allows architects to explore and cull the design space early on, using faster, higher-level tools. It also opens up the field of power-efficient computing to a wider range of researchers by providing a power evaluation methodology within the portable and familiar SimpleScalar framework.

...read moreread less

2,848 citations

Proceedings Article•DOI•

Dynamic thermal management for high-performance microprocessors

[...]

David Brooks¹, Margaret Martonosi¹•Institutions (1)

Princeton University¹

20 Jan 2001

TL;DR: This work investigates dynamic thermal management as a technique to control CPU power dissipation and explores the tradeoffs between several mechanisms for responding to periods of thermal trauma and the effects of hardware and software implementations.

...read moreread less

Abstract: With the increasing clock rate and transistor count of today's microprocessors, power dissipation is becoming a critical component of system design complexity. Thermal and power-delivery issues are becoming especially critical for high-performance computing systems. In this work, we investigate dynamic thermal management as a technique to control CPU power dissipation. With the increasing usage of clock gating techniques, the average power dissipation typically seen by common applications is becoming much less than the chip's rated maximum power dissipation. However system designers still must design thermal heat sinks to withstand the worse-case scenario. We define and investigate the major components of any dynamic thermal management scheme. Specifically we explore the tradeoffs between several mechanisms for responding to periods of thermal trauma and we consider the effects of hardware and software implementations. With approximate dynamic thermal management, the CPU can be designed for a much lower maximum power rating, with minimal performance impact for typical applications.

...read moreread less

882 citations

Proceedings Article•DOI•

System level analysis of fast, per-core DVFS using on-chip switching regulators

[...]

Wonyoung Kim¹, Meeta S. Gupta¹, Gu-Yeon Wei¹, David Brooks¹•Institutions (1)

Harvard University¹

24 Oct 2008

TL;DR: It is concluded that on-chip regulators can significantly improve DVFS effectiveness and lead to overall system energy savings in a CMP, but architects must carefully account for overheads and costs when designing next-generation DVFS systems and algorithms.

...read moreread less

Abstract: Portable, embedded systems place ever-increasing demands on high-performance, low-power microprocessor design. Dynamic voltage and frequency scaling (DVFS) is a well-known technique to reduce energy in digital systems, but the effectiveness of DVFS is hampered by slow voltage transitions that occur on the order of tens of microseconds. In addition, the recent trend towards chip-multiprocessors (CMP) executing multi-threaded workloads with heterogeneous behavior motivates the need for per-core DVFS control mechanisms. Voltage regulators that are integrated onto the same chip as the microprocessor core provide the benefit of both nanosecond-scale voltage switching and per-core voltage control. We show that these characteristics provide significant energy-saving opportunities compared to traditional off-chip regulators. However, the implementation of on-chip regulators presents many challenges including regulator efficiency and output voltage transient characteristics, which are significantly impacted by the system-level application of the regulator. In this paper, we describe and model these costs, and perform a comprehensive analysis of a CMP system with on-chip integrated regulators. We conclude that on-chip regulators can significantly improve DVFS effectiveness and lead to overall system energy savings in a CMP, but architects must carefully account for overheads and costs when designing next-generation DVFS systems and algorithms.

...read moreread less

758 citations

Journal Article•DOI•

Minerva: enabling low-power, highly-accurate deep neural network accelerators

[...]

Brandon Reagen¹, Paul N. Whatmough¹, Robert Adolf¹, Saketh Rama¹, Hyunkwang Lee¹, Sae Kyu Lee¹, José Miguel Hernández-Lobato¹, Gu-Yeon Wei¹, David Brooks¹ - Show less +5 more•Institutions (1)

Harvard University¹

18 Jun 2016

TL;DR: Minerva as mentioned in this paper proposes a co-design approach across the algorithm, architecture, and circuit levels to optimize DNN hardware accelerators, and shows that fine-grained, heterogeneous dataatype optimization reduces power by 1.5× and aggressive, inline predication and pruning of small activity values further reduces power.

...read moreread less

Abstract: The continued success of Deep Neural Networks (DNNs) in classification tasks has sparked a trend of accelerating their execution with specialized hardware. While published designs easily give an order of magnitude improvement over general-purpose hardware, few look beyond an initial implementation. This paper presents Minerva, a highly automated co-design approach across the algorithm, architecture, and circuit levels to optimize DNN hardware accelerators. Compared to an established fixed-point accelerator baseline, we show that fine-grained, heterogeneous datatype optimization reduces power by 1.5×; aggressive, inline predication and pruning of small activity values further reduces power by 2.0×; and active hardware fault detection coupled with domain-aware error mitigation eliminates an additional 2.7× through lowering SRAM voltages. Across five datasets, these optimizations provide a collective average of 8.1× power reduction over an accelerator baseline without compromising DNN model accuracy. Minerva enables highly accurate, ultra-low power DNN accelerators (in the range of tens of milliwatts), making it feasible to deploy DNNs in power-constrained IoT and mobile devices.

...read moreread less

540 citations

Proceedings Article•DOI•

Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective

[...]

Kim Hazelwood¹, Sarah Bird¹, David Brooks², Soumith Chintala¹, Utku Diril¹, Dmytro Dzhulgakov¹, Mohamed Fawzy, Bill Jia¹, Yangqing Jia¹, Aditya Kalro, James Law³, Kevin M. Lee, Jason Lu, Pieter Noordhuis⁴, Misha Smelyanskiy, Liang Xiong¹, Xiaodong Wang¹ - Show less +13 more•Institutions (4)

Facebook¹, Harvard University², Google³, University of Warsaw⁴

01 Feb 2018

TL;DR: The hardware and software infrastructure that supports machine learning at global scale is described, leveraging both GPU and CPU platforms for training and abundant CPU capacity for real-time inference.

...read moreread less

Abstract: Machine learning sits at the core of many essential products and services at Facebook. This paper describes the hardware and software infrastructure that supports machine learning at global scale. Facebook's machine learning workloads are extremely diverse: services require many different types of models in practice. This diversity has implications at all layers in the system stack. In addition, a sizable fraction of all data stored at Facebook flows through machine learning pipelines, presenting significant challenges in delivering data to high-performance distributed training flows. Computational requirements are also intense, leveraging both GPU and CPU platforms for training and abundant CPU capacity for real-time inference. Addressing these and other emerging challenges continues to require diverse efforts that span machine learning algorithms, software, and hardware design.

...read moreread less

523 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Machine learning

[...]

Thomas G. Dietterich¹•Institutions (1)

Oregon State University¹

01 Dec 1996-ACM Computing Surveys

TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.

...read moreread less

Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

...read moreread less

13,246 citations

Journal Article•DOI•

Cloud computing: state-of-the-art and research challenges

[...]

Qi Zhang¹, Lu Cheng¹, Raouf Boutaba¹•Institutions (1)

University of Waterloo¹

20 Apr 2010-Journal of Internet Services and Applications

TL;DR: A survey of cloud computing is presented, highlighting its key concepts, architectural principles, state-of-the-art implementation as well as research challenges to provide a better understanding of the design challenges of cloud Computing and identify important research directions in this increasingly important area.

...read moreread less

Abstract: Cloud computing has recently emerged as a new paradigm for hosting and delivering services over the Internet. Cloud computing is attractive to business owners as it eliminates the requirement for users to plan ahead for provisioning, and allows enterprises to start from the small and increase resources only when there is a rise in service demand. However, despite the fact that cloud computing offers huge opportunities to the IT industry, the development of cloud computing technology is currently at its infancy, with many issues still to be addressed. In this paper, we present a survey of cloud computing, highlighting its key concepts, architectural principles, state-of-the-art implementation as well as research challenges. The aim of this paper is to provide a better understanding of the design challenges of cloud computing and identify important research directions in this increasingly important area.

...read moreread less

3,465 citations

Posted Content•

In-Datacenter Performance Analysis of a Tensor Processing Unit

[...]

Norman P. Jouppi, Cliff Young, Nishant Patil, David A. Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Albert T. Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Christopher Aaron Clark, Jeremy Coriell, Michael J. Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William John Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, D. Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Khaitan Harshit, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andrew Everett Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Michael Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay K. Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, Doe Hyun Yoon - Show less +71 more

16 Apr 2017-arXiv: Hardware Architecture

TL;DR: This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.

...read moreread less

Abstract: Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

...read moreread less

3,067 citations

Proceedings Article•DOI•

Wattch: a framework for architectural-level power analysis and optimizations

[...]

David Brooks¹, Vivek Tiwari², Margaret Martonosi¹•Institutions (2)

Princeton University¹, Intel²

01 May 2000

...read moreread less

2,848 citations

Proceedings Article•DOI•

In-Datacenter Performance Analysis of a Tensor Processing Unit

[...]

Norman P. Jouppi¹, Cliff Young¹, Nishant Patil¹, David A. Patterson¹, Gaurav Agrawal¹, Raminder Bajwa¹, Sarah Bates¹, Suresh Bhatia¹, Nan Boden¹, Albert T. Borchers¹, Rick Boyle¹, Pierre-luc Cantin¹, Clifford Chao¹, Christopher Aaron Clark¹, Jeremy Coriell¹, Michael J. Daley¹, Matt Dau¹, Jeffrey Dean¹, Ben Gelb¹, Tara Vazir Ghaemmaghami¹, Rajendra Gottipati¹, William John Gulland¹, Robert Hagmann¹, C. Richard Ho¹, Doug Hogberg¹, John Hu¹, Robert Hundt¹, D. Hurt¹, Julian Ibarz¹, Aaron Jaffey¹, Alek Jaworski¹, Alexander Kaplan¹, Khaitan Harshit¹, Daniel Killebrew¹, Andy Koch¹, Naveen Kumar¹, Steve Lacy¹, James Laudon¹, James Law¹, Diemthu Le¹, Chris Leary¹, Zhuyuan Liu¹, Kyle Lucke¹, Alan Lundin¹, Gordon MacKean¹, Adriana Maggiore¹, Maire Mahony¹, Kieran Miller¹, Rahul Nagarajan¹, Ravi Narayanaswami¹, Ray Ni¹, Kathy Nix¹, Thomas Norrie¹, Mark Omernick¹, Narayana Penukonda¹, Andrew Everett Phelps¹, Jonathan Ross¹, Matt Ross¹, Amir Salek¹, Emad Samadiani¹, Chris Severn¹, Gregory Sizikov¹, Matthew Snelham¹, Jed Souter¹, Dan Steinberg¹, Andy Swing¹, Mercedes Tan¹, Gregory Michael Thorson¹, Bo Tian¹, Horia Toma¹, Erick Tuttle¹, Vijay K. Vasudevan¹, Richard Walter¹, Walter Wang¹, Eric Wilcox¹, Doe Hyun Yoon¹ - Show less +72 more•Institutions (1)

Google¹

24 Jun 2017

TL;DR: The Tensor Processing Unit (TPU) as discussed by the authors is a custom ASIC deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) using a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS).

...read moreread less

Abstract: Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X -- 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X -- 80X higher. Moreover, using the CPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

...read moreread less

2,679 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse