Home
/
Authors
/
John L. Hennessy

Author

John L. Hennessy

Bio: John L. Hennessy is an academic researcher from National Academies. The author has contributed to research in topics: Interface (computing) & Assembly language. The author has an hindex of 7, co-authored 11 publications receiving 3071 citations.

Papers

PDF

Open Access

More filters

Book•

Computer Architecture: A Quantitative Approach, 2nd Edition

[...]

John L. Hennessy, David A. Patterson¹•Institutions (1)

Imperial College London¹

01 Jan 1996

TL;DR: A quantitative approach to computer architecture a quantitative approach 5th edition computer architecture quantitative approach solution manual computer Architecture quantitative approach solutions manual computer architecture an quantitative approach 3rd editionComputer architecture, fifth edition.

...read moreread less

Abstract: The computing world today is in the middle of a revolution: mobile clients and cloud computing have emerged as the dominant paradigms driving programming and hardware innovation today. The Fifth Edition of Computer Architecture focuses on this dramatic shift, exploring the ways in which software and technology in the cloud are accessed by cell phones, tablets, laptops, and other mobile computing devices. Each chapter includes two real-world examples, one mobile and one datacenter, to illustrate this revolutionary change. Updated to cover the mobile computing revolution Emphasizes the two most important topics in architecture today: memory hierarchy and parallelism in all its forms. Develops common themes throughout each chapter: power, performance, cost, dependability, protection, programming models, and emerging trends ("What's Next") Includes three review appendices in the printed text. Additional reference appendices are available online. Includes updated Case Studies and completely new exercises.

...read moreread less

1,626 citations

Book•

Computer Architecture, Fifth Edition: A Quantitative Approach

[...]

John L. Hennessy, David A. Patterson

29 Sep 2011

TL;DR: The Fifth Edition of Computer Architecture focuses on this dramatic shift in the ways in which software and technology in the "cloud" are accessed by cell phones, tablets, laptops, and other mobile computing devices.

...read moreread less

Abstract: The computing world today is in the middle of a revolution: mobile clients and cloud computing have emerged as the dominant paradigms driving programming and hardware innovation today. The Fifth Edition of Computer Architecture focuses on this dramatic shift, exploring the ways in which software and technology in the "cloud" are accessed by cell phones, tablets, laptops, and other mobile computing devices. Each chapter includes two real-world examples, one mobile and one datacenter, to illustrate this revolutionary change. Updated to cover the mobile computing revolutionEmphasizes the two most important topics in architecture today: memory hierarchy and parallelism in all its forms.Develops common themes throughout each chapter: power, performance, cost, dependability, protection, programming models, and emerging trends ("What's Next")Includes three review appendices in the printed text. Additional reference appendices are available online.Includes updated Case Studies and completely new exercises.

...read moreread less

984 citations

Book•

Computer Organization and Design, Fourth Edition, Fourth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design)

[...]

David A. Patterson, John L. Hennessy

31 Oct 2008

TL;DR: The classic textbook for computer systems analysis and design, Computer Organization and Design, has been thoroughly updated to provide a new focus on the revolutionary change taking place in industry today: the switch from uniprocessor to multicore microprocessors.

...read moreread less

Abstract: The classic textbook for computer systems analysis and design, Computer Organization and Design, has been thoroughly updated to provide a new focus on the revolutionary change taking place in industry today: the switch from uniprocessor to multicore microprocessors. This new emphasis on parallelism is supported by updates reflecting the newest technologies with examples highlighting the latest processor designs, benchmarking standards, languages and tools. As with previous editions, a MIPS processor is the core used to present the fundamentals of hardware technologies, assembly language, computer arithmetic, pipelining, memory hierarchies and I/O. Along with its increased coverage of parallelism, this new edition offers new content on Flash memory and virtual machines as well as a new and important appendix written by industry experts covering the emergence and importance of the modern GPU (graphics processing unit), the highly parallel, highly multithreaded multiprocessor optimized for visual computing.A new exercise paradigm allows instructors to reconfigure the 600 exercises included in the book to easily generate new exercises and solutions of their own. A CD provides a toolkit of simulators and compilers along with tutorials for using them as well as additional problems and solutions, and references.

...read moreread less

395 citations

Book•

Computer Organization and Design, Revised Fourth Edition, Fourth Edition: The Hardware/Software Interface

[...]

David A. Patterson, John L. Hennessy

09 Nov 2011

TL;DR: This Fourth Revised Edition of Computer Organization and Design includes a complete set of updated and new exercises, along with improvements and changes suggested by instructors and students, covering the revolutionary change from sequential to parallel computing.

...read moreread less

Abstract: This Fourth Revised Edition of Computer Organization and Design includes a complete set of updated and new exercises, along with improvements and changes suggested by instructors and students. Focusing on the revolutionary change taking place in industry today--the switch from uniprocessor to multicore microprocessors--this classic textbook has a modern and up-to-date focus on parallelism in all its forms. Examples highlighting multicore and GPU processor designs are supported with performance and benchmarking data. As with previous editions, a MIPS processor is the core used to present the fundamentals of hardware technologies, assembly language, computer arithmetic, pipelining, memory hierarchies and I/O. Sections on the ARM and x86 architectures are also included.The companion CD provides a toolkit of simulators and compilers along with tutorials for using them, as well as advanced content for further study and a search utility for finding content on the CD and in the printed text. For the convenience of readers who have purchased an ebook edition or who may have misplaced the CD-ROM, all CD content is available as a download at bit.ly/nFXcLqThis Revised Fourth Edition of Computer Organization and Design has been updated with new exercises and improvements throughout suggested by instructors teaching from the bookCovers the revolutionary change from sequential to parallel computing, with a chapter on parallelism and sections in every chapter highlighting parallel hardware and software topicsIncludes an appendix by the Chief Scientist and the Director of Architecture of NVIDIA covering the emergence and importance of the modern GPU, describing in detail for the first time the highly parallel, highly multithreaded multiprocessor optimized for visual computing

...read moreread less

49 citations

Book•

Computer Organization and Design: The Hardware Software Interface ARM Edition

[...]

David A. Patterson, John L. Hennessy

16 Mar 2016

TL;DR: The new ARM Edition of Computer Organization and Design features a subset of the ARMv8-A architecture, which is used to present the fundamentals of hardware technologies, assembly language, computer arithmetic, pipelining, memory hierarchies, and I/O.

...read moreread less

Abstract: The new ARM Edition of Computer Organization and Design features a subset of the ARMv8-A architecture, which is used to present the fundamentals of hardware technologies, assembly language, computer arithmetic, pipelining, memory hierarchies, and I/O. With the post-PC era now upon us, Computer Organization and Design moves forward to explore this generational change with examples, exercises, and material highlighting the emergence of mobile computing and the Cloud. Updated content featuring tablet computers, Cloud infrastructure, and the ARM (mobile computing devices) and x86 (cloud computing) architectures is included. An online companion Web site provides links to a free version of the DS-5 Community Edition (a free professional quality tool chain developed by ARM), as well as additional advanced content for further study, appendices, glossary, references, and recommended reading. Covers parallelism in depth with examples and content highlighting parallel hardware and software topicsFeatures the Intel Core i7, ARM Cortex-A53, and NVIDIA Fermi GPU as real-world examples throughout the bookAdds a new concrete example, "Going Faster," to demonstrate how understanding hardware can inspire software optimizations that improve performance by 200XDiscusses and highlights the "Eight Great Ideas" of computer architecture: Performance via Parallelism; Performance via Pipelining; Performance via Prediction; Design for Moore's Law; Hierarchy of Memories; Abstraction to Simplify Design; Make the Common Case Fast; and Dependability via Redundancy. Includes a full set of updated exercises

...read moreread less

35 citations

Cited by

PDF

Open Access

More filters

Posted Content•

In-Datacenter Performance Analysis of a Tensor Processing Unit

[...]

Norman P. Jouppi, Cliff Young, Nishant Patil, David A. Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Albert T. Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Christopher Aaron Clark, Jeremy Coriell, Michael J. Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William John Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, D. Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Khaitan Harshit, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andrew Everett Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Michael Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay K. Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, Doe Hyun Yoon - Show less +71 more

16 Apr 2017-arXiv: Hardware Architecture

TL;DR: This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.

...read moreread less

Abstract: Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

...read moreread less

3,067 citations

Proceedings Article•DOI•

In-Datacenter Performance Analysis of a Tensor Processing Unit

[...]

Norman P. Jouppi¹, Cliff Young¹, Nishant Patil¹, David A. Patterson¹, Gaurav Agrawal¹, Raminder Bajwa¹, Sarah Bates¹, Suresh Bhatia¹, Nan Boden¹, Albert T. Borchers¹, Rick Boyle¹, Pierre-luc Cantin¹, Clifford Chao¹, Christopher Aaron Clark¹, Jeremy Coriell¹, Michael J. Daley¹, Matt Dau¹, Jeffrey Dean¹, Ben Gelb¹, Tara Vazir Ghaemmaghami¹, Rajendra Gottipati¹, William John Gulland¹, Robert Hagmann¹, C. Richard Ho¹, Doug Hogberg¹, John Hu¹, Robert Hundt¹, D. Hurt¹, Julian Ibarz¹, Aaron Jaffey¹, Alek Jaworski¹, Alexander Kaplan¹, Khaitan Harshit¹, Daniel Killebrew¹, Andy Koch¹, Naveen Kumar¹, Steve Lacy¹, James Laudon¹, James Law¹, Diemthu Le¹, Chris Leary¹, Zhuyuan Liu¹, Kyle Lucke¹, Alan Lundin¹, Gordon MacKean¹, Adriana Maggiore¹, Maire Mahony¹, Kieran Miller¹, Rahul Nagarajan¹, Ravi Narayanaswami¹, Ray Ni¹, Kathy Nix¹, Thomas Norrie¹, Mark Omernick¹, Narayana Penukonda¹, Andrew Everett Phelps¹, Jonathan Ross¹, Matt Ross¹, Amir Salek¹, Emad Samadiani¹, Chris Severn¹, Gregory Sizikov¹, Matthew Snelham¹, Jed Souter¹, Dan Steinberg¹, Andy Swing¹, Mercedes Tan¹, Gregory Michael Thorson¹, Bo Tian¹, Horia Toma¹, Erick Tuttle¹, Vijay K. Vasudevan¹, Richard Walter¹, Walter Wang¹, Eric Wilcox¹, Doe Hyun Yoon¹ - Show less +72 more•Institutions (1)

Google¹

24 Jun 2017

TL;DR: The Tensor Processing Unit (TPU) as discussed by the authors is a custom ASIC deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) using a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS).

...read moreread less

Abstract: Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X -- 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X -- 80X higher. Moreover, using the CPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

...read moreread less

2,679 citations

Proceedings Article•DOI•

McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures

[...]

Sheng Li¹, Jung Ho Ahn², Richard Strong³, Jay B. Brockman¹, Dean M. Tullsen³, Norman P. Jouppi⁴ - Show less +2 more•Institutions (4)

University of Notre Dame¹, Seoul National University², University of California, San Diego³, Hewlett-Packard⁴

12 Dec 2009

TL;DR: Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks at the 22nm technology node for both common in-order and out-of-order manycore designs shows that when die cost is not taken into account clustering 8 cores together gives the best energy-delay product, whereas when cost is taking into account configuring clusters with 4 cores gives thebest EDA2P and EDAP.

...read moreread less

Abstract: This paper introduces McPAT, an integrated power, area, and timing modeling framework that supports comprehensive design space exploration for multicore and manycore processor configurations ranging from 90nm to 22nm and beyond. At the microarchitectural level, McPAT includes models for the fundamental components of a chip multiprocessor, including in-order and out-of-order processor cores, networks-on-chip, shared caches, integrated memory controllers, and multiple-domain clocking. At the circuit and technology levels, McPAT supports critical-path timing modeling, area modeling, and dynamic, short-circuit, and leakage power modeling for each of the device types forecast in the ITRS roadmap including bulk CMOS, SOI, and double-gate transistors. McPAT has a flexible XML interface to facilitate its use with many performance simulators. Combined with a performance simulator, McPAT enables architects to consistently quantify the cost of new ideas and assess tradeoffs of different architectures using new metrics like energy-delay-area2 product (EDA2P) and energy-delay-area product (EDAP). This paper explores the interconnect options of future manycore processors by varying the degree of clustering over generations of process technologies. Clustering will bring interesting tradeoffs between area and performance because the interconnects needed to group cores into clusters incur area overhead, but many applications can make good use of them due to synergies of cache sharing. Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks at the 22nm technology node for both common in-order and out-of-order manycore designs shows that when die cost is not taken into account clustering 8 cores together gives the best energy-delay product, whereas when cost is taken into account configuring clusters with 4 cores gives the best EDA2P and EDAP.

...read moreread less

2,487 citations

The Landscape of Parallel Computing Research: A View from Berkeley

[...]

Krste Asanovic, Ras Bodik, Bryan Catanzaro, Joseph Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Plishker, John Shalf, Samuel Williams, Katherine Yelick - Show less +7 more

18 Dec 2006

TL;DR: The parallel landscape is frame with seven questions, and the following are recommended to explore the design space rapidly: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS each development dollar.

...read moreread less

Abstract: Author(s): Asanovic, K; Bodik, R; Catanzaro, B; Gebis, J; Husbands, P; Keutzer, K; Patterson, D; Plishker, W; Shalf, J; Williams, SW | Abstract: The recent switch to parallel microprocessors is a milestone in the history of computing. Industry has laid out a roadmap for multicore designs that preserves the programming paradigm of the past via binary compatibility and cache coherence. Conventional wisdom is now to double the number of cores on a chip with each silicon generation. A multidisciplinary group of Berkeley researchers met nearly two years to discuss this change. Our view is that this evolutionary approach to parallel hardware and software may work from 2 or 8 processor systems, but is likely to face diminishing returns as 16 and 32 processor systems are realized, just as returns fell with greater instruction-level parallelism. We believe that much can be learned by examining the success of parallelism at the extremes of the computing spectrum, namely embedded computing and high performance computing. This led us to frame the parallel landscape with seven questions, and to recommend the following: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS per development dollar. • Instead of traditional benchmarks, use 13 “Dwarfs” to design and evaluate parallel programming models and architectures. (A dwarf is an algorithmic method that captures a pattern of computation and communication.) • “Autotuners” should play a larger role than conventional compilers in translating parallel programs. • To maximize programmer productivity, future programming models must be more human-centric than the conventional focus on hardware or applications. • To be successful, programming models should be independent of the number of processors. • To maximize application efficiency, programming models should support a wide range of data types and successful models of parallelism: task-level parallelism, word-level parallelism, and bit-level parallelism. 1 The Landscape of Parallel Computing Research: A View From Berkeley • Architects should not include features that significantly affect performance or energy if programmers cannot accurately measure their impact via performance counters and energy counters. • Traditional operating systems will be deconstructed and operating system functionality will be orchestrated using libraries and virtual machines. • To explore the design space rapidly, use system emulators based on Field Programmable Gate Arrays (FPGAs) that are highly scalable and low cost. Since real world applications are naturally parallel and hardware is naturally parallel, what we need is a programming model, system software, and a supporting architecture that are naturally parallel. Researchers have the rare opportunity to re-invent these cornerstones of computing, provided they simplify the efficient programming of highly parallel systems.

...read moreread less

2,262 citations

Journal Article•DOI•

Roofline: an insightful visual performance model for multicore architectures

[...]

Samuel Williams¹, Andrew Waterman², David A. Patterson²•Institutions (2)

Lawrence Berkeley National Laboratory¹, University of California, Berkeley²

01 Apr 2009-Communications of The ACM

TL;DR: The Roofline model offers insight on how to improve the performance of software and hardware in the rapidly changing world of connected devices.

...read moreread less

Abstract: The Roofline model offers insight on how to improve the performance of software and hardware.

...read moreread less

2,181 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse