Home
/
Authors
/
Eric C. Peterson

Author

Eric C. Peterson

Bio: Eric C. Peterson is an academic researcher from Microsoft. The author has contributed to research in topics: Energy storage & Server. The author has an hindex of 15, co-authored 47 publications receiving 2006 citations.

Topics: Energy storage, Server, Data center, Electricity generation, Chassis ...read more

Papers

PDF

Open Access

More filters

Journal Article•DOI•

A reconfigurable fabric for accelerating large-scale datacenter services

[...]

Andrew Putnam¹, Adrian M. Caulfield¹, Eric S. Chung¹, Derek Chiou², Kypros Constantinides³, John Demme⁴, Hadi Esmaeilzadeh⁵, Jeremy Fowers¹, Gopi Prashanth Gopal¹, Jan Gray¹, Michael Haselman¹, Scott Hauck⁶, Stephen F. Heil¹, Amir Hormati⁷, Joo-Young Kim¹, Sitaram Lanka¹, James R. Larus⁸, Eric C. Peterson¹, Simon Pope¹, Aaron L. Smith¹, Jason Thong¹, Phillip Yi Xiao¹, Doug Burger¹ - Show less +19 more•Institutions (8)

Microsoft¹, University of Texas at Austin², Amazon.com³, Columbia University⁴, Georgia Institute of Technology⁵, University of Washington⁶, Google⁷, École Polytechnique Fédérale de Lausanne⁸

28 Oct 2016-Communications of The ACM

TL;DR: The authors deployed the reconfigurable fabric in a bed of 1,632 servers and FPGAs in a production datacenter and successfully used it to accelerate the ranking portion of the Bing Web search engine by nearly a factor of two.

...read moreread less

Abstract: Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost It is challenging to improve all of these factors simultaneously To advance datacenter capabilities beyond what commodity server designs can provide, we designed and built a composable, reconfigurable hardware fabric based on field programmable gate arrays (FPGA) Each server in the fabric contains one FPGA, and all FPGAs within a 48-server rack are interconnected over a low-latency, high-bandwidth networkWe describe a medium-scale deployment of this fabric on a bed of 1632 servers, and measure its effectiveness in accelerating the ranking component of the Bing web search engine We describe the requirements and architecture of the system, detail the critical engineering challenges and solutions needed to make the system robust in the presence of failures, and measure the performance, power, and resilience of the system Under high load, the large-scale reconfigurable fabric improves the ranking throughput of each server by 95% at a desirable latency distribution or reduces tail latency by 29% at a fixed throughput In other words, the reconfigurable fabric enables the same throughput using only half the number of servers

...read moreread less

835 citations

Journal Article•DOI•

A reconfigurable fabric for accelerating large-scale datacenter services

[...]

Andrew Putnam¹, Adrian M. Caulfield¹, Eric S. Chung¹, Derek Chiou², Kypros Constantinides¹, John Demme³, Hadi Esmaeilzadeh⁴, Jeremy Fowers¹, Gopi Prashanth Gopal¹, Jan Gray¹, Michael Haselman¹, Scott Hauck⁵, Stephen F. Heil¹, Amir Hormati⁶, Joo-Young Kim¹, Sitaram Lanka¹, James R. Larus⁷, Eric C. Peterson¹, Simon Pope¹, Aaron L. Smith¹, Jason Thong¹, Phillip Yi Xiao¹, Doug Burger¹ - Show less +19 more•Institutions (7)

Microsoft¹, University of Texas at Austin², Columbia University³, Georgia Institute of Technology⁴, University of Washington⁵, Google⁶, École Polytechnique Fédérale de Lausanne⁷

14 Jun 2014

TL;DR: The requirements and architecture of the fabric are described, the critical engineering challenges and solutions needed to make the system robust in the presence of failures are detailed, and the performance, power, and resilience of the system when ranking candidate documents are measured.

...read moreread less

Abstract: Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we have designed and built a composable, reconfigurablefabric to accelerate portions of large-scale software services. Each instantiation of the fabric consists of a 6x8 2-D torus of high-end Stratix V FPGAs embedded into a half-rack of 48 machines. One FPGA is placed into each server, accessible through PCIe, and wired directly to other FPGAs with pairs of 10 Gb SAS cablesIn this paper, we describe a medium-scale deployment of this fabric on a bed of 1,632 servers, and measure its efficacy in accelerating the Bing web search engine. We describe the requirements and architecture of the system, detail the critical engineering challenges and solutions needed to make the system robust in the presence of failures, and measure the performance, power, and resilience of the system when ranking candidate documents. Under high load, the largescale reconfigurable fabric improves the ranking throughput of each server by a factor of 95% for a fixed latency distribution--- or, while maintaining equivalent throughput, reduces the tail latency by 29%

...read moreread less

712 citations

Patent•

Rack-based uninterruptible power supply

[...]

Eric C. Peterson¹, Shaun L. Harris¹, Christian L. Belady¹, Darren A. Shakib¹, Sompong Paul Olarig¹, Frank J. Wirtz¹ - Show less +2 more•Institutions (1)

Microsoft¹

09 Jun 2010

TL;DR: In this paper, a rack power unit is configured to be inserted into a device rack of a data center, and the one or more power supplies are further configured to output the DC power to a DC power bus of the device rack.

...read moreread less

Abstract: A rack power unit is configured to be inserted into a device rack of a data center. The rack power unit includes one or more power supplies and one or more battery packs. The one or more power supplies are each configured to receive power (e.g., AC power) when the apparatus is in the device rack, and convert the received power to a DC power. The one or more power supplies are further configured to output the DC power to a DC power bus of the device rack. The one or more battery packs are each configured to provide, in response to an interruption in the received power, DC power to the DC power bus of the device rack.

...read moreread less

74 citations

Proceedings Article•DOI•

Pelican: a building block for exascale cold data storage

[...]

Shobana Balakrishnan¹, Richard Black¹, Austin Donnelly¹, Paul England¹, Adam B. Glass¹, Dave Harper¹, Sergey Legtchenko¹, Aaron W. Ogus¹, Eric C. Peterson¹, Antony Rowstron¹ - Show less +6 more•Institutions (1)

Microsoft¹

06 Oct 2014

TL;DR: Pelican performs well for cold workloads, providing high throughput with acceptable latency, and is compared against a traditional resource overprovisioned storage rack using a cross-validated simulator.

...read moreread less

Abstract: A significant fraction of data stored in cloud storage is rarely accessed. This data is referred to as cold data; cost-effective storage for cold data has become a challenge for cloud providers. Pelican is a rack-scale hard-disk based storage unit designed as the basic building block for exabyte scale storage for cold data. In Pelican, server, power, cooling and interconnect bandwidth resources are provisioned by design to support cold data workloads; this right-provisioning significantly reduces Pelican's total cost of ownership compared to traditional disk-based storage.Resource right-provisioning in Pelican means only 8% of the drives can be concurrently spinning. This introduces complex resource management to be handled by the Pelican storage stack. Resource restrictions are expressed as constraints over the hard drives. The data layout and IO scheduling ensures that these constraints are not violated. We evaluate the performance of a prototype Pelican, and compare against a traditional resource overprovisioned storage rack using a cross-validated simulator. We show that compared to this overprovisioned storage rack Pelican performs well for cold workloads, providing high throughput with acceptable latency.

...read moreread less

60 citations

Journal Article•DOI•

A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services

[...]

Andrew Putnam¹, Adrian M. Caulfield¹, Eric S. Chung¹, Derek Chiou², Kypros Constantinides³, John Demme⁴, Hadi Esmaeilzadeh⁵, Jeremy Fowers¹, Gopi Prashanth Gopal³, Jan Gray⁶, Michael Haselman¹, Scott Hauck⁷, Stephen F. Heil¹, Amir Hormati⁸, Joo-Young Kim¹, Sitaram Lanka¹, James R. Larus⁹, Eric C. Peterson¹, Simon Pope¹, Aaron L. Smith¹, Jason Thong¹, Phillip Yi Xiao¹, Doug Burger¹ - Show less +19 more•Institutions (9)

Microsoft¹, University of Texas at Austin², Amazon.com³, Columbia University⁴, Georgia Institute of Technology⁵, University of Waterloo⁶, University of Washington⁷, Google⁸, École Polytechnique Fédérale de Lausanne⁹

01 May 2015-IEEE Micro

TL;DR: In this article, a composable, reconfigurable fabric is proposed to accelerate large-scale software services in a datacenter, which consists of a 6 x 8 2D torus of high-end field-programmable gate arrays (FPGAs) embedded into a half-rack of 48 servers.

...read moreread less

Abstract: To advance datacenter capabilities beyond what commodity server designs can provide, the authors designed and built a composable, reconfigurable fabric to accelerate large-scale software services. Each instantiation of the fabric consists of a 6 x 8 2D torus of high-end field-programmable gate arrays (FPGAs) embedded into a half-rack of 48 servers. The authors deployed the reconfigurable fabric in a bed of 1,632 servers and FPGAs in a production datacenter and successfully used it to accelerate the ranking portion of the Bing Web search engine by nearly a factor of two.

...read moreread less

57 citations

1
2
3
4
…
5
6
7
8
9
10

Collapse

Cited by

PDF

Open Access

More filters

Posted Content•

In-Datacenter Performance Analysis of a Tensor Processing Unit

[...]

Norman P. Jouppi, Cliff Young, Nishant Patil, David A. Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Albert T. Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Christopher Aaron Clark, Jeremy Coriell, Michael J. Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William John Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, D. Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Khaitan Harshit, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andrew Everett Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Michael Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay K. Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, Doe Hyun Yoon - Show less +71 more

16 Apr 2017-arXiv: Hardware Architecture

TL;DR: This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.

...read moreread less

Abstract: Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

...read moreread less

3,067 citations

Proceedings Article•DOI•

In-Datacenter Performance Analysis of a Tensor Processing Unit

[...]

Norman P. Jouppi¹, Cliff Young¹, Nishant Patil¹, David A. Patterson¹, Gaurav Agrawal¹, Raminder Bajwa¹, Sarah Bates¹, Suresh Bhatia¹, Nan Boden¹, Albert T. Borchers¹, Rick Boyle¹, Pierre-luc Cantin¹, Clifford Chao¹, Christopher Aaron Clark¹, Jeremy Coriell¹, Michael J. Daley¹, Matt Dau¹, Jeffrey Dean¹, Ben Gelb¹, Tara Vazir Ghaemmaghami¹, Rajendra Gottipati¹, William John Gulland¹, Robert Hagmann¹, C. Richard Ho¹, Doug Hogberg¹, John Hu¹, Robert Hundt¹, D. Hurt¹, Julian Ibarz¹, Aaron Jaffey¹, Alek Jaworski¹, Alexander Kaplan¹, Khaitan Harshit¹, Daniel Killebrew¹, Andy Koch¹, Naveen Kumar¹, Steve Lacy¹, James Laudon¹, James Law¹, Diemthu Le¹, Chris Leary¹, Zhuyuan Liu¹, Kyle Lucke¹, Alan Lundin¹, Gordon MacKean¹, Adriana Maggiore¹, Maire Mahony¹, Kieran Miller¹, Rahul Nagarajan¹, Ravi Narayanaswami¹, Ray Ni¹, Kathy Nix¹, Thomas Norrie¹, Mark Omernick¹, Narayana Penukonda¹, Andrew Everett Phelps¹, Jonathan Ross¹, Matt Ross¹, Amir Salek¹, Emad Samadiani¹, Chris Severn¹, Gregory Sizikov¹, Matthew Snelham¹, Jed Souter¹, Dan Steinberg¹, Andy Swing¹, Mercedes Tan¹, Gregory Michael Thorson¹, Bo Tian¹, Horia Toma¹, Erick Tuttle¹, Vijay K. Vasudevan¹, Richard Walter¹, Walter Wang¹, Eric Wilcox¹, Doe Hyun Yoon¹ - Show less +72 more•Institutions (1)

Google¹

24 Jun 2017

TL;DR: The Tensor Processing Unit (TPU) as discussed by the authors is a custom ASIC deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) using a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS).

...read moreread less

Abstract: Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X -- 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X -- 80X higher. Moreover, using the CPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

...read moreread less

2,679 citations

Journal Article•DOI•

ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars

[...]

Ali Shafiee¹, Anirban Nag¹, Naveen Muralimanohar², Rajeev Balasubramonian¹, John Paul Strachan², Miao Hu², R. Stanley Williams², Vivek Srikumar¹ - Show less +4 more•Institutions (2)

University of Utah¹, Hewlett-Packard²

18 Jun 2016

TL;DR: This work explores an in-situ processing approach, where memristor crossbar arrays not only store input weights, but are also used to perform dot-product operations in an analog manner.

...read moreread less

Abstract: A number of recent efforts have attempted to design accelerators for popular machine learning algorithms, such as those involving convolutional and deep neural networks (CNNs and DNNs). These algorithms typically involve a large number of multiply-accumulate (dot-product) operations. A recent project, DaDianNao, adopts a near data processing approach, where a specialized neural functional unit performs all the digital arithmetic operations and receives input weights from adjacent eDRAM banks.This work explores an in-situ processing approach, where memristor crossbar arrays not only store input weights, but are also used to perform dot-product operations in an analog manner. While the use of crossbar memory as an analog dot-product engine is well known, no prior work has designed or characterized a full-fledged accelerator based on crossbars. In particular, our work makes the following contributions: (i) We design a pipelined architecture, with some crossbars dedicated for each neural network layer, and eDRAM buffers that aggregate data between pipeline stages. (ii) We define new data encoding techniques that are amenable to analog computations and that can reduce the high overheads of analog-to-digital conversion (ADC). (iii) We define the many supporting digital components required in an analog CNN accelerator and carry out a design space exploration to identify the best balance of memristor storage/compute, ADCs, and eDRAM storage on a chip. On a suite of CNN and DNN workloads, the proposed ISAAC architecture yields improvements of 14.8×, 5.5×, and 7.5× in throughput, energy, and computational density (respectively), relative to the state-of-the-art DaDianNao architecture.

...read moreread less

1,558 citations

Journal Article•DOI•

Hydrogen production for energy: An overview

[...]

Furat Dawood¹, Martin Anda¹, Gm. Shafiullah¹•Institutions (1)

Murdoch University¹

07 Feb 2020-International Journal of Hydrogen Energy

TL;DR: In this article, the authors presented the hydrogen-based energy system as four corners (stages) of a square shaped integrated whole to demonstrate the interconnection and interdependency of these main stages.

...read moreread less

1,090 citations

Proceedings Article•DOI•

A cloud-scale acceleration architecture

[...]

Adrian M. Caulfield¹, Eric S. Chung¹, Andrew Putnam¹, Hari Angepat¹, Jeremy Fowers¹, Michael Haselman¹, Stephen F. Heil¹, Matt Humphrey¹, Puneet Kaur¹, Joo-Young Kim¹, Lo Daniel¹, Todd Massengill¹, Kalin Ovtcharov¹, Michael K. Papamichael¹, Lisa Woods¹, Sitaram Lanka¹, Derek Chiou¹, Doug Burger¹ - Show less +14 more•Institutions (1)

Microsoft¹

15 Oct 2016

TL;DR: A new cloud architecture that uses reconfigurable logic to accelerate both network plane functions and applications, and is much more scalable than prior work which used secondary rack-scale networks for inter-FPGA communication.

...read moreread less

Abstract: Hyperscale datacenter providers have struggled to balance the growing need for specialized hardware (efficiency) with the economic benefits of homogeneity (manageability) In this paper we propose a new cloud architecture that uses reconfigurable logic to accelerate both network plane functions and applications This Configurable Cloud architecture places a layer of reconfigurable logic (FPGAs) between the network switches and the servers, enabling network flows to be programmably transformed at line rate, enabling acceleration of local applications running on the server, and enabling the FPGAs to communicate directly, at datacenter scale, to harvest remote FPGAs unused by their local servers We deployed this design over a production server bed, and show how it can be used for both service acceleration (Web search ranking) and network acceleration (encryption of data in transit at high-speeds) This architecture is much more scalable than prior work which used secondary rack-scale networks for inter-FPGA communication By coupling to the network plane, direct FPGA-to-FPGA messages can be achieved at comparable latency to previous work, without the secondary network Additionally, the scale of direct inter-FPGA messaging is much larger The average round-trip latencies observed in our measurements among 24, 1000, and 250,000 machines are under 3, 9, and 20 microseconds, respectively The Configurable Cloud architecture has been deployed at hyperscale in Microsoft's production datacenters worldwide

...read moreread less

512 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse