Home
/
Authors
/
Benjamin C. Lee

Author

Benjamin C. Lee

Other affiliations: Microsoft, Stanford University, Harvard University ...read more

Bio: Benjamin C. Lee is an academic researcher from Duke University. The author has contributed to research in topics: Efficient energy use & Design space exploration. The author has an hindex of 26, co-authored 72 publications receiving 6084 citations. Previous affiliations of Benjamin C. Lee include Microsoft & Stanford University.

Papers published on a yearly basis

2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2002

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Architecting phase change memory as a scalable dram alternative

[...]

Benjamin C. Lee¹, Engin Ipek¹, Onur Mutlu², Doug Burger¹•Institutions (2)

Microsoft¹, Carnegie Mellon University²

20 Jun 2009

TL;DR: This work proposes, crafted from a fundamental understanding of PCM technology parameters, area-neutral architectural enhancements that address these limitations and make PCM competitive with DRAM.

...read moreread less

Abstract: Memory scaling is in jeopardy as charge storage and sensing mechanisms become less reliable for prevalent memory technologies, such as DRAM. In contrast, phase change memory (PCM) storage relies on scalable current and thermal mechanisms. To exploit PCM's scalability as a DRAM alternative, PCM must be architected to address relatively long latencies, high energy writes, and finite endurance.We propose, crafted from a fundamental understanding of PCM technology parameters, area-neutral architectural enhancements that address these limitations and make PCM competitive with DRAM. A baseline PCM system is 1.6x slower and requires 2.2x more energy than a DRAM system. Buffer reorganizations reduce this delay and energy gap to 1.2x and 1.0x, using narrow rows to mitigate write energy and multiple rows to improve locality and write coalescing. Partial writes enhance memory endurance, providing 5.6 years of lifetime. Process scaling will further reduce PCM energy costs and improve endurance.

...read moreread less

1,568 citations

Proceedings Article•DOI•

Better I/O through byte-addressable, persistent memory

[...]

Jeremy P. Condit¹, Edmund B. Nightingale¹, Christopher Frost², Engin Ipek¹, Benjamin C. Lee¹, Doug Burger¹, Derrick Coetzee¹ - Show less +3 more•Institutions (2)

Microsoft¹, University of California, Los Angeles²

11 Oct 2009

TL;DR: A file system and a hardware architecture that are designed around the properties of persistent, byteaddressable memory, which provides strong reliability guarantees and offers better performance than traditional file systems, even when both are run on top of byte-addressable, persistent memory.

...read moreread less

Abstract: Modern computer systems have been built around the assumption that persistent storage is accessed via a slow, block-based interface. However, new byte-addressable, persistent memory technologies such as phase change memory (PCM) offer fast, fine-grained access to persistent storage.In this paper, we present a file system and a hardware architecture that are designed around the properties of persistent, byteaddressable memory. Our file system, BPFS, uses a new technique called short-circuit shadow paging to provide atomic, fine-grained updates to persistent storage. As a result, BPFS provides strong reliability guarantees and offers better performance than traditional file systems, even when both are run on top of byte-addressable, persistent memory. Our hardware architecture enforces atomicity and ordering guarantees required by BPFS while still providing the performance benefits of the L1 and L2 caches.Since these memory technologies are not yet widely available, we evaluate BPFS on DRAM against NTFS on both a RAM disk and a traditional disk. Then, we use microarchitectural simulations to estimate the performance of BPFS on PCM. Despite providing strong safety and consistency guarantees, BPFS on DRAM is typically twice as fast as NTFS on a RAM disk and 4-10 times faster than NTFS on disk. We also show that BPFS on PCM should be significantly faster than a traditional disk-based file system.

...read moreread less

935 citations

Proceedings Article•DOI•

Accurate and efficient regression modeling for microarchitectural performance and power prediction

[...]

Benjamin C. Lee¹, David Brooks¹•Institutions (1)

Harvard University¹

20 Oct 2006

TL;DR: This paper derives and validate regression models for performance and power, and presents optimizations for a baseline regression model to obtain application-specific models to maximize accuracy in performance prediction and regional power models leveraging only the most relevant samples from the microarchitectural design space to maximizing accuracy in power prediction.

...read moreread less

Abstract: We propose regression modeling as an efficient approach for accurately predicting performance and power for various applications executing on any microprocessor configuration in a large microarchitectural design space. This paper addresses fundamental challenges in microarchitectural simulation cost by reducing the number of required simulations and using simulated results more effectively via statistical modeling and inference.Specifically, we derive and validate regression models for performance and power. Such models enable computationally efficient statistical inference, requiring the simulation of only 1 in 5 million points of a joint microarchitecture-application design space while achieving median error rates as low as 4.1 percent for performance and 4.3 percent for power. Although both models achieve similar accuracy, the sources of accuracy are strikingly different. We present optimizations for a baseline regression model to obtain (1) application-specific models to maximize accuracy in performance prediction and (2) regional power models leveraging only the most relevant samples from the microarchitectural design space to maximize accuracy in power prediction. Assessing sensitivity to the number of samples simulated for model formulation, we find fewer than 4,000 samples from a design space of approximately 22 billion points are sufficient. Collectively, our results suggest significant potential in accurate and efficient statistical inference for microarchitectural design space exploration via regression models.

...read moreread less

472 citations

Proceedings Article•DOI•

Understanding sources of inefficiency in general-purpose chips

[...]

Rehan Hameed¹, Wajahat Qadeer¹, Megan Wachs¹, Omid Azizi¹, Alex Solomatnikov, Benjamin C. Lee¹, Stephen Richardson¹, Christos Kozyrakis¹, Mark Horowitz¹ - Show less +5 more•Institutions (1)

Stanford University¹

19 Jun 2010

TL;DR: The sources of these performance and energy overheads in general-purpose processing systems are explored by quantifying the overheads of a 720p HD H.264 encoder running on a general- Purpose CMP system and exploring methods to eliminate these overheads by transforming the CPU into a specialized system for H. 264 encoding.

...read moreread less

Abstract: Due to their high volume, general-purpose processors, and now chip multiprocessors (CMPs), are much more cost effective than ASICs, but lag significantly in terms of performance and energy efficiency. This paper explores the sources of these performance and energy overheads in general-purpose processing systems by quantifying the overheads of a 720p HD H.264 encoder running on a general-purpose CMP system. It then explores methods to eliminate these overheads by transforming the CPU into a specialized system for H.264 encoding. We evaluate the gains from customizations useful to broad classes of algorithms, such as SIMD units, as well as those specific to particular computation, such as customized storage and functional units. The ASIC is 500x more energy efficient than our original four-processor CMP. Broadly applicable optimizations improve performance by 10x and energy by 7x. However, the very low energy costs of actual core ops (100s fJ in 90nm) mean that over 90% of the energy used in these solutions is still "overhead". Achieving ASIC-like performance and efficiency requires algorithm-specific optimizations. For each sub-algorithm of H.264, we create a large, specialized functional unit that is capable of executing 100s of operations per instruction. This improves performance and energy by an additional 25x and the final customized CMP matches an ASIC solution's performance within 3x of its energy and within comparable area.

...read moreread less

460 citations

Journal Article•DOI•

Phase-Change Technology and the Future of Main Memory

[...]

Benjamin C. Lee¹, Ping Zhou², Jun Yang², Youtao Zhang², Bo Zhao², Engin Ipek³, Onur Mutlu⁴, Doug Burger⁵ - Show less +4 more•Institutions (5)

Stanford University¹, University of Pittsburgh², University of Rochester³, Carnegie Mellon University⁴, Microsoft⁵

01 Jan 2010-IEEE Micro

TL;DR: This article discusses how to mitigate limitations through buffer sizing, row caching, write reduction, and wear leveling, to make PCM a viable dream alternative for scalable main memories.

...read moreread less

Abstract: Phase-change may enable continued scaling of main memories, but PCM has higher access latencies, incurs higher power costs, and wears out more quickly than DRAM. This article discusses how to mitigate these limitations through buffer sizing, row caching, write reduction, and wear leveling, to make PCM a viable dream alternative for scalable main memories.

...read moreread less

454 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Machine learning

[...]

Thomas G. Dietterich¹•Institutions (1)

Oregon State University¹

01 Dec 1996-ACM Computing Surveys

TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.

...read moreread less

Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

...read moreread less

13,246 citations

Journal Article•DOI•

Roofline: an insightful visual performance model for multicore architectures

[...]

Samuel Williams¹, Andrew Waterman², David A. Patterson²•Institutions (2)

Lawrence Berkeley National Laboratory¹, University of California, Berkeley²

01 Apr 2009-Communications of The ACM

TL;DR: The Roofline model offers insight on how to improve the performance of software and hardware in the rapidly changing world of connected devices.

...read moreread less

Abstract: The Roofline model offers insight on how to improve the performance of software and hardware.

...read moreread less

2,181 citations

Journal Article•DOI•

Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks

[...]

Yu-Hsin Chen¹, Tushar Krishna¹, Joel Emer¹, Vivienne Sze¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jan 2017-IEEE Journal of Solid-state Circuits

TL;DR: Eyeriss as mentioned in this paper is an accelerator for state-of-the-art deep convolutional neural networks (CNNs) that optimizes for the energy efficiency of the entire system, including the accelerator chip and off-chip DRAM, by reconfiguring the architecture.

...read moreread less

Abstract: Eyeriss is an accelerator for state-of-the-art deep convolutional neural networks (CNNs). It optimizes for the energy efficiency of the entire system, including the accelerator chip and off-chip DRAM, for various CNN shapes by reconfiguring the architecture. CNNs are widely used in modern AI systems but also bring challenges on throughput and energy efficiency to the underlying hardware. This is because its computation requires a large amount of data, creating significant data movement from on-chip and off-chip that is more energy-consuming than computation. Minimizing data movement energy cost for any CNN shape, therefore, is the key to high throughput and energy efficiency. Eyeriss achieves these goals by using a proposed processing dataflow, called row stationary (RS), on a spatial architecture with 168 processing elements. RS dataflow reconfigures the computation mapping of a given shape, which optimizes energy efficiency by maximally reusing data locally to reduce expensive data movement, such as DRAM accesses. Compression and data gating are also applied to further improve energy efficiency. Eyeriss processes the convolutional layers at 35 frames/s and 0.0029 DRAM access/multiply and accumulation (MAC) for AlexNet at 278 mW (batch size $N = 4$ ), and 0.7 frames/s and 0.0035 DRAM access/MAC for VGG-16 at 236 mW ( $N = 3$ ).

...read moreread less

2,165 citations

Book Chapter•DOI•

A User’s Guide

[...]

Robert S. Leiken

30 Dec 2011

TL;DR: This table lists the most common surnames in the United States used to be Anglicised as "United States", then changed to "United Kingdom" in the 1990s.

...read moreread less

Abstract: OUTPU T 29 OUTPU T 30 OUTPU T 31 OUTPU T 32 OUTPU T 25 OUTPU T 26 OUTPU T 27 OUTPU T 28 OUTPU T 21 OUTPU T 22 OUTPU T 23 OUTPU T 24 OUTPU T 17 OUTPU T 18 OUTPU T 19 OUTPU T 20 OUTPU T 13 OUTPU T 14 OUTPU T 15 OUTPU T 16 OUTPU T 9 OUTPU T 10 OUTPU T 11 OUTPU T 12 OUTPU T 5 OUTPU T 6 OUTPU T 7 OUTPU T 8 OUTPU T 1 OUTPU T 2 OUTPU T 3 OUTPU T 4 29 30 31 32 25 26 27 28 21 22 23 24 17 18 19 20 13 14 15 16 9

...read moreread less

1,662 citations

Proceedings Article•DOI•

DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning

[...]

Tianshi Chen¹, Zidong Du¹, Ninghui Sun¹, Jia Wang¹, Chengyong Wu¹, Yunji Chen¹, Olivier Temam² - Show less +3 more•Institutions (2)

Chinese Academy of Sciences¹, French Institute for Research in Computer Science and Automation²

24 Feb 2014

TL;DR: This study designs an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy, and shows that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s in a small footprint.

...read moreread less

Abstract: Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-of-the-art across many applications. As architectures evolve towards heterogeneous multi-cores composed of a mix of cores and accelerators, a machine-learning accelerator can achieve the rare combination of efficiency (due to the small number of target algorithms) and broad application scope. Until now, most machine-learning accelerator designs have focused on efficiently implementing the computational part of the algorithms. However, recent state-of-the-art CNNs and DNNs are characterized by their large size. In this study, we design an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy. We show that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s (key NN operations such as synaptic weight multiplications and neurons outputs additions) in a small footprint of 3.02 mm2 and 485 mW; compared to a 128-bit 2GHz SIMD processor, the accelerator is 117.87x faster, and it can reduce the total energy by 21.08x. The accelerator characteristics are obtained after layout at 65 nm. Such a high throughput in a small footprint can open up the usage of state-of-the-art machine-learning algorithms in a broad set of systems and for a broad set of applications.

...read moreread less

1,582 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse