Home
/
Authors
/
David Black-Schaffer

Author

David Black-Schaffer

Other affiliations: Stanford University, University of Michigan

Bio: David Black-Schaffer is an academic researcher from Uppsala University. The author has contributed to research in topics: Cache & Cache algorithms. The author has an hindex of 17, co-authored 78 publications receiving 913 citations. Previous affiliations of David Black-Schaffer include Stanford University & University of Michigan.

Topics: Cache, Cache algorithms, Cache pollution, Cache coloring, Cache invalidation ...read more

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2008
2007
2004
2003

Papers

PDF

Open Access

More filters

Journal Article•DOI•

An Energy-Efficient Processor Architecture for Embedded Systems

[...]

James Balfour¹, William J. Dally¹, David Black-Schaffer¹, V. Parikh¹, Jongsoo Park¹ - Show less +1 more•Institutions (1)

Stanford University¹

01 Jan 2008-IEEE Computer Architecture Letters

TL;DR: An efficient programmable architecture for compute-intensive embedded applications that uses instruction registers to reduce the cost of delivering instructions, and a hierarchical and distributed data register organization to deliver data.

...read moreread less

Abstract: We present an efficient programmable architecture for compute-intensive embedded applications. The processor architecture uses instruction registers to reduce the cost of delivering instructions, and a hierarchical and distributed data register organization to deliver data. Instruction registers capture instruction reuse and locality in inexpensive storage structures that arc located near to the functional units. The data register organization captures reuse and locality in different levels of the hierarchy to reduce the cost of delivering data. Exposed communication resources eliminate pipeline registers and control logic, and allow the compiler to schedule efficient instruction and data movement. The architecture keeps a significant fraction of instruction and data bandwidth local to the functional units, which reduces the cost of supplying instructions and data to large numbers of functional units. This architecture achieves an energy efficiency that is 23x greater than an embedded RISC processor.

...read moreread less

72 citations

Proceedings Article•DOI•

Cache Pirating: Measuring the Curse of the Shared Cache

[...]

David Eklov¹, Nikos Nikoleris¹, David Black-Schaffer¹, Erik Hagersten¹•Institutions (1)

Uppsala University¹

13 Sep 2011

TL;DR: A low-overhead method for accurately measuring application performance (CPI) and off-chip bandwidth (GB/s) as a function of available shared cache capacity is presented.

...read moreread less

Abstract: We present a low-overhead method for accurately measuring application performance (CPI) and off-chip bandwidth (GB/s) as a function of available shared cache capacity. The method is implemented on real hardware, with no modifications to the application or operating system. We accomplish this by co-running a Pirate application that "steals" cache space with the Target application. By adjusting how much space the Pirate steals during the Target's execution, and using hardware performance counters to record the Target's performance, we can accurately and efficiently capture performance data for the Target application as a function of its available shared cache. At the same time we use performance counters to monitor the Pirate to ensure that it is successfully stealing the desired amount of cache. To evaluate this approach, we show that 1) the cache available to the Target behaves as expected, 2) the Pirate steals the desired amount of cache, and ) the Pirate does not bias the Target's performance. As a result, we are able to accurately measure the Target's performance while stealing up to an average of 6.8MB of the 8MB of cache on our Nehalem based test system with an average measurement overhead of only 5.5%.

...read moreread less

69 citations

Proceedings Article•DOI•

Fast modeling of shared caches in multicore systems

[...]

David Eklov¹, David Black-Schaffer¹, Erik Hagersten¹•Institutions (1)

Uppsala University¹

24 Jan 2011

TL;DR: The StatCC leverages the StatStack cache model to estimate the co-scheduled applications' cache miss ratios from their individual memory reuse distance distributions, and a simple performance model that estimates their CPIs based on the shared cache miss ratio.

...read moreread less

Abstract: This work presents StatCC, a simple and efficient model for estimating the shared cache miss ratios of co-scheduled applications on architectures with a hierarchy of private and shared caches. StatCC leverages the StatStack cache model to estimate the co-scheduled applications' cache miss ratios from their individual memory reuse distance distributions, and a simple performance model that estimates their CPIs based on the shared cache miss ratios. These methods are combined into a system of equations that explicitly models the CPIs in terms of the shared miss ratios and can be solved to determine both. The result is a fast algorithm with a 2% error across the SPEC CPU2006 benchmark suite compared to a simulated in-order processor and a hierarchy of private and shared caches.

...read moreread less

56 citations

Proceedings Article•DOI•

Bandwidth Bandit: Quantitative characterization of memory contention

[...]

David Black-Schaffer¹, Nikos Nikoleris¹, Erik Hagersten¹, David Eklov¹•Institutions (1)

Uppsala University¹

23 Feb 2013

TL;DR: The Bandwidth Bandit is introduced, a general, quantitative, profiling method for analyzing the performance impact of contention for memory bandwidth on multicore machines that accurately captures the measured application's performance as a function of its available memory bandwidth, and enables to determine how much the application suffers when its available bandwidth is reduced.

...read moreread less

Abstract: On multicore processors, co-executing applications compete for shared resources, such as cache capacity and memory bandwidth. This leads to suboptimal resource allocation and can cause substantial performance loss, which makes it important to effectively manage these shared resources. This, however, requires insights into how the applications are impacted by such resource sharing. While there are several methods to analyze the performance impact of cache contention, less attention has been paid to general, quantitative methods for analyzing the impact of contention for memory bandwidth. To this end we introduce the Bandwidth Bandit, a general, quantitative, profiling method for analyzing the performance impact of contention for memory bandwidth on multicore machines. The profiling data captured by the Bandwidth Bandit is presented in a bandwidth graph. This graph accurately captures the measured application's performance as a function of its available memory bandwidth, and enables us to determine how much the application suffers when its available bandwidth is reduced. To demonstrate the value of this data, we present a case study in which we use the bandwidth graph to analyze the performance impact of memory contention when co-running multiple instances of single threaded application.

...read moreread less

53 citations

The HIPEAC vision for advanced computing in horizon 2020

[...]

Marc Duranton, David Black-Schaffer, Koen De Bosschere, Jonas Maebe

01 Jan 2013

49 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

GPUs and the Future of Parallel Computing

[...]

Stephen W. Keckler¹, William J. Dally¹, Brucek Khailany¹, Michael Garland¹, D. Glasco¹ - Show less +1 more•Institutions (1)

Nvidia¹

01 Sep 2011-IEEE Micro

TL;DR: The capabilities of state-of-the art GPU-based high-throughput computing systems are discussed and the challenges to scaling single-chip parallel-computing systems are considered, highlighting high-impact areas that the computing research community can address.

...read moreread less

Abstract: This article discusses the capabilities of state-of-the art GPU-based high-throughput computing systems and considers the challenges to scaling single-chip parallel-computing systems, highlighting high-impact areas that the computing research community can address. Nvidia Research is investigating an architecture for a heterogeneous high-performance computing system that seeks to address these challenges.

...read moreread less

626 citations

Proceedings Article•DOI•

Understanding sources of inefficiency in general-purpose chips

[...]

Rehan Hameed¹, Wajahat Qadeer¹, Megan Wachs¹, Omid Azizi¹, Alex Solomatnikov, Benjamin C. Lee¹, Stephen Richardson¹, Christos Kozyrakis¹, Mark Horowitz¹ - Show less +5 more•Institutions (1)

Stanford University¹

19 Jun 2010

TL;DR: The sources of these performance and energy overheads in general-purpose processing systems are explored by quantifying the overheads of a 720p HD H.264 encoder running on a general- Purpose CMP system and exploring methods to eliminate these overheads by transforming the CPU into a specialized system for H. 264 encoding.

...read moreread less

Abstract: Due to their high volume, general-purpose processors, and now chip multiprocessors (CMPs), are much more cost effective than ASICs, but lag significantly in terms of performance and energy efficiency. This paper explores the sources of these performance and energy overheads in general-purpose processing systems by quantifying the overheads of a 720p HD H.264 encoder running on a general-purpose CMP system. It then explores methods to eliminate these overheads by transforming the CPU into a specialized system for H.264 encoding. We evaluate the gains from customizations useful to broad classes of algorithms, such as SIMD units, as well as those specific to particular computation, such as customized storage and functional units. The ASIC is 500x more energy efficient than our original four-processor CMP. Broadly applicable optimizations improve performance by 10x and energy by 7x. However, the very low energy costs of actual core ops (100s fJ in 90nm) mean that over 90% of the energy used in these solutions is still "overhead". Achieving ASIC-like performance and efficiency requires algorithm-specific optimizations. For each sub-algorithm of H.264, we create a large, specialized functional unit that is capable of executing 100s of operations per instruction. This improves performance and energy by an additional 25x and the final customized CMP matches an ASIC solution's performance within 3x of its energy and within comparable area.

...read moreread less

460 citations

Proceedings Article•DOI•

PALLOC: DRAM bank-aware memory allocator for performance isolation on multicore platforms

[...]

Heechul Yun¹, Renato Mancuso², Zheng-Pei Wu³, Rodolfo Pellizzoni³•Institutions (3)

University of Kansas¹, University of Illinois at Urbana–Champaign², University of Waterloo³

15 Apr 2014

TL;DR: PALLOC, a DRAM bank-aware memory allocator which exploits the page-based virtual memory system to allocate memory pages of each application to specific banks, thereby improving isolation on COTS multicore platforms without requiring any special hardware support.

...read moreread less

Abstract: DRAM consists of multiple resources called banks that can be accessed in parallel and independently maintain state information. In Commercial Off-The-Shelf (COTS) multicore platforms, banks are typically shared among all cores, even though programs running on the cores do not share memory space. In this situation, memory performance is highly unpredictable due to contention in the shared banks.

...read moreread less

235 citations

Proceedings Article•DOI•

Convolution engine: balancing efficiency & flexibility in specialized computing

[...]

Wajahat Qadeer¹, Rehan Hameed¹, Ofer Shacham¹, Preethi Venkatesan¹, Christos Kozyrakis¹, Mark Horowitz¹ - Show less +2 more•Institutions (1)

Stanford University¹

23 Jun 2013

TL;DR: The Convolution Engine, specialized for the convolution-like data-flow that is common in computational photography, image processing, and video processing applications, is presented and it is demonstrated that CE is within a factor of 2-3x of the energy and area efficiency of custom units optimized for a single kernel.

...read moreread less

Abstract: This paper focuses on the trade-off between flexibility and efficiency in specialized computing. We observe that specialized units achieve most of their efficiency gains by tuning data storage and compute structures and their connectivity to the data-flow and data-locality patterns in the kernels. Hence, by identifying key data-flow patterns used in a domain, we can create efficient engines that can be programmed and reused across a wide range of applications.We present an example, the Convolution Engine (CE), specialized for the convolution-like data-flow that is common in computational photography, image processing, and video processing applications. CE achieves energy efficiency by capturing data reuse patterns, eliminating data transfer overheads, and enabling a large number of operations per memory access. We quantify the tradeoffs in efficiency and flexibility and demonstrate that CE is within a factor of 2-3x of the energy and area efficiency of custom units optimized for a single kernel. CE improves energy and area efficiency by 8-15x over a SIMD engine for most applications.

...read moreread less

201 citations

Proceedings Article•DOI•

StatCache: a probabilistic approach to efficient and accurate data locality analysis

[...]

Erik J. Berg¹, Erik Hagersten¹•Institutions (1)

Uppsala University¹

10 Mar 2004

TL;DR: This paper presents StatCache, a novel sampling-based method for performing data-locality analysis on realistic workloads, based on a probabilistic model of the cache, rather than a functional cache simulator.

...read moreread less

Abstract: The widening memory gap reduces performance of applications with poor data locality. Therefore, there is a need for methods to analyze data locality and help application optimization. In this paper we present StatCache, a novel sampling-based method for performing data-locality analysis on realistic workloads. StatCache is based on a probabilistic model of the cache, rather than a functional cache simulator. It uses statistics from a single run to accurately estimate miss ratios of fully-associative caches of arbitrary sizes and generate working-set graphs. We evaluate StatCache using the SPEC CPU2000 benchmarks and show that StatCache gives accurate results with a sampling rate as low as 10/sup -4/. We also provide a proof-of-concept implementation, and discuss potentially very fast implementation alternatives.

...read moreread less

195 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169

Collapse