scispace - formally typeset
Search or ask a question
Author

David Black-Schaffer

Bio: David Black-Schaffer is an academic researcher from Uppsala University. The author has contributed to research in topics: Cache & Cache algorithms. The author has an hindex of 17, co-authored 78 publications receiving 913 citations. Previous affiliations of David Black-Schaffer include Stanford University & University of Michigan.


Papers
More filters
Journal ArticleDOI
TL;DR: An efficient programmable architecture for compute-intensive embedded applications that uses instruction registers to reduce the cost of delivering instructions, and a hierarchical and distributed data register organization to deliver data.
Abstract: We present an efficient programmable architecture for compute-intensive embedded applications. The processor architecture uses instruction registers to reduce the cost of delivering instructions, and a hierarchical and distributed data register organization to deliver data. Instruction registers capture instruction reuse and locality in inexpensive storage structures that arc located near to the functional units. The data register organization captures reuse and locality in different levels of the hierarchy to reduce the cost of delivering data. Exposed communication resources eliminate pipeline registers and control logic, and allow the compiler to schedule efficient instruction and data movement. The architecture keeps a significant fraction of instruction and data bandwidth local to the functional units, which reduces the cost of supplying instructions and data to large numbers of functional units. This architecture achieves an energy efficiency that is 23x greater than an embedded RISC processor.

72 citations

Proceedings ArticleDOI
13 Sep 2011
TL;DR: A low-overhead method for accurately measuring application performance (CPI) and off-chip bandwidth (GB/s) as a function of available shared cache capacity is presented.
Abstract: We present a low-overhead method for accurately measuring application performance (CPI) and off-chip bandwidth (GB/s) as a function of available shared cache capacity. The method is implemented on real hardware, with no modifications to the application or operating system. We accomplish this by co-running a Pirate application that "steals" cache space with the Target application. By adjusting how much space the Pirate steals during the Target's execution, and using hardware performance counters to record the Target's performance, we can accurately and efficiently capture performance data for the Target application as a function of its available shared cache. At the same time we use performance counters to monitor the Pirate to ensure that it is successfully stealing the desired amount of cache. To evaluate this approach, we show that 1) the cache available to the Target behaves as expected, 2) the Pirate steals the desired amount of cache, and ) the Pirate does not bias the Target's performance. As a result, we are able to accurately measure the Target's performance while stealing up to an average of 6.8MB of the 8MB of cache on our Nehalem based test system with an average measurement overhead of only 5.5%.

69 citations

Proceedings ArticleDOI
24 Jan 2011
TL;DR: The StatCC leverages the StatStack cache model to estimate the co-scheduled applications' cache miss ratios from their individual memory reuse distance distributions, and a simple performance model that estimates their CPIs based on the shared cache miss ratio.
Abstract: This work presents StatCC, a simple and efficient model for estimating the shared cache miss ratios of co-scheduled applications on architectures with a hierarchy of private and shared caches. StatCC leverages the StatStack cache model to estimate the co-scheduled applications' cache miss ratios from their individual memory reuse distance distributions, and a simple performance model that estimates their CPIs based on the shared cache miss ratios. These methods are combined into a system of equations that explicitly models the CPIs in terms of the shared miss ratios and can be solved to determine both. The result is a fast algorithm with a 2% error across the SPEC CPU2006 benchmark suite compared to a simulated in-order processor and a hierarchy of private and shared caches.

56 citations

Proceedings ArticleDOI
23 Feb 2013
TL;DR: The Bandwidth Bandit is introduced, a general, quantitative, profiling method for analyzing the performance impact of contention for memory bandwidth on multicore machines that accurately captures the measured application's performance as a function of its available memory bandwidth, and enables to determine how much the application suffers when its available bandwidth is reduced.
Abstract: On multicore processors, co-executing applications compete for shared resources, such as cache capacity and memory bandwidth. This leads to suboptimal resource allocation and can cause substantial performance loss, which makes it important to effectively manage these shared resources. This, however, requires insights into how the applications are impacted by such resource sharing. While there are several methods to analyze the performance impact of cache contention, less attention has been paid to general, quantitative methods for analyzing the impact of contention for memory bandwidth. To this end we introduce the Bandwidth Bandit, a general, quantitative, profiling method for analyzing the performance impact of contention for memory bandwidth on multicore machines. The profiling data captured by the Bandwidth Bandit is presented in a bandwidth graph. This graph accurately captures the measured application's performance as a function of its available memory bandwidth, and enables us to determine how much the application suffers when its available bandwidth is reduced. To demonstrate the value of this data, we present a case study in which we use the bandwidth graph to analyze the performance impact of memory contention when co-running multiple instances of single threaded application.

53 citations


Cited by
More filters
Journal ArticleDOI
Stephen W. Keckler1, William J. Dally1, Brucek Khailany1, Michael Garland1, D. Glasco1 
TL;DR: The capabilities of state-of-the art GPU-based high-throughput computing systems are discussed and the challenges to scaling single-chip parallel-computing systems are considered, highlighting high-impact areas that the computing research community can address.
Abstract: This article discusses the capabilities of state-of-the art GPU-based high-throughput computing systems and considers the challenges to scaling single-chip parallel-computing systems, highlighting high-impact areas that the computing research community can address. Nvidia Research is investigating an architecture for a heterogeneous high-performance computing system that seeks to address these challenges.

626 citations

Proceedings ArticleDOI
19 Jun 2010
TL;DR: The sources of these performance and energy overheads in general-purpose processing systems are explored by quantifying the overheads of a 720p HD H.264 encoder running on a general- Purpose CMP system and exploring methods to eliminate these overheads by transforming the CPU into a specialized system for H. 264 encoding.
Abstract: Due to their high volume, general-purpose processors, and now chip multiprocessors (CMPs), are much more cost effective than ASICs, but lag significantly in terms of performance and energy efficiency. This paper explores the sources of these performance and energy overheads in general-purpose processing systems by quantifying the overheads of a 720p HD H.264 encoder running on a general-purpose CMP system. It then explores methods to eliminate these overheads by transforming the CPU into a specialized system for H.264 encoding. We evaluate the gains from customizations useful to broad classes of algorithms, such as SIMD units, as well as those specific to particular computation, such as customized storage and functional units. The ASIC is 500x more energy efficient than our original four-processor CMP. Broadly applicable optimizations improve performance by 10x and energy by 7x. However, the very low energy costs of actual core ops (100s fJ in 90nm) mean that over 90% of the energy used in these solutions is still "overhead". Achieving ASIC-like performance and efficiency requires algorithm-specific optimizations. For each sub-algorithm of H.264, we create a large, specialized functional unit that is capable of executing 100s of operations per instruction. This improves performance and energy by an additional 25x and the final customized CMP matches an ASIC solution's performance within 3x of its energy and within comparable area.

460 citations

Proceedings ArticleDOI
15 Apr 2014
TL;DR: PALLOC, a DRAM bank-aware memory allocator which exploits the page-based virtual memory system to allocate memory pages of each application to specific banks, thereby improving isolation on COTS multicore platforms without requiring any special hardware support.
Abstract: DRAM consists of multiple resources called banks that can be accessed in parallel and independently maintain state information. In Commercial Off-The-Shelf (COTS) multicore platforms, banks are typically shared among all cores, even though programs running on the cores do not share memory space. In this situation, memory performance is highly unpredictable due to contention in the shared banks.

235 citations

Proceedings ArticleDOI
23 Jun 2013
TL;DR: The Convolution Engine, specialized for the convolution-like data-flow that is common in computational photography, image processing, and video processing applications, is presented and it is demonstrated that CE is within a factor of 2-3x of the energy and area efficiency of custom units optimized for a single kernel.
Abstract: This paper focuses on the trade-off between flexibility and efficiency in specialized computing. We observe that specialized units achieve most of their efficiency gains by tuning data storage and compute structures and their connectivity to the data-flow and data-locality patterns in the kernels. Hence, by identifying key data-flow patterns used in a domain, we can create efficient engines that can be programmed and reused across a wide range of applications.We present an example, the Convolution Engine (CE), specialized for the convolution-like data-flow that is common in computational photography, image processing, and video processing applications. CE achieves energy efficiency by capturing data reuse patterns, eliminating data transfer overheads, and enabling a large number of operations per memory access. We quantify the tradeoffs in efficiency and flexibility and demonstrate that CE is within a factor of 2-3x of the energy and area efficiency of custom units optimized for a single kernel. CE improves energy and area efficiency by 8-15x over a SIMD engine for most applications.

201 citations

Proceedings ArticleDOI
10 Mar 2004
TL;DR: This paper presents StatCache, a novel sampling-based method for performing data-locality analysis on realistic workloads, based on a probabilistic model of the cache, rather than a functional cache simulator.
Abstract: The widening memory gap reduces performance of applications with poor data locality. Therefore, there is a need for methods to analyze data locality and help application optimization. In this paper we present StatCache, a novel sampling-based method for performing data-locality analysis on realistic workloads. StatCache is based on a probabilistic model of the cache, rather than a functional cache simulator. It uses statistics from a single run to accurately estimate miss ratios of fully-associative caches of arbitrary sizes and generate working-set graphs. We evaluate StatCache using the SPEC CPU2000 benchmarks and show that StatCache gives accurate results with a sampling rate as low as 10/sup -4/. We also provide a proof-of-concept implementation, and discuss potentially very fast implementation alternatives.

195 citations