scispace - formally typeset
Search or ask a question
Author

Brian Greskamp

Bio: Brian Greskamp is an academic researcher from University of Illinois at Urbana–Champaign. The author has contributed to research in topics: Microarchitecture & Iterative design. The author has an hindex of 8, co-authored 12 publications receiving 830 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: In this paper, a microarchitecture-aware model for process variation is proposed, including both random and systematic effects, and the model is specified using a small number of highly intuitive parameters.
Abstract: Within-die parameter variation poses a major challenge to high-performance microprocessor design, negatively impacting a processor's frequency and leakage power. Addressing this problem, this paper proposes a microarchitecture-aware model for process variation-including both random and systematic effects. The model is specified using a small number of highly intuitive parameters. Using the variation model, this paper also proposes a framework to model timing errors caused by parameter variation. The model yields the failure rate of microarchitectural blocks as a function of clock frequency and the amount of variation. With the combination of the variation model and the error model, we have VARIUS, a comprehensive model that is capable of producing detailed statistics of timing errors as a function of different process parameters and operating conditions. We propose possible applications of VARIUS to microarchitectural research.

386 citations

Proceedings ArticleDOI
08 Nov 2008
TL;DR: An effective technique to maximize performance and minimize power in the presence of variation-induced errors, namely High-Dimensional dynamic adaptation is introduced, which increases processor frequency by 56% on average, allowing the processor to cycle 21% faster than without variation.
Abstract: Parameter variation in integrated circuits causes sections of a chip to be slower than others. If, to prevent any resulting timing errors, we design processors for worst-case parameter values, we may lose substantial performance. An alternate approach explored in this paper is to design for closer to nominal values, and provide some transistor budget to tolerate unavoidable variation-induced errors. To assess this approach, this paper first presents a novel framework that shows how microarchitecture techniques can trade off variation-induced errors for power and processor frequency. Then, the paper introduces an effective technique to maximize performance and minimize power in the presence of variation-induced errors, namely High-Dimensional dynamic adaptation. For efficiency, the technique is implemented using a machine-learning algorithm. The results show that our best configuration increases processor frequency by 56% on average, allowing the processor to cycle 21% faster than without variation. Processor performance increases by 40% on average, resulting in a performance that is 14% higher than without variation - at only a 10.6% area cost.

98 citations

Proceedings ArticleDOI
15 Sep 2007
TL;DR: This paper presents the Paceline leader-checker microarchitecture, a leader core that runs the thread at higher-than-rated frequency, while passing execution hints and prefetches to a safely-clocked checker core in the same chip multiprocessor.
Abstract: Under current worst-case design practices, manufacturers specify conservative values for processor frequencies in order to guarantee correctness. To recover some of the lost performance and improve single-thread performance, this paper presents the Paceline leader-checker microarchitecture. In Paceline, a leader core runs the thread at higher-than-rated frequency, while passing execution hints and prefetches to a safely-clocked checker core in the same chip multiprocessor. The checker redundantly executes the thread faster than without the leader, while checking the results to guarantee correctness. Leader and checker cores periodically swap functionality. The result is that the thread improves performance substantially without significantly increasing the power density or the hardware design complexity of the chip. By overclocking the leader by 30%, we estimate that Paceline improves SPECint and SPECfp performance by a geometric mean of 21% and 9%, respectively. Moreover, Paceline also provides tolerance to transient faults such as soft errors.

89 citations

Proceedings ArticleDOI
06 Mar 2009
TL;DR: This paper presents a new approach where the processor itself is designed from the ground up for Timing Speculation, and introduces two techniques that, when applied under BlueShift, improve processor performance: On-demand Selective Biasing (OSB) and Path Constraint Tuning (PCT).
Abstract: Several recent processor designs have proposed to enhance performance by increasing the clock frequency to the point where timing faults occur, and by adding error-correcting support to guarantee correctness. However, such Timing Speculation (TS) proposals are limited in that they assume traditional design methodologies that are suboptimal under TS. In this paper, we present a new approach where the processor itself is designed from the ground up for TS. The idea is to identify and optimize the most frequently-exercised critical paths in the design, at the expense of the majority of the static critical paths, which are allowed to suffer timing errors. Our approach and design optimization algorithm are called BlueShift. We also introduce two techniques that, when applied under BlueShift, improve processor performance: On-demand Selective Biasing (OSB) and Path Constraint Tuning (PCT). Our evaluation with modules from the OpenSPARC T1 processor shows that, compared to conventional TS, BlueShift with OSB speeds up applications by an average of 8% while increasing the processor power by an average of 12%. Moreover, compared to a high-performance TS design, BlueShift with PCT speeds up applications by an average of 6% with an average processor power overhead of 23% . providing a way to speed up logic modules that is orthogonal to voltage scaling.

89 citations

Proceedings ArticleDOI
12 Dec 2009
TL;DR: This paper proposes Dynamic Voltage Scaling for Aging Management (DVSAM) - a new scheme for managing processor aging to attain higher performance or lower power consumption and introduces the BubbleWrap many-core, a novel architecture that makes extensive use of DVSAM.
Abstract: Many-core scaling now faces a power wall. The gap between the number of cores that fit on a die and the number that can operate simultaneously under the power budget is rapidly increasing with technology scaling. In future designs, many of the cores may have to be dormant at any given time to meet the power budget. To push back the many-core power wall, this paper proposes Dynamic Voltage Scaling for Aging Management (DVSAM) --- a new scheme for managing processor aging to attain higher performance or lower power consumption. In addition, this paper introduces the BubbleWrap many-core, a novel architecture that makes extensive use of DVSAM. BubbleWrap identifies the most power-efficient set of cores in a variation-affected chip --- the largest set that can be simultaneously powered-on --- and designates them as Throughput cores dedicated to parallel-section execution. The rest of the cores are designated as Expendable and are dedicated to accelerating sequential sections. BubbleWrap attains maximum sequential acceleration by sacrificing Expendable cores one at a time, running them at elevated supply voltage for a significantly shorter service life each, until they completely wear-out and are discarded --- figuratively, as if popping bubbles in bubble wrap that protects Throughput cores. In simulated 32-core chips, BubbleWrap provides substantial improvements over a plain chip. For example, on average, one design runs fully-sequential applications at a 16% higher frequency, and fully-parallel ones with a 30% higher throughput.

75 citations


Cited by
More filters
Journal ArticleDOI

6,278 citations

Journal ArticleDOI
TL;DR: In this paper, a microarchitecture-aware model for process variation is proposed, including both random and systematic effects, and the model is specified using a small number of highly intuitive parameters.
Abstract: Within-die parameter variation poses a major challenge to high-performance microprocessor design, negatively impacting a processor's frequency and leakage power. Addressing this problem, this paper proposes a microarchitecture-aware model for process variation-including both random and systematic effects. The model is specified using a small number of highly intuitive parameters. Using the variation model, this paper also proposes a framework to model timing errors caused by parameter variation. The model yields the failure rate of microarchitectural blocks as a function of clock frequency and the amount of variation. With the combination of the variation model and the error model, we have VARIUS, a comprehensive model that is capable of producing detailed statistics of timing errors as a function of different process parameters and operating conditions. We propose possible applications of VARIUS to microarchitectural research.

386 citations

Journal ArticleDOI
01 Jun 2008
TL;DR: In a 20-core CMP, the combination of variation-aware application scheduling and LinOpt increases the average throughput by 12-17% and reduces the average ED2 by 30-38% - all relative to using variation- aware scheduling together with a simple extension to Intel's Foxton power management algorithm.
Abstract: Within-die process variation causes individual cores in a ChipMultiprocessor (CMP) to differ substantially in both static powerconsumed and maximum frequency supported. In this environment,ignoring variation effects whenscheduling applications or when managing power withDynamic Voltage and Frequency Scaling (DVFS) is suboptimal. This paper proposes variation-aware algorithms for applicationscheduling and power management. One such power managementalgorithm, called {\em LinOpt}, uses linear programmingto find the best voltage and frequency levels for each of thecores in the CMP --- maximizing throughput at a given power budget.In a 20-core CMP, the combination of variation-awareapplication scheduling and {\em LinOpt} increases the averagethroughput by 12--17\% and reduces the average $ED^2$ by 30--38\%--- all relative to using variation-awarescheduling together with a simple extension to Intel's Foxtonpower management algorithm.

351 citations

Proceedings ArticleDOI
11 Oct 2009
TL;DR: A novel technique called PRES (probabilistic replay via execution sketching) is proposed to help reproduce concurrency bugs on multi-processors and significantly lowered the production-run recording overhead of previous approaches, while still reproducing most tested bugs in fewer than 10 replay attempts.
Abstract: Bug reproduction is critically important for diagnosing a production-run failure. Unfortunately, reproducing a concurrency bug on multi-processors (e.g., multi-core) is challenging. Previous techniques either incur large overhead or require new non-trivial hardware extensions.This paper proposes a novel technique called PRES (probabilistic replay via execution sketching) to help reproduce concurrency bugs on multi-processors. It relaxes the past (perhaps idealistic) objective of "reproducing the bug on the first replay attempt" to significantly lower production-run recording overhead. This is achieved by (1) recording only partial execution information (referred to as "sketches") during the production run, and (2) relying on an intelligent replayer during diagnosis time (when performance is less critical) to systematically explore the unrecorded non-deterministic space and reproduce the bug. With only partial information, our replayer may require more than one coordinated replay run to reproduce a bug. However, after a bug is reproduced once, PRES can reproduce it every time.We implemented PRES along with five different execution sketching mechanisms. We evaluated them with 11 representative applications, including 4 servers, 3 desktop/client applications, and 4 scientific/graphics applications, with 13 real-world concurrency bugs of different types, including atomicity violations, order violations and deadlocks. PRES (with synchronization or system call sketching) significantly lowered the production-run recording overhead of previous approaches (by up to 4416 times), while still reproducing most tested bugs in fewer than 10 replay attempts. Moreover, PRES scaled well with the number of processors; PRES's feedback generation from unsuccessful replays is critical in bug reproduction.

277 citations