scispace - formally typeset
Search or ask a question
Author

David William Boerstler

Bio: David William Boerstler is an academic researcher from IBM. The author has contributed to research in topics: Phase-locked loop & Clock signal. The author has an hindex of 24, co-authored 158 publications receiving 2423 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: A global clock distribution strategy implemented on several microprocessor chips is described, which consists of buffered, tunable tree networks, with the final trees all driving a common grid.
Abstract: A global clock distribution strategy used on several microprocessor chips is described. The clock network consists of buffered tunable trees or treelike networks, with the final level of trees all driving a single common grid covering most of the chip. This topology combines advantages of both trees and grids. A new tuning method was required to efficiently tune such a large strongly connected interconnect network consisting of up to 6 m of wire and modeled with 50000 resistors, capacitors, and inductors. Variations are described to handle different floor-planning styles. Global clock skew as low as 22 ps on large microprocessor chips has been measured.

311 citations

Journal ArticleDOI
TL;DR: In this paper, the design challenges that current and future processors must face, with stringent power limits, high-frequency targets, and the continuing system integration trends, are reviewed, and a first-generation Cell processor is described.
Abstract: This paper reviews the design challenges that current and future processors must face, with stringent power limits, high-frequency targets, and the continuing system integration trends. This paper then describes the architecture, circuit design, and physical implementation of a first-generation Cell processor and the design techniques used to overcome the above challenges. A Cell processor consists of a 64-bit Power Architecture processor coupled with multiple synergistic processors, a flexible IO interface, and a memory interface controller that supports multiple operating systems including Linux. This multi-core SoC, implemented in 90-nm SOI technology, achieved a high clock rate by maximizing custom circuit design while maintaining reasonable complexity through design modularity and reuse.

258 citations

Proceedings ArticleDOI
06 Mar 2014
TL;DR: This paper presents an iVRM system developed for the POWER8™ microprocessor, which functions as a very fast, accurate low-dropout regulator (LDO), with 90.5% peak power efficiency (only 3.1% worse than an ideal LDO).
Abstract: Integrated voltage regulator modules (iVRMs) [1] provide a cost-effective path to realizing per-core dynamic voltage and frequency scaling (DVFS), which can be used to optimize the performance of a power-constrained multi-core processor. This paper presents an iVRM system developed for the POWER8™ microprocessor, which functions as a very fast, accurate low-dropout regulator (LDO), with 90.5% peak power efficiency (only 3.1% worse than an ideal LDO). At low output voltages, efficiency is reduced but still sufficient to realize beneficial energy savings with DVFS. Each iVRM features a bypass mode so that some of the cores can be operated at maximum performance with no regulator loss. With the iVRM area including the input decoupling capacitance (DCAP) (but not the output DCAP inherent to the cores), the iVRMs achieve a power density of 34.5W/mm2, which exceeds that of inductor-based or SC converters by at least 3.4× [2].

119 citations

Journal ArticleDOI
David William Boerstler1
TL;DR: A phase-locked loop (PLL) clock generator/phase aligner for the POWER3 microprocessor has been designed using a 2.5-V, 0.40-/spl mu/m digital CMOS6S process as discussed by the authors.
Abstract: A fully integrated, phase-locked loop (PLL) clock generator/phase aligner for the POWER3 microprocessor has been designed using a 2.5-V, 0.40-/spl mu/m digital CMOS6S process. The PLL design supports multiple integer and noninteger frequency multiplication factors for both the processor clock and an L2 cache clock. The fully differential delay-interpolating voltage-controlled oscillator (VCO) is tunable over a frequency range determined by programmable frequency limit settings, enhancing yield and application flexibility. PLL lock range for the maximum VCO frequency range settings is 340-612 MHz. The charge-pump current is programmable for additional control of the PLL loop dynamics. A differential on-chip loop filter with common-mode correction improves noise rejection. Cycle-cycle jitter measurements with the microprocessor actively executing instructions were 10.0 ps rms, 80 ps peak to peak (P-P) measured from the clock tree. Cycle-cycle jitter measured for the processor in a reset state with the clock tree active was 8.4 ps rms, 62 ps P-P. PLL area is 1040/spl times/640 /spl mu/m/sup 2/. Power dissipation is <100 mW.

99 citations

Journal ArticleDOI
TL;DR: POWER8™ is a 12-core processor fabricated in IBM's 22 nm SOI technology with core and cache improvements driven by big data applications, providing 2.5× socket performance over POWER7+™, and power efficiency is improved with several techniques.
Abstract: POWER8™ is a 12-core processor fabricated in IBM's 22 nm SOI technology with core and cache improvements driven by big data applications, providing 25× socket performance over POWER7+™ Core throughput is supported by 76 Tb/s of off-chip I/O bandwidth which is provided by three primary interfaces, including two new variants of Elastic Interface as well as embedded PCI Gen-3 Power efficiency is improved with several techniques An on-chip controller based on an embedded PowerPC™ 405 processor applies per-core DVFS by adjusting DPLLs and fully integrated voltage regulators Each voltage regulator is a highly distributed system of digitally controlled microregulators, which achieves a peak power efficiency of 905% A wide frequency range resonant clock design is used in 13 clock meshes and demonstrates a minimum power savings of 4% Power and delay efficiency is achieved through the use of pulsed-clock latches, which require statistical validation to ensure robust yield

50 citations


Cited by
More filters
Journal ArticleDOI
Y. Hoskote1, Sriram R. Vangal1, A. Singh1, Nitin Borkar1, S. Borkar1 
TL;DR: A multicore processor in 65-Nm technology with 80 single-precision, floatingpoint cores delivers performance in excess of a Teraflops while consuming less than 100 W.
Abstract: A multicore processor in 65-Nm technology with 80 single-precision, floatingpoint cores delivers performance in excess of a Teraflops while consuming less than 100 W. A 2D on-die mesh interconnection network operating at 5 GHz provides the high-performance communication fabric to connect the cores. The network delivers a bisection bandwidth of 2.56 Terabits per second and a per hop fall-through latency of 1 nanosecond.

658 citations

Proceedings ArticleDOI
09 May 2012
TL;DR: DSENT, a NoC modeling tool for rapid design space exploration of electrical and opto-electrical networks, is presented and the results show the implications of different technology scenarios and the need to reduce laser and thermal tuning power in a photonic network due to their non-data-dependent nature.
Abstract: With the rise of many-core chips that require substantial bandwidth from the network on chip (NoC), integrated photonic links have been investigated as a promising alternative to traditional electrical interconnects While numerous opto-electronic NoCs have been proposed, evaluations of photonic architectures have thus-far had to use a number of simplifications, reflecting the need for a modeling tool that accurately captures the tradeoffs for the emerging technology and its impacts on the overall network In this paper, we present DSENT, a NoC modeling tool for rapid design space exploration of electrical and opto-electrical networks We explain our modeling framework and perform an energy-driven case study, focusing on electrical technology scaling, photonic parameters, and thermal tuning Our results show the implications of different technology scenarios and, in particular, the need to reduce laser and thermal tuning power in a photonic network due to their non-data-dependent nature

529 citations

Proceedings ArticleDOI
24 Oct 2008
TL;DR: Regional Congestion Awareness (RCA) is proposed, a lightweight technique to improve global network balance that informs the routing policy of congestion in parts of the network beyond adjacent routers.
Abstract: Interconnection networks-on-chip (NOCs) are rapidly replacing other forms of interconnect in chip multiprocessors and system-on-chip designs. Existing interconnection networks use either oblivious or adaptive routing algorithms to determine the route taken by a packet to its destination. Despite somewhat higher implementation complexity, adaptive routing enjoys better fault tolerance characteristics, increases network throughput, and decreases latency compared to oblivious policies when faced with non-uniform or bursty traffic. However, adaptive routing can hurt performance by disturbing any inherent global load balance through greedy local decisions. To improve load balance in adapting routing, we propose Regional Congestion Awareness (RCA), a lightweight technique to improve global network balance. Instead of relying solely on local congestion information, RCA informs the routing policy of congestion in parts of the network beyond adjacent routers. Our experiments show that RCA matches or exceeds the performance of conventional adaptive routing across all workloads examined, with a 16% average and 71% maximum latency reduction on SPLASH-2 benchmarks running on a 49-core CMP. Compared to a baseline adaptive router, RCA incurs a negligible logic and modest wiring overhead.

409 citations

Patent
05 Mar 2003
TL;DR: In this article, an Internet advertisement listings provider distributes advertisements in a bid-for-placement arrangement based on the revenue-efficiency of the advertisements from the bidding advertisers that calculates the revenue to the advertising distribution system by multiplying the click-through rate times the bid amount for each clickthrough.
Abstract: An Internet advertisement listings provider that distributes advertisements in a bid-for-placement arrangement based on the revenue-efficiency of the advertisements from the bidding advertisers that calculates the revenue to the advertising distribution system by multiplying the click-through rate times the bid amount for each click-through. Advertisers may be allowed to provide multiple advertisements to enable the advertisement listings provider to select from those various advertisements for inclusion in ranked listings based on a determined efficiency among the advertisements. The system also determines the most efficient grouping of advertisements for a limited-space output, comparing groupings of advertisements to other groups to determine the greater revenue to the distribution system.

397 citations