scispace - formally typeset
Search or ask a question
Author

Brian W. Curran

Bio: Brian W. Curran is an academic researcher from IBM. The author has contributed to research in topics: Microprocessor & Instruction register. The author has an hindex of 18, co-authored 72 publications receiving 1033 citations.


Papers
More filters
Proceedings ArticleDOI
18 Jun 2007
TL;DR: The POWER6trade microprocessor combines ultra-high frequency operation, aggressive power reduction, a highly scalable memory subsystem, and mainframe-like reliability, availability, and serviceability.
Abstract: The POWER6trade microprocessor combines ultra-high frequency operation, aggressive power reduction, a highly scalable memory subsystem, and mainframe-like reliability, availability, and serviceability. The 341mm2 700M transistor dual-core microprocessor is fabricated in a 65nm SOI process with 10 levels of low-k copper interconnect. It operates at clock frequencies over 5GHz in high-performance applications, and consumes under 100W in power-sensitive applications.

120 citations

Proceedings ArticleDOI
18 Jun 2018
TL;DR: A multi-TOPS AI core is presented for acceleration of deep learning training and inference in systems from edge devices to data centers by employing a dataflow architecture and an on-chip scratchpad hierarchy.
Abstract: A multi-TOPS AI core is presented for acceleration of deep learning training and inference in systems from edge devices to data centers. With a programmable architecture and custom ISA, this engine achieves >90% sustained utilization across the range of neural network topologies by employing a dataflow architecture and an on-chip scratchpad hierarchy. Compute precision is optimized at 16b floating point (fp 16) for high model accuracy in training and inference as well as 1b/2b (bi-nary/ternary) integer for aggressive inference performance. At 1.5 GHz, the AI core prototype achieves 1.5 TFLOPS fp 16, 12 TOPS ternary, or 24 TOPS binary peak performance in 14nm CMOS.

103 citations

Journal ArticleDOI
06 Feb 1997
TL;DR: The IBM S/390 microprocessor as discussed by the authors implements a 10+2 way system at frequencies up to 411 MHz (2.43 ns) using self-resetting CMOS (SRCMOS).
Abstract: A microprocessor implementing IBM S/390 architecture operates in a 10+2 way system at frequencies up to 411 MHz (2.43 ns). The chip is fabricated in a 0.2-/spl mu/m L/sub eff/ CMOS technology with five layers of metal and tungsten local interconnect. The chip size is 17.35 mm/spl times/17.30 mm with about 7.8 million transistors. The power supply is 2.5 V and measured power dissipation at 300 MHz is 37 W. The microprocessor features two instruction units (IUs), two fixed point units (FXUs), two floating point units (FPUs), a buffer control element (BCE) with a unified 64-KB L1 cache, and a register unit (RU). The microprocessor dispatches one instruction per cycle. The dual-instruction, fixed, and floating point units are used to check each other to increase reliability and not for improved performance. A phase-locked-loop (PLL) provides a processor clock that runs at 2/spl times/ the system bus frequency. High-frequency operation was achieved through careful static circuit design and timing optimization, along with limited use of dynamic circuits for highly critical functions, and several different clocking/latching strategies for cycle time reduction. Timing-driven synthesis and placement of the control logic provided the maximum flexibility with minimum turnaround time. Extensive use of self-resetting CMOS (SRCMOS) circuits in the on-chip L1 cache provides a 2.0-ns access time and up to 500 MHz operation.

62 citations

Journal ArticleDOI
TL;DR: This paper describes the design methodology employed in the design of the S/390® Parallel Enterprise Server G4 microprocessors and practical issues of managing the complexity of a 7.8-million-transistor design and encouraging design productivity are introduced.
Abstract: This paper describes the design methodology employed in the design of the S/390® Parallel Enterprise Server G4 microprocessors. Issues of verifying design metrics of area, power, noise, timing, testability, and functional correctness are discussed within the context of a transistor-level custom design approach. Practical issues of managing the complexity of a 7.8-million-transistor design and encouraging design productivity are introduced.

57 citations

Proceedings ArticleDOI
13 Feb 2021
TL;DR: In this article, a 4-core AI chip in 7nm EUV technology is presented to exploit cutting-edge algorithmic advances for iso-accurate models in low-precision training and inference to achieve leading-edge power-performance.
Abstract: Low-precision computation is the key enabling factor to achieve high compute densities (T0PS/W and T0PS/mm2) in AI hardware accelerators across cloud and edge platforms. However, robust deep learning (DL) model accuracy equivalent to high-precision computation must be maintained. Improvements in bandwidth, architecture, and power management are also required to harness the benefit of reduced precision by feeding and supporting more parallel engines to achieve high sustained utilization and optimize performance within a given product power envelope. In this work, we present a 4-core AI chip in 7nm EUV technology that exploits cutting-edge algorithmic advances for iso-accurate models in low-precision training and inference [1, 2] and aggressive circuit/architecture optimization to achieve leading-edge power-performance. The chip supports fp16 (DLFIoat16 [8]) and hybrid-fp8(hfp8) [1] formats for training and inference of DL models, as well as int4 and int2 formats for highly scaled inference.

56 citations


Cited by
More filters
01 Jan 2010
TL;DR: The TILE64TM processor as mentioned in this paper is a multicore SoC targeting the high-performance demands of a wide range of embedded applications across networking and digital multimedia applications, with 64 tile processors arranged in an 8x8 array.
Abstract: The TILE64TM processor is a multicore SoC targeting the high-performance demands of a wide range of embedded applications across networking and digital multimedia applications. A figure shows a block diagram with 64 tile processors arranged in an 8x8 array. These tiles connect through a scalable 2D mesh network with high-speed I/Os on the periphery. Each general-purpose processor is identical and capable of running SMP Linux.

634 citations

Proceedings ArticleDOI
01 Feb 2008
TL;DR: The TILE64TM processor is a multicore SoC targeting the high-performance demands of a wide range of embedded applications across networking and digital multimedia applications.
Abstract: The TILE64TM processor is a multicore SoC targeting the high-performance demands of a wide range of embedded applications across networking and digital multimedia applications. A figure shows a block diagram with 64 tile processors arranged in an 8x8 array. These tiles connect through a scalable 2D mesh network with high-speed I/Os on the periphery. Each general-purpose processor is identical and capable of running SMP Linux.

587 citations

Proceedings ArticleDOI
01 Dec 2007
TL;DR: A range of cache refresh and placement schemes that are sensitive to retention time are proposed, and it is shown that most of the retention time variations can be masked by the microarchitecture when using these schemes.
Abstract: Process variations will greatly impact the stability, leakage power consumption, and performance of future microprocessors. These variations are especially detrimental to 6T SRAM (6-transistor static memory) structures and will become critical with continued technology scaling. In this paper, we propose new on-chip memory architectures based on novel 3T1D DRAM (3-transistor, 1-diode dynamic memory) cells. We provide a detailed comparison between 6T and 3T1D designs in the context of a L1 data cache. The effects of physical device variation on a 3T1D cache can be lumped into variation of data retention times. This paper proposes a range of cache refresh and placement schemes that are sensitive to reten- tion time, and we show that most of the retention time variations can be masked by the microarchitecture when using these schemes. We have performed detailed circuit and architectural simulations assuming different degrees of variability in advanced technology nodes, and we show that the resulting memory architecture can tol- erate large process variations with little or even no impact on per- formance when compared to ideal 6T SRAM designs. Furthermore, these designs are robust to memory cell stability issues and can achieve large power savings. These advantages make the new mem- ory architectures a promising choice for on-chip variation-tolerant cache structures required for next generation microprocessors.

359 citations

Journal ArticleDOI
TL;DR: The IBM S/390 G5 microprocessor in IBM's newest CMOS mainframe system provides more than twice the performance of the previous generation, the G4, and offers improved reliability and availability, along with new architectural features such as support for IEEE floating-point arithmetic and a redesigned L2 cache and processor interconnect.
Abstract: The IBM S/390 G5 microprocessor in IBM's newest CMOS mainframe system provides more than twice the performance of the previous generation, the G4. The G5 system offers improved reliability and availability, along with new architectural features such as support for IEEE floating-point arithmetic and a redesigned L2 cache and processor interconnect. The G5 system implements the ESA/390 instruction-set architecture, which is based on and compatible with the original S/360 architecture. Therefore, it has no RISC (reduced-instruction-set computing) concepts and is one of the most complex of all CISC (complex-instruction-set computing) architectures. Designers had to meet a unique set of challenges to achieve the G5's level of performance-for example, achieving a very high frequency given the complexity of the architecture.

340 citations

Journal ArticleDOI
TL;DR: A global clock distribution strategy implemented on several microprocessor chips is described, which consists of buffered, tunable tree networks, with the final trees all driving a common grid.
Abstract: A global clock distribution strategy used on several microprocessor chips is described. The clock network consists of buffered tunable trees or treelike networks, with the final level of trees all driving a single common grid covering most of the chip. This topology combines advantages of both trees and grids. A new tuning method was required to efficiently tune such a large strongly connected interconnect network consisting of up to 6 m of wire and modeled with 50000 resistors, capacitors, and inductors. Variations are described to handle different floor-planning styles. Global clock skew as low as 22 ps on large microprocessor chips has been measured.

311 citations